# Dataset

1. Data used : Seattle Checkouts by Title (6.62GB After Extraction) [ https://www.kaggle.com/city-of-seattle/seattle-checkouts-by-title ]

2. Description : This dataset includes a monthly count of Seattle Public Library checkouts by title for physical and electronic items. The dataset begins with checkouts that occurred in April 2005.

# Initiating Spark

In [1]:
import findspark
findspark.init()

In [2]:
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.appName("simple data clustering").getOrCreate()

In [4]:
#test wheter spark successfully created or not
print(spark)

<pyspark.sql.session.SparkSession object at 0x000001DEE587A9B0>


# Loading Dataset

In [5]:
df = spark.read.csv("D:/Repos/Resource/Lib-Checkout/checkouts-by-title.csv", header=True, inferSchema=True)

In [6]:
#counting rows

df.count()

32723545

In [7]:
#show schema

df.schema

StructType(List(StructField(UsageClass,StringType,true),StructField(CheckoutType,StringType,true),StructField(MaterialType,StringType,true),StructField(CheckoutYear,IntegerType,true),StructField(CheckoutMonth,IntegerType,true),StructField(Checkouts,IntegerType,true),StructField(Title,StringType,true),StructField(Creator,StringType,true),StructField(Subjects,StringType,true),StructField(Publisher,StringType,true),StructField(PublicationYear,StringType,true)))

In [8]:
#creating datas as a temporary SQL View

df.createOrReplaceTempView('libchecksout')

# Training Machine Learning

Let's say we are going to train data to define the cluster of MaterialType.

### 1. Retrieve Data 

In [21]:
#let's say we going to train 

datas=spark.sql("SELECT MaterialType,sum(Checkouts) as Total FROM libchecksout GROUP BY MaterialType LIMIT 10000")

In [22]:
datas.show(10)

+-------------------+--------+
|       MaterialType|   Total|
+-------------------+--------+
|          MICROFORM|     178|
|              GLOBE|     741|
|REGPRINT, SOUNDDISC|     412|
|               BOOK|54356026|
|      ER, VIDEOCASS|       3|
|           VIDEOREC|    1943|
|        UNSPECIFIED|     320|
| PICTURE, VIDEODISC|      34|
|          MAP, VIEW|       2|
|            SECTION|       3|
+-------------------+--------+
only showing top 10 rows



In [23]:
datas.count()

66

### 2. Assembling Vector

In [24]:
# Assembling Vector
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(
    inputCols=["Total"],
    outputCol='features')

datas = assembler.transform(datas)
datas.show()

+--------------------+--------+-------------+
|        MaterialType|   Total|     features|
+--------------------+--------+-------------+
|           MICROFORM|     178|      [178.0]|
|               GLOBE|     741|      [741.0]|
| REGPRINT, SOUNDDISC|     412|      [412.0]|
|                BOOK|54356026|[5.4356026E7]|
|       ER, VIDEOCASS|       3|        [3.0]|
|            VIDEOREC|    1943|     [1943.0]|
|         UNSPECIFIED|     320|      [320.0]|
|  PICTURE, VIDEODISC|      34|       [34.0]|
|           MAP, VIEW|       2|        [2.0]|
|             SECTION|       3|        [3.0]|
|                SONG| 1311618|  [1311618.0]|
|           SOUNDCASS|  330665|   [330665.0]|
|            COMPFILE|      22|       [22.0]|
|        NONPROJGRAPH|       1|        [1.0]|
|                 KIT|   43737|    [43737.0]|
|SOUNDCASS, VIDEOCASS|      47|       [47.0]|
|              VISUAL|  110539|   [110539.0]|
|               CHART|       4|        [4.0]|
|        ER, VIDEOREC|     305|   

### 3. Training Model 

In [29]:
# Train model
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(datas)

### 4. Prediction

In [30]:
# Make a prediction
predictions = model.transform(datas)
predictions.show(5)

+-------------------+--------+-------------+----------+
|       MaterialType|   Total|     features|prediction|
+-------------------+--------+-------------+----------+
|          MICROFORM|     178|      [178.0]|         0|
|              GLOBE|     741|      [741.0]|         0|
|REGPRINT, SOUNDDISC|     412|      [412.0]|         0|
|               BOOK|54356026|[5.4356026E7]|         1|
|      ER, VIDEOCASS|       3|        [3.0]|         0|
+-------------------+--------+-------------+----------+
only showing top 5 rows



### 5. Evaluate

In [31]:
# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()

silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))

Silhouette with squared euclidean distance = 0.9779913142714661


# Vizualization

In [19]:
import pixiedust

Pixiedust database opened successfully


Table SPARK_PACKAGES created successfully


In [32]:
display(predictions)

![cluster](../Resource/clustering_histogram.png)

![cluster](../Resource/clustering_line_chart.png)