## Clustering using SparkML


<p style='color: red'>The purpose of this lab is to show you how to use SparkML to cluster data.


## __Table of Contents__
<ol>
  <li>
    <a href="#Objectives">Objectives
    </a>
  </li>
  <li>
    <a href="#Datasets">Datasets
    </a>
  </li>
  <li>
    <a href="#Setup">Setup
    </a>
    <ol>
      <li>
        <a href="#Installing-Required-Libraries">Installing Required Libraries
        </a>
      </li>
      <li>
        <a href="#Importing-Required-Libraries">Importing Required Libraries
        </a>
      </li>
    </ol>
  </li>
  <li>
    <a href="#Examples">Examples
    </a>
    <ol>
      <li>
        <a href="#Task-1---Create-a-spark-session">Task 1 - Create a spark session
        </a>
      </li>
      <li>
        <a href="#Task-2---Load-the-data-in-a-csv-file-into-a-dataframe">Task 2 - Load the data in a csv file into a dataframe
        </a>
      </li>
      <li>
        <a href="#Task-3---Create-a-feature-vector">Task 3 - Create a feature vector
        </a>
      </li>
      <li>
        <a href="#Task-4---Create-a-clustering-model">Task 4 - Create a clustering model
        </a>
      </li>
      <li>
        <a href="#Task-5---Print-Cluster-Centers">Task 5 - Print Cluster Centers
        </a>
      </li>
    </ol>
  </li>
  <li>
    <a href="#Exercises">Exercises
    </a>
  </li>
  <ol>
    <li>
      <a href="#Exercise-1---Create-a-spark-session">Exercise 1 - Create a spark session
      </a>
    </li>
    <li>
      <a href="#Exercise-2---Load-the-data-in-a-csv-file-into-a-dataframe">Exercise 2 - Load the data in a csv file into a dataframe
      </a>
    </li>
    <li>
      <a href="#Exercise-3---Create-a-feature-vector">Exercise 3 - Create a feature vector
      </a>
    </li>
    <li>
      <a href="#Exercise-4---Create-a-clustering-model">Exercise 4 - Create a clustering model
      </a>
    </li>
    <li>
      <a href="#Exercise-5---Print-Cluster-Centers">Exercise 5 - Print Cluster Centers
      </a>
    </li>
  </ol>
</ol>


## Objectives

After completing this lab you will be able to:

 - Use PySpark to connect to a spark cluster.
 - Create a spark session.
 - Read a csv file into a data frame.
 - Use KMeans algorithm to cluster the data
 - Stop the spark session





## Datasets

In this lab you will be using dataset(s):

 - Modified version of Wholesale customers dataset. Original dataset available at https://archive.ics.uci.edu/ml/datasets/Wholesale+customers
 - Seeds dataset. Available at https://archive.ics.uci.edu/ml/datasets/seeds


----


## Setup


In [8]:
!pip install pyspark==3.1.2 -q
!pip install findspark -q

### Importing Required Libraries

_We recommend you import all required libraries in one place (here):_


In [9]:
# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

# FindSpark simplifies the process of using Apache Spark with Python

import findspark
findspark.init()

#import functions/Classes for sparkml

from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler

from pyspark.sql import SparkSession

## Examples


## Task 1 - Create a spark session


In [10]:
#Create SparkSession
#Ignore any warnings by SparkSession command

spark = SparkSession.builder.appName("Clustering using SparkML").getOrCreate()

## Task 2 - Load the data in a csv file into a dataframe


Download the data file


In [12]:
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/customers.csv


--2024-06-18 07:02:39--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/customers.csv
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 198.23.119.245
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|198.23.119.245|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8909 (8.7K) [text/csv]
Saving to: ‘customers.csv’


2024-06-18 07:02:40 (170 MB/s) - ‘customers.csv’ saved [8909/8909]



Load the dataset into the spark dataframe


In [14]:
# using the spark.read.csv function we load the data into a dataframe.
# the header = True mentions that there is a header row in out csv file
# the inferSchema = True, tells spark to automatically find out the data types of the columns.

# Load customers dataset
customer_data = spark.read.csv("customers.csv", header=True, inferSchema=True)

Print the schema of the dataset


In [None]:
# Each row in this dataset is about a customer. The columns indicate the orders placed
# by a customer for Fresh_food, Milk, Grocery and Frozen_Food

In [15]:
customer_data.printSchema()

root
 |-- Fresh_Food: integer (nullable = true)
 |-- Milk: integer (nullable = true)
 |-- Grocery: integer (nullable = true)
 |-- Frozen_Food: integer (nullable = true)



Show top 5 rows from the dataset


In [16]:
customer_data.show(n=5, truncate=False)

+----------+----+-------+-----------+
|Fresh_Food|Milk|Grocery|Frozen_Food|
+----------+----+-------+-----------+
|12669     |9656|7561   |214        |
|7057      |9810|9568   |1762       |
|6353      |8808|7684   |2405       |
|13265     |1196|4221   |6404       |
|22615     |5410|7198   |3915       |
+----------+----+-------+-----------+
only showing top 5 rows



## Task 3 - Create a feature vector


In [21]:
# Assemble the features into a single vector column
feature_cols = ['Fresh_Food', 'Milk', 'Grocery', 'Frozen_Food']
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
customer_transformed_data = assembler.transform(customer_data)

In [23]:
customer_transformed_data.show(5)

+----------+----+-------+-----------+--------------------+
|Fresh_Food|Milk|Grocery|Frozen_Food|            features|
+----------+----+-------+-----------+--------------------+
|     12669|9656|   7561|        214|[12669.0,9656.0,7...|
|      7057|9810|   9568|       1762|[7057.0,9810.0,95...|
|      6353|8808|   7684|       2405|[6353.0,8808.0,76...|
|     13265|1196|   4221|       6404|[13265.0,1196.0,4...|
|     22615|5410|   7198|       3915|[22615.0,5410.0,7...|
+----------+----+-------+-----------+--------------------+
only showing top 5 rows



You must tell the KMeans algorithm how many clusters to create out of your data


In [17]:
number_of_clusters = 3

## Task 4 - Create a clustering model


Create a KMeans clustering model


In [19]:
kmeans = KMeans(k = number_of_clusters)


Train/Fit the model on the dataset<br>


In [22]:
model = kmeans.fit(customer_transformed_data)


## Task 5 - Print Cluster Details


Your model is now trained. Time to evaluate the model.


In [24]:
# Make predictions on the dataset
predictions = model.transform(customer_transformed_data)

In [25]:
# Display the results
predictions.show(5)

+----------+----+-------+-----------+--------------------+----------+
|Fresh_Food|Milk|Grocery|Frozen_Food|            features|prediction|
+----------+----+-------+-----------+--------------------+----------+
|     12669|9656|   7561|        214|[12669.0,9656.0,7...|         2|
|      7057|9810|   9568|       1762|[7057.0,9810.0,95...|         2|
|      6353|8808|   7684|       2405|[6353.0,8808.0,76...|         2|
|     13265|1196|   4221|       6404|[13265.0,1196.0,4...|         2|
|     22615|5410|   7198|       3915|[22615.0,5410.0,7...|         1|
+----------+----+-------+-----------+--------------------+----------+
only showing top 5 rows



Display how many customers are there in each cluster.


In [26]:
predictions.groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|   74|
|         2|  322|
|         0|   44|
+----------+-----+



In [27]:
#stop spark session
spark.stop()

# Seed Clustering


### Exercise 1 - Create a spark session


Create SparkSession with appname "Seed Clustering"


In [29]:
spark = SparkSession.builder.appName("Seed Clustering SparkML").getOrCreate()

### Exercise 2 - Load the data in a csv file into a dataframe


In [30]:
#download seed dataset
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/seeds.csv


--2024-06-18 07:11:42--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/seeds.csv
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8973 (8.8K) [text/csv]
Saving to: ‘seeds.csv’


2024-06-18 07:11:42 (155 MB/s) - ‘seeds.csv’ saved [8973/8973]



Load the seed dataset


In [31]:
seed_data = spark.read.csv("seeds.csv", header=True, inferSchema=True)

Print the schema of the dataset


In [32]:
seed_data.printSchema()

root
 |-- area: double (nullable = true)
 |-- perimeter: double (nullable = true)
 |-- compactness: double (nullable = true)
 |-- length of kernel: double (nullable = true)
 |-- width of kernel: double (nullable = true)
 |-- asymmetry coefficient: double (nullable = true)
 |-- length of kernel groove: double (nullable = true)



Show top 5 rows of the data set


In [34]:
seed_data.show(5)

+-----+---------+-----------+----------------+---------------+---------------------+-----------------------+
| area|perimeter|compactness|length of kernel|width of kernel|asymmetry coefficient|length of kernel groove|
+-----+---------+-----------+----------------+---------------+---------------------+-----------------------+
|15.26|    14.84|      0.871|           5.763|          3.312|                2.221|                   5.22|
|14.88|    14.57|     0.8811|           5.554|          3.333|                1.018|                  4.956|
|14.29|    14.09|      0.905|           5.291|          3.337|                2.699|                  4.825|
|13.84|    13.94|     0.8955|           5.324|          3.379|                2.259|                  4.805|
|16.14|    14.99|     0.9034|           5.658|          3.562|                1.355|                  5.175|
+-----+---------+-----------+----------------+---------------+---------------------+-----------------------+
only showing top 5 

### Exercise 3 - Create a feature vector


Assemble all columns into a single vector


In [36]:
features_cols = ["area","perimeter","compactness","length of kernel","width of kernel","asymmetry coefficient","length of kernel groove"]
assembler = VectorAssembler(inputCols=features_cols, outputCol="features")
seed_transformed_data = assembler.transform(seed_data)

### Exercise 4 - Create a clustering model


Create 7 clusters


In [39]:
number_of_clusters =  7
kmeans = KMeans(k = number_of_clusters)
model = kmeans.fit(seed_transformed_data)

### Exercise 5 - Print Cluster Details


In [None]:
predictions =  model.transform(seed_transformed_data)

In [42]:
predictions.show(n=5)

+-----+---------+-----------+----------------+---------------+---------------------+-----------------------+--------------------+----------+
| area|perimeter|compactness|length of kernel|width of kernel|asymmetry coefficient|length of kernel groove|            features|prediction|
+-----+---------+-----------+----------------+---------------+---------------------+-----------------------+--------------------+----------+
|15.26|    14.84|      0.871|           5.763|          3.312|                2.221|                   5.22|[15.26,14.84,0.87...|         0|
|14.88|    14.57|     0.8811|           5.554|          3.333|                1.018|                  4.956|[14.88,14.57,0.88...|         0|
|14.29|    14.09|      0.905|           5.291|          3.337|                2.699|                  4.825|[14.29,14.09,0.90...|         0|
|13.84|    13.94|     0.8955|           5.324|          3.379|                2.259|                  4.805|[13.84,13.94,0.89...|         0|
|16.14|    14

In [43]:
predictions.groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|   14|
|         6|   31|
|         3|   56|
|         5|    6|
|         4|   28|
|         2|   21|
|         0|   54|
+----------+-----+



In [44]:
#stop spark session
spark.stop()