### Clustring with Apache Spark
Clustring is an unsupervised learning algorithm which groups similar data points together in clusters. It discovers the patterns based on the similarties.
**Applications of Clustring**
- Customer Segmentation
- Anomly Detection
- Image Segmentation (Computer Vision)
- Recommendation System

**Steps for implemntation of clustring algorithm using SparkML**
1. Import libraries
2. Start spark session
3. Load data into spark dataframe
4. Select features you wanna to use for clustring
5. Assemble the features into one vector
6. Train the K-mean model
7. Make predictions
8. Show the cluster assignments
9. Again, Don't forget stop spark session

In [1]:
import findspark
from pyspark.ml.feature import VectorAssembler
from pyspark.sql import SparkSession
from pyspark.ml.clustering import KMeans

In [2]:
#Create Spark Session
spark = SparkSession.builder.appName('Clustring with Apache Spark').getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/12/10 15:40:17 WARN Utils: Your hostname, omar, resolves to a loopback address: 127.0.1.1; using 192.168.1.4 instead (on interface wlo1)
25/12/10 15:40:17 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/10 15:40:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
%%bash
fileName='customers.csv'
if test -f  data/$fileName; then
    echo 'file already exists'
else
    echo 'Downloading  the file'
    if wget -d  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/customers.csv; then
        echo 'file is downloaded successfully'
        mv customers.csv data/
    else
        echo "Something  Wrong Happened"
    fi
fi

file already exists


In [4]:
#Load the data into spark dataframe
df = spark.read.csv('data/customers.csv', header=True,inferSchema=True)
df.printSchema()

root
 |-- Fresh_Food: integer (nullable = true)
 |-- Milk: integer (nullable = true)
 |-- Grocery: integer (nullable = true)
 |-- Frozen_Food: integer (nullable = true)



In [None]:
print(f'size of data is {df.count()}')

In [None]:
df.columns

In [None]:
#Assemble features
assembler = VectorAssembler(inputCols=df.columns, outputCol='features')
df = assembler.transform(df)
df.columns


In [None]:
df.show(5)

In [12]:
#Initiate K-mean algorithm
#You should decide the number of clusters the data should belong to
clusters = 3
k_mean = KMeans(k=clusters)

#Train the model
model = k_mean.fit(df)

In [14]:
#Make Predictions
predictions = model.transform(df)

In [15]:
predictions.show(5)

+----------+----+-------+-----------+--------------------+----------+
|Fresh_Food|Milk|Grocery|Frozen_Food|            features|prediction|
+----------+----+-------+-----------+--------------------+----------+
|     12669|9656|   7561|        214|[12669.0,9656.0,7...|         0|
|      7057|9810|   9568|       1762|[7057.0,9810.0,95...|         0|
|      6353|8808|   7684|       2405|[6353.0,8808.0,76...|         0|
|     13265|1196|   4221|       6404|[13265.0,1196.0,4...|         0|
|     22615|5410|   7198|       3915|[22615.0,5410.0,7...|         2|
+----------+----+-------+-----------+--------------------+----------+
only showing top 5 rows


In [None]:
#There are three clusters and the number of observations belong to each cluster
predictions.groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|   49|
|         2|   60|
|         0|  331|
+----------+-----+



In [5]:
spark.stop()