# Lab6: Spark ML: Clustering and Recommendation System

## Tasks
1. **Install and Configure PySpark:**
    - Install the PySpark library.
    - Import necessary modules from PySpark.

2. **Start Spark Session:**
    - Initialize a Spark session for clustering and recommendation tasks.

3. **Load and Simulate Telecom Data:**
    - Define a schema for the telecom dataset.
    - Create a Spark DataFrame with sample telecom data.

4. **Display the Dataset:**
    - Show the contents of the telecom dataset.

5. **Clustering:**
    - Combine features into a single vector using `VectorAssembler`.
    - Build and train a KMeans clustering model.
    - Assign clusters to customers and display the cluster assignments.

6. **Recommendation System (ALS):**
    - Simulate user-item interaction data for the recommendation system.
    - Define a schema for the recommendation dataset.
    - Create a Spark DataFrame with the recommendation data.
    - Initialize and train the ALS model.
    - Generate and display recommendations for customers.
    - Generate and display recommendations for plans.

7. **Project - Customer Segmentation Analysis:**
    - Load the telecom segmentation dataset from an Excel file.
    - Convert the Pandas DataFrame to a Spark DataFrame.
    - Display the schema and preview the data.
    - Select features for clustering.
    - Assemble features into a single vector column.
    - Apply KMeans clustering to the dataset.
    - Assign clusters to each customer and display the resulting clusters.
    - Evaluate clustering performance using the Silhouette Score.

8. **Practice Case Study - Clustering and Recommendation:**
    - Repeat the clustering and recommendation tasks with a new set of sample data.

9. **General Deployment Considerations:**
    - Discuss scalability, performance optimization, monitoring, and A/B testing for the deployed models.

10. **Stop Spark Session:**
     - Stop the Spark session after completing all tasks.




## Spark ML for Clustering and Recommendation Systems

These notes detail the development and deployment of Clustering models and Recommendation Systems using Spark ML.

### I. Clustering Models with Spark ML

**A. Overview:**

Clustering in Spark ML involves grouping similar data points together based on their features.  Spark ML provides algorithms like K-Means, Gaussian Mixture Models (GMM), and Bisecting K-Means. The choice of algorithm depends on the data distribution and desired cluster characteristics.

**B. Data Preparation:**

1. **Data Loading:** Load data into a Spark DataFrame. Ensure data is properly formatted and features are numerical.  Categorical features need encoding (e.g., one-hot encoding, string indexer) before use in most clustering algorithms.

2. **Feature Scaling:**  Scale features to prevent features with larger values from dominating the distance calculations.  Common scaling methods include standardization (z-score normalization) and MinMaxScaler.

3. **Feature Selection:**  Select relevant features for clustering.  Dimensionality reduction techniques (PCA) can be applied to reduce noise and improve performance, especially with high-dimensional data.

**C. Model Development:**

1. **Algorithm Selection:** Choose an appropriate algorithm:
    * **K-Means:** Simple, efficient, and widely used. Requires specifying the number of clusters (k).
    * **Gaussian Mixture Models (GMM):** Assumes data points are generated from a mixture of Gaussian distributions.  Can identify clusters with different shapes and sizes.
    * **Bisecting K-Means:**  A hierarchical clustering algorithm that repeatedly bisects the largest cluster.

2. **Parameter Tuning:** Optimize model parameters using techniques like cross-validation or grid search. Key parameters include:
    * **K-Means:** `k`, `maxIterations`, `initMode`, `tol`
    * **GMM:**  `k`, `maxIterations`, `tol`
    * **Bisecting K-Means:** `k`, `maxIterations`, `minDivisibleClusterSize`

3. **Model Training:** Fit the chosen algorithm to the preprocessed data.


**D. Model Evaluation:**

1. **Silhouette Score:** Measures how similar a data point is to its own cluster compared to other clusters.  Higher scores indicate better clustering.

2. **Calinski-Harabasz Index:** Ratio of the between-cluster dispersion mean and the within-cluster dispersion mean.  Higher scores indicate better clustering.

3. **Davies-Bouldin Index:**  Measures the average similarity between each cluster and its most similar cluster.  Lower scores indicate better clustering.

**E. Deployment:**

1. **Model Persistence:** Save the trained model to disk for later use.

2. **Model Prediction:** Use the saved model to predict cluster assignments for new data.

3. **Integration:** Integrate the clustering model into a larger application or pipeline.



### II. Recommendation Systems with Spark ML

**A. Overview:**

Recommendation systems aim to predict user preferences for items.  Spark MLlib provides collaborative filtering algorithms like Alternating Least Squares (ALS).

**B. Data Preparation:**

1. **Data Loading:**  Load user-item interaction data into a Spark DataFrame.  This typically includes user IDs, item IDs, and ratings or implicit feedback (e.g., purchase history).

2. **Data Splitting:** Split the data into training, validation, and test sets.


**C. Model Development:**

1. **Algorithm Selection:** ALS is a common choice for collaborative filtering.

2. **Parameter Tuning:** Optimize parameters using the validation set. Key parameters:
    * **rank:** Number of latent factors.
    * **maxIter:** Maximum number of iterations.
    * **regParam:** Regularization parameter.
    * **alpha:**  Parameter for implicit feedback (only relevant for implicit feedback data).

3. **Model Training:** Train the ALS model on the training data.


**D. Model Evaluation:**

1. **Root Mean Squared Error (RMSE):**  Measures the difference between predicted and actual ratings.  Lower values indicate better performance.
2. **Mean Absolute Error (MAE):**  Similar to RMSE but uses absolute differences.
3. **Precision and Recall:** Evaluate the quality of recommendations based on the top-N recommendations.

**E. Deployment:**

1. **Model Persistence:** Save the trained model.

2. **Real-time Recommendations:** Deploy the model for real-time predictions using a serving layer (e.g., REST API).

3. **Batch Recommendations:** Generate recommendations in batches for offline analysis or updates.



**III. General Deployment Considerations**

* **Scalability:** Spark's distributed computing capabilities are crucial for handling large datasets.
* **Performance:** Optimize data pipelines and model parameters for efficient processing.
* **Monitoring:** Monitor the model's performance over time and retrain as needed.
* **A/B Testing:** Experiment with different models and parameters to find the optimal configuration.

##Case Study - Clustering recommendation

In [None]:
#Install and Configure PySpark
!pip install pyspark
from pyspark.sql import SparkSession

#Start Spark session
spark = SparkSession.builder.appName("TelecomClusteringRecommendation").getOrCreate()



In [None]:
#Load and Simulate Telecom Data
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, FloatType

# Simulate a telecom dataset
data = [
    (1, 20, 500, 15, 50, 3),
    (2, 45, 1000, 50, 200, 5),
    (3, 30, 600, 20, 80, 2),
    (4, 25, 300, 10, 30, 1),
    (5, 40, 900, 40, 150, 4),
]
schema = StructType([
    StructField("CustomerID", IntegerType(), True),
    StructField("Age", IntegerType(), True),
    StructField("MonthlyUsage", IntegerType(), True),
    StructField("CallFrequency", IntegerType(), True),
    StructField("DataUsage", IntegerType(), True),
    StructField("PlanID", IntegerType(), True),
])
df = spark.createDataFrame(data, schema=schema)

In [None]:
# Display data
df.show()

+----------+---+------------+-------------+---------+------+
|CustomerID|Age|MonthlyUsage|CallFrequency|DataUsage|PlanID|
+----------+---+------------+-------------+---------+------+
|         1| 20|         500|           15|       50|     3|
|         2| 45|        1000|           50|      200|     5|
|         3| 30|         600|           20|       80|     2|
|         4| 25|         300|           10|       30|     1|
|         5| 40|         900|           40|      150|     4|
+----------+---+------------+-------------+---------+------+



In [None]:
#Clustering
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans

In [None]:
#Combine features into a single vector
assembler = VectorAssembler(
    inputCols=["Age", "MonthlyUsage", "CallFrequency", "DataUsage"],
    outputCol="features"
)
data_with_features = assembler.transform(df)

In [None]:
#Build and train the KMeans model
kmeans = KMeans(k=3, seed=123, featuresCol="features", predictionCol="cluster")
kmeans_model = kmeans.fit(data_with_features)

In [None]:
#Assign clusters to customers
clusters = kmeans_model.transform(data_with_features)
clusters.select("CustomerID", "features", "cluster").show()

+----------+--------------------+-------+
|CustomerID|            features|cluster|
+----------+--------------------+-------+
|         1|[20.0,500.0,15.0,...|      2|
|         2|[45.0,1000.0,50.0...|      1|
|         3|[30.0,600.0,20.0,...|      2|
|         4|[25.0,300.0,10.0,...|      0|
|         5|[40.0,900.0,40.0,...|      1|
+----------+--------------------+-------+



In [None]:
#Recommendation System (ALS)
from pyspark.ml.recommendation import ALS

In [None]:
#Simulate user-item interaction data (customer plan usage)
recommendation_data = [
    (1, 1, 4.0),
    (1, 2, 2.0),
    (2, 1, 5.0),
    (2, 3, 3.0),
    (3, 2, 4.0),
    (3, 3, 1.0),
    (4, 1, 3.0),
    (4, 2, 5.0),
    (5, 3, 4.0),
]
rec_schema = StructType([
    StructField("CustomerID", IntegerType(), True),
    StructField("PlanID", IntegerType(), True),
    StructField("Rating", FloatType(), True),
])
rec_df = spark.createDataFrame(recommendation_data, schema=rec_schema)

In [None]:
# Initialize and train ALS model
als = ALS(
    userCol="CustomerID",
    itemCol="PlanID",
    ratingCol="Rating",
    maxIter=10,
    regParam=0.1,
    rank=10,
    coldStartStrategy="drop"
)
als_model = als.fit(rec_df)

In [None]:
# Generate recommendations for customers
customer_recommendations = als_model.recommendForAllUsers(3)
customer_recommendations.show(truncate=False)

+----------+------------------------------------------------+
|CustomerID|recommendations                                 |
+----------+------------------------------------------------+
|1         |[{1, 3.8783436}, {3, 2.6226692}, {2, 2.0328445}]|
|2         |[{1, 4.859957}, {2, 3.4273922}, {3, 3.0245786}] |
|3         |[{2, 3.8877108}, {1, 2.56783}, {3, 1.0166105}]  |
|4         |[{2, 4.847688}, {1, 3.025969}, {3, 1.1368906}]  |
|5         |[{1, 5.419569}, {3, 3.8903399}, {2, 2.561105}]  |
+----------+------------------------------------------------+



In [None]:
# Generate recommendations for plans
plan_recommendations = als_model.recommendForAllItems(3)
plan_recommendations.show(truncate=False)

+------+------------------------------------------------+
|PlanID|recommendations                                 |
+------+------------------------------------------------+
|1     |[{5, 5.419569}, {2, 4.859957}, {1, 3.8783436}]  |
|2     |[{4, 4.847688}, {3, 3.8877108}, {2, 3.4273922}] |
|3     |[{5, 3.8903399}, {2, 3.0245786}, {1, 2.6226692}]|
+------+------------------------------------------------+



##Project - Customer Segmentation analysis

Telecom companies operate in a competitive market, where retaining customers and optimizing services is key to success. The goal of this analysis is to segment customers into meaningful clusters based on their usage patterns. These clusters can help in better understanding customer behavior and personalizing services.

The objective is to use KMeans clustering to group customers based on their usage metrics:

- Monthly bill amount
- Average call duration
- Data usage in GB
- Number of calls
- Customer tenure

The output clusters will help the telecom company to understand behavioral patterns. Develop targeted strategies for customer satisfaction and revenue growth.

In [None]:
#Install PySpark and Required Libraries
!pip install pyspark openpyxl



In [None]:
#Initialize Spark Session
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

In [None]:
#Start a Spark session
spark = SparkSession.builder.appName("TelecomClustering").getOrCreate()

In [None]:
#Load the Dataset
import pandas as pd

#Upload the dataset file to Colab (ensure it's named 'telecom_segmentation_data.xlsx')
data = pd.read_excel("/content/telecom_segmentation_data (1).xlsx")

In [None]:
#Convert Pandas DataFrame to Spark DataFrame
df = spark.createDataFrame(data)

In [None]:
#Display the schema and preview the data
df.printSchema()
df.show()

root
 |-- Customer_ID: long (nullable = true)
 |-- Monthly_Bill_Amount: double (nullable = true)
 |-- Average_Call_Duration: double (nullable = true)
 |-- Segment: string (nullable = true)
 |-- Data_Usage_GB: double (nullable = true)
 |-- Number_of_Calls: long (nullable = true)
 |-- Customer_Tenure: long (nullable = true)

+-----------+-------------------+---------------------+---------+-----------------+---------------+---------------+
|Customer_ID|Monthly_Bill_Amount|Average_Call_Duration|  Segment|    Data_Usage_GB|Number_of_Calls|Customer_Tenure|
+-----------+-------------------+---------------------+---------+-----------------+---------------+---------------+
|          1|  45.09054491313347|                  2.0|Low Usage|19.35246582352076|            118|              9|
|          2|  46.92616129684993|    2.757929853910177|Low Usage|47.58500101408589|            125|              1|
|          3|  38.72679256500254|    2.475610187744059|Low Usage|36.86770314875885|            

In [None]:
#Select Features for Clustering
feature_cols = ["Monthly_Bill_Amount", "Average_Call_Duration", "Data_Usage_GB", "Number_of_Calls", "Customer_Tenure"]

In [None]:
#Assemble features into a single vector column
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
df_features = assembler.transform(df)

In [None]:
#Apply KMeans Clustering
kmeans = KMeans(featuresCol="features", predictionCol="cluster", k=3, seed=42)  # Specify the number of clusters (k)
kmeans_model = kmeans.fit(df_features)

In [None]:
#Assign Clusters to Each Customer
df_clusters = kmeans_model.transform(df_features)

In [None]:
#Display the resulting clusters
df_clusters.select("Customer_ID", "features", "cluster").show()

+-----------+--------------------+-------+
|Customer_ID|            features|cluster|
+-----------+--------------------+-------+
|          1|[45.0905449131334...|      2|
|          2|[46.9261612968499...|      2|
|          3|[38.7267925650025...|      2|
|          4|[44.2037271587046...|      0|
|          5|[43.3013117286886...|      2|
|          6|[46.8687213044576...|      0|
|          7|[47.6177325720739...|      2|
|          8|[35.1646500738022...|      0|
|          9|[43.8348782960322...|      2|
|         10|[35.6541057370489...|      0|
|         11|[41.6296933786249...|      2|
|         12|[38.3050918061458...|      2|
|         13|[38.3303700075245...|      2|
|         14|[38.5922849972231...|      2|
|         15|[40.8763641618155...|      2|
|         16|[37.9065198609228...|      2|
|         17|[40.7944639934784...|      2|
|         18|[37.8656987473993...|      0|
|         19|[35.4259241839322...|      0|
|         20|[39.1956963322939...|      2|
+----------

In [None]:
#Evaluate Clustering Performance using Silhouette Score
evaluator = ClusteringEvaluator(featuresCol="features", predictionCol="cluster", metricName="silhouette")
silhouette_score = evaluator.evaluate(df_clusters)

print(f"Silhouette Score: {silhouette_score:.4f}")

Silhouette Score: 0.5630


##Practice Case Study - Clustering and Recommendation



```
# data = [
    (1, 20, 500, 15, 50, 3),
    (2, 45, 1000, 50, 200, 5),
    (3, 30, 600, 20, 80, 2),
    (4, 25, 300, 10, 30, 1),
    (5, 40, 900, 40, 150, 4),
    (6, 22, 700, 25, 100, 3),
    (7, 50, 1200, 60, 250, 5),
    (8, 35, 800, 30, 120, 4),
    (9, 28, 400, 12, 40, 1),
    (10, 42, 1100, 45, 180, 5)
]
```



In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, FloatType
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans
from pyspark.ml.recommendation import ALS

In [None]:
# Install and Configure PySpark (if not already installed)
!pip install pyspark



In [None]:
# Start Spark session
spark = SparkSession.builder.appName("TelecomClusteringRecommendation").getOrCreate()

# Load and Simulate Telecom Data
# Sample data
data = [
    (1, 20, 500, 15, 50, 3),
    (2, 45, 1000, 50, 200, 5),
    (3, 30, 600, 20, 80, 2),
    (4, 25, 300, 10, 30, 1),
    (5, 40, 900, 40, 150, 4),
    (6, 22, 700, 25, 100, 3),
    (7, 50, 1200, 60, 250, 5),
    (8, 35, 800, 30, 120, 4),
    (9, 28, 400, 12, 40, 1),
    (10, 42, 1100, 45, 180, 5)
]

schema = StructType([
    StructField("CustomerID", IntegerType(), True),
    StructField("Age", IntegerType(), True),
    StructField("MonthlyUsage", IntegerType(), True),
    StructField("CallFrequency", IntegerType(), True),
    StructField("DataUsage", IntegerType(), True),
    StructField("PlanID", IntegerType(), True),
])

In [None]:
df = spark.createDataFrame(data, schema=schema)

In [None]:
# Display the dataset
print("Sample Telecom Dataset:")
df.show()

#Clustering
# Combine features into a single vector
assembler = VectorAssembler(
    inputCols=["Age", "MonthlyUsage", "CallFrequency", "DataUsage"],
    outputCol="features"
)

data_with_features = assembler.transform(df)

Sample Telecom Dataset:
+----------+---+------------+-------------+---------+------+
|CustomerID|Age|MonthlyUsage|CallFrequency|DataUsage|PlanID|
+----------+---+------------+-------------+---------+------+
|         1| 20|         500|           15|       50|     3|
|         2| 45|        1000|           50|      200|     5|
|         3| 30|         600|           20|       80|     2|
|         4| 25|         300|           10|       30|     1|
|         5| 40|         900|           40|      150|     4|
|         6| 22|         700|           25|      100|     3|
|         7| 50|        1200|           60|      250|     5|
|         8| 35|         800|           30|      120|     4|
|         9| 28|         400|           12|       40|     1|
|        10| 42|        1100|           45|      180|     5|
+----------+---+------------+-------------+---------+------+



In [None]:
# Build and train the KMeans model
kmeans = KMeans(k=3, seed=123, featuresCol="features", predictionCol="cluster") # setting k=3
kmeans_model = kmeans.fit(data_with_features)

In [None]:
# Assign clusters to customers
clusters = kmeans_model.transform(data_with_features)
print("Cluster assignments:")
clusters.select("CustomerID", "features", "cluster").show()

Cluster assignments:
+----------+--------------------+-------+
|CustomerID|            features|cluster|
+----------+--------------------+-------+
|         1|[20.0,500.0,15.0,...|      1|
|         2|[45.0,1000.0,50.0...|      0|
|         3|[30.0,600.0,20.0,...|      1|
|         4|[25.0,300.0,10.0,...|      1|
|         5|[40.0,900.0,40.0,...|      2|
|         6|[22.0,700.0,25.0,...|      2|
|         7|[50.0,1200.0,60.0...|      0|
|         8|[35.0,800.0,30.0,...|      2|
|         9|[28.0,400.0,12.0,...|      1|
|        10|[42.0,1100.0,45.0...|      0|
+----------+--------------------+-------+



In [None]:
#Recommendation System (ALS)
#Simulate user-item interaction data
recommendation_data = [
    (1, 1, 4.0), (1, 2, 2.0), (2, 1, 5.0), (2, 3, 3.0), (3, 2, 4.0),
    (3, 3, 1.0), (4, 1, 3.0), (4, 2, 5.0), (5, 3, 4.0), (6, 2, 3.0),
    (7, 3, 5.0), (8, 1, 2.0), (9, 2, 4.0), (10,3, 5.0)
]


rec_schema = StructType([
    StructField("CustomerID", IntegerType(), True),
    StructField("PlanID", IntegerType(), True),
    StructField("Rating", FloatType(), True),
])

rec_df = spark.createDataFrame(recommendation_data, schema=rec_schema)

In [None]:
# Initialize and train ALS model
als = ALS(
    userCol="CustomerID",
    itemCol="PlanID",
    ratingCol="Rating",
    maxIter=10,
    regParam=0.1,
    rank=10,
    coldStartStrategy="drop"
)

als_model = als.fit(rec_df)

In [None]:
# Generate recommendations for customers
customer_recommendations = als_model.recommendForAllUsers(3)
print("Recommendations for all users:")
customer_recommendations.show(truncate=False)

Recommendations for all users:
+----------+------------------------------------------------+
|CustomerID|recommendations                                 |
+----------+------------------------------------------------+
|10        |[{3, 4.926397}, {1, 3.2402215}, {2, 1.7122594}] |
|1         |[{1, 3.85063}, {3, 2.7237918}, {2, 2.0484934}]  |
|2         |[{1, 4.8521023}, {2, 3.6179414}, {3, 3.0074117}]|
|3         |[{2, 3.885463}, {1, 2.6564176}, {3, 1.0096402}] |
|4         |[{2, 4.8412123}, {1, 3.0279834}, {3, 1.2310266}]|
|5         |[{3, 3.9411175}, {1, 2.5921772}, {2, 1.3698076}]|
|6         |[{2, 2.95763}, {1, 2.1223497}, {3, 0.985676}]   |
|7         |[{3, 4.926397}, {1, 3.2402215}, {2, 1.7122594}] |
|8         |[{1, 1.9709389}, {2, 1.4556934}, {3, 1.2793583}]|
|9         |[{2, 3.9435065}, {1, 2.8298}, {3, 1.3142346}]   |
+----------+------------------------------------------------+



In [None]:
# Generate recommendations for plans
plan_recommendations = als_model.recommendForAllItems(3)
print("Recommendations for all plans:")
plan_recommendations.show(truncate=False)


spark.stop()

Recommendations for all plans:
+------+-----------------------------------------------+
|PlanID|recommendations                                |
+------+-----------------------------------------------+
|1     |[{2, 4.8521023}, {1, 3.85063}, {10, 3.2402215}]|
|2     |[{4, 4.8412123}, {9, 3.9435065}, {3, 3.885463}]|
|3     |[{10, 4.926397}, {7, 4.926397}, {5, 3.9411175}]|
+------+-----------------------------------------------+

