#Lab 4 : Spark ML for Machine Learning (Regression)

#Tasks

- Installation and Configuration
Installs required dependencies, including:
- openjdk-8-jdk for Java support.
- Spark binaries from Apache Spark's official source.
- Python libraries like pyspark and findspark.
- Configures environment variables for Java and Spark.

- Spark Session Initialization
Initializes a Spark session.

- Dataset Preparation
- Creates a simulated dataset related to telecom
- Defines a schema for the dataset using PySpark's StructType and StructField.
- Converts the simulated data into a PySpark DataFrame using the schema.

- Data Display
Displays the dataset using the show() method.

- Feature Engineering
Uses VectorAssembler to combine multiple feature columns (CallDuration, MessagesSent, DataUsage) into a single feature vector.
Optionally scales the features using StandardScaler to normalize the data.

- Pipeline Creation
Creates a PySpark ML pipeline to automate data preprocessing, combining feature assembly and scaling steps.

- Train-Test Split
Splits the dataset into training and testing sets to ensure proper model evaluation.

- Model Training
Trains a regression model (likely LinearRegression or similar) using the training data.

- Model Evaluation
Evaluates the trained regression model on the test dataset.
Calculates metrics such as RMSE (Root Mean Squared Error), R², or others to assess performance.

- Predictions
Applies the trained regression model to the test data to predict target values (e.g., Revenue). Displays predictions and compares them to the true values.

- Result Display
Summarizes and visualizes model performance and predictions.

## Spark ML and Spark MLlib for Regression

**Spark MLlib (Legacy):**

* **Overview:** Spark MLlib was the original machine learning library in Apache Spark. While still functional, it's now largely superseded by Spark ML.  MLlib primarily uses RDDs (Resilient Distributed Datasets) for data representation.
* **Regression Algorithms:** MLlib offered algorithms like Linear Regression, Logistic Regression, Support Vector Machines (SVM), and decision trees for regression tasks.  These algorithms were often implemented as lower-level APIs, requiring more manual configuration.
* **Data Handling:**  Because it uses RDDs, data manipulation and preprocessing often involved a series of transformations and actions on the RDDs, which could sometimes be less intuitive or efficient compared to DataFrames.
* **Limitations:** MLlib's API is generally considered less user-friendly and flexible than Spark ML's.  It's also less actively developed and supported.


**Spark ML :**


* **Overview:** Spark ML is the newer and preferred machine learning library in Spark. It uses DataFrames as its primary data structure, making it significantly more convenient and efficient for data manipulation and preprocessing. DataFrames offer schema, which enables better performance, especially for complex data structures.
* **Pipeline API:** Spark ML introduces a powerful Pipeline API, which allows users to define sequences of data transformation and model training steps. This makes workflows more organized, repeatable, and easier to manage.  Pipelines promote modularity.
* **Estimators and Transformers:** Spark ML represents algorithms as *Estimators* (fittable) and *Transformers* (transforming data). This clear distinction simplifies model building and the application of trained models to new data.  Estimators are fit to data producing a model; Transformers use fitted models to transform data.  This promotes a consistent API.
* **Feature Engineering:**  Spark ML provides a rich set of tools for feature engineering, including feature scaling, one-hot encoding, and vector assemblers.  This is crucial for achieving optimal model performance. It is easier to assemble features with this approach.
* **Regression Algorithms:**  Spark ML includes several algorithms suited for regression, such as Linear Regression, Generalized Linear Regression, Decision Tree Regression, Random Forest Regression, Gradient-Boosted Trees Regression (GBT Regression), and others.  These algorithms benefit from the DataFrame-based API.
* **Model Evaluation:**  Spark ML provides evaluation metrics such as Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), R-squared, to assess model accuracy.
* **Hyperparameter Tuning:** Using cross-validation and parameter grids, one can optimize the model's hyperparameters, which are parameters that are not learned from the data. Hyperparameter Tuning is essential to optimal model performance.


**Regression in Spark ML :**

1. **Data Preparation:** Load the data into a DataFrame, handle missing values, and perform feature engineering (e.g., one-hot encoding categorical variables, scaling numerical features).

2. **Vector Assembler:** Combine features into a single vector column (required by many algorithms).

3. **Model Selection:**  Choose an appropriate regression algorithm (Linear Regression, Decision Trees, Random Forests, GBT, etc.). The choice depends on data characteristics (linearity, non-linearity, number of features) and desired performance.

4. **Pipeline Creation:** Create a pipeline that includes feature transformations (vector assembler, scalers, encoders) and the selected regression model. This ensures a reusable workflow.

5. **Train-Test Split:** Split the data into training and testing sets.

6. **Model Training:** Train the model on the training set using the pipeline.

7. **Model Evaluation:** Use the trained model to predict on the testing set and evaluate its performance using appropriate metrics (RMSE, R-squared, MAE, etc.).

8. **Hyperparameter Tuning:**  Use CrossValidator or TrainValidationSplit to fine-tune the model's hyperparameters for optimal performance.

9. **Deployment:**  Save the trained model for deployment and later use.


**Key Advantages of Spark ML:**

* **Scalability:** Handles large datasets efficiently.
* **Ease of Use:** DataFrame-based API makes data handling and preprocessing straightforward.
* **Pipeline API:** Streamlines model development and deployment.
* **Rich Feature Set:** Includes a wide variety of algorithms and feature engineering tools.

Choosing between MLlib and ML: *Always* choose Spark ML due to its superior performance, features and ease of use. MLlib is deprecated.


This notebook explores the use of Spark ML and Spark MLlib for regression tasks. It provides a comparison between the two libraries, highlighting the advantages of Spark ML and demonstrating a typical regression workflow using Spark ML.

##Case Study - predicting customer revenue

The objective of this case study is to build a basic regression model using Spark ML to predict customer revenue based on their usage metrics, such as call duration, number of messages sent, and data usage.

The dataset is a simulated telecom dataset with the following features:

UserID:
A unique identifier for each customer.

CallDuration (numeric):
Total minutes of calls made by the customer in a specific period.

MessagesSent (numeric):
Total number of messages sent by the customer.

DataUsage (numeric):
Total data consumed by the customer (in GB).

Revenue (numeric):
The revenue generated by the customer (target variable).

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://dlcdn.apache.org/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.tgz

!tar xzf spark-3.5.3-bin-hadoop3.tgz

In [None]:
!pip install pyspark
!pip install -q findspark



In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.3-bin-hadoop3"

In [None]:
import pyspark
print(pyspark.__version__)

3.5.3


In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, FloatType, IntegerType

# Initialize SparkSession
spark = SparkSession.builder.appName("TelecomML").getOrCreate()

In [None]:
# Simulated Telecom Dataset
# Changed data types to floats
data = [
    (1, 200.0, 30.0, 15.0, 0.5),  # UserID, CallDuration, MessagesSent, DataUsage, Revenue
    (2, 100.0, 20.0, 10.0, 0.3),
    (3, 300.0, 50.0, 20.0, 0.8),
    (4, 150.0, 25.0, 12.0, 0.4),
    (5, 250.0, 40.0, 18.0, 0.7),
]

schema = StructType([
    StructField("UserID", IntegerType(), True),
    StructField("CallDuration", FloatType(), True),
    StructField("MessagesSent", FloatType(), True),
    StructField("DataUsage", FloatType(), True),
    StructField("Revenue", FloatType(), True),
])

df = spark.createDataFrame(data, schema=schema)

In [None]:
# Show dataset
df.show()

+------+------------+------------+---------+-------+
|UserID|CallDuration|MessagesSent|DataUsage|Revenue|
+------+------------+------------+---------+-------+
|     1|       200.0|        30.0|     15.0|    0.5|
|     2|       100.0|        20.0|     10.0|    0.3|
|     3|       300.0|        50.0|     20.0|    0.8|
|     4|       150.0|        25.0|     12.0|    0.4|
|     5|       250.0|        40.0|     18.0|    0.7|
+------+------------+------------+---------+-------+



In [None]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler
from pyspark.ml import Pipeline

# Assemble features into a single vector
feature_columns = ["CallDuration", "MessagesSent", "DataUsage"]
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")

# Split into training and testing data
train_data, test_data = df.randomSplit([0.8, 0.2], seed=42)

In [None]:
from pyspark.ml.regression import LinearRegression

# Initialize Linear Regression model
lr = LinearRegression(featuresCol="features", labelCol="Revenue")

In [None]:
# Create a pipeline
pipeline = Pipeline(stages=[assembler, lr])

In [None]:
# Train the model
model = pipeline.fit(train_data)

In [None]:
# Make predictions
predictions = model.transform(test_data)

In [None]:
# Show predictions
predictions.select("UserID", "features", "Revenue", "prediction").show()

+------+-----------------+-------+------------------+
|UserID|         features|Revenue|        prediction|
+------+-----------------+-------+------------------+
|     3|[300.0,50.0,20.0]|    0.8|0.8999999761581466|
+------+-----------------+-------+------------------+



In [None]:
from pyspark.ml.evaluation import RegressionEvaluator

# Initialize evaluator
evaluator = RegressionEvaluator(labelCol="Revenue", predictionCol="prediction", metricName="rmse")

In [None]:
# Calculate RMSE
rmse = evaluator.evaluate(predictions)
print(f"Root Mean Square Error (RMSE): {rmse}")

Root Mean Square Error (RMSE): 0.09999996423721769


##Project on Regression using SparkML - predicting bandwidth allocation

Problem Statement
Title: Predicting 5G Bandwidth Allocation Based on Quality of Service Metrics
In the rapidly evolving domain of 5G networks, efficient allocation of bandwidth to users is critical to maintaining optimal Quality of Service (QoS). The goal of this project is to develop a regression model using Spark ML to predict the bandwidth allocated to users based on various QoS metrics.

The Allocated Bandwidth is influenced by factors such as:

- Signal strength
- Latency
- Required bandwidth
- Resource allocation percentage
- Type of application (e.g., video calls, streaming, emergency services).

By accurately predicting the allocated bandwidth, telecom operators can:

- Optimize resource utilization
- Ensure equitable and efficient bandwidth distribution
- Enhance the overall user experience

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import FloatType
from pyspark.sql.functions import regexp_replace, col

# Initialize Spark Session
spark = SparkSession.builder.appName("5G_Bandwidth_Allocation").getOrCreate()

# Load dataset (replace 'path_to_file' with actual file path in Colab)
file_path = "/content/Quality of Service 5G.csv"
spark_df = spark.read.csv(file_path, header=True, inferSchema=True)

spark_df.show()

+--------------+-------+-------------------+---------------+-------+------------------+-------------------+-------------------+
|     Timestamp|User_ID|   Application_Type|Signal_Strength|Latency|Required_Bandwidth|Allocated_Bandwidth|Resource_Allocation|
+--------------+-------+-------------------+---------------+-------+------------------+-------------------+-------------------+
|9/3/2023 10:00| User_1|         Video_Call|        -75 dBm|  30 ms|           10 Mbps|            15 Mbps|                70%|
|9/3/2023 10:00| User_2|         Voice_Call|        -80 dBm|  20 ms|          100 Kbps|           120 Kbps|                80%|
|9/3/2023 10:00| User_3|          Streaming|        -85 dBm|  40 ms|            5 Mbps|             6 Mbps|                75%|
|9/3/2023 10:00| User_4|  Emergency_Service|        -70 dBm|  10 ms|            1 Mbps|           1.5 Mbps|                90%|
|9/3/2023 10:00| User_5|      Online_Gaming|        -78 dBm|  25 ms|            2 Mbps|             3 Mb

In [None]:
from pyspark.sql.functions import regexp_replace, col

# Remove unwanted strings from specific columns
cleaned_df = spark_df \
    .withColumn("Signal_Strength", regexp_replace(col("Signal_Strength"), " dBm", "").cast("int")) \
    .withColumn("Latency", regexp_replace(col("Latency"), " ms", "").cast("int")) \
    .withColumn("Required_Bandwidth", regexp_replace(col("Required_Bandwidth"), " (Kbps|Mbps)", "").cast("float")) \
    .withColumn("Allocated_Bandwidth", regexp_replace(col("Allocated_Bandwidth"), " (Kbps|Mbps)", "").cast("float")) \
    .withColumn("Resource_Allocation", regexp_replace(col("Resource_Allocation"), "%", "").cast("int"))

# Show the cleaned DataFrame
cleaned_df.show()


+--------------+-------+-------------------+---------------+-------+------------------+-------------------+-------------------+
|     Timestamp|User_ID|   Application_Type|Signal_Strength|Latency|Required_Bandwidth|Allocated_Bandwidth|Resource_Allocation|
+--------------+-------+-------------------+---------------+-------+------------------+-------------------+-------------------+
|9/3/2023 10:00| User_1|         Video_Call|            -75|     30|              10.0|               15.0|                 70|
|9/3/2023 10:00| User_2|         Voice_Call|            -80|     20|             100.0|              120.0|                 80|
|9/3/2023 10:00| User_3|          Streaming|            -85|     40|               5.0|                6.0|                 75|
|9/3/2023 10:00| User_4|  Emergency_Service|            -70|     10|               1.0|                1.5|                 90|
|9/3/2023 10:00| User_5|      Online_Gaming|            -78|     25|               2.0|                3

In [None]:
# Print the number of rows and columns
print(f"Total Rows: {spark_df.count()}")
print(f"Total Columns: {len(spark_df.columns)}")

Total Rows: 400
Total Columns: 8


In [None]:
from pyspark.sql.functions import col,isnan, when, count

cleaned_df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in cleaned_df.columns]).show()


+---------+-------+----------------+---------------+-------+------------------+-------------------+-------------------+
|Timestamp|User_ID|Application_Type|Signal_Strength|Latency|Required_Bandwidth|Allocated_Bandwidth|Resource_Allocation|
+---------+-------+----------------+---------------+-------+------------------+-------------------+-------------------+
|        0|      0|               0|              0|      0|                 0|                  0|                  0|
+---------+-------+----------------+---------------+-------+------------------+-------------------+-------------------+



In [None]:
from pyspark.ml.feature import StringIndexer, VectorAssembler

# Encode categorical column
indexer = StringIndexer(inputCol="Application_Type", outputCol="Application_Index")

In [None]:
# Assemble features into a single vector
feature_columns = ["Signal_Strength", "Latency", "Required_Bandwidth", "Resource_Allocation", "Application_Index"]
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")

In [None]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml import Pipeline

# Initialize Linear Regression model
lr = LinearRegression(featuresCol="features", labelCol="Allocated_Bandwidth")

In [None]:
# Create a pipeline
pipeline = Pipeline(stages=[indexer, assembler, lr])

In [None]:
# Split data into training and testing sets
train_data, test_data = cleaned_df.randomSplit([0.8, 0.2], seed=42)

In [None]:
# Train the model
model = pipeline.fit(train_data)

In [None]:
# Make predictions
predictions = model.transform(test_data)

# Show predictions
predictions.select("features", "Allocated_Bandwidth", "prediction").show()

+--------------------+-------------------+-------------------+
|            features|Allocated_Bandwidth|         prediction|
+--------------------+-------------------+-------------------+
|[-76.0,32.0,12.0,...|               14.0| 13.686394859125464|
|[-74.0,29.0,10.0,...|               12.0|   5.19777683669836|
|[-69.0,9.0,1.2000...|                1.3|   5.56509596082433|
|[-77.0,31.0,11.0,...|               13.0| 15.912271788979972|
|[-68.0,8.0,1.1000...|                1.2|-0.9547421177030131|
|[-84.0,33.0,3.400...|                3.7|0.27815617734280096|
|[-82.0,36.0,4.0,8...|                4.6| 11.171584753116925|
|[-79.0,29.0,10.5,...|               12.3|  12.32637927087034|
|[-86.0,31.0,3.599...|                3.9| 0.5951779176391199|
|[-90.0,50.0,500.0...|              550.0|  505.9951888442168|
|[-88.0,30.0,1.0,6...|                1.0|-4.9864736569116275|
|[-82.0,35.0,3.0,8...|                3.5|  6.135364392929517|
|[-86.0,18.0,155.0...|              180.0| 172.03014393

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator

In [None]:
# Evaluate predictions
evaluator = RegressionEvaluator(labelCol="Allocated_Bandwidth", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)

In [None]:
# Calculate R-squared
r2 = evaluator.evaluate(predictions, {evaluator.metricName: "r2"})

print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"R-squared (R2): {r2}")

Root Mean Squared Error (RMSE): 7.159055459189656
R-squared (R2): 0.9989080949744653


## Practice Case Study - bandwidth prediction on custom dataset



```
# data = [
    (1, -70.0, 20.0, 10.0, 5.0, "Video Call", 5.0),
    (2, -65.0, 15.0, 5.0, 3.0, "Streaming", 3.5),
    (3, -75.0, 25.0, 20.0, 10.0, "Emergency", 18.0),
    (4, -80.0, 30.0, 10.0, 5.0, "Streaming", 6.0),
    (5, -60.0, 10.0, 2.0, 2.0, "Video Call", 2.0),
    (6, -72.0, 22.0, 12.0, 6.0, "Video Call", 7.0),
    (7, -68.0, 18.0, 8.0, 4.0, "Streaming", 4.0),
    (8, -78.0, 28.0, 15.0, 7.0, "Emergency", 15.0),
    (9, -63.0, 12.0, 3.0, 1.0, "Video Call", 1.5),
    (10, -73.0, 23.0, 13.0, 6.5, "Streaming", 8.0),
]
```



To create a regression model using Spark ML with this dataset, participants are supposed to perform the following steps:

1. **Data Loading and Preparation:** Load the dataset into a Spark DataFrame.  Ensure data types are correct (numeric for features and the target variable). Handle missing values appropriately (imputation or removal).  Perform any necessary data cleaning, like removing units from numerical columns if they exist, and converting them to the appropriate numerical data type.


2. **Feature Engineering:** If needed, create new features or transform existing ones. For instance, you might create interaction terms between features or apply transformations to better represent the data. Encode any categorical variables (e.g., application type in the 5G example) using methods like StringIndexer, creating numerical representations for them.


3. **Vector Assembler:**  Combine the relevant features into a single vector column. This is a crucial step as many ML algorithms in Spark expect a single vector column of input features.


4. **Data Splitting:** Divide the dataset into training and testing sets.  This is standard practice to assess how well your model generalizes to unseen data.


5. **Model Selection:** Choose a suitable regression algorithm.  Linear Regression is a good starting point, especially if you believe there's a linear relationship between the features and the target variable.  Other options include Decision Tree Regression, Random Forest Regression, and Gradient-Boosted Trees (GBT) for more complex relationships.


6. **Pipeline Creation (Recommended):** Build a pipeline that includes feature transformations and the chosen regression model. Pipelines make the entire workflow manageable, reproducible, and easier to tune.


7. **Model Training:** Train the model on the training data using the pipeline.


8. **Model Evaluation:** Evaluate the model's performance on the testing data. Common evaluation metrics for regression are RMSE, MAE, and R-squared. Use a RegressionEvaluator to calculate these metrics.


9. **Hyperparameter Tuning (Important):** Fine-tune the model's hyperparameters (parameters that aren't learned from the data itself).  Techniques like cross-validation help optimize these parameters for better performance.

In [None]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, FloatType, IntegerType, StringType # Import StringType
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql.types import FloatType
from pyspark.sql.functions import regexp_replace, col
from pyspark.ml.feature import StringIndexer, VectorAssembler

In [None]:
# Initialize Spark Session
spark = SparkSession.builder.appName("5G_Bandwidth_Allocation").getOrCreate()

data = [
    (1, -70.0, 20.0, 10.0, 5.0, "Video Call", 5.0),
    (2, -65.0, 15.0, 5.0, 3.0, "Streaming", 3.5),
    (3, -75.0, 25.0, 20.0, 10.0, "Emergency", 18.0),
    (4, -80.0, 30.0, 10.0, 5.0, "Streaming", 6.0),
    (5, -60.0, 10.0, 2.0, 2.0, "Video Call", 2.0),
    (6, -72.0, 22.0, 12.0, 6.0, "Video Call", 7.0),
    (7, -68.0, 18.0, 8.0, 4.0, "Streaming", 4.0),
    (8, -78.0, 28.0, 15.0, 7.0, "Emergency", 15.0),
    (9, -63.0, 12.0, 3.0, 1.0, "Video Call", 1.5),
    (10, -73.0, 23.0, 13.0, 6.5, "Streaming", 8.0),
]

schema = StructType([
    StructField("UserID", IntegerType(), True),
    StructField("Signal_Strength", FloatType(), True),
    StructField("Latency", FloatType(), True),
    StructField("Required_Bandwidth", FloatType(), True),
    StructField("Resource_Allocation", FloatType(), True),
    StructField("Application_Type", StringType(), True),
    StructField("Allocated_Bandwidth", FloatType(), True),
])

In [None]:
spark_df = spark.createDataFrame(data, schema=schema)

In [None]:
# Encode categorical column
indexer = StringIndexer(inputCol="Application_Type", outputCol="Application_Index")

In [None]:
# Assemble features into a single vector
feature_columns = ["Signal_Strength", "Latency", "Required_Bandwidth", "Resource_Allocation", "Application_Index"]
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")

In [None]:
# Initialize Linear Regression model
lr = LinearRegression(featuresCol="features", labelCol="Allocated_Bandwidth")

In [None]:
# Create a pipeline
pipeline = Pipeline(stages=[indexer, assembler, lr])

In [None]:
# Split data into training and testing sets
train_data, test_data = spark_df.randomSplit([0.8, 0.2], seed=42)

In [None]:
# Train the model
model = pipeline.fit(train_data)

In [None]:
# Make predictions
predictions = model.transform(test_data)

In [None]:
# Show predictions
predictions.select("features", "Allocated_Bandwidth", "prediction").show()

+--------------------+-------------------+------------------+
|            features|Allocated_Bandwidth|        prediction|
+--------------------+-------------------+------------------+
|[-75.0,25.0,20.0,...|               18.0|17.352974344514656|
|[-72.0,22.0,12.0,...|                7.0|  9.25509820505565|
+--------------------+-------------------+------------------+



In [None]:
# Evaluate predictions
evaluator = RegressionEvaluator(labelCol="Allocated_Bandwidth", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)

In [None]:
# Calculate R-squared
r2 = evaluator.evaluate(predictions, {evaluator.metricName: "r2"})

print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"R-squared (R2): {r2}")

Root Mean Squared Error (RMSE): 1.658931902354863
R-squared (R2): 0.9090229733338604
