#Lab 5 : Spark ML for Classification

#Tasks

- Installation and Setup
Installs the PySpark library using pip install pyspark.
Imports necessary modules from PySpark, including SparkSession.

- Spark Session Initialization
Initializes a Spark sessions.

- Dataset Preparation
Simulates a telecom fraud detection dataset with fields such as:
TransactionID (integer)
TransactionType (string)
CallDuration (float)
TransactionFrequency (integer)
DeviationScore (integer)
Fraudulent (integer, binary: 0 or 1).
Defines a schema for the dataset using StructType and StructField.
Creates a PySpark DataFrame using the simulated data and defined schema.

- Data Display
Uses df.show() to display the initial dataset.

- Data Preprocessing
Imports modules for data transformation: StringIndexer, VectorAssembler, and Pipeline.
Converts the categorical column TransactionType into a numerical index using StringIndexer.
Combines multiple feature columns (TypeIndex, CallDuration, TransactionFrequency, DeviationScore) into a single feature vector using VectorAssembler.

- Pipeline Creation
Creates a PySpark ML pipeline for data preprocessing, combining transformations like string indexing and feature vector assembly.

- Train-Test Split
Splits the data into training and testing sets using the randomSplit function.

- Model Training
Uses a classification algorithm (likely LogisticRegression or similar) to train a machine learning model on the preprocessed training data.

- Model Evaluation
Evaluates the trained model on the test dataset.
Calculates performance metrics such as accuracy or other evaluation measures using the MulticlassClassificationEvaluator.

- Predictions
Applies the trained model to the test dataset to generate predictions.
Displays predictions alongside true labels.

- Result Display
Shows the results of predictions and evaluates the model's performance.


**Introduction:**

Apache Spark's MLlib library provides a powerful and scalable platform for machine learning tasks, including classification. PySpark, the Python API for Spark, allows users to leverage this functionality with the familiar Python syntax.  This is particularly valuable for handling large datasets that wouldn't fit into the memory of a single machine.

**Key Concepts:**

* **Resilient Distributed Datasets (RDDs):**  Spark operates on RDDs, which are fault-tolerant collections of elements distributed across a cluster.  MLlib algorithms are designed to work efficiently with RDDs, enabling parallel processing of vast amounts of data.  While DataFrames and Datasets are now preferred for their schema enforcement and optimization, understanding the underlying RDD concept is beneficial.
* **DataFrames and Datasets:**  These are higher-level abstractions built on top of RDDs. They offer improved performance, schema enforcement, and optimization opportunities.  DataFrames and Datasets are the recommended way to work with structured data in Spark ML.
* **Pipelines:** MLlib provides pipelines for chaining multiple stages of data transformation and model building. This simplifies the workflow and improves reproducibility.  Pipelines enable modularization and reusability of different stages of a machine learning pipeline.
* **Estimators and Transformers:**  MLlib algorithms are implemented as estimators or transformers.  Estimators learn from data to create a model (e.g., Logistic Regression, Decision Tree).  Transformers take data and a model to produce transformed data (e.g., feature scaling, model prediction).
* **Feature Engineering:**  Feature engineering plays a crucial role in the success of any machine learning model. PySpark provides tools for feature extraction, transformation, and selection, including methods for handling categorical features (one-hot encoding, string indexer), numerical features (scaling, standardization), and feature importance calculation.
* **Model Evaluation:**  Evaluating model performance is essential. PySpark offers various metrics like accuracy, precision, recall, F1-score, AUC-ROC, and others for classification tasks.  Understanding which metric is most relevant for your problem is critical.
* **Hyperparameter Tuning:**  Finding the optimal hyperparameters for your chosen model is crucial.  Grid search, random search, and cross-validation methods are commonly used within the Spark MLlib framework for efficient exploration of the hyperparameter space.
* **Model Persistence:**  Trained models can be saved to disk and loaded back later, eliminating the need to retrain every time.  This enables model reuse and deployment in production environments.


**Classification Algorithms in PySpark MLlib:**

Several common classification algorithms are available in PySpark's MLlib:

* **Logistic Regression:**  A linear model that predicts the probability of a categorical outcome.  It's widely used due to its simplicity and interpretability.
* **Decision Trees:** Tree-based models that partition the data based on features to create a hierarchy of decisions leading to a predicted class.  They are relatively easy to interpret but can be prone to overfitting.
* **Random Forests:** An ensemble of decision trees that improve predictive accuracy and reduce overfitting. They are highly versatile and robust.
* **Gradient-Boosted Trees (GBTs):** Another ensemble method that sequentially builds trees, where each tree corrects the errors of the previous ones.  GBDTs are often among the top-performing classification algorithms.
* **Support Vector Machines (SVMs):**  Finds the optimal hyperplane that separates classes in the feature space. They work well in high-dimensional spaces but can be computationally expensive.
* **Naive Bayes:**  A probabilistic classifier based on Bayes' theorem with the assumption of feature independence.  It's simple and efficient, particularly suitable for text classification.
* **Multilayer Perceptron (MLP):**  A neural network architecture that can model complex non-linear relationships.  It can be very powerful but requires careful tuning and often substantial computational resources.


**Workflow:**

A typical workflow in PySpark for classification involves:

1. **Data Loading and Preparation:** Loading the data into a Spark DataFrame or Dataset.
2. **Data Cleaning and Preprocessing:**  Handling missing values, outliers, and converting data types as needed.
3. **Feature Engineering:** Creating or transforming features to improve model performance.
4. **Feature Selection:**  Choosing relevant features to include in the model.
5. **Splitting Data:** Partitioning the data into training and testing sets.
6. **Model Training:** Selecting and training a classification model.
7. **Hyperparameter Tuning:** Optimizing the model's hyperparameters.
8. **Model Evaluation:** Evaluating the model's performance using metrics relevant to the task and finally deployment

Understanding these concepts and tools will enable you to effectively use PySpark for building and deploying robust classification models for large datasets.

Let us look at a scenario where we are building a ML classification model using Spark ML for Fraud Detection in Telecom Transactions.


##Case Study - fraud detection in telecom transactions

This scenario focuses on classifying telecom transactions as either fraudulent or non-fraudulent based on certain features, such as call duration, location, frequency of transactions, and whether the activity deviates from the user's typical patterns.

The dataset contains:

- TransactionType: Indicates whether the transaction is domestic or international.

- CallDuration: Duration of the call in minutes.

- TransactionFrequency: Number of transactions in the past week.

- DeviationScore: A computed score indicating how much the transaction deviates from typical behavior.

- Fraudulent: The target label (1 for fraudulent, 0 for non-fraudulent).

In [1]:
#Install and Configure PySpark
!pip install pyspark
from pyspark.sql import SparkSession



In [2]:
#Start Spark session
spark = SparkSession.builder.appName("TelecomFraudDetection").getOrCreate()

In [3]:
#Load and Simulate Telecom Fraud Dataset
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, FloatType

In [4]:
#Simulate a fraud detection dataset
data = [
    (1, "domestic", 5.5, 10, 3, 0),
    (2, "international", 15.2, 50, 20, 1),
    (3, "domestic", 3.1, 5, 2, 0),
    (4, "domestic", 25.4, 100, 50, 1),
    (5, "international", 12.3, 30, 10, 0),
    (6, "domestic", 2.0, 1, 1, 0),
    (7, "international", 50.5, 200, 100, 1),
]
schema = StructType([
    StructField("TransactionID", IntegerType(), True),
    StructField("TransactionType", StringType(), True),
    StructField("CallDuration", FloatType(), True),
    StructField("TransactionFrequency", IntegerType(), True),
    StructField("DeviationScore", IntegerType(), True),
    StructField("Fraudulent", IntegerType(), True),
])
df = spark.createDataFrame(data, schema=schema)

In [5]:
#Display data
df.show()

+-------------+---------------+------------+--------------------+--------------+----------+
|TransactionID|TransactionType|CallDuration|TransactionFrequency|DeviationScore|Fraudulent|
+-------------+---------------+------------+--------------------+--------------+----------+
|            1|       domestic|         5.5|                  10|             3|         0|
|            2|  international|        15.2|                  50|            20|         1|
|            3|       domestic|         3.1|                   5|             2|         0|
|            4|       domestic|        25.4|                 100|            50|         1|
|            5|  international|        12.3|                  30|            10|         0|
|            6|       domestic|         2.0|                   1|             1|         0|
|            7|  international|        50.5|                 200|           100|         1|
+-------------+---------------+------------+--------------------+--------------+

In [6]:
#Data Preprocessing
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml import Pipeline

In [7]:
#Convert categorical columns to numerical
type_indexer = StringIndexer(inputCol="TransactionType", outputCol="TypeIndex")

#Combine features into a single vector
assembler = VectorAssembler(
    inputCols=["TypeIndex", "CallDuration", "TransactionFrequency", "DeviationScore"],
    outputCol="features"
)

In [8]:
#Create a pipeline for preprocessing
pipeline = Pipeline(stages=[type_indexer, assembler])
preprocessed_data = pipeline.fit(df).transform(df)

#Display preprocessed data
preprocessed_data.select("features", "Fraudulent").show()

+--------------------+----------+
|            features|Fraudulent|
+--------------------+----------+
|  [0.0,5.5,10.0,3.0]|         0|
|[1.0,15.199999809...|         1|
|[0.0,3.0999999046...|         0|
|[0.0,25.399999618...|         1|
|[1.0,12.300000190...|         0|
|   [0.0,2.0,1.0,1.0]|         0|
|[1.0,50.5,200.0,1...|         1|
+--------------------+----------+



In [9]:
#Train a Classification Model
from pyspark.ml.classification import RandomForestClassifier

In [10]:
#Split data into training and test sets
train, test = preprocessed_data.randomSplit([0.8, 0.2], seed=123)

In [11]:
#Initialize and train the Random Forest classifier
rf = RandomForestClassifier(featuresCol="features", labelCol="Fraudulent")
model = rf.fit(train)

In [12]:
#Evaluate the Model
predictions = model.transform(test)
predictions.select("features", "Fraudulent", "prediction").show()

+--------------------+----------+----------+
|            features|Fraudulent|prediction|
+--------------------+----------+----------+
|[0.0,3.0999999046...|         0|       0.0|
|   [0.0,2.0,1.0,1.0]|         0|       0.0|
+--------------------+----------+----------+



In [13]:
#Evaluate using MulticlassClassificationEvaluator
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(labelCol="Fraudulent", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)

print(f"Model Accuracy: {accuracy:.2f}")

Model Accuracy: 1.00


# Project on Classification using Spark ML - Network  Slicing Recognition

The objective is to build a machine learning classification model to predict the type of network slice (slice Type) based on the provided features. Accurate classification helps in:

- Automating network resource allocation.
- Enhancing the efficiency of service-specific configurations.
- Supporting real-time decision-making for optimizing LTE/5G network performance.

#Steps to Build the Classification Model

- Data Loading and Inspection:

- Load the dataset into a PySpark DataFrame and inspect for any anomalies or issues in the data.

- Data Preprocessing:
Column Renaming: Resolve naming issues (e.g., spaces in column names like "Industry 4.0").
Handle Missing Values: Impute or remove rows with null or invalid data.
Feature Assembly: Combine all relevant features into a single vector using VectorAssembler.
Label Encoding: Convert the target variable (slice Type) into numerical labels using StringIndexer.

- Train-Test Split:
Partition the dataset into training (80%) and testing (20%) subsets to evaluate model performance.

- Model Training:
Train a classification model using Random Forest, which is robust and handles numerical and categorical features well.

- Model Evaluation:
Use metrics like accuracy to evaluate the model’s performance on the test set using MulticlassClassificationEvaluator.

In [25]:
# Import required PySpark libraries
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [26]:
# Start Spark session
spark = SparkSession.builder.appName("NetworkSlicingClassification").getOrCreate()

In [27]:
# Load the dataset
file_path = "/content/network_slicing_recognition.csv"
data = spark.read.csv(file_path, header=True, inferSchema=True)

data.show(5)

+---------------+----+----------------+------------+---+------+---+-------+------------+----------+------------+-----------+-------------+-----------------+--------------------+----------+----------+
|LTE/5g Category|Time|Packet Loss Rate|Packet delay|IoT|LTE/5G|GBR|Non-GBR|AR/VR/Gaming|Healthcare|Industry 4.0|IoT Devices|Public Safety|Smart City & Home|Smart Transportation|Smartphone|slice Type|
+---------------+----+----------------+------------+---+------+---+-------+------------+----------+------------+-----------+-------------+-----------------+--------------------+----------+----------+
|             14|   0|          1.0E-6|          10|  1|     0|  0|      1|           0|         0|           0|          0|            1|                0|                   0|         0|         3|
|             18|  20|           0.001|         100|  0|     1|  1|      0|           1|         0|           0|          0|            0|                0|                   0|         0|         1|


In [42]:
# Preprocessing
# Rename problematic columns to avoid errors
data = data.withColumnRenamed("Industry 4.0", "Industry_4_0")

# Assemble features into a single vector
feature_columns = [
    "Time", "Packet Loss Rate", "Packet delay", "IoT", "LTE/5G", "GBR", "Non-GBR",
    "AR/VR/Gaming", "Healthcare", "Industry_4_0", "IoT Devices", "Public Safety",
    "Smart City & Home", "Smart Transportation", "Smartphone"
]

assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
data = assembler.transform(data)


In [43]:
# Convert target variable to indexed labels
label_indexer = StringIndexer(inputCol="slice Type", outputCol="label")
data = label_indexer.fit(data).transform(data)

In [44]:
# Split data into training and testing sets
train_data, test_data = data.randomSplit([0.8, 0.2], seed=42)

In [45]:
# Train a Random Forest Classifier
rf = RandomForestClassifier(featuresCol="features", labelCol="label", numTrees=20, maxDepth=5)
model = rf.fit(train_data)

In [46]:
# Make predictions
test_predictions = model.transform(test_data)
test_predictions.select("features", "label", "prediction").show(5)

+--------------------+-----+----------+
|            features|label|prediction|
+--------------------+-----+----------+
|(15,[1,2,3,6,13],...|  2.0|       2.0|
|(15,[1,2,3,6,13],...|  2.0|       2.0|
|(15,[1,2,3,6,11],...|  2.0|       2.0|
|(15,[1,2,3,6,8],[...|  2.0|       2.0|
|(15,[1,2,4,5,14],...|  0.0|       0.0|
+--------------------+-----+----------+
only showing top 5 rows



In [47]:
# Evaluate the model
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(test_predictions)
print(f"Model Accuracy: {accuracy:.2f}")

Model Accuracy: 1.00


In [48]:
# Stop the Spark session
spark.stop()

## Practice Case Study: customer churn prediction in telecom

This case study focuses on predicting customer churn in a telecom company. Customer churn, the rate at which customers stop doing business with a company, is a critical metric for businesses. Predicting which customers are likely to churn allows companies to implement retention strategies and reduce customer loss.

The dataset used in this analysis includes various customer features potentially related to churn. These features can be broadly categorized into:

- Demographics: Customer age, gender
- Account Information: Tenure (how long they've been a customer), contract type, monthly charges, total charges
- Service Usage: Number of phone lines, multiple lines, internet service, online security, tech support, streaming services, etc.
- Payment Information: Payment method


Workflow:

- Data Loading and Preparation: Load the simulated data into a Spark DataFrame.
- Data Cleaning and Preprocessing: Handle missing values (if any) and convert categorical features to numerical representations using StringIndexer.
- Feature Scaling: Standardize numerical features using StandardScaler.
- Feature Engineering: Create a feature vector using VectorAssembler.
- Model Selection and Training:** Train multiple classification models (Logistic Regression, Random Forest, Gradient-Boosted Trees) to predict churn.
- Hyperparameter Tuning: Use CrossValidator to optimize model hyperparameters.
- Model Evaluation: Evaluate the performance of the models using suitable metrics (accuracy, precision, recall, F1-score, AUC-ROC).
- Model Selection: Choose the best performing model based on the evaluation.



```
# data = [
    (1, "Male", 34, 2, "Month-to-month", 65.5, 1300.0, 0),
    (2, "Female", 45, 12, "One year", 85.2, 10224.0, 0),
    (3, "Male", 22, 1, "Month-to-month", 55.1, 551.0, 1),
    (4, "Female", 58, 7, "Two year", 95.4, 6678.0, 0),
    (5, "Male", 30, 3, "Month-to-month", 70.3, 2109.0, 1)
]
```



In [14]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, FloatType, DoubleType
from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier, GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

In [15]:
#Spark Configuration:
spark = SparkSession.builder.appName("TelecomChurnPrediction").getOrCreate()

data = [
    (1, "Male", 34, 2, "Month-to-month", 65.5, 1300.0, 0),
    (2, "Female", 45, 12, "One year", 85.2, 10224.0, 0),
    (3, "Male", 22, 1, "Month-to-month", 55.1, 551.0, 1),
    (4, "Female", 58, 7, "Two year", 95.4, 6678.0, 0),
    (5, "Male", 30, 3, "Month-to-month", 70.3, 2109.0, 1)
]

schema = StructType([
    StructField("CustomerID", IntegerType(), True),
    StructField("Gender", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Tenure", IntegerType(), True),
    StructField("Contract", StringType(), True),
    StructField("MonthlyCharges", DoubleType(), True),
    StructField("TotalCharges", DoubleType(), True),
    StructField("Churn", IntegerType(), True),
])

In [16]:
df = spark.createDataFrame(data, schema=schema)
df.show()

+----------+------+---+------+--------------+--------------+------------+-----+
|CustomerID|Gender|Age|Tenure|      Contract|MonthlyCharges|TotalCharges|Churn|
+----------+------+---+------+--------------+--------------+------------+-----+
|         1|  Male| 34|     2|Month-to-month|          65.5|      1300.0|    0|
|         2|Female| 45|    12|      One year|          85.2|     10224.0|    0|
|         3|  Male| 22|     1|Month-to-month|          55.1|       551.0|    1|
|         4|Female| 58|     7|      Two year|          95.4|      6678.0|    0|
|         5|  Male| 30|     3|Month-to-month|          70.3|      2109.0|    1|
+----------+------+---+------+--------------+--------------+------------+-----+



In [17]:
#Data Preprocessing:
indexers = [StringIndexer(inputCol=col, outputCol=col + "Index") for col in ["Gender", "Contract"]]
assembler = VectorAssembler(inputCols=["GenderIndex", "Age", "Tenure", "ContractIndex", "MonthlyCharges", "TotalCharges"], outputCol="features")
pipeline = Pipeline(stages=indexers + [assembler])
df = pipeline.fit(df).transform(df)
df.show()

+----------+------+---+------+--------------+--------------+------------+-----+-----------+-------------+--------------------+
|CustomerID|Gender|Age|Tenure|      Contract|MonthlyCharges|TotalCharges|Churn|GenderIndex|ContractIndex|            features|
+----------+------+---+------+--------------+--------------+------------+-----+-----------+-------------+--------------------+
|         1|  Male| 34|     2|Month-to-month|          65.5|      1300.0|    0|        0.0|          0.0|[0.0,34.0,2.0,0.0...|
|         2|Female| 45|    12|      One year|          85.2|     10224.0|    0|        1.0|          1.0|[1.0,45.0,12.0,1....|
|         3|  Male| 22|     1|Month-to-month|          55.1|       551.0|    1|        0.0|          0.0|[0.0,22.0,1.0,0.0...|
|         4|Female| 58|     7|      Two year|          95.4|      6678.0|    0|        1.0|          2.0|[1.0,58.0,7.0,2.0...|
|         5|  Male| 30|     3|Month-to-month|          70.3|      2109.0|    1|        0.0|          0.0|[0.0,3

In [18]:
#Feature Scaling:
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=True)
scalerModel = scaler.fit(df)
df = scalerModel.transform(df)
df.show()

+----------+------+---+------+--------------+--------------+------------+-----+-----------+-------------+--------------------+--------------------+
|CustomerID|Gender|Age|Tenure|      Contract|MonthlyCharges|TotalCharges|Churn|GenderIndex|ContractIndex|            features|      scaledFeatures|
+----------+------+---+------+--------------+--------------+------------+-----+-----------+-------------+--------------------+--------------------+
|         1|  Male| 34|     2|Month-to-month|          65.5|      1300.0|    0|        0.0|          0.0|[0.0,34.0,2.0,0.0...|[-0.7302967433402...|
|         2|Female| 45|    12|      One year|          85.2|     10224.0|    0|        1.0|          1.0|[1.0,45.0,12.0,1....|[1.09544511501033...|
|         3|  Male| 22|     1|Month-to-month|          55.1|       551.0|    1|        0.0|          0.0|[0.0,22.0,1.0,0.0...|[-0.7302967433402...|
|         4|Female| 58|     7|      Two year|          95.4|      6678.0|    0|        1.0|          2.0|[1.0,58

In [22]:
#Train-Test Split:
train, test = df.randomSplit([0.9, 0.1], seed=42)

In [23]:
#Model Evaluation:
evaluator = BinaryClassificationEvaluator(labelCol="Churn", rawPredictionCol="prediction", metricName="areaUnderROC")

def evaluate_model(model, test_data):
    predictions = model.transform(test_data)
    accuracy = evaluator.evaluate(predictions)
    print(f"Model Accuracy: {accuracy:.2f}")

In [24]:

# Initialize and train Logistic Regression model
lr = LogisticRegression(featuresCol="scaledFeatures", labelCol="Churn")
lr_model = lr.fit(train)

# Evaluate Logistic Regression
print("Logistic Regression:")
evaluate_model(lr_model, test)

# Initialize and train Random Forest Classifier
rf = RandomForestClassifier(featuresCol="scaledFeatures", labelCol="Churn")
rf_model = rf.fit(train)

# Evaluate Random Forest
print("\nRandom Forest:")
evaluate_model(rf_model, test)

# Initialize and train Gradient-Boosted Trees Classifier
gbt = GBTClassifier(featuresCol="scaledFeatures", labelCol="Churn", maxIter=10)
gbt_model = gbt.fit(train)

# Evaluate Gradient-Boosted Trees
print("\nGradient-Boosted Trees:")
evaluate_model(gbt_model, test)


#Hyperparameter Tuning using CrossValidator (example with Random Forest)
paramGrid = (ParamGridBuilder()
             .addGrid(rf.numTrees, [10, 20])
             .addGrid(rf.maxDepth, [5, 10])
             .build())

crossval = CrossValidator(estimator=rf,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=2)  # Adjust numFolds as needed

cvModel = crossval.fit(train)
print("\nRandom Forest with Cross Validation:")
evaluate_model(cvModel.bestModel, test)

Logistic Regression:
Model Accuracy: 0.50

Random Forest:
Model Accuracy: 0.50

Gradient-Boosted Trees:
Model Accuracy: 0.50

Random Forest with Cross Validation:
Model Accuracy: 0.50
