<a href="https://colab.research.google.com/github/amien1410/colab-notebooks/blob/main/Colab_Pyspark_Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## PySpark Logistic Regression Tutorial: Predicting Customer Churn 📊

Welcome to this tutorial! We're going to dive into predicting customer churn using **PySpark** and **Logistic Regression**. 🚀

Customer churn is a big deal for businesses. It means customers are leaving. By predicting which customers might churn, companies can take action to keep them. 🎯

We'll be using a dataset from Kaggle. It contains information about consumers, including:

*   **Names**: The name of the customer.
*   **Age**: The customer's age.
*   **Total\_Purchase**: The total amount the customer has spent.
*   **Account\_Manager**: Whether the customer has an account manager (1 for yes, 0 for no).
*   **Years**: The number of years the customer has been with the company.
*   **Num\_Sites**: The number of websites the customer uses.
*   **Onboard\_date**: The date the customer joined.
*   **Location**: The customer's location.
*   **Company**: The customer's company.
*   **Churn**: Whether the customer churned (1 for yes, 0 for no). This is what we'll try to predict! 💪

Get ready to learn how to use PySpark for this exciting machine learning task! 🎉

In [1]:
#@title Download the dataset from Kaggle

from google.colab import drive
drive.mount('/content/drive')

!pip install kaggle
import os
os.environ['KAGGLE_CONFIG_DIR'] = '/content/drive/MyDrive/kaggle'
!kaggle datasets download -d brycepeakega/generalassemblywelcome5k
!unzip -q "/content/generalassemblywelcome5k.zip"

Mounted at /content/drive
Dataset URL: https://www.kaggle.com/datasets/brycepeakega/generalassemblywelcome5k
License(s): unknown
Downloading generalassemblywelcome5k.zip to /content
  0% 0.00/5.99M [00:00<?, ?B/s]
100% 5.99M/5.99M [00:00<00:00, 550MB/s]


In this cell, we're getting set up to work with some data. 🚀

*   `from google.colab import drive`: This line imports a tool to connect Google Colab to your Google Drive. 🤝
*   `drive.mount('/content/drive')`: This line actually makes the connection. Now, Colab can see and use files stored in your Google Drive. 📂
*   `!pip install kaggle`: We're installing the Kaggle library. Kaggle is a platform for data science competitions and datasets. 🏆
*   `import os`: This imports the 'os' module, which helps us interact with the computer's operating system. 💻
*   `os.environ['KAGGLE_CONFIG_DIR'] = '/content/drive/MyDrive/kaggle'`: This sets up a special location where Kaggle will look for your credentials. We're telling it to look inside your Google Drive. 🔑
*   `!kaggle datasets download -d dansbecker/melbourne-housing-snapshot`: This command downloads a specific dataset from Kaggle about housing in Melbourne. 🏠
*   `!unzip -q "/content/melbourne-housing-snapshot.zip"`: After downloading, the dataset is usually in a compressed format (like a zip file). This line unzips it so we can access the data inside. 📦

In [2]:
#@title Import the libraries, start Spark Session and load the dataset

from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml import Pipeline

# Start Spark session
spark = SparkSession.builder \
    .appName("ChurnPredictionLogisticRegression") \
    .getOrCreate()

# Load data
df = spark.read.csv("/content/customer_churn.csv", header=True, inferSchema=True)

# Print schema and preview data
df.printSchema()
df.show(5)

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Churn: integer (nullable = true)

+----------------+----+--------------+---------------+-----+---------+-------------------+--------------------+--------------------+-----+
|           Names| Age|Total_Purchase|Account_Manager|Years|Num_Sites|       Onboard_date|            Location|             Company|Churn|
+----------------+----+--------------+---------------+-----+---------+-------------------+--------------------+--------------------+-----+
|Cameron Williams|42.0|       11066.8|              0| 7.22|      8.0|2013-08-30 07:00:40|10265 Elizabeth M...|          Harvey LLC|    1|
|   Kevin Mueller|41.0|      1

Let's break down the code you just ran: 👇

*   `from pyspark.sql import SparkSession`: This line imports the necessary class to create a Spark session. Think of a Spark session as your main entry point to using Spark's functionality. 🚪
*   `from pyspark.ml.feature import StringIndexer, VectorAssembler`: We're importing tools for preparing our data. `StringIndexer` helps convert text categories into numbers, and `VectorAssembler` combines different columns into a single feature vector that machine learning models can use. 🛠️
*   `from pyspark.ml.classification import LogisticRegression`: This imports the Logistic Regression algorithm from Spark's machine learning library. This is the model we'll use for prediction. 🧠
*   `from pyspark.ml.evaluation import BinaryClassificationEvaluator`: This imports a tool to evaluate how well our binary classification model (like predicting churn, which is either yes or no) performs. ✅
*   `from pyspark.ml import Pipeline`: This imports the `Pipeline` class, which allows us to chain multiple data processing and machine learning steps together. This makes our workflow organized and repeatable. 🏗️
*   `spark = SparkSession.builder.appName("ChurnPredictionLogisticRegression").getOrCreate()`: This is where we create or get an existing Spark session. We give it a name ("ChurnPredictionLogisticRegression") so we can identify it. ✨
*   `df = spark.read.csv("customer_churn.csv", header=True, inferSchema=True)`: This line loads our data from a CSV file named "customer_churn.csv" into a Spark DataFrame. `header=True` tells Spark that the first row is the header, and `inferSchema=True` tells Spark to automatically figure out the data types of each column. 📝
*   `df.printSchema()`: This displays the structure of our DataFrame, showing the column names and their inferred data types. It's like looking at the blueprint of our data. 🗺️
*   `df.show(5)`: This shows the first 5 rows of our DataFrame. It's a quick way to peek at the actual data. 👀

In [10]:
#@title Select features and target, assemble features into vector

# Select features, excluding non-numeric and non-categorical columns that don't need indexing
feature_cols = ['Age', 'Total_Purchase', 'Account_Manager', 'Years', 'Num_Sites']

# Optional: Index categorical 'Location' (if you think it adds signal)
location_indexer = StringIndexer(inputCol='Location', outputCol='Location_indexed', handleInvalid='keep')

# Add the indexed location to feature columns
feature_cols.append('Location_indexed')

# Step 5: Assemble features into vector
assembler = VectorAssembler(inputCols=feature_cols, outputCol='features')

Let's break down the data preprocessing steps: 👇

*   `label_indexer = StringIndexer(inputCol="Churn", outputCol="label")`: This creates a `StringIndexer` to convert the 'Churn' column (our target variable) into a numerical format. Machine learning models typically work with numerical data. We're naming the new numerical column 'label'. 🔢
*   `feature_cols = [col for col in df.columns if col != 'Churn']`: This line creates a list of all column names in our DataFrame *except* for the 'Churn' column. These will be our input features for the model. 📋
*   `categorical_cols = []`: We initialize an empty list for categorical columns. Based on the schema we saw earlier, there aren't any columns that immediately appear to be categorical and require indexing for this specific dataset. If your dataset had columns like 'Gender' or 'Contract' with text values, you would list them here. 📝
*   `indexers = [StringIndexer(inputCol=col, outputCol=col + "_indexed") for col in categorical_cols]`: If we had categorical columns listed, this would create a `StringIndexer` for each of them to convert their text values into numerical indices. The new indexed columns would have "_indexed" added to their original name. 🔤➡️🔢
*   `indexed_feature_cols = [col + "_indexed" if col in categorical_cols else col for col in feature_cols]`: This updates our list of feature columns. If a column was identified as categorical, we use its new indexed name (e.g., 'Gender\_indexed'); otherwise, we use the original column name. This ensures our feature list contains the numerical representations of any categorical features. 🔄
*   `assembler = VectorAssembler(inputCols=indexed_feature_cols, outputCol="features")`: This creates a `VectorAssembler`. This tool takes all the specified input columns (our `indexed_feature_cols`) and combines them into a single vector column named 'features'. Logistic Regression in PySpark expects the input features to be in this vector format. 💪

In [12]:
#@title Split the dataset into training and testing data

train_data, test_data = df.randomSplit([0.8, 0.2], seed=42)

Let's break down the data splitting process: 👇

*   `train_data, test_data = df.randomSplit([0.8, 0.2], seed=42)`: This line takes our `df` DataFrame and randomly splits it into two parts: `train_data` (80% of the data) and `test_data` (20% of the data). The `seed=42` ensures that the split is the same every time you run the code, which is good for reproducibility. 🎲

We now have two separate DataFrames: one for training our model and one for testing its performance. 🛠️✅

In [11]:
#@title Build Logistic Regression model
lr = LogisticRegression(featuresCol='features', labelCol='Churn')

Let's break down this step: 👇

*   `lr = LogisticRegression(featuresCol='features', labelCol='label')`: This line creates an instance of the `LogisticRegression` model. We're telling it to use the column named 'features' as the input features and the column named 'label' as the target variable (what we want to predict). 🎯

We've now initialized our Logistic Regression model, ready to be trained! 💪

In [13]:
#@title Create ML pipeline
pipeline = Pipeline(stages=[location_indexer, assembler, lr])

Let's break down the pipeline creation: 👇

*   `pipeline = Pipeline(stages=indexers + [label_indexer, assembler, lr])`: This line creates a Spark ML `Pipeline`. A pipeline allows us to combine multiple `Estimators` and `Transformers` into a single workflow. 🔄
    *   `indexers`: This includes any `StringIndexer` stages we created for categorical features. If `categorical_cols` was empty, this list is empty.
    *   `label_indexer`: This is the `StringIndexer` for our target variable, 'Churn'.
    *   `assembler`: This is the `VectorAssembler` that combines our features into a single vector column.
    *   `lr`: This is our `LogisticRegression` model.

The pipeline will execute these stages in order: first, it will index categorical features (if any), then index the label column, then assemble the features, and finally apply the logistic regression model. This ensures our data is correctly transformed before being fed into the model. ✨

In [14]:
#@title Train the model
model = pipeline.fit(train_data)

Let's break down the model training step: 👇

*   `model = pipeline.fit(train_data)`: This line trains the entire pipeline using the `train_data` DataFrame. When you call `.fit()` on a `Pipeline`, it sequentially runs the `.fit()` method on each `Estimator` (like `StringIndexer` and `LogisticRegression`) and the `.transform()` method on each `Transformer` (like `VectorAssembler`) within the pipeline. The result is a `PipelineModel`, which is a trained pipeline. 🚀

The `model` variable now holds our trained pipeline, ready to make predictions! 💪

In [15]:
#@title Predict on test data
predictions = model.transform(test_data)

Let's break down the prediction step: 👇

*   `predictions = model.transform(test_data)`: This line uses our trained `model` (which is a `PipelineModel`) to make predictions on the `test_data` DataFrame. The `.transform()` method applies all the stages in the trained pipeline (including feature assembly and the trained logistic regression model) to the new data. 🚀

The result is a new DataFrame called `predictions` that includes the original columns from `test_data` plus additional columns generated by the pipeline, such as the predicted churn label and the raw prediction scores. ✨

In [16]:
#@title Evaluate the model using AUC
evaluator = BinaryClassificationEvaluator(labelCol='Churn', rawPredictionCol='rawPrediction', metricName='areaUnderROC')
roc_auc = evaluator.evaluate(predictions)
print(f"ROC AUC Score: {roc_auc:.4f}")

ROC AUC Score: 0.8795


Let's break down the model evaluation step: 👇

*   `evaluator = BinaryClassificationEvaluator(labelCol='Churn', rawPredictionCol='rawPrediction', metricName='areaUnderROC')`: This line creates a `BinaryClassificationEvaluator`. We configure it to use the 'Churn' column as the true labels, the 'rawPrediction' column (generated by the logistic regression model) for the prediction scores, and specify that we want to calculate the 'areaUnderROC' metric (AUC). The AUC measures the ability of the classifier to distinguish between positive and negative classes. A higher AUC indicates better performance. ✅
*   `roc_auc = evaluator.evaluate(predictions)`: This line runs the evaluator on our `predictions` DataFrame to calculate the ROC AUC score. 📊
*   `print(f"ROC AUC Score: {roc_auc:.4f}")`: This line prints the calculated ROC AUC score, formatted to four decimal places, so we can see how well our model performed. 🎯

Let's break down the ROC AUC Score: 👇

*   **ROC AUC Score: 0.8795** ✨: This score, standing at **0.8795**, is a key metric for evaluating binary classification models like the one we built to predict customer churn.
*   **What it means**: The Area Under the Receiver Operating Characteristic Curve (ROC AUC) essentially measures the model's ability to distinguish between the two classes (churned vs. not churned).
*   **Interpretation**: A score of 0.5 suggests the model performs no better than random chance, while a score of 1.0 indicates a perfect model. Our score of **0.8795** is quite good! It means our model has a strong ability to differentiate between customers who are likely to churn and those who are not. 💪
*   **In simpler terms**: The model is reasonably accurate at identifying potential churners, which is valuable for a business wanting to retain customers. 🎯

In [17]:
#@title Show predictions
predictions.select("Churn", "prediction", "probability").show(10)

+-----+----------+--------------------+
|Churn|prediction|         probability|
+-----+----------+--------------------+
|    0|       0.0|[0.95450919377362...|
|    0|       0.0|[0.87345478839347...|
|    0|       0.0|[0.96324976241585...|
|    0|       0.0|[0.90264082322965...|
|    0|       0.0|[0.96275299577163...|
|    0|       0.0|[0.90119262811257...|
|    0|       0.0|[0.98999787282630...|
|    0|       0.0|[0.97989748156890...|
|    0|       0.0|[0.86721806205680...|
|    0|       0.0|[0.55978175465318...|
+-----+----------+--------------------+
only showing top 10 rows



Let's break down the prediction display: 👇

*   `predictions.select("Churn", "prediction", "probability")`: This selects the original 'Churn' column (the actual value), the 'prediction' column (the model's predicted churn value, either 0 or 1), and the 'probability' column (the probability distribution of the prediction, showing the likelihood of the customer being in each class). 📊
*   `.show(10)`: This displays the first 10 rows of the selected columns, allowing us to quickly inspect the model's predictions and compare them to the actual values. 📋

This gives us a glimpse into how our model is performing on individual instances in the test set. ✨

## Wrapping Up: What We've Learned and What's Next! 🎉

We've successfully built a customer churn prediction model using PySpark's Logistic Regression! Here's a quick recap of what we covered:

*   **Spark Session**: We learned how to start a Spark session, which is our gateway to using PySpark. 🚪
*   **Data Loading and Exploration**: We loaded a CSV dataset into a Spark DataFrame and explored its schema and initial rows. 📊
*   **Data Preprocessing**: We prepared our data by handling the label column and assembling features into a vector format suitable for machine learning models. 🛠️
*   **Model Building**: We initialized a Logistic Regression model. 🧠
*   **Pipeline Creation**: We built a pipeline to chain our preprocessing and modeling steps together. 🏗️
*   **Model Training**: We trained the pipeline model on our training data. 🚀
*   **Prediction**: We used the trained model to make predictions on the test data. ✅
*   **Evaluation**: We evaluated the model's performance using the ROC AUC score. 🎯
*   **Prediction Inspection**: We looked at some individual predictions to see the actual vs. predicted churn and associated probabilities. 👀

### What's Next? 🤔

This tutorial provides a solid foundation. Here are some exciting next steps you could explore:

*   **Feature Engineering**: Create new features from existing ones (e.g., customer engagement metrics, duration since last activity).
*   **Handling Categorical Features**: Properly handle categorical features like 'Location' and 'Company' using techniques like `StringIndexer` and `OneHotEncoderEstimator` if you believe they contain valuable information for predicting churn.
*   **Hyperparameter Tuning**: Experiment with different hyperparameters for the Logistic Regression model to potentially improve performance (e.g., regularization parameter `regParam`, elastic net mixing `elasticNetParam`).
*   **Other Algorithms**: Try other PySpark ML classification algorithms like Decision Trees, Random Forests, or Gradient Boosted Trees to see if they yield better results.
*   **Cross-Validation**: Implement cross-validation for more robust model evaluation.
*   **Model Interpretation**: Explore techniques to understand which features are most important for the model's predictions.
*   **Saving and Loading**: Learn how to save your trained model and load it later for making new predictions.

### Where Can This Be Applied? 💡

Customer churn prediction is valuable across many industries:

*   **Telecommunications**: Predicting which customers are likely to switch providers.
*   **Subscription Services (Streaming, SaaS)**: Identifying users at risk of canceling their subscriptions.
*   **Banking and Finance**: Predicting customers who might close accounts.
*   **E-commerce**: Predicting customers who might stop making purchases.

### Areas for Improvement 🚧

While our model performed well, there are always areas to improve:

*   **Data Quality**: Ensure the data is clean and accurate. Missing values or inconsistencies can impact model performance.
*   **Feature Importance**: Understand which features are driving the predictions. This can help in business decision-making.
*   **Class Imbalance**: If churned customers are a small percentage of the total, the dataset might be imbalanced. Techniques like oversampling or undersampling could be explored.
*   **Deployment**: For real-world applications, the model needs to be deployed to make predictions on new data.

Keep experimenting and building! Happy coding! 😊