<a href="https://colab.research.google.com/github/amien1410/colab-notebooks/blob/main/Colab_Pyspark_Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## PySpark Logistic Regression Tutorial: Predicting Customer Churn 📊

Welcome to this tutorial! We're going to dive into predicting customer churn using **PySpark** and **Logistic Regression**. 🚀

Customer churn is a big deal for businesses. It means customers are leaving. By predicting which customers might churn, companies can take action to keep them. 🎯

We'll be using a dataset from Kaggle. It contains information about consumers, including:

*   **Names**: The name of the customer.
*   **Age**: The customer's age.
*   **Total\_Purchase**: The total amount the customer has spent.
*   **Account\_Manager**: Whether the customer has an account manager (1 for yes, 0 for no).
*   **Years**: The number of years the customer has been with the company.
*   **Num\_Sites**: The number of websites the customer uses.
*   **Onboard\_date**: The date the customer joined.
*   **Location**: The customer's location.
*   **Company**: The customer's company.
*   **Churn**: Whether the customer churned (1 for yes, 0 for no). This is what we'll try to predict! 💪

Get ready to learn how to use PySpark for this exciting machine learning task! 🎉

In [1]:
from google.colab import drive
drive.mount('/content/drive')

!pip install kaggle
import os
os.environ['KAGGLE_CONFIG_DIR'] = '/content/drive/MyDrive/kaggle'
!kaggle datasets download -d brycepeakega/generalassemblywelcome5k
!unzip -q "/content/generalassemblywelcome5k.zip"

Mounted at /content/drive
Dataset URL: https://www.kaggle.com/datasets/brycepeakega/generalassemblywelcome5k
License(s): unknown
Downloading generalassemblywelcome5k.zip to /content
  0% 0.00/5.99M [00:00<?, ?B/s]
100% 5.99M/5.99M [00:00<00:00, 550MB/s]


In this cell, we're getting set up to work with some data. 🚀

*   `from google.colab import drive`: This line imports a tool to connect Google Colab to your Google Drive. 🤝
*   `drive.mount('/content/drive')`: This line actually makes the connection. Now, Colab can see and use files stored in your Google Drive. 📂
*   `!pip install kaggle`: We're installing the Kaggle library. Kaggle is a platform for data science competitions and datasets. 🏆
*   `import os`: This imports the 'os' module, which helps us interact with the computer's operating system. 💻
*   `os.environ['KAGGLE_CONFIG_DIR'] = '/content/drive/MyDrive/kaggle'`: This sets up a special location where Kaggle will look for your credentials. We're telling it to look inside your Google Drive. 🔑
*   `!kaggle datasets download -d dansbecker/melbourne-housing-snapshot`: This command downloads a specific dataset from Kaggle about housing in Melbourne. 🏠
*   `!unzip -q "/content/melbourne-housing-snapshot.zip"`: After downloading, the dataset is usually in a compressed format (like a zip file). This line unzips it so we can access the data inside. 📦

In [2]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml import Pipeline

# Start Spark session
spark = SparkSession.builder \
    .appName("ChurnPredictionLogisticRegression") \
    .getOrCreate()

# Load data
df = spark.read.csv("/content/customer_churn.csv", header=True, inferSchema=True)

# Print schema and preview data
df.printSchema()
df.show(5)

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Churn: integer (nullable = true)

+----------------+----+--------------+---------------+-----+---------+-------------------+--------------------+--------------------+-----+
|           Names| Age|Total_Purchase|Account_Manager|Years|Num_Sites|       Onboard_date|            Location|             Company|Churn|
+----------------+----+--------------+---------------+-----+---------+-------------------+--------------------+--------------------+-----+
|Cameron Williams|42.0|       11066.8|              0| 7.22|      8.0|2013-08-30 07:00:40|10265 Elizabeth M...|          Harvey LLC|    1|
|   Kevin Mueller|41.0|      1

Let's break down the code you just ran: 👇

*   `from pyspark.sql import SparkSession`: This line imports the necessary class to create a Spark session. Think of a Spark session as your main entry point to using Spark's functionality. 🚪
*   `from pyspark.ml.feature import StringIndexer, VectorAssembler`: We're importing tools for preparing our data. `StringIndexer` helps convert text categories into numbers, and `VectorAssembler` combines different columns into a single feature vector that machine learning models can use. 🛠️
*   `from pyspark.ml.classification import LogisticRegression`: This imports the Logistic Regression algorithm from Spark's machine learning library. This is the model we'll use for prediction. 🧠
*   `from pyspark.ml.evaluation import BinaryClassificationEvaluator`: This imports a tool to evaluate how well our binary classification model (like predicting churn, which is either yes or no) performs. ✅
*   `from pyspark.ml import Pipeline`: This imports the `Pipeline` class, which allows us to chain multiple data processing and machine learning steps together. This makes our workflow organized and repeatable. 🏗️
*   `spark = SparkSession.builder.appName("ChurnPredictionLogisticRegression").getOrCreate()`: This is where we create or get an existing Spark session. We give it a name ("ChurnPredictionLogisticRegression") so we can identify it. ✨
*   `df = spark.read.csv("customer_churn.csv", header=True, inferSchema=True)`: This line loads our data from a CSV file named "customer_churn.csv" into a Spark DataFrame. `header=True` tells Spark that the first row is the header, and `inferSchema=True` tells Spark to automatically figure out the data types of each column. 📝
*   `df.printSchema()`: This displays the structure of our DataFrame, showing the column names and their inferred data types. It's like looking at the blueprint of our data. 🗺️
*   `df.show(5)`: This shows the first 5 rows of our DataFrame. It's a quick way to peek at the actual data. 👀

In [3]:
# Assume the label column is named 'Churn' (Yes/No), convert it to numeric
label_indexer = StringIndexer(inputCol="Churn", outputCol="label")

# Select features (replace 'data' with your actual DataFrame name, which is 'df')
feature_cols = [col for col in df.columns if col != 'Churn']

# Handle categorical features (if any)
# Example: Let's say 'Gender' and 'Contract' are categorical
# categorical_cols = ['Gender', 'Contract']  # Replace with your categorical columns
categorical_cols = [] # Based on the schema, there are no obvious categorical columns that need indexing

indexers = [StringIndexer(inputCol=col, outputCol=col + "_indexed") for col in categorical_cols]

# Replace categorical columns with indexed versions in features list
indexed_feature_cols = [col + "_indexed" if col in categorical_cols else col for col in feature_cols]

# Assemble features into a single vector
assembler = VectorAssembler(inputCols=indexed_feature_cols, outputCol="features")

Let's break down the data preprocessing steps: 👇

*   `label_indexer = StringIndexer(inputCol="Churn", outputCol="label")`: This creates a `StringIndexer` to convert the 'Churn' column (our target variable) into a numerical format. Machine learning models typically work with numerical data. We're naming the new numerical column 'label'. 🔢
*   `feature_cols = [col for col in df.columns if col != 'Churn']`: This line creates a list of all column names in our DataFrame *except* for the 'Churn' column. These will be our input features for the model. 📋
*   `categorical_cols = []`: We initialize an empty list for categorical columns. Based on the schema we saw earlier, there aren't any columns that immediately appear to be categorical and require indexing for this specific dataset. If your dataset had columns like 'Gender' or 'Contract' with text values, you would list them here. 📝
*   `indexers = [StringIndexer(inputCol=col, outputCol=col + "_indexed") for col in categorical_cols]`: If we had categorical columns listed, this would create a `StringIndexer` for each of them to convert their text values into numerical indices. The new indexed columns would have "_indexed" added to their original name. 🔤➡️🔢
*   `indexed_feature_cols = [col + "_indexed" if col in categorical_cols else col for col in feature_cols]`: This updates our list of feature columns. If a column was identified as categorical, we use its new indexed name (e.g., 'Gender\_indexed'); otherwise, we use the original column name. This ensures our feature list contains the numerical representations of any categorical features. 🔄
*   `assembler = VectorAssembler(inputCols=indexed_feature_cols, outputCol="features")`: This creates a `VectorAssembler`. This tool takes all the specified input columns (our `indexed_feature_cols`) and combines them into a single vector column named 'features'. Logistic Regression in PySpark expects the input features to be in this vector format. 💪

In [4]:
# Split data
train_data, test_data = df.randomSplit([0.8, 0.2], seed=42)

Let's break down the data splitting process: 👇

*   `train_data, test_data = df.randomSplit([0.8, 0.2], seed=42)`: This line takes our `df` DataFrame and randomly splits it into two parts: `train_data` (80% of the data) and `test_data` (20% of the data). The `seed=42` ensures that the split is the same every time you run the code, which is good for reproducibility. 🎲

We now have two separate DataFrames: one for training our model and one for testing its performance. 🛠️✅

In [5]:
# Build Logistic Regression model
lr = LogisticRegression(featuresCol='features', labelCol='label')

Let's break down this step: 👇

*   `lr = LogisticRegression(featuresCol='features', labelCol='label')`: This line creates an instance of the `LogisticRegression` model. We're telling it to use the column named 'features' as the input features and the column named 'label' as the target variable (what we want to predict). 🎯

We've now initialized our Logistic Regression model, ready to be trained! 💪