# Customer Churn Prediction with PySpark

This project uses PySpark to perform classification on a telecom dataset to predict customer churn. It demonstrates big data handling, feature engineering, and machine learning with Spark's MLlib.

### Importing Libraries

We start by importing necessary Python and PySpark libraries for data handling and modeling.

In [10]:
# Import necessary libraries
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession

## Data Loading

We load the customer churn dataset using pandas.

In [30]:
# Load dataset using pandas
df = pd.read_csv("churn.csv")

# Preview the first few rows
df.head()

Unnamed: 0.1,Unnamed: 0,Names,Age,Total_Purchase,Account_Manager,Years,Num_Sites,Churn
0,0,Cameron Williams,42.0,11066.8,0,7.22,8.0,1
1,1,Kevin Mueller,41.0,11916.22,0,6.5,11.0,1
2,2,Eric Lozano,38.0,12884.75,0,6.67,12.0,1
3,3,Phillip White,42.0,8010.76,0,6.71,10.0,1
4,4,Cynthia Norton,37.0,9191.58,0,5.56,9.0,1


## Exploratory Data Analysis (EDA)

Before we build a machine learning model, we explore the dataset to:

- Understand the structure of the data
- Check for missing values
- Examine statistical summaries
- Review the distribution of the target variable (`Churn`)

In [36]:
# Summary statistics
df.describe()

Unnamed: 0.1,Unnamed: 0,Age,Total_Purchase,Account_Manager,Years,Num_Sites,Churn
count,900.0,900.0,900.0,900.0,900.0,900.0,900.0
mean,449.5,41.816667,10062.824033,0.481111,5.273156,8.587778,0.166667
std,259.951919,6.12756,2408.644532,0.499921,1.274449,1.764836,0.372885
min,0.0,22.0,100.0,0.0,1.0,3.0,0.0
25%,224.75,38.0,8497.1225,0.0,4.45,7.0,0.0
50%,449.5,42.0,10045.87,0.0,5.215,8.0,0.0
75%,674.25,46.0,11760.105,1.0,6.11,10.0,0.0
max,899.0,65.0,18026.01,1.0,9.15,14.0,1.0


In [38]:
# Check for missing values
df.isnull().sum()

Unnamed: 0         0
Names              0
Age                0
Total_Purchase     0
Account_Manager    0
Years              0
Num_Sites          0
Churn              0
dtype: int64

In [40]:
# Distribution of target variable
df["Churn"].value_counts()

Churn
0    750
1    150
Name: count, dtype: int64

## Converting to Spark DataFrame

After completing our exploratory data analysis with pandas, we convert the dataset into a Spark DataFrame.

This allows us to take advantage of PySpark’s distributed machine learning pipeline tools.

In [43]:
# Start a Spark session
spark = SparkSession.builder.appName("ChurnPrediction").getOrCreate()

# Convert pandas DataFrame to Spark DataFrame
spark_df = spark.createDataFrame(df)

# Show structure and first few records
spark_df.printSchema()
spark_df.show(5)

root
 |-- Unnamed: 0: long (nullable = true)
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: long (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Churn: long (nullable = true)

+----------+----------------+----+--------------+---------------+-----+---------+-----+
|Unnamed: 0|           Names| Age|Total_Purchase|Account_Manager|Years|Num_Sites|Churn|
+----------+----------------+----+--------------+---------------+-----+---------+-----+
|         0|Cameron Williams|42.0|       11066.8|              0| 7.22|      8.0|    1|
|         1|   Kevin Mueller|41.0|      11916.22|              0|  6.5|     11.0|    1|
|         2|     Eric Lozano|38.0|      12884.75|              0| 6.67|     12.0|    1|
|         3|   Phillip White|42.0|       8010.76|              0| 6.71|     10.0|    1|
|         4|  Cynthia Norton|37.0|       9191.58|             

## Preprocessing for Machine Learning

Before training a model, we need to prepare the data by:

- Renaming the target column to `label`
- Selecting relevant numeric features
- Assembling features into a single vector column (`features`)

In [48]:
from pyspark.ml.feature import VectorAssembler

# Rename the label column
spark_df = spark_df.withColumnRenamed("Churn", "label")

# Define feature columns
feature_cols = ["Age", "Total_Purchase", "Account_Manager", "Years", "Num_Sites"]

# Assemble features into a single column
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
data = assembler.transform(spark_df).select("features", "label")

# Show a sample of the resulting DataFrame
data.show(5)

+--------------------+-----+
|            features|label|
+--------------------+-----+
|[42.0,11066.8,0.0...|    1|
|[41.0,11916.22,0....|    1|
|[38.0,12884.75,0....|    1|
|[42.0,8010.76,0.0...|    1|
|[37.0,9191.58,0.0...|    1|
+--------------------+-----+
only showing top 5 rows


## Splitting the Data

We split the data into training and testing sets.

- **Training Set (70%)**: Used to fit the machine learning model.
- **Testing Set (30%)**: Used to evaluate how well the model performs on unseen data.

In [51]:
# Split the data into training and testing sets (70/30)
train_data, test_data = data.randomSplit([0.7, 0.3], seed=42)

# Check the size of each set
print("Training Records:", train_data.count())
print("Testing Records:", test_data.count())

Training Records: 633
Testing Records: 267


## Training a Logistic Regression Model

We use PySpark’s `LogisticRegression` to build a simple binary classification model that predicts customer churn.

The model learns to separate customers likely to churn (`label = 1`) from those likely to stay (`label = 0`) based on the selected features.

In [54]:
from pyspark.ml.classification import LogisticRegression

# Initialize logistic regression model
lr = LogisticRegression(featuresCol='features', labelCol='label')

# Fit the model to training data
lr_model = lr.fit(train_data)

25/06/05 22:39:08 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS


## Making Predictions and Evaluating the Model

After training the model, we evaluate its performance on the test set using:

- **AUC (Area Under the ROC Curve)**: Measures how well the model distinguishes between the two classes.
- **Prediction samples**: Review a few predictions to observe the output.

In [57]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Predict on test data
predictions = lr_model.transform(test_data)

# Show example predictions
predictions.select("label", "prediction", "probability").show(5)

# Evaluate using AUC
evaluator = BinaryClassificationEvaluator(labelCol="label")
auc = evaluator.evaluate(predictions)
print(f"AUC: {auc:.4f}")

+-----+----------+--------------------+
|label|prediction|         probability|
+-----+----------+--------------------+
|    1|       0.0|[0.81979276108790...|
|    1|       0.0|[0.64336714680380...|
|    1|       1.0|[0.43756059264380...|
|    1|       1.0|[0.38762905932159...|
|    1|       0.0|[0.50779432812282...|
+-----+----------+--------------------+
only showing top 5 rows
AUC: 0.9405


## Conclusion

In this project, we built a machine learning pipeline using PySpark to predict customer churn in a telecom dataset.

- We started with exploratory data analysis using pandas.
- Then we transformed the dataset into a Spark DataFrame for scalable processing.
- A logistic regression model was trained using key customer features.
- The model achieved an AUC of **0.9405**, indicating strong capability in distinguishing churners from non-churners.

Although the results are promising, further improvements could be made by:

- Testing different classification models (e.g., Random Forest, Gradient Boosted Trees)
- Incorporating categorical variables with proper encoding
- Applying hyperparameter tuning to optimize performance

This notebook demonstrates how big data tools like PySpark can be effectively applied to solve real-world classification problems at scale.