<a href="https://colab.research.google.com/github/amien1410/colab-notebooks/blob/main/Colab_Pyspark_Churn_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#@title Install Kaggle modules and download the dataset

from google.colab import drive
drive.mount('/content/drive')

!pip install kaggle
import os
os.environ['KAGGLE_CONFIG_DIR'] = '/content/drive/MyDrive/kaggle'
!kaggle datasets download -d halimedogan/churn-dataset
!unzip -q "/content/churn-dataset.zip"

Mounted at /content/drive
Dataset URL: https://www.kaggle.com/datasets/halimedogan/churn-dataset
License(s): unknown
Downloading churn-dataset.zip to /content
  0% 0.00/262k [00:00<?, ?B/s]
100% 262k/262k [00:00<00:00, 627MB/s]


## 📊 Analyzing Customer Churn with PySpark

Hey there! Let's dive into this customer churn dataset. You know the drill – keeping existing customers happy is way more cost-effective than finding new ones. For us folks in the banking world, understanding *why* a client might pack their bags and leave is gold. If we can get ahead of churn, we can cook up some sweet loyalty programs and retention campaigns to keep our customers sticking around.

Here's a quick rundown of what we're working with in this dataset, keeping an eye out for features that might be handy in our PySpark analysis:

*   **Surname**: This is just a record ID, pretty much noise for our analysis. We won't be feeding this into our models.
*   **CreditScore**: Looks like this one's just random. We can probably skip this when we're building our features for PySpark.
*   **Geography**: This *could* be interesting. Location can definitely play a role in customer behavior. We'll want to consider this in our feature engineering – maybe one-hot encoding for regions? 🌍
*   **Gender**: Let's see if gender has any sway. Another candidate for categorical feature handling in PySpark. 🚻
*   **Age**: Definitely a key one! Older customers often seem more rooted. This is a strong candidate for our models. 🎂
*   **Tenure**: How long a customer's been with us. Makes sense that longer tenure might mean more loyalty. 🕰️
*   **NumOfProducts**: How many banking products a customer uses. More products could imply more entanglement with the bank. 🛒
*   **HasCrCard**: Do they have a credit card with us? This might indicate a stronger tie to the bank. Another one to consider for feature vectors. 💳
*   **IsActiveMember**: Are they actively using their account? Seems intuitive that active users are less likely to leave. ✅
*   **EstimatedSalary**: Like account balance, higher salaries might mean less likelihood to leave. We'll want to scale this one properly. 💰
*   **Balance**: A really important feature! Customers with more cash in their accounts might be less likely to jump ship. This is probably a strong predictor. 💲
*   **Exited** (Our Target Variable!): This is what we're trying to predict – whether the customer churned or not. This will be our label column for training our PySpark machine learning models. 🎯

So, when we get into the PySpark code, we'll be focusing on transforming and vectorizing these features (Geography, Gender, Age, Tenure, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, Balance) to build a model to predict `Exited`.

Let's fire up the Spark context and get to it! 🔥

<!-- Add an image here, for example: -->
<!-- ![Churn Analysis](https://www.example.com/your-image.png) -->

## 🛠️ Setting up our PySpark Environment

Before we can jump into analyzing the data and building our churn prediction model, we need to make sure we have all the necessary tools in place. This involves installing the `sparkmagic` and `pyspark` libraries, which are essential for running Spark code within our notebook. We'll also import a few other helpful libraries for data manipulation, visualization, and machine learning with PySpark.

This setup ensures we have a smooth workflow as we move through the data preprocessing, feature engineering, and model training stages.

In [2]:
#@title Install and Import Libraries

!pip install sparkmagic
!pip install pyspark

# libraries
import warnings
# import findspark
import pandas as pd
import seaborn as sns
from pyspark.ml.classification import GBTClassifier, LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import StringIndexer, VectorAssembler, OneHotEncoder, StandardScaler
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.sql import SparkSession
from pyspark.ml.feature import Bucketizer

warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

Collecting sparkmagic
  Downloading sparkmagic-0.22.0.tar.gz (45 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.3/45.3 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting hdijupyterutils>=0.6 (from sparkmagic)
  Downloading hdijupyterutils-0.22.0.tar.gz (5.1 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting autovizwidget>=0.6 (from sparkmagic)
  Downloading autovizwidget-0.22.0.tar.gz (9.0 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting requests_kerberos>=0.8.0 (from sparkmagic)
  Downloading requests_kerberos-0.15.0-py2.py3-none-any.whl.metadata (12 kB)
Collecting jupyter>=1 (from hdijupyterutils>=0.6->sparkmagic)
  Downloading jupyter-1.1.1-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting jedi>=0.16 (from ipython>=4.0.2->sparkmagic)
  Downloading 

## 🔥 Firing up Spark and Loading the Data

Now that our libraries are in place, it's time to get Spark going! We'll create a Spark session, which is the entry point for any Spark functionality. Then, we'll load our churn dataset (`churn2.csv`) into a Spark DataFrame. This will allow us to leverage Spark's distributed processing capabilities for our analysis. Finally, we'll take a quick peek at the first few rows to make sure everything loaded correctly.

In [3]:
spark = SparkSession.builder.master("local[*]").getOrCreate()
spark_df = spark.read.csv('/content/churn2.csv', inferSchema=True, header=True)
spark_df.show(10)

+---------+----------+--------+-----------+---------+------+---+------+---------+-------------+---------+--------------+---------------+------+
|RowNumber|CustomerId| Surname|CreditScore|Geography|Gender|Age|Tenure|  Balance|NumOfProducts|HasCrCard|IsActiveMember|EstimatedSalary|Exited|
+---------+----------+--------+-----------+---------+------+---+------+---------+-------------+---------+--------------+---------------+------+
|        1|  15634602|Hargrave|        619|   France|Female| 42|     2|      0.0|            1|        1|             1|      101348.88|     1|
|        2|  15647311|    Hill|        608|    Spain|Female| 41|     1| 83807.86|            1|        0|             1|      112542.58|     0|
|        3|  15619304|    Onio|        502|   France|Female| 42|     8| 159660.8|            3|        1|             0|      113931.57|     1|
|        4|  15701354|    Boni|        699|   France|Female| 39|     1|      0.0|            2|        0|             0|       93826.63|

## 👀 Getting to Know Our Data: Initial Exploration

With our Spark DataFrame ready, it's time for some initial exploration. We'll start by checking the dimensions of the dataset (number of rows and columns), examining the schema to understand the data types of each column, and generating some descriptive statistics. This gives us a foundational understanding of our data before we dive deeper into feature engineering and modeling.

In [4]:
print("Shape: ", (spark_df.count(), len(spark_df.columns)))
spark_df.printSchema()
spark_df.describe().show()

Shape:  (10000, 14)
root
 |-- RowNumber: integer (nullable = true)
 |-- CustomerId: integer (nullable = true)
 |-- Surname: string (nullable = true)
 |-- CreditScore: integer (nullable = true)
 |-- Geography: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tenure: integer (nullable = true)
 |-- Balance: double (nullable = true)
 |-- NumOfProducts: integer (nullable = true)
 |-- HasCrCard: integer (nullable = true)
 |-- IsActiveMember: integer (nullable = true)
 |-- EstimatedSalary: double (nullable = true)
 |-- Exited: integer (nullable = true)

+-------+------------------+-----------------+-------+-----------------+---------+------+------------------+------------------+-----------------+------------------+-------------------+-------------------+-----------------+-------------------+
|summary|         RowNumber|       CustomerId|Surname|      CreditScore|Geography|Gender|               Age|            Tenure|          Balance|    

## 📊 Checking the Target Variable Distribution

Before we build any models, it's always a good idea to understand the distribution of our target variable. In this case, we want to see the split between customers who `Exited` (churned) and those who didn't. This helps us understand if we have a balanced dataset or if we need to consider techniques for handling imbalanced data later on.

In [5]:
spark_df.groupby("exited").count().show()

+------+-----+
|exited|count|
+------+-----+
|     1| 2037|
|     0| 7963|
+------+-----+



## Exploring Tenure and Churn

Let's see if the length of time a customer has been with the bank (their `Tenure`) has any influence on whether they churn (`Exited`). We'll calculate the average tenure for customers who stayed versus those who left to see if there's a noticeable difference.

In [6]:
spark_df.groupby("exited").agg({"tenure": "mean"}).show()

+------+-----------------+
|exited|      avg(tenure)|
+------+-----------------+
|     1|4.932744231713304|
|     0|5.033278914981791|
+------+-----------------+



## Delving into Numerical Features

Now, let's focus on our numerical features. We'll select these columns and get a quick snapshot of their descriptive statistics. This helps us understand the spread and central tendency of these important variables.

In [7]:
#Selection and summary statistics of all numeric variables
num_cols = [col[0] for col in spark_df.dtypes if col[1] != 'string']
spark_df.select(num_cols).describe().show()

+-------+------------------+-----------------+-----------------+------------------+------------------+-----------------+------------------+-------------------+-------------------+-----------------+-------------------+
|summary|         RowNumber|       CustomerId|      CreditScore|               Age|            Tenure|          Balance|     NumOfProducts|          HasCrCard|     IsActiveMember|  EstimatedSalary|             Exited|
+-------+------------------+-----------------+-----------------+------------------+------------------+-----------------+------------------+-------------------+-------------------+-----------------+-------------------+
|  count|             10000|            10000|            10000|             10000|             10000|            10000|             10000|              10000|              10000|            10000|              10000|
|   mean|            5000.5|  1.56909405694E7|         650.5288|           38.9218|            5.0128|76485.88928799961|        

## Exploring Categorical Features

Let's now turn our attention to the categorical variables in our dataset. We'll select these columns and examine their summary statistics to understand the different categories present and their frequencies. This is a crucial step before we encode these features for our machine learning model.

In [8]:
#Selection and summary of all categorical variables
cat_cols = [col[0] for col in spark_df.dtypes if col[1] == 'string']
spark_df.select(cat_cols).describe().show()

+-------+-------+---------+------+
|summary|Surname|Geography|Gender|
+-------+-------+---------+------+
|  count|  10000|    10000| 10000|
|   mean|   NULL|     NULL|  NULL|
| stddev|   NULL|     NULL|  NULL|
|    min|  Abazu|   France|Female|
|    max| Zuyeva|    Spain|  Male|
+-------+-------+---------+------+



## Analyzing Numerical Features by Churn Status

Let's investigate how the average values of our numerical features differ between customers who churned (`Exited`=1) and those who didn't (`Exited`=0). This can highlight which numerical variables show a significant difference between the two groups and might be strong predictors of churn.

In [9]:
# mean of numerical variables relative to the target variable
for col in [col.lower() for col in num_cols]:
    spark_df.groupby("exited").agg({col: "mean"}).show()

+------+-----------------+
|exited|   avg(rownumber)|
+------+-----------------+
|     1|4905.917525773196|
|     0|5024.694964209469|
+------+-----------------+

+------+--------------------+
|exited|     avg(customerid)|
+------+--------------------+
|     1|1.5690051964653904E7|
|     0|1.5691167881702876E7|
+------+--------------------+

+------+-----------------+
|exited| avg(creditscore)|
+------+-----------------+
|     1|645.3514972999509|
|     0|651.8531960316463|
+------+-----------------+

+------+-----------------+
|exited|         avg(age)|
+------+-----------------+
|     1| 44.8379970544919|
|     0|37.40838879819164|
+------+-----------------+

+------+-----------------+
|exited|      avg(tenure)|
+------+-----------------+
|     1|4.932744231713304|
|     0|5.033278914981791|
+------+-----------------+

+------+-----------------+
|exited|     avg(balance)|
+------+-----------------+
|     1|91108.53933726063|
|     0|72745.29677885193|
+------+-----------------+

+---

## Checking for Missing Values

Data cleaning is a vital step before we can train any machine learning model. Let's check if there are any missing values in our dataset. Identifying and handling these nulls is important for ensuring the quality of our data and the performance of our model. We'll count the number of missing values in each column and display the results.

In [10]:
#Missing Values
from pyspark.sql.functions import when, count, col
spark_df.select([count(when(col(c).isNull(), c)).alias(c) for c in spark_df.columns]).toPandas().T

Unnamed: 0,0
RowNumber,0
CustomerId,0
Surname,0
CreditScore,0
Geography,0
Gender,0
Age,0
Tenure,0
Balance,0
NumOfProducts,0


## Standardizing Column Names

To ensure consistency and make it easier to work with our DataFrame, let's convert all column names to lowercase. This is a common practice in data processing pipelines.

In [11]:
spark_df = spark_df.toDF(*[c.lower() for c in spark_df.columns])
spark_df.show(5)

+---------+----------+--------+-----------+---------+------+---+------+---------+-------------+---------+--------------+---------------+------+
|rownumber|customerid| surname|creditscore|geography|gender|age|tenure|  balance|numofproducts|hascrcard|isactivemember|estimatedsalary|exited|
+---------+----------+--------+-----------+---------+------+---+------+---------+-------------+---------+--------------+---------------+------+
|        1|  15634602|Hargrave|        619|   France|Female| 42|     2|      0.0|            1|        1|             1|      101348.88|     1|
|        2|  15647311|    Hill|        608|    Spain|Female| 41|     1| 83807.86|            1|        0|             1|      112542.58|     0|
|        3|  15619304|    Onio|        502|   France|Female| 42|     8| 159660.8|            3|        1|             0|      113931.57|     1|
|        4|  15701354|    Boni|        699|   France|Female| 39|     1|      0.0|            2|        0|             0|       93826.63|

## Feature Interaction

Let's move on to creating some new features! We'll start by dropping columns that aren't relevant for our model. Then, we'll engineer some interaction terms by combining existing features. These new features might help our model pick up on more nuanced patterns in the data that could be related to churn.

In [12]:
# Feature Interaction
spark_df = spark_df.drop('rownumber', "customerid", "surname")
spark_df = spark_df.withColumn('creditscore_salary', spark_df.creditscore / spark_df.estimatedsalary)
spark_df = spark_df.withColumn('creditscore_tenure', spark_df.creditscore * spark_df.tenure)
spark_df = spark_df.withColumn('balance_salary', spark_df.balance / spark_df.estimatedsalary)
spark_df.show(5)

+-----------+---------+------+---+------+---------+-------------+---------+--------------+---------------+------+--------------------+------------------+------------------+
|creditscore|geography|gender|age|tenure|  balance|numofproducts|hascrcard|isactivemember|estimatedsalary|exited|  creditscore_salary|creditscore_tenure|    balance_salary|
+-----------+---------+------+---+------+---------+-------------+---------+--------------+---------------+------+--------------------+------------------+------------------+
|        619|   France|Female| 42|     2|      0.0|            1|        1|             1|      101348.88|     1|0.006107615594765329|              1238|               0.0|
|        608|    Spain|Female| 41|     1| 83807.86|            1|        0|             1|      112542.58|     0|0.005402399696186101|               608|0.7446769036217226|
|        502|   France|Female| 42|     8| 159660.8|            3|        1|             0|      113931.57|     1|0.004406153623618106| 

## Bucketization: Transforming Age into Categories

To potentially capture different patterns across age groups, let's transform the numerical `age` feature into categorical bins. We'll define specific age ranges and assign each customer to an `age_cat` based on their age. This can sometimes help the model by providing a different perspective on this feature.

In [13]:
#Bucketization / Bining / Num to Cat
spark_df.select('age').describe().toPandas().transpose()
spark_df.select("age").summary("count", "min", "25%", "50%","75%", "max").show()
bucketizer = Bucketizer(splits=[0, 35, 55, 75, 95], inputCol="age", outputCol="age_cat")
spark_df = bucketizer.setHandleInvalid("keep").transform(spark_df)
spark_df = spark_df.withColumn('age_cat', spark_df.age_cat + 1)

+-------+-----+
|summary|  age|
+-------+-----+
|  count|10000|
|    min|   18|
|    25%|   32|
|    50%|   37|
|    75%|   44|
|    max|   92|
+-------+-----+



## Converting Age Categories to Integer Type

We've successfully bucketized the `age` column into `age_cat`. Since these categories are discrete, let's convert the data type of the `age_cat` column from float to integer for better data representation.

In [14]:
#converting float values to integer
spark_df = spark_df.withColumn("age_cat", spark_df["age_cat"].cast("integer"))

## Encoding Categorical Features: Gender

Now it's time to handle our categorical features. We'll start by encoding the `gender` column. Using PySpark's `StringIndexer`, we'll convert the 'Male' and 'Female' categories into numerical labels. This is a necessary step before we can use this feature in our machine learning model. We'll also convert the resulting column to an integer type and drop the original string column.

In [15]:
indexer = StringIndexer(inputCol="gender", outputCol="gender_label")
indexer.fit(spark_df).transform(spark_df).show(5)
temp_sdf = indexer.fit(spark_df).transform(spark_df)
spark_df = temp_sdf.withColumn("gender_label", temp_sdf["gender_label"].cast("integer"))
spark_df = spark_df.drop('gender')
spark_df.show(5)

+-----------+---------+------+---+------+---------+-------------+---------+--------------+---------------+------+--------------------+------------------+------------------+-------+------------+
|creditscore|geography|gender|age|tenure|  balance|numofproducts|hascrcard|isactivemember|estimatedsalary|exited|  creditscore_salary|creditscore_tenure|    balance_salary|age_cat|gender_label|
+-----------+---------+------+---+------+---------+-------------+---------+--------------+---------------+------+--------------------+------------------+------------------+-------+------------+
|        619|   France|Female| 42|     2|      0.0|            1|        1|             1|      101348.88|     1|0.006107615594765329|              1238|               0.0|      2|         1.0|
|        608|    Spain|Female| 41|     1| 83807.86|            1|        0|             1|      112542.58|     0|0.005402399696186101|               608|0.7446769036217226|      2|         1.0|
|        502|   France|Female|

## Encoding Categorical Features: Geography

Following the same process as with 'gender', we'll now encode the `geography` column. We'll use `StringIndexer` to map the different countries to numerical indices. This is another crucial step to prepare this categorical feature for our machine learning model. We'll then convert the resulting column to an integer type and remove the original string column.

In [16]:
indexer = StringIndexer(inputCol="geography", outputCol="geography_label")
indexer.fit(spark_df).transform(spark_df).show(5)
temp_sdf = indexer.fit(spark_df).transform(spark_df)
spark_df = temp_sdf.withColumn("geography_label", temp_sdf["geography_label"].cast("integer"))
spark_df = spark_df.drop('geography')

+-----------+---------+---+------+---------+-------------+---------+--------------+---------------+------+--------------------+------------------+------------------+-------+------------+---------------+
|creditscore|geography|age|tenure|  balance|numofproducts|hascrcard|isactivemember|estimatedsalary|exited|  creditscore_salary|creditscore_tenure|    balance_salary|age_cat|gender_label|geography_label|
+-----------+---------+---+------+---------+-------------+---------+--------------+---------------+------+--------------------+------------------+------------------+-------+------------+---------------+
|        619|   France| 42|     2|      0.0|            1|        1|             1|      101348.88|     1|0.006107615594765329|              1238|               0.0|      2|           1|            0.0|
|        608|    Spain| 41|     1| 83807.86|            1|        0|             1|      112542.58|     0|0.005402399696186101|               608|0.7446769036217226|      2|           1|  