<a href="https://colab.research.google.com/github/amien1410/colab-notebooks/blob/main/Colab_Pyspark_Churn_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#@title Install Kaggle modules and download the dataset

from google.colab import drive
drive.mount('/content/drive')

!pip install kaggle
import os
os.environ['KAGGLE_CONFIG_DIR'] = '/content/drive/MyDrive/kaggle'
!kaggle datasets download -d halimedogan/churn-dataset
!unzip -q "/content/churn-dataset.zip"

Mounted at /content/drive
Dataset URL: https://www.kaggle.com/datasets/halimedogan/churn-dataset
License(s): unknown
Downloading churn-dataset.zip to /content
  0% 0.00/262k [00:00<?, ?B/s]
100% 262k/262k [00:00<00:00, 627MB/s]


## 📊 Analyzing Customer Churn with PySpark

Hey there! Let's dive into this customer churn dataset. You know the drill – keeping existing customers happy is way more cost-effective than finding new ones. For us folks in the banking world, understanding *why* a client might pack their bags and leave is gold. If we can get ahead of churn, we can cook up some sweet loyalty programs and retention campaigns to keep our customers sticking around.

Here's a quick rundown of what we're working with in this dataset, keeping an eye out for features that might be handy in our PySpark analysis:

*   **Surname**: This is just a record ID, pretty much noise for our analysis. We won't be feeding this into our models.
*   **CreditScore**: Looks like this one's just random. We can probably skip this when we're building our features for PySpark.
*   **Geography**: This *could* be interesting. Location can definitely play a role in customer behavior. We'll want to consider this in our feature engineering – maybe one-hot encoding for regions? 🌍
*   **Gender**: Let's see if gender has any sway. Another candidate for categorical feature handling in PySpark. 🚻
*   **Age**: Definitely a key one! Older customers often seem more rooted. This is a strong candidate for our models. 🎂
*   **Tenure**: How long a customer's been with us. Makes sense that longer tenure might mean more loyalty. 🕰️
*   **NumOfProducts**: How many banking products a customer uses. More products could imply more entanglement with the bank. 🛒
*   **HasCrCard**: Do they have a credit card with us? This might indicate a stronger tie to the bank. Another one to consider for feature vectors. 💳
*   **IsActiveMember**: Are they actively using their account? Seems intuitive that active users are less likely to leave. ✅
*   **EstimatedSalary**: Like account balance, higher salaries might mean less likelihood to leave. We'll want to scale this one properly. 💰
*   **Balance**: A really important feature! Customers with more cash in their accounts might be less likely to jump ship. This is probably a strong predictor. 💲
*   **Exited** (Our Target Variable!): This is what we're trying to predict – whether the customer churned or not. This will be our label column for training our PySpark machine learning models. 🎯

So, when we get into the PySpark code, we'll be focusing on transforming and vectorizing these features (Geography, Gender, Age, Tenure, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, Balance) to build a model to predict `Exited`.

Let's fire up the Spark context and get to it! 🔥

<!-- Add an image here, for example: -->
<!-- ![Churn Analysis](https://www.example.com/your-image.png) -->

## 🛠️ Setting up our PySpark Environment

Before we can jump into analyzing the data and building our churn prediction model, we need to make sure we have all the necessary tools in place. This involves installing the `sparkmagic` and `pyspark` libraries, which are essential for running Spark code within our notebook. We'll also import a few other helpful libraries for data manipulation, visualization, and machine learning with PySpark.

This setup ensures we have a smooth workflow as we move through the data preprocessing, feature engineering, and model training stages.

In [2]:
#@title Install and Import Libraries

!pip install sparkmagic
!pip install pyspark

# libraries
import warnings
# import findspark
import pandas as pd
import seaborn as sns
from pyspark.ml.classification import GBTClassifier, LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import StringIndexer, VectorAssembler, OneHotEncoder, StandardScaler
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.sql import SparkSession
from pyspark.ml.feature import Bucketizer

warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

Collecting sparkmagic
  Downloading sparkmagic-0.22.0.tar.gz (45 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.3/45.3 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting hdijupyterutils>=0.6 (from sparkmagic)
  Downloading hdijupyterutils-0.22.0.tar.gz (5.1 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting autovizwidget>=0.6 (from sparkmagic)
  Downloading autovizwidget-0.22.0.tar.gz (9.0 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting requests_kerberos>=0.8.0 (from sparkmagic)
  Downloading requests_kerberos-0.15.0-py2.py3-none-any.whl.metadata (12 kB)
Collecting jupyter>=1 (from hdijupyterutils>=0.6->sparkmagic)
  Downloading jupyter-1.1.1-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting jedi>=0.16 (from ipython>=4.0.2->sparkmagic)
  Downloading 

## 🔥 Firing up Spark and Loading the Data

Now that our libraries are in place, it's time to get Spark going! We'll create a Spark session, which is the entry point for any Spark functionality. Then, we'll load our churn dataset (`churn2.csv`) into a Spark DataFrame. This will allow us to leverage Spark's distributed processing capabilities for our analysis. Finally, we'll take a quick peek at the first few rows to make sure everything loaded correctly.

In [3]:
spark = SparkSession.builder.master("local[*]").getOrCreate()
spark_df = spark.read.csv('/content/churn2.csv', inferSchema=True, header=True)
spark_df.show(10)

+---------+----------+--------+-----------+---------+------+---+------+---------+-------------+---------+--------------+---------------+------+
|RowNumber|CustomerId| Surname|CreditScore|Geography|Gender|Age|Tenure|  Balance|NumOfProducts|HasCrCard|IsActiveMember|EstimatedSalary|Exited|
+---------+----------+--------+-----------+---------+------+---+------+---------+-------------+---------+--------------+---------------+------+
|        1|  15634602|Hargrave|        619|   France|Female| 42|     2|      0.0|            1|        1|             1|      101348.88|     1|
|        2|  15647311|    Hill|        608|    Spain|Female| 41|     1| 83807.86|            1|        0|             1|      112542.58|     0|
|        3|  15619304|    Onio|        502|   France|Female| 42|     8| 159660.8|            3|        1|             0|      113931.57|     1|
|        4|  15701354|    Boni|        699|   France|Female| 39|     1|      0.0|            2|        0|             0|       93826.63|

## 👀 Getting to Know Our Data: Initial Exploration

With our Spark DataFrame ready, it's time for some initial exploration. We'll start by checking the dimensions of the dataset (number of rows and columns), examining the schema to understand the data types of each column, and generating some descriptive statistics. This gives us a foundational understanding of our data before we dive deeper into feature engineering and modeling.

In [4]:
print("Shape: ", (spark_df.count(), len(spark_df.columns)))
spark_df.printSchema()
spark_df.describe().show()

Shape:  (10000, 14)
root
 |-- RowNumber: integer (nullable = true)
 |-- CustomerId: integer (nullable = true)
 |-- Surname: string (nullable = true)
 |-- CreditScore: integer (nullable = true)
 |-- Geography: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tenure: integer (nullable = true)
 |-- Balance: double (nullable = true)
 |-- NumOfProducts: integer (nullable = true)
 |-- HasCrCard: integer (nullable = true)
 |-- IsActiveMember: integer (nullable = true)
 |-- EstimatedSalary: double (nullable = true)
 |-- Exited: integer (nullable = true)

+-------+------------------+-----------------+-------+-----------------+---------+------+------------------+------------------+-----------------+------------------+-------------------+-------------------+-----------------+-------------------+
|summary|         RowNumber|       CustomerId|Surname|      CreditScore|Geography|Gender|               Age|            Tenure|          Balance|    