<a href="https://colab.research.google.com/github/bdh777psu/UCSD-MLE_Bootcamp_Projects/blob/main/Customer_Churn_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mini-Project: An End-to-End Churn Prediction Model Using AWS

Churn prediction is a crucial aspect for businesses, especially those operating in subscription-based models or industries with high customer turnover. Churn refers to the phenomenon where customers discontinue using a product or service. Predicting churn is important for several reasons:

1. **Revenue Impact:**
   - Retaining existing customers is often more cost-effective than acquiring new ones. Losing customers means losing the associated revenue. Churn prediction helps businesses identify customers at risk of leaving so that proactive measures can be taken to retain them.

2. **Resource Allocation:**
   - By predicting which customers are likely to churn, businesses can allocate resources more efficiently. They can concentrate efforts and resources on retaining high-value customers who are at a higher risk of leaving, rather than applying blanket retention strategies.

3. **Customer Experience Improvement:**
   - Understanding why customers churn provides valuable insights into areas that may need improvement. It could be issues related to product quality, customer service, or competitive factors. Identifying and addressing these issues can enhance overall customer experience.

In this mini-project, you'll be building an end-to-end churn prediction model using AWS's SageMaker. Amazon SageMaker is a fully managed service that simplifies the process of building, training, and deploying machine learning models at scale. It's designed to make it easier for developers to build, train, and deploy machine learning models in a production environment. Click [here](https://aws.amazon.com/blogs/machine-learning/build-tune-and-deploy-an-end-to-end-churn-prediction-model-using-amazon-sagemaker-pipelines/) and follow the instructions to build an end-to-end churn prediction model using AWS.   

## Customer Churn Model Using XGBoost Framework

### 1. Customer Retention Retail Dataset

This dataset can be used to understand what are the various marketing strategy based on consumer behaviour that can be adopted to increase customer retention of a retail store.

An online tea retail store which sells tea of different flavors across various cities in India. The dataset contains data about the store's customers, their orders, quantity ordered, order frequency, city,etc. This is a large dataset which will help in analysis.

Reference: https://www.kaggle.com/uttamp/store-data

In [1]:
%%html
<style>
table {float:left}
</style>

| column | Description
|--|--|
| custid | Computer generated ID to identify customers throughout the database
|retained |	1, if customer is assumed to be active, 0 = otherwise
|created | Date when the contact was created in the database - when the customer joined
|firstorder | Date when the customer placed first order
|lastorder | Date when the customer placed last order
|esent |	Number of emails sent
|eopenrate | Number of emails opened divided by number of emails sent
|eclickrate | Number of emails clicked divided by number of emails sent
|avgorder | Average order size for the customer
|ordfreq | Number of orders divided by customer tenure
|paperless | 1 if customer subscribed for paperless communication (only online)
|refill | 1 if customer subscribed for automatic refill
|doorstep | 1 if customer subscribed for doorstep delivery
|train | 1 if customer is in the training database
|favday | Customer's favorite delivery day
|city | City where the customer resides in

### 2. Import Packages and Constants

Install shap and smdebug packages if not already installed and restart kernel after installing the packages

In [2]:
!conda install -c conda-forge shap --yes
!pip install smdebug --upgrade

/bin/bash: line 1: conda: command not found
Collecting smdebug
  Downloading smdebug-1.0.34-py2.py3-none-any.whl (280 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.1/280.1 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
Collecting boto3>=1.10.32 (from smdebug)
  Downloading boto3-1.34.82-py3-none-any.whl (139 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.3/139.3 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyinstrument==3.4.2 (from smdebug)
  Downloading pyinstrument-3.4.2-py2.py3-none-any.whl (83 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.3/83.3 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyinstrument-cext>=0.2.2 (from pyinstrument==3.4.2->smdebug)
  Downloading pyinstrument_cext-0.2.4.tar.gz (4.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting botocore<1.35.0,>=1.34.82 (from boto3>=1.10.32->smdebug)
  Downloading botocore-1.34.82-py3-none-any.whl (12.

In [3]:
import re
import s3fs
import shap
import time
import boto3
import pandas as pd
import numpy as np

from itertools import islice
import matplotlib.pyplot as plt

import sagemaker
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.session import Session
from sagemaker.inputs import TrainingInput
from sagemaker.debugger import DebuggerHookConfig,CollectionConfig
from sagemaker.debugger import rule_configs, Rule
from smdebug.trials import create_trial
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner
)

ModuleNotFoundError: No module named 's3fs'

In [None]:
#Replace this value with the S3 Bucket Created

default_bucket = "customer-churn-sm-pipeline"

In [None]:
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
region = sagemaker_session.boto_region_name

### 3. Preprocess Data

In [None]:
def preprocess_data(file_path):
    df = pd.read_csv(file_path)
    ## Convert to datetime columns
    df["firstorder"]=pd.to_datetime(df["firstorder"],errors='coerce')
    df["lastorder"] = pd.to_datetime(df["lastorder"],errors='coerce')
    ## Drop Rows with null values
    df = df.dropna()
    ## Create Column which gives the days between the last order and the first order
    df["first_last_days_diff"] = (df['lastorder']-df['firstorder']).dt.days
    ## Create Column which gives the days between when the customer record was created and the first order
    df['created'] = pd.to_datetime(df['created'])
    df['created_first_days_diff']=(df['created']-df['firstorder']).dt.days
    ## Drop Columns
    df.drop(['custid','created','firstorder','lastorder'],axis=1,inplace=True)
    ## Apply one hot encoding on favday and city columns
    df = pd.get_dummies(df,prefix=['favday','city'],columns=['favday','city'])
    return df

In [None]:
storedata = preprocess_data(f"s3://{default_bucket}/data/storedata_total.csv")

In [None]:
storedata.head()

### 4. Split Train, Test and Validation Datasets

In [None]:
def split_datasets(df):
    y=df.pop("retained")
    X_pre = df
    y_pre = y.to_numpy().reshape(len(y),1)
    feature_names = list(X_pre.columns)
    X= np.concatenate((y_pre,X_pre),axis=1)
    np.random.shuffle(X)
    train,validation,test=np.split(X,[int(.7*len(X)),int(.85*len(X))])
    return feature_names,train,validation,test

In [None]:
feature_names,train,validation,test = split_datasets(storedata)

In [None]:
pd.DataFrame(train).to_csv(f"s3://{default_bucket}/data/train/train.csv",header=False,index=False)
pd.DataFrame(validation).to_csv(f"s3://{default_bucket}/data/validation/validation.csv",header=False,index=False)
pd.DataFrame(test).to_csv(f"s3://{default_bucket}/data/test/test.csv",header=False,index=False)

### 5. Hyperparameter Tuning HPO

In [None]:
s3_input_train = TrainingInput(
    s3_data=f"s3://{default_bucket}/data/train/",content_type="csv")
s3_input_validation = TrainingInput(
    s3_data=f"s3://{default_bucket}/data/validation/",content_type="csv")

In [None]:
fixed_hyperparameters = {
    "eval_metric":"auc",
    "objective":"binary:logistic",
    "num_round":"100",
    "rate_drop":"0.3",
    "tweedie_variance_power":"1.4"
}

In [None]:
sess = sagemaker.Session()
container = sagemaker.image_uris.retrieve("xgboost",region,"0.90-2")

estimator = sagemaker.estimator.Estimator(
    container,
    role,
    instance_count=1,
    hyperparameters=fixed_hyperparameters,
    instance_type="ml.m4.xlarge",
    output_path="s3://{}/output".format(default_bucket),
    sagemaker_session=sagemaker_session
)

In [None]:
hyperparameter_ranges = {
    "eta": ContinuousParameter(0, 1),
    "min_child_weight": ContinuousParameter(1, 10),
    "alpha": ContinuousParameter(0, 2),
    "max_depth": IntegerParameter(1, 10),
}

In [None]:
objective_metric_name = "validation:auc"

In [None]:
tuner = HyperparameterTuner(
    estimator,objective_metric_name,hyperparameter_ranges,max_jobs=10,max_parallel_jobs=2)

In [None]:
tuner.fit({
    "train":s3_input_train,
    "validation":s3_input_validation
    },include_cls_metadata=False)

In [None]:
tuning_job_result = boto3.client("sagemaker").describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuner.latest_tuning_job.job_name
)

In [None]:
job_count = tuning_job_result["TrainingJobStatusCounters"]["Completed"]
print("%d training jobs have completed" %job_count)

In [None]:
from pprint import pprint

if tuning_job_result.get("BestTrainingJob",None):
    print("Best Model found so far:")
    pprint(tuning_job_result["BestTrainingJob"])
else:
    print("No training jobs have reported results yet.")

In [None]:
best_hyperparameters = tuning_job_result["BestTrainingJob"]["TunedHyperParameters"]

In [None]:
best_hyperparameters

### 7. XGBoost Model with SageMaker Debugger

In [None]:
hyperparameters = {**fixed_hyperparameters,**best_hyperparameters}
save_interval = 5
base_job_name = "demo-smdebug-xgboost-churn-classification"

In [None]:
container = sagemaker.image_uris.retrieve("xgboost",region,"0.90-2")

In [None]:
estimator = sagemaker.estimator.Estimator(
    container,
    role,
    base_job_name=base_job_name,
    instance_count=1,
    instance_type="ml.m4.xlarge",
    output_path="s3://{}/output".format(default_bucket),
    sagemaker_session=sess,
    hyperparameters=hyperparameters,
    max_run=1800,
    debugger_hook_config = DebuggerHookConfig(
        s3_output_path=f"s3://{default_bucket}/debugger/",  # Required
        collection_configs=[
            CollectionConfig(
                name="metrics",
                parameters={
                    "save_interval": "5"
                }),
            CollectionConfig(
                name="feature_importance", parameters={"save_interval": "5"}
            ),
            CollectionConfig(name="full_shap", parameters={"save_interval": "5"}),
            CollectionConfig(name="average_shap", parameters={"save_interval": "5"}),
        ]
    ),
    rules=[
        Rule.sagemaker(
            rule_configs.loss_not_decreasing(),
            rule_parameters={
                "collection_names": "metrics",
                "num_steps": "10",
            },
        ),
    ]
)

In [None]:
estimator.fit(
        {"train":s3_input_train,"validation":s3_input_validation},wait=False
    )

In [None]:
for _ in range(36):
    job_name = estimator.latest_training_job.name
    client = estimator.sagemaker_session.sagemaker_client
    description = client.describe_training_job(TrainingJobName=job_name)
    training_job_status = description["TrainingJobStatus"]
    rule_job_summary = estimator.latest_training_job.rule_job_summary()
    rule_evaluation_status = rule_job_summary[0]["RuleEvaluationStatus"]
    print(
        "Training job status: {}, Rule Evaluation Status: {}".format(
            training_job_status, rule_evaluation_status
        )
    )
    if training_job_status in ["Completed", "Failed"]:
        break
    time.sleep(10)

### 8. Analyze Debugger Output

In [None]:
estimator.latest_training_job.rule_job_summary()

In [None]:
s3_output_path = estimator.latest_job_debugger_artifacts_path()
trial = create_trial(s3_output_path)

In [None]:
trial.tensor_names()

In [None]:
trial.tensor("average_shap/f1").values()

In [None]:
MAX_PLOTS = 35


def get_data(trial, tname):
    """
    For the given tensor name, walks though all the iterations
    for which you have data and fetches the values.
    Returns the set of steps and the values.
    """
    tensor = trial.tensor(tname)
    steps = tensor.steps()
    vals = [tensor.value(s) for s in steps]
    return steps, vals


def match_tensor_name_with_feature_name(tensor_name, feature_names=feature_names):
    feature_tag = tensor_name.split("/")
    for ifeat, feature_name in enumerate(feature_names):
        if feature_tag[-1] == "f{}".format(str(ifeat)):
            return feature_name
    return tensor_name


def plot_collection(trial, collection_name, regex=".*", figsize=(8, 6)):
    """
    Takes a `trial` and a collection name, and
    plots all tensors that match the given regex.
    """
    fig, ax = plt.subplots(figsize=figsize)
    tensors = trial.collection(collection_name).tensor_names
    matched_tensors = [t for t in tensors if re.match(regex, t)]
    for tensor_name in islice(matched_tensors, MAX_PLOTS):
        steps, data = get_data(trial, tensor_name)
        ax.plot(steps, data, label=match_tensor_name_with_feature_name(tensor_name))

    ax.legend(loc="center left", bbox_to_anchor=(1, 0.5))
    ax.set_xlabel("Iteration")

In [None]:
plot_collection(trial, "metrics")

In [None]:
def plot_feature_importance(trial, importance_type="weight"):
    SUPPORTED_IMPORTANCE_TYPES = ["weight", "gain", "cover", "total_gain", "total_cover"]
    if importance_type not in SUPPORTED_IMPORTANCE_TYPES:
        raise ValueError(f"{importance_type} is not one of the supported importance types.")
    plot_collection(trial, "feature_importance", regex=f"feature_importance/{importance_type}/.*")

In [None]:
plot_feature_importance(trial, importance_type="cover")

### SHAP

In [None]:
plot_collection(trial, "average_shap")


### Global Explanations

In [None]:
shap_values = trial.tensor("full_shap/f0").value(trial.last_complete_step)
shap_no_base = shap_values[:, :-1]
shap_base_value = shap_values[0, -1]
shap.summary_plot(shap_no_base, plot_type="bar", feature_names=feature_names)

In [None]:
shap_base_value

In [None]:
train_shap = pd.DataFrame(train[:,1:],columns=feature_names)

In [None]:
shap.summary_plot(shap_no_base, train_shap)

### Local Explanations

In [None]:
shap.initjs()

In [None]:
shap.force_plot(
    shap_base_value,
    shap_no_base[100, :],
    train_shap.iloc[100, :],
    link="logit",
    matplotlib=False,
)

In [None]:
N_ROWS = shap_no_base.shape[0]
N_SAMPLES = min(100, N_ROWS)
sampled_indices = np.random.randint(N_ROWS, size=N_SAMPLES)

In [None]:
shap.force_plot(
    shap_base_value,
    shap_no_base[sampled_indices, :],
    train_shap.iloc[sampled_indices, :],
    link="logit",
)