# **CUSTOMER CHURN PREDICTION**
*CUSTOMER CHURN PREDICTION IS ONE OF THE VERY IMPORTANT TASKS FOR ANY KIND OF BUSINESS. IT IS THE MEASURE OF CUSTOMERS LEAVING YOUR BUSINESS AND IF NOT STUDIED PROPERLY IT CAN PUT YOU OUT OF BUSINESS BUT IF STUDIED AND AMNALYSED IN A PROPER WAY, IT WILL HELP YOUR BUSINESS A LOT.*

![](https://blog.hubspot.com/hubfs/what-is-customer-churn.jpg)

**STEP 1)** **IMPORTING REQUIRED LIBRARIES**

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.tree import plot_tree, export_text
from tensorflow.keras.losses import BinaryCrossentropy
from sklearn.metrics import accuracy_score, f1_score
#pip install pytorch_tabular[all]
#pip install pytorch_tabular
from pytorch_tabular import TabularModel
from pytorch_tabular.models import CategoryEmbeddingModelConfig
from pytorch_tabular.config import DataConfig, OptimizerConfig, TrainerConfig, ExperimentConfig
%matplotlib inline

**STEP 2) READING  THE DATA**

In [2]:
df = pd.read_csv("../input/churn-dataset/churn_data.csv")
df.sample(5)

In [3]:
df.isna().sum()

In [4]:
df.dtypes

In [5]:
df[df.TotalCharges==' '].shape

In [6]:
df1 = df[df.TotalCharges!=' ']
df1.shape

*CHANGING DTYPE OF TotalCharges TO NUMERIC*

In [7]:
df1.TotalCharges = pd.to_numeric(df1.TotalCharges)

*DROPPING customerID column BECAUSE IT IS USELESS FOR NOW*

In [8]:
df1.drop('customerID',axis='columns',inplace=True)

In [9]:
df1['Churn'].value_counts()

**STEP 3) EXPLORATORY DATA ANALYSIS**

In [10]:
plt.xlabel("Tenure")
plt.ylabel("Number Of Customers")
plt.title("Tenure VS No of Customers (Churned or Not Churned)")
plt.hist([df1[df1.Churn=='Yes'].tenure,df1[df1.Churn=='No'].tenure],label=['Churn=Yes','Churn=No'])
plt.legend();

In [11]:
plt.xlabel("Monthly Charges")
plt.ylabel("Number Of Customers")
plt.title("Monthly Charges VS No of Customers (Churned or Not Churned)")
plt.hist([df1[df1.Churn=='Yes'].MonthlyCharges,df1[df1.Churn=='No'].MonthlyCharges],label=['Churn=Yes','Churn=No'])
plt.legend();

In [12]:
plt.xlabel("Total Charges")
plt.ylabel("Number Of Customers")
plt.title("Total Charges VS No of Customers (Churned or Not Churned)")
plt.hist([df1[df1.Churn=='Yes'].TotalCharges,df1[df1.Churn=='No'].TotalCharges],label=['Churn=Yes','Churn=No'])
plt.legend();

In [13]:
df1.sample(4)

**STEP 4) PREPROCESSING THE DATA (IMPUTING, SCALING & ENCODING)**

In [14]:
def print_unique_col_values(df):
    lst = []
    for column in df:
        if df[column].dtypes=='object':
            print(f'{column}: {df[column].unique()}')
            lst.append([column]+list(df1[column].unique()))
    return lst

In [15]:
a = print_unique_col_values(df1)

In [16]:
print(a)

In [17]:
df1.replace('No internet service','No',inplace=True)
df1.replace('No phone service','No',inplace=True)

In [18]:
lst_a = print_unique_col_values(df1)

In [19]:
lst_a

In [20]:
for cols in lst_a:
    if len(cols)==3:
        df1[cols[0]].replace({'Yes': 1,'No': 0},inplace=True)

In [21]:
df1.sample(2)

In [22]:
enc_cols = ['InternetService', 'Contract', 'PaymentMethod']

In [23]:
encoder = OneHotEncoder(sparse=False, handle_unknown = 'ignore')

encoder.fit(df1[enc_cols])

In [24]:
encoded_cols = list(encoder.get_feature_names_out(enc_cols))
encoded_cols

In [25]:
df1[encoded_cols] = encoder.transform(df1[enc_cols])
df1.sample(10)

In [26]:
df1 = df1.drop(['InternetService', 'Contract', 'PaymentMethod'], axis=1)

In [27]:
df1.replace('Male', 1 ,inplace=True)
df1.replace('Female', 0 ,inplace=True)

In [28]:
df1.sample(2)

In [29]:
scl_cols = ['tenure', 'TotalCharges', 'MonthlyCharges']

In [30]:
scaler = MinMaxScaler()

#FITTING THE DATA INTO THE SCALER
scaler.fit(df1[scl_cols])

#TRANSFORMING THE SCALED DATA INTO THE NUMERIC COLUMNS
df1[scl_cols] = scaler.transform(df1[scl_cols])

In [31]:
input_cols = list(df1.columns)
input_cols.remove('Churn')
input_cols

In [32]:
target_col = 'Churn'

In [33]:
inputs = df1[input_cols]
target = df1[target_col].values

In [34]:
df1.sample(2)

*LETS LOOK AT THE CORRELATIONS OF DIFFERENT COLUMNS*

In [35]:
plt.figure(figsize=(18,18))
sns.heatmap(df1.corr(), center=0, annot=True, linewidths=1);
plt.title('Heatmap of Correlations');

*IT LOOKS LIKE TENURE IS THE MOST IMPORTANT FACTOR*

**STEP 5) SPLITTING THE DATA INTO TRAIN AND TEST**

In [36]:
X_train, X_test, y_train, y_test = train_test_split(inputs,target,test_size=0.15,random_state=5)

**STEP 6) TRAINING MODELS AND EVALUATING**

In [37]:
def model_eval(model, X_train, y_train, X_test, y_test):
    
    #PREDICTIONS ON THE TRAINING INPUTS
    train_preds = model.predict(X_train)
    #TRAINING ACCURACY SCORE
    train_acc = accuracy_score(y_train, train_preds)
    
    #PREDICTIONS ON THE VALIDATION INPUTS
    test_preds = model.predict(X_test)
    #TEST ACCURACY SCORE
    test_acc = accuracy_score(y_test, test_preds)
    
    #PRINTING TRAINING AND TEST RMSEs
    print('Train ACC: {}, Validation ACC: {}'.format(train_acc, test_acc))

In [38]:
#DECLARING THE MODEL OBJECT
model = XGBClassifier(random_state=42)

#FITTING THE DATA INTO THE MODEL
model.fit(X_train, y_train)

In [39]:
model_eval(model, X_train, y_train, X_test, y_test)

*XGBClassifier gives a fairly good test acc of 79 percent*

In [40]:
#DECLARING THE MODEL OBJECT
model1 = LogisticRegression(random_state=42)

#FITTING THE DATA INTO THE MODEL
model1.fit(X_train, y_train)

In [41]:
model_eval(model1, X_train, y_train, X_test, y_test)

*Our LogisticRegressor Model performs slightly better at approx 80% test accuracy*

In [42]:
#DECLARING THE MODEL OBJECT
model2 = RandomForestClassifier(n_jobs=-1, random_state=42, n_estimators=350)

#FITTING THE DATA INTO THE MODEL
model2.fit(X_train, y_train)

In [43]:
model_eval(model2, X_train, y_train, X_test, y_test)

*RandomForestClassifier also performs good with 80% test accuracy*

**BONUS STEP) IN THIS STEP, I AM TRYING OUT A NEW FRAMEWORK CALLED PYTORCH TABULAR,IT IS A DEEP LEARNING FRAMEWORK BIULT OVER PYTORCH AND PYTORCH LIGHTNING AND PROVIDES VERY GOOD DEEP LEARNING MODEL TO WORK WITH TABULAR DATA
CHECK IT OUT**  :-

https://pytorch-tabular.readthedocs.io/en/latest/

In [44]:
data = df1.copy()
num_col_names = scl_cols
cat_col_names = list(set(input_cols) - set(scl_cols))

train, val = train_test_split(data, test_size=0.2, random_state=42)
val, test = train_test_split(train, test_size=0.25, random_state=42)

In [45]:
train.sample(4)

In [46]:
cat_col_names

*CONFIGS FOR THE TABULAR MODEL*

In [53]:
data_config = DataConfig(
    target=['Churn'], #target should always be a list. Multi-targets are only supported for regression. Multi-Task Classification is not implemented
    continuous_cols=num_col_names,
    categorical_cols=cat_col_names,
)
trainer_config = TrainerConfig(
    auto_lr_find=True, # Runs the LRFinder to automatically derive a learning rate
    batch_size=150,
    max_epochs=100,
    gpus=1, #index of the GPU to use. -1 means all available GPUs, None, means CPU
)
optimizer_config = OptimizerConfig()

model_config = CategoryEmbeddingModelConfig(
    task="classification",
    layers="1024-512-512",  # Number of nodes in each layer
    activation="LeakyReLU", # Activation between each layers
    learning_rate = 1e-5
)

tabular_model = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config,
)

**TRAINING THE MODEL**

In [54]:
tabular_model.fit(train=train, validation=val)


*WE GOT A VALIDATION ACCURACY OF 74% WITH DEFAULT PARAMETERS*

**EVALUATING THE MODEL ON TEST SET**

In [55]:
result = tabular_model.evaluate(test)

*OUR TESTING ACCURACY IS ALMOST AS GOOD AS OUR ABOVE MODELS*

In [56]:
pred_df = tabular_model.predict(test)
pred_df.head()
