# Telco Customer Churn - Data preparation

In this notebook, we prepare the data for later analysis.

### Things to install
pip install imblearn  
pip install sklearn

!pip install imblearn sklearn

Load packages

In [1]:
# !conda install -c numba/label/dev numba
# !pip install pandas_profiling imblearn sklearn pandas_profiling

In [2]:
import os
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
from _utils import data_utils
from sklearn.model_selection import train_test_split
from collections import Counter
from sklearn.cross_decomposition import PLSRegression

In [3]:
OUPUT_PATH = "./output/"

# STEP 1: EDA & Data preparation

Using the [Telco Customer Churn data from Kaggle](https://www.kaggle.com/blastchar/telco-customer-churn), we perform the data clean up just as demonstrated in [Telecom Customer Churn Prediction](https://www.kaggle.com/pavanraj159/telecom-customer-churn-prediction).

In [4]:
telcom = pd.read_csv(
    "https://data.atoti.io/notebooks/telco-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv"
)
# perform data clean up
telcom = data_utils.data_cleanup(telcom)

In [5]:
ProfileReport(telcom)

Summarize dataset:   0%|          | 0/35 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



## Data processing & split train / test

We do some data preprocessing to handle categorical variables. We create:
- Binary variables for categorical variables with only 2 levels
- Dummy variables for those with more than levels

Also, we split the data into train and test sets

We create a few new columns in preparation for the machine learning output.  
In the actual churn data, `ChurnProbability` is fixed as the customers have already churned. Hence we gave the probability a value 1.  
The `ChurnPredicted` would be the actual churn in this base use case.

In [6]:
# since the statistics is based on previous month, Churn/Non Churn probability is fixed and therefore 1
# telcom["ChurnProbability"] = 1.0
# telcom["ChurnPredicted"] = telcom["Churn"]

telcom["ChurnProbability"] = np.where((telcom["Churn"] == "Yes"), 1.0, 0.0)  # 1.0
telcom["ChurnPredicted"] = telcom["Churn"]

First, we introduce a new column to tag the training and testing data to be used by the machine learning algorithms.

In [7]:
# Target columns
target_col = ["Churn"]

# separating dependent and independent variables
cols = [c for c in telcom.columns if c not in target_col]

X_train, X_test, Y_train, Y_test = train_test_split(
    telcom[cols], telcom[target_col], test_size=0.05, random_state=0
)

We add a column indicating if a row belongs to training set or test set

In [8]:
X_train["Subset"] = ["Train"] * X_train.shape[0]
X_test["Subset"] = ["Test"] * X_test.shape[0]

telcom = pd.concat(
    [pd.concat([X_train, Y_train], axis=1), pd.concat([X_test, Y_test], axis=1)], axis=0
).reset_index(drop=True)

We will be matching the binary features back to the telcom set in sequential order. So let's output this dataset to csv for our main analysis.

In [9]:
telcom.to_csv(os.path.join(OUPUT_PATH, "tranformed_customer_churn.csv"), index=False)

In [10]:
# Perform data processing seperately on train and test
telcom_train = telcom[telcom["Subset"] == "Train"].reset_index(drop=True)
telcom_test = telcom[telcom["Subset"] == "Test"].reset_index(drop=True)

# Columns to ignore for model training
ignore_col = ["CustomerID", "ChurnPredicted", "ChurnProbability_1.0", "Subset"]

binary_df_train = data_utils.data_preprocessing(telcom_train, ignore_col, target_col)
binary_df_test = data_utils.data_preprocessing(telcom_test, ignore_col, target_col)

In [11]:
binary_df_train = binary_df_train[
    [c for c in binary_df_train.columns if c not in ["TenureGroup_Tenure_0-12"]]
]
binary_df_test = binary_df_test[
    [c for c in binary_df_test.columns if c not in ["TenureGroup_Tenure_0-12"]]
]

drop_cols_train = [
    c
    for c in binary_df_train.columns
    if "CustomerID" in c or "Churn" in c or "Subset" in c
]
drop_cols_test = [
    c
    for c in binary_df_test.columns
    if "CustomerID" in c or "Churn" in c or "Subset" in c
]

train_X = binary_df_train.drop(drop_cols_train, axis=1)
train_Y = binary_df_train[target_col]

test_X = binary_df_test.drop(drop_cols_test, axis=1)
test_Y = binary_df_test[target_col]

In [12]:
binary_df = pd.concat([binary_df_train, binary_df_test]).reset_index(drop=True)

### Save the data

In [13]:
binary_df.to_csv(os.path.join(OUPUT_PATH, "all_df.csv"), index=False)
binary_df_train.to_csv(os.path.join(OUPUT_PATH, "train_df.csv"), index=False)
binary_df_test.to_csv(os.path.join(OUPUT_PATH, "test_df.csv"), index=False)

## Reduce dimension

We use PLS-DA to:
   - Reduce the dimension of the data
   - Eliminate existing correlations in the data

In [14]:
plsda = PLSRegression(n_components=len(train_X.columns), scale=False)
plsda.fit(train_X, train_Y)

PLSRegression(n_components=32, scale=False)

In [15]:
train_X_ = pd.DataFrame(
    plsda.x_scores_,
    columns=["LV" + str(i + 1) for i in range(plsda.x_scores_.shape[1])],
)

In [16]:
variance_X = np.var(train_X_, axis=0)
explained_variance = round(variance_X / np.sum(variance_X) * 100, 2)

In [17]:
signif_thres = round(100 / len(explained_variance[explained_variance > 0]), 2)

signif_component = [
    train_X_.columns[i]
    for i in range(len(train_X_.columns))
    if explained_variance[i] >= signif_thres
]
signif_component_var = round(
    np.sum([var for var in explained_variance if var >= signif_thres])
)

print("Relevant latent variables: {}".format(signif_component))
print("Variance explained: {}%".format(signif_component_var))

Relevant latent variables: ['LV1', 'LV2', 'LV3']
Variance explained: 58%


We refit PLS-DA model keeping only the three first components as they are the only ones that are significant.

In [18]:
plsda = PLSRegression(n_components=len(signif_component), scale=False)
plsda.fit(train_X, train_Y)

PLSRegression(n_components=3, scale=False)

Then, we project the data into the latent variables space

In [19]:
train_X_transf = pd.DataFrame(
    plsda.transform(train_X),
    columns=["LV" + str(i + 1) for i in range(plsda.x_scores_.shape[1])],
)

test_X_transf = pd.DataFrame(
    plsda.transform(test_X),
    columns=["LV" + str(i + 1) for i in range(plsda.x_scores_.shape[1])],
)

In [20]:
train_df_transf = pd.concat([train_X_transf, train_Y], axis=1).reset_index(drop=True)
test_df_transf = pd.concat([test_X_transf, test_Y], axis=1).reset_index(drop=True)
binary_df_transf = pd.concat([train_df_transf, test_df_transf]).reset_index(drop=True)

### Save the transformed data

In [21]:
binary_df_transf.to_csv(os.path.join(OUPUT_PATH, "all_df_transf.csv"), index=False)
train_df_transf.to_csv(os.path.join(OUPUT_PATH, "train_df_transf.csv"), index=False)
test_df_transf.to_csv(os.path.join(OUPUT_PATH, "test_df_transf.csv"), index=False)