# EDA Notebook

This notebook is used to explore the data and understand through some graphs of categories and numerical characteristics and at the end to verify the importance of the characteristics with SHAP. SHAP values are the most mathematically consistent way to get resource importances and work particularly well with tree-based models.

## 1. Importing libraries

In [None]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import LabelEncoder, FunctionTransformer
from sklearn.impute import SimpleImputer
from category_encoders import TargetEncoder
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold
from sklearn.metrics import classification_report, f1_score, precision_recall_curve, roc_auc_score, plot_roc_curve
import shap
import warnings

# Config
%matplotlib inline
pd.set_option('display.max_columns', None)
warnings.filterwarnings('ignore')
shap.initjs()

## 2.1 Loading data

In [None]:
# Load train and test data
df_train = pd.read_csv('../data/raw/train.csv')
df_test = pd.read_csv('../data/raw/test.csv')

## 2.2 Checking data

In [None]:
df_train.info()

In [None]:
# Check quantity of rows and columns
print('Quantidade de Linhas: ', df_train.shape[0])
print('Quantidade de Colunas: ', df_train.shape[1])

In [None]:
# Check null values
df_train.isnull().sum()

We can start to see that the train dataset is small, containing only 5634 rows, 21 variables and no *missing* value, and of these 21 variables only 3 are numerical variables.

Let's start looking at an attribute plus exploratory background analysis now for these variable elements, and lastly to study the elemental attributes. We start with the bar chart of the target Churn variable

## 2.3 Target variable

In [None]:
# Plot target variable
#plt.figure(figsize=(4,3))
ax = sns.barplot(x=df_train['Churn'].value_counts().index, 
                 y=df_train['Churn'].value_counts().values,
                 data=df_train)

total = float(len(df_train))

for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x() + p.get_width()/2., height, '{:1.2f}'.format(height/total), 
            fontsize=12, color='black', ha='center', va='bottom')

plt.title('Churn Rate')
plt.show()

This plot shows that most of the company's customers did not cancel the contract, about 73%. It is a good result for the company, however, for our model it is something to be corrected, because the variable is unbalanced. We will correct this later, now we will continue the exploratory analysis and cleaning of the other variables.

In [None]:
df_train['Churn'] = df_train['Churn'].map({'Yes': 1,'No': 0})

## 2.3 Category variables (EDA)

In [None]:
# casting senior citizen to object
df_train['SeniorCitizen'] = df_train['SeniorCitizen'].astype('object')

# casting total charges to float
df_train['TotalCharges'] = df_train['TotalCharges'].apply(lambda x: str(x).replace(',', '.'))
df_train['TotalCharges'] = pd.to_numeric(df_train['TotalCharges'], errors='coerce')
df_train['TotalCharges'] = df_train['TotalCharges'].astype('float64')
df_train['TotalCharges'] = df_train["TotalCharges"].replace(" ",np.nan)

In [None]:
# Drop customerID
df_train.drop('customerID', axis=1, inplace=True)

In [None]:
# Select only categorical variables
df_train_cat = df_train.select_dtypes(include=['object'])

# Select only numerical variables
df_train_quant = df_train.select_dtypes(exclude=['object'])
df_train_quant.drop('Churn', axis=1, inplace=True)

In [None]:
df_train_cat.describe()

Most of categorical features have a two or three unique values, so we can use a bar chart to see the distribution of the values. We will use a function to plot the bar chart of each categorical variable.

In [None]:
fig, axes = plt.subplots(3, 3, figsize=(20, 20))
for idx, col in enumerate(df_train_cat):
    ax = plt.subplot(4,4,idx+1)
    ax.yaxis.set_ticklabels([])
    sns.countplot(x=col, data=df_train_cat) 
    ax.set_title(col)
    ax.set_xlabel('')
    ax.set_ylabel('')
    ax.set_xticklabels(ax.get_xticklabels(), rotation=45)
    for p in ax.patches:
        height = p.get_height()
        ax.text(p.get_x() + p.get_width()/2., height, '{:1.2f}'.format(height/total), 
                fontsize=12, color='black', ha='center', va='bottom')
plt.tight_layout()
plt.show()

In [None]:
# group all categorical variables by churn then plot a horizontal stacked bar chart of the proportions
fig, axes = plt.subplots(4, 4, figsize=(30, 20))
for i, col in enumerate(df_train_cat.columns):
    df_train_cat.groupby([col, df_train['Churn']]).size().unstack().plot(kind='barh', stacked=True, ax=axes[i//4, i%4])
    axes[i//4, i%4].set_title(col)
    axes[i//4, i%4].set_xlabel('')
    axes[i//4, i%4].set_ylabel('')
    axes[i//4, i%4].set_xticklabels(axes[i//4, i%4].get_xticklabels())
plt.tight_layout()
plt.show()

In the two graphs above we can see that most of the categorical characteristics are unbalanced, this is clearer in the stacked bar plot and we need to understand how these variables are really important to predict the target variable.

We can see that in some categorical attributes they have values like "No phone service" and "No internet service" this makes it difficult for both data visualizations and machine learning algorithms.

Instead of keeping these variables with these values, let's replace them with "No" so we can have the data more organized and also when the dummy variables are created in the final dataset it will have a smaller dimension.

## 2.4 Numerical variables (EDA)

In [None]:
df_train_quant.describe()

In [None]:
fig, axes = plt.subplots(3, 3, figsize=(8, 4))
for idx, col in enumerate(df_train[df_train_quant.columns]):
    ax = plt.subplot(1,3,idx+1)
    ax.yaxis.set_ticklabels([])
    sns.distplot(df_train.loc[df_train.Churn == 0][col], hist=False, axlabel= False, 
    kde_kws={'linestyle':'-', 
             'label':"Risk"})
    sns.distplot(df_train.loc[df_train.Churn == 1][col], hist=False, axlabel= False, 
    kde_kws={'linestyle':'--', 
             'label':"No Risk"})
    ax.set_title(col)

# Mostra o gráfico
plt.tight_layout()
plt.show()

In the histogram below we can see that the numerical variables are not normally distributed, so we will need to normalize them before training some linear machine learning models. We can also see that distribution for clients who have churned is different from those who have not churned for the variables Tenue and MonthlyCharges.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(12, 4))
axes = axes.flatten()

for ax, box in zip(axes, df_train[df_train_quant.columns]):
    sns.boxplot(x='Churn',
                y=box, 
                ax=ax,
                data=df_train
    )

# Mostra o gráfico
plt.show()

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(12, 4))
axes = axes.flatten()

# Itera sobre as variáveis quantitativas para gerar o violin plot
for ax, violin in zip(axes, df_train[df_train_quant.columns]):
    sns.violinplot(x='Churn',
                   y=violin, 
                   ax=ax,
                   split=True,
                   data=df_train)

# Mostra o gráfico
plt.show()

With box and violin plots we can see the distribution of the numerical variables for each class of the target variable. We can see these three variables dont have much outliers.

In [None]:
sns.pairplot(df_train, 
             vars=df_train[df_train_quant.columns],
             hue='Churn', 
             diag_kind='hist', 
             corner=True, 
             plot_kws=dict(alpha=0.5))

plt.tight_layout()
plt.show()

In the pairplot above we can see that the numerical variables are not normally distributed and the variable TotalCharges have a strong possitive relationship with the Tenure and MonthlyCharges variables. For check this relationship we can use a heatmap below.

In [None]:
sns.heatmap(df_train[df_train_quant.columns].corr(method='pearson'), 
            annot=True, 
            linewidths=0.3, 
            square=True)

# Mostra o gráfico
plt.show()

## 3.1 Model Pipeline

Now we will create a pipeline to train and test the model. We will use the pipeline to train and test the model with the original dataset and with the dataset. We will use this to check feature importance with SHAP.

In [None]:
X_train = pd.read_csv('../data/raw/train.csv')
X_test = pd.read_csv('../data/raw/test.csv')

In [None]:
X_train['Churn'] = X_train['Churn'].map({'Yes': 1,'No': 0})
X_test['Churn'] = X_test['Churn'].map({'Yes': 1,'No': 0})

In [None]:
y_train = X_train['Churn']
X_train = X_train.drop('Churn', axis=1)

y_test = X_test['Churn']
X_test = X_test.drop('Churn', axis=1)

In [None]:
def replace_commas(X):
    X['TotalCharges'] = X['TotalCharges'].apply(lambda x: str(x).replace(',', '.'))	
    return np.array(X['TotalCharges']).reshape(-1, 1)

def casting_to_float(X):
    X['TotalCharges'] = pd.to_numeric(X['TotalCharges'], errors='coerce')
    X['TotalCharges'] = X['TotalCharges'].astype('float64')
    return np.array(X['TotalCharges']).reshape(-1, 1)

def replace_spaces(X):
    X['TotalCharges'] = X['TotalCharges'].apply(lambda x: str(x).replace(' ', 'np.nan'))	
    return np.array(X['TotalCharges']).reshape(-1, 1)

In [None]:
num_vars = ['tenure', 'MonthlyCharges']

cat_vars = [
    var for var in X_train.columns if X_train[var].dtype == "object" 
    and var in ['SeniorCitizen'] 
    and var not in ['Churn', 'customerID']
]

In [None]:
# Create a simple pipeline for numerical features and categorical features with a xgboost classifier
pipe = Pipeline([
    ('preprocessor', ColumnTransformer([
        ('num', SimpleImputer(strategy='median'), num_vars),
        ('cat', TargetEncoder(cols=cat_vars), cat_vars)
    ], remainder='drop')),
    ('classifier', xgb.XGBClassifier(random_state=42))
])

# Fit the pipeline
pipe.fit(X_train, y_train)

In [None]:
pipe.score(X_test, y_test)