<a href="https://colab.research.google.com/github/desireeHim/praktika_kt/blob/main/praktika_kt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import files
# upload csv file
files.upload()

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# reading in csv file and getting a glimpse what it looks like
input_file = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
input_file.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


By looking at the data and knowing the objective, this is a binary classification problem.

In [3]:
# seeing what are all the columns since head limits the number of columns displayed
input_file.columns

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

In [4]:
# for cleaning i want to see what are all the possible values for each column
for column in input_file.columns:
  print(f'\'{column}\' has values:\n{input_file[column].value_counts()}\n\n')

'customerID' has values:
7590-VHVEG    1
3791-LGQCY    1
6008-NAIXK    1
5956-YHHRX    1
5365-LLFYV    1
             ..
9796-MVYXX    1
2637-FKFSY    1
1552-AAGRX    1
4304-TSPVK    1
3186-AJIEK    1
Name: customerID, Length: 7043, dtype: int64


'gender' has values:
Male      3555
Female    3488
Name: gender, dtype: int64


'SeniorCitizen' has values:
0    5901
1    1142
Name: SeniorCitizen, dtype: int64


'Partner' has values:
No     3641
Yes    3402
Name: Partner, dtype: int64


'Dependents' has values:
No     4933
Yes    2110
Name: Dependents, dtype: int64


'tenure' has values:
1     613
72    362
2     238
3     200
4     176
     ... 
28     57
39     56
44     51
36     50
0      11
Name: tenure, Length: 73, dtype: int64


'PhoneService' has values:
Yes    6361
No      682
Name: PhoneService, dtype: int64


'MultipleLines' has values:
No                  3390
Yes                 2971
No phone service     682
Name: MultipleLines, dtype: int64


'InternetService' has values:
Fib

We have inbalanced dataset and the class we want to predict 'Yes' is a minority class. Some of the classes have three unique values and to reduce the complexity of the dataset I'll bunch together categories 'No' and 'No internet service' or 'No phone service' together and columns with binary values I convert 'No' as 0 and 'Yes' as 1.

In [5]:
# for simplicity column with binary textual values I convert to 0 and 1
# first i make a copy not to change anything in the original data
cleaned = input_file.copy(deep=True)
# droping the ID column, because this isn't needed for model to train or make predictions
cleaned.drop('customerID', axis=1, inplace=True)
cleaned['gender'].replace({'Male': 0, 'Female':1}, inplace=True)
cleaned['Partner'].replace({'No': 0, 'Yes':1}, inplace=True)
cleaned['Dependents'].replace({'No': 0, 'Yes':1}, inplace=True)
cleaned['PhoneService'].replace({'No': 0, 'Yes':1}, inplace=True)
cleaned['MultipleLines'].replace({'No': 0, 'No phone service':0, 'Yes':1}, inplace=True)
cleaned['InternetService'].replace({'No': 0, 'DSL':1, 'Fiber optic':1}, inplace=True)
cleaned['OnlineSecurity'].replace({'No': 0, 'No internet service':0, 'Yes':1}, inplace=True)
cleaned['OnlineBackup'].replace({'No': 0, 'No internet service':0,'Yes':1}, inplace=True)
cleaned['DeviceProtection'].replace({'No': 0, 'No internet service':0,'Yes':1}, inplace=True)
cleaned['TechSupport'].replace({'No': 0, 'No internet service':0,'Yes':1}, inplace=True)
cleaned['StreamingTV'].replace({'No': 0, 'No internet service':0,'Yes':1}, inplace=True)
cleaned['StreamingMovies'].replace({'No': 0, 'No internet service':0,'Yes':1}, inplace=True)
cleaned['PaperlessBilling'].replace({'No': 0, 'Yes':1}, inplace=True)
cleaned['Churn'].replace({'No': 0, 'Yes':1}, inplace=True)

# one-hot encoding some columns with categorical values
cleaned['Contract'] = cleaned['Contract'].astype('category')
one_hot_encoded = pd.get_dummies(cleaned['Contract'])
cleaned = pd.concat([cleaned, one_hot_encoded], axis=1)
cleaned.drop('Contract', axis=1, inplace=True)

cleaned['PaymentMethod'] = cleaned['PaymentMethod'].astype('category')
one_hot_encoded = pd.get_dummies(cleaned['PaymentMethod'])
cleaned = pd.concat([cleaned, one_hot_encoded], axis=1)
cleaned.drop('PaymentMethod', axis=1, inplace=True)

cleaned.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,...,MonthlyCharges,TotalCharges,Churn,Month-to-month,One year,Two year,Bank transfer (automatic),Credit card (automatic),Electronic check,Mailed check
0,1,0,1,0,1,0,0,1,0,1,...,29.85,29.85,0,1,0,0,0,0,1,0
1,0,0,0,0,34,1,0,1,1,0,...,56.95,1889.5,0,0,1,0,0,0,0,1
2,0,0,0,0,2,1,0,1,1,1,...,53.85,108.15,1,1,0,0,0,0,0,1
3,0,0,0,0,45,0,0,1,1,0,...,42.3,1840.75,0,0,1,0,1,0,0,0
4,1,0,0,0,2,1,0,1,0,0,...,70.7,151.65,1,1,0,0,0,0,1,0


In [6]:
# checking if there are any columns with more than 50% of empty fields or empty values
print("dataframe shape before: ", cleaned.shape)
missing_percentage_cols = cleaned.isin(['',' ', np.nan, 'null', 'NULL']).mean(axis=0)
# checking if there are any rows with more than 50% of empty fields or empty values
missing_percentage_rows = cleaned.isin(['',' ', np.nan, 'null', 'NULL']).mean(axis=1)
cleaned = cleaned.loc[missing_percentage_rows <= 0.5, missing_percentage_cols <= 0.5]
print("dataframe shape after: ", cleaned.shape)

dataframe shape before:  (7043, 25)
dataframe shape after:  (7043, 25)


Luckily no rows or columns were dropped

In [7]:
# checking if everything is numerical
cleaned.dtypes

# changing column TotalCharges datatype to float
#cleaned['TotalCharges'] = cleaned['TotalCharges'].astype('float64')

# got error, some column values are missing, replacing empty strings or missing values with -1
cleaned['TotalCharges'].replace(['',' ', np.nan, 'null', 'NULL'], -1, inplace=True)

# trying again
cleaned['TotalCharges'] = cleaned['TotalCharges'].astype('float64')

In [8]:
from sklearn.model_selection import train_test_split

# I separate the class we want to predict from rest of the data
y = cleaned['Churn']
cleaned.drop(['Churn'], axis=1, inplace=True)
X = cleaned
# I do train/test split, with 80% data used for training and 20% as test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Since this is a classification problem with labelled data I am using supervised learning models. For models I choose one simpler and one more complex: K-nearest neighbours (KNN) and RandomForest (RF).

In [9]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# initializing the model, with k=5
knn_model = KNeighborsClassifier(n_neighbors=5)
# training the model
knn_model.fit(X_train, y_train)
# predicting test results
knn_preds = knn_model.predict(X_test)
# evaluating the model performance
knn_report = classification_report(y_test, knn_preds)
print("KNN Classification report:\n", knn_report)

# initializing the model, with max depth 5
rf_model = RandomForestClassifier(max_depth=5, random_state=42)
# training the model
rf_model.fit(X_train, y_train)
# predicting test results
rf_preds = rf_model.predict(X_test)
# evaluating the model performance
rf_report = classification_report(y_test, rf_preds)
print("RF Classification report:\n", rf_report)

KNN Classification report:
               precision    recall  f1-score   support

           0       0.83      0.88      0.85      1036
           1       0.59      0.48      0.53       373

    accuracy                           0.78      1409
   macro avg       0.71      0.68      0.69      1409
weighted avg       0.76      0.78      0.77      1409

RF Classification report:
               precision    recall  f1-score   support

           0       0.81      0.94      0.87      1036
           1       0.69      0.40      0.50       373

    accuracy                           0.79      1409
   macro avg       0.75      0.67      0.69      1409
weighted avg       0.78      0.79      0.77      1409



RF model is good at predicting class 0, meaning Churn value 'No', but has poor perfomance predicting what we want to know. It is likely due to fact that we imbalanced data and minority class is the on we want to predict. Maybe SMOTE would help here

In [10]:
from imblearn.over_sampling import SMOTE

# repeating previous steps, but this time using SMOTE
# initializing SMOTE
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# initializing the model
knn_model_smote = KNeighborsClassifier(n_neighbors=5)
# training the model
knn_model_smote.fit(X_train_smote, y_train_smote)
# predicting test results
knn_preds = knn_model_smote.predict(X_test)
# evaluating the model performance
knn_report = classification_report(y_test, knn_preds)
print("KNN with SMOTE Classification report:\n", knn_report)

# initializing the model
rf_model_smote = RandomForestClassifier(max_depth=5, random_state=42)
# training the model
rf_model_smote.fit(X_train_smote, y_train_smote)
# predicting test results
rf_preds_smote = rf_model_smote.predict(X_test)
# evaluating the model performance
rf_report = classification_report(y_test, rf_preds_smote)
print("RF with SMOTE Classification report:\n", rf_report)

KNN with SMOTE Classification report:
               precision    recall  f1-score   support

           0       0.87      0.72      0.78      1036
           1       0.47      0.70      0.56       373

    accuracy                           0.71      1409
   macro avg       0.67      0.71      0.67      1409
weighted avg       0.76      0.71      0.72      1409

RF with SMOTE Classification report:
               precision    recall  f1-score   support

           0       0.91      0.75      0.82      1036
           1       0.53      0.79      0.64       373

    accuracy                           0.76      1409
   macro avg       0.72      0.77      0.73      1409
weighted avg       0.81      0.76      0.77      1409



Overall accuracy is a little lower, but f1-score and recall on predicting 1 is a lot better now, meaning our model is better at capturing positive instances. I'm going to try hyperparameter tuning to get an even better performance. I am focusing on trying to get the best weighted f1 score which combines precision and recall scores.

In [11]:
from sklearn.model_selection import GridSearchCV

param_grid_knn = {
    'n_neighbors': [3, 5, 7, 9]
}

# initializing grid search for knn model, 5 cross validations and we want to find the best f1 weighted score
grid_search_knn = GridSearchCV(KNeighborsClassifier(), param_grid_knn, cv=5, scoring='f1_weighted')
grid_search_knn.fit(X_train_smote, y_train_smote)
# get best model
best_knn_model = grid_search_knn.best_estimator_
print("Best parameters for KNN: ", grid_search_knn.best_params_)
# evaluate
pred = best_knn_model.predict(X_test)
knn_report = classification_report(y_test, pred)
print("KNN with SMOTE and grid search Classification Report:\n", knn_report)

param_grid_rf = {
    'n_estimators': [75, 100, 125],
    'criterion': ['gini', 'entropy', 'log_loss'],
    'max_depth': [2, 3, 5, None],
    'min_samples_split': [2, 4, 6]
}

# initializing grid search for rf model, 5 cross validations and we want to find the best f1 weighted score
grid_search_rf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid_rf, cv=5, scoring='f1_weighted')
grid_search_rf.fit(X_train_smote, y_train_smote)
# get best model
best_rf_model = grid_search_rf.best_estimator_
print("Best parameters for RF: ", grid_search_rf.best_params_)
# evaluate
pred = best_rf_model.predict(X_test)
rf_report = classification_report(y_test, pred)
print("RF with SMOTE and grid search Classification Report:\n", rf_report)

Best parameters for KNN:  {'n_neighbors': 3}
KNN with SMOTE and grid search Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.73      0.79      1036
           1       0.47      0.68      0.56       373

    accuracy                           0.71      1409
   macro avg       0.67      0.70      0.67      1409
weighted avg       0.76      0.71      0.73      1409

Best parameters for RF:  {'criterion': 'entropy', 'max_depth': None, 'min_samples_split': 6, 'n_estimators': 125}
RF with SMOTE and grid search Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.86      0.85      1036
           1       0.60      0.59      0.59       373

    accuracy                           0.79      1409
   macro avg       0.72      0.72      0.72      1409
weighted avg       0.78      0.79      0.79      1409



Trying the same models again, but this time with undersampling the majority class, because oversampling a minority class creates new synthetic data, and this also adds some noise and may affect the models performance.

In [12]:
from imblearn.under_sampling import TomekLinks

# initializing tomeklinks and doing undersampling
tl = TomekLinks()
X_under, y_under = tl.fit_resample(X_train, y_train)
# initializing the model
knn_model_tl = KNeighborsClassifier(n_neighbors=5)
# training the model
knn_model_tl.fit(X_under, y_under)
# predicting test results
knn_preds = knn_model_tl.predict(X_test)
# evaluating the model performance
knn_report = classification_report(y_test, knn_preds)
print("KNN with tomekLinks Classification report:\n", knn_report)

# initializing the model
rf_model_tl = RandomForestClassifier(max_depth=5, random_state=42)
# training the model
rf_model_tl.fit(X_under, y_under)
# predicting test results
rf_preds = rf_model_tl.predict(X_test)
# evaluating the model performance
rf_report = classification_report(y_test, rf_preds)
print("RF with tomekLinks Classification report:\n", rf_report)

KNN with tomekLinks Classification report:
               precision    recall  f1-score   support

           0       0.84      0.83      0.84      1036
           1       0.55      0.56      0.56       373

    accuracy                           0.76      1409
   macro avg       0.70      0.70      0.70      1409
weighted avg       0.76      0.76      0.76      1409

RF with tomekLinks Classification report:
               precision    recall  f1-score   support

           0       0.86      0.89      0.87      1036
           1       0.66      0.58      0.62       373

    accuracy                           0.81      1409
   macro avg       0.76      0.74      0.75      1409
weighted avg       0.80      0.81      0.81      1409



Seems like tomeklinks gives slightly better results and I'll use undersampled data, but now try to improve the models with hyperparamter tuning and grid search

In [13]:
param_grid_knn = {
    'n_neighbors': [3, 5, 7, 9]
}

# initializing grid search for knn model, 5 cross validations and we want to find the best recall
grid_search_knn = GridSearchCV(KNeighborsClassifier(), param_grid_knn, cv=5, scoring='f1_weighted')
grid_search_knn.fit(X_under, y_under)
# get best model
best_knn_model = grid_search_knn.best_estimator_
print("Best parameters for KNN: ", grid_search_knn.best_params_)
# evaluate
pred = best_knn_model.predict(X_test)
knn_report = classification_report(y_test, pred)
print("KNN with tomekLinks and grid search Classification Report:\n", knn_report)

param_grid_rf = {
    'n_estimators': [75, 100, 125],
    'criterion': ['gini', 'entropy', 'log_loss'],
    'max_depth': [2, 3, 5, None],
    'min_samples_split': [2, 4, 6]
}

# initializing grid search for rf model, 5 cross validations and we want to find the best recall
grid_search_rf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid_rf, cv=5, scoring='f1_weighted')
grid_search_rf.fit(X_under, y_under)
# get best model
best_rf_model = grid_search_rf.best_estimator_
print("Best parameters for RF: ", grid_search_rf.best_params_)
# evaluate
pred = best_rf_model.predict(X_test)
rf_report = classification_report(y_test, pred)
print("RF with tomekLinks and grid search Classification Report:\n", rf_report)

Best parameters for KNN:  {'n_neighbors': 9}
KNN with tomekLinks and grid search Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.87      0.86      1036
           1       0.61      0.54      0.57       373

    accuracy                           0.79      1409
   macro avg       0.72      0.71      0.72      1409
weighted avg       0.78      0.79      0.78      1409

Best parameters for RF:  {'criterion': 'entropy', 'max_depth': None, 'min_samples_split': 6, 'n_estimators': 100}
RF with tomekLinks and grid search Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.87      0.86      1036
           1       0.63      0.60      0.61       373

    accuracy                           0.80      1409
   macro avg       0.74      0.74      0.74      1409
weighted avg       0.80      0.80      0.80      1409



So far I think the best model is RF with undersampling, it has the highest accuracy, but more importantly it has a good precision/recall ratio. While the F1 score on predicting 1 (Yes) is not the best, this model has a decent F1 score on predicting 0 as well.
I try to do PCA to see if I can reduce the dimensionality of the data and get better results with grid search.

In [14]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# performing PCA to maybe try and reduce dimensionality and find some patterns
# standardizing data before PCA
scaler = StandardScaler()
train_transformed = scaler.fit_transform(X_train)
test_transformed = scaler.transform(X_test)

# create PCA and use as many components needed to explain 90% of the variance
pca = PCA(0.9)
train_pca = pca.fit_transform(train_transformed)
test_pca = pca.transform(test_transformed)

param_grid_knn = {
    'n_neighbors': [3, 5, 7, 9]
}

# initializing grid search for knn model, 5 cross validations and we want to find the best f1 weighted score
grid_search_knn = GridSearchCV(KNeighborsClassifier(), param_grid_knn, cv=5, scoring='f1_weighted')
grid_search_knn.fit(train_pca, y_train)
# get best model
best_knn_model = grid_search_knn.best_estimator_
print("Best parameters for KNN: ", grid_search_knn.best_params_)
# evaluate
pred = best_knn_model.predict(test_pca)
knn_report = classification_report(y_test, pred)
print("KNN Classification Report:\n", knn_report)

param_grid_rf = {
    'n_estimators': [75, 100, 125],
    'criterion': ['gini', 'entropy', 'log_loss'],
    'max_depth': [2, 3, 5, None],
    'min_samples_split': [2, 4, 6]
}

# initializing grid search for rf model, 5 cross validations and we want to find the best f1 weighted score
grid_search_rf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid_rf, cv=5, scoring='f1_weighted')
grid_search_rf.fit(train_pca, y_train)
# get best model
best_rf_model = grid_search_rf.best_estimator_
print("Best parameters for RF: ", grid_search_rf.best_params_)
# evaluate
pred = best_rf_model.predict(test_pca)
rf_report = classification_report(y_test, pred)
print("RF Classification Report:\n", rf_report)

Best parameters for KNN:  {'n_neighbors': 9}
KNN Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.86      0.85      1036
           1       0.58      0.54      0.56       373

    accuracy                           0.77      1409
   macro avg       0.71      0.70      0.70      1409
weighted avg       0.77      0.77      0.77      1409

Best parameters for RF:  {'criterion': 'gini', 'max_depth': None, 'min_samples_split': 6, 'n_estimators': 125}
RF Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.90      0.86      1036
           1       0.62      0.44      0.52       373

    accuracy                           0.78      1409
   macro avg       0.72      0.67      0.69      1409
weighted avg       0.77      0.78      0.77      1409



The best model is RF model with undersampling data and model parameters {'criterion': 'entropy', 'max_depth': None, 'min_samples_split': 6, 'n_estimators': 100}, because it had the best precision and recall combined for predicting 1. Predicing the majority class 0 was good on all the models. The model overall accuracy was 80%.