# Lab | Cross Validation
## Instructions
### Apply SMOTE for upsampling the data

- Use logistic regression to fit the model and compute the accuracy of the model.
- Use decision tree classifier to fit the model and compute the accuracy of the model.
- Compare the accuracies of the two models.

### Apply TomekLinks for downsampling

- It is important to remember that it does not make the two classes equal but only removes the points from the majority class that are close to other points in minority class.
- Use logistic regression to fit the model and compute the accuracy of the model.
- Use decision tree classifier to fit the model and compute the accuracy of the model.
- Compare the accuracies of the two models.
- You can also apply this algorithm one more time and check the how the imbalance in the two classes changed from the last time.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('files_for_lab/Customer-Churn.csv')

In [3]:
df.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.7,151.65,Yes


In [4]:
df['TotalCharges'] = df['TotalCharges'].replace(' ', np.nan)
df['TotalCharges'] = df['TotalCharges'].astype(float)
median_value = df['TotalCharges'].median()
df['TotalCharges'].fillna(median_value, inplace=True)

df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   OnlineSecurity    7043 non-null   object 
 7   OnlineBackup      7043 non-null   object 
 8   DeviceProtection  7043 non-null   object 
 9   TechSupport       7043 non-null   object 
 10  StreamingTV       7043 non-null   object 
 11  StreamingMovies   7043 non-null   object 
 12  Contract          7043 non-null   object 
 13  MonthlyCharges    7043 non-null   float64
 14  TotalCharges      7043 non-null   float64
 15  Churn             7043 non-null   int64  
dtypes: float64(2), int64(3), object(11)
memory

# Benchmark Logistic Regression Model

In [6]:
X = df[['tenure','SeniorCitizen','MonthlyCharges','TotalCharges']]
y = df['Churn'] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


model = LogisticRegression()


model.fit(X_train, y_train)


y_pred = model.predict(X_test)


print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.93      0.88      1036
           1       0.70      0.48      0.57       373

    accuracy                           0.81      1409
   macro avg       0.76      0.70      0.72      1409
weighted avg       0.80      0.81      0.79      1409



# Benchmark Decission Tree Classifier

In [7]:
X = df[['tenure','SeniorCitizen','MonthlyCharges','TotalCharges']]
y = df['Churn'] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


model = DecisionTreeClassifier()


model.fit(X_train, y_train)


y_pred = model.predict(X_test)


print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.81      0.82      1036
           1       0.50      0.52      0.51       373

    accuracy                           0.73      1409
   macro avg       0.66      0.67      0.66      1409
weighted avg       0.74      0.73      0.74      1409



# Logistic regression after SMOTE

In [8]:
from imblearn.over_sampling import SMOTE

X = df[['tenure','SeniorCitizen','MonthlyCharges','TotalCharges']]
y = df['Churn'] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


smote = SMOTE()
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)


model_smote = LogisticRegression()
model_smote.fit(X_train_smote, y_train_smote)

y_pred_smote = model_smote.predict(X_test)
print("Classification Report after SMOTE:")
print(classification_report(y_test, y_pred_smote))

Classification Report after SMOTE:
              precision    recall  f1-score   support

           0       0.90      0.73      0.81      1036
           1       0.51      0.77      0.61       373

    accuracy                           0.74      1409
   macro avg       0.70      0.75      0.71      1409
weighted avg       0.80      0.74      0.76      1409



# Decision Tree Classifier after SMOTE

In [9]:
X = df[['tenure','SeniorCitizen','MonthlyCharges','TotalCharges']]
y = df['Churn'] # Target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

smote = SMOTE() 
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

model_smote = DecisionTreeClassifier()
model_smote.fit(X_train_smote, y_train_smote)

y_pred_smote = model_smote.predict(X_test)
print("Classification Report after SMOTE:")
print(classification_report(y_test, y_pred_smote))

Classification Report after SMOTE:
              precision    recall  f1-score   support

           0       0.83      0.75      0.79      1036
           1       0.45      0.57      0.50       373

    accuracy                           0.70      1409
   macro avg       0.64      0.66      0.64      1409
weighted avg       0.73      0.70      0.71      1409



# Logistic regression after TomekLinks

In [10]:
from imblearn.under_sampling import TomekLinks

X = df[['tenure','SeniorCitizen','MonthlyCharges','TotalCharges']]
y = df['Churn'] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


tomek = TomekLinks()
X_train_tomek, y_train_tomek = tomek.fit_resample(X_train, y_train)


model_tomek = LogisticRegression()
model_tomek.fit(X_train_tomek, y_train_tomek)

y_pred_tomek = model_tomek.predict(X_test)
print("Classification Report after TomekLinks:")
print(classification_report(y_test, y_pred_tomek))

Classification Report after TomekLinks:
              precision    recall  f1-score   support

           0       0.85      0.88      0.86      1036
           1       0.62      0.55      0.58       373

    accuracy                           0.79      1409
   macro avg       0.73      0.72      0.72      1409
weighted avg       0.79      0.79      0.79      1409



# Decision Tree Classifier after TomekLinks

In [11]:
from imblearn.under_sampling import TomekLinks

X = df[['tenure','SeniorCitizen','MonthlyCharges','TotalCharges']]
y = df['Churn'] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


tomek = TomekLinks()
X_train_tomek, y_train_tomek = tomek.fit_resample(X_train, y_train)


model_tomek = DecisionTreeClassifier()
model_tomek.fit(X_train_tomek, y_train_tomek)

y_pred_tomek = model_tomek.predict(X_test)
print("Classification Report after TomekLinks:")
print(classification_report(y_test, y_pred_tomek))

Classification Report after TomekLinks:
              precision    recall  f1-score   support

           0       0.83      0.78      0.81      1036
           1       0.48      0.57      0.52       373

    accuracy                           0.72      1409
   macro avg       0.66      0.67      0.66      1409
weighted avg       0.74      0.72      0.73      1409



# Conclusion

|  | Logistic | Decision | Logistic(SMOTE) | Decision(SMOTE) | Logistic(Tomek) |Decision (Tomek) |
|----------|----------|----------|----------|----------|----------|----------|
| Accuracy | 0.81  | 0.73  | 0.74  | 0.70  | 0.79  | 0.72  |

The benchmark Logistic Regression turns out the best accuracy