# Instructions
* 1. Load the dataset and explore the variables.
* 2. We will try to predict variable Churn using a logistic regression on variables tenure, SeniorCitizen,MonthlyCharges.
* 3. Extract the target variable.
* 4. Extract the independent variables and scale them.
* 5. Build the logistic regression model.
* 6. Evaluate the model.
* 7. Even a simple model will give us more than 70% accuracy. Why?
* 8. Synthetic Minority Oversampling TEchnique (SMOTE) is an over sampling technique based on nearest neighbors that adds new points between existing points. Apply imblearn.over_sampling.SMOTE to the dataset. Build and evaluate the logistic regression model. Is it there any improvement?
* 9. Tomek links are pairs of very close instances, but of opposite classes. Removing the instances of the majority class of each pair increases the space between the two classes, facilitating the classification process. Apply imblearn.under_sampling.TomekLinks to the dataset. Build and evaluate the logistic regression model. Is it there any improvement?

# Load the dataset and explore the variables.

In [1]:
# Import libraries and dependencies

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import TomekLinks

In [2]:
customer_data  = pd.read_csv("files_for_lab/customer_churn.csv")
customer_data

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.30,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,6840-RESVB,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,...,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.80,1990.5,No
7039,2234-XADUH,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,...,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.20,7362.9,No
7040,4801-JZAZL,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.60,346.45,No
7041,8361-LTMKD,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Mailed check,74.40,306.6,Yes


In [3]:
customer_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


# Extract the target variable.

In [4]:
X = customer_data[['SeniorCitizen', 'MonthlyCharges', 'tenure']]
y = customer_data['Churn']

In [5]:
display(X.head(3))
display(y.head(3)) #sanity checks

Unnamed: 0,SeniorCitizen,MonthlyCharges,tenure
0,0,29.85,1
1,0,56.95,34
2,0,53.85,2


0     No
1     No
2    Yes
Name: Churn, dtype: object

In [6]:
y.value_counts(normalize=True) #this means that we have an imbalanced target variable

No     0.73463
Yes    0.26537
Name: Churn, dtype: float64

# Train/Test Split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [8]:
y_train.value_counts(normalize=True)

No     0.733582
Yes    0.266418
Name: Churn, dtype: float64

In [9]:
y_test.value_counts(normalize=True) #we don't mind the imbalance here

No     0.738822
Yes    0.261178
Name: Churn, dtype: float64

# Scaling the independent variables

In [10]:
scaler = MinMaxScaler() #we chose MinMaxScaler cause it doesn't assume the normality of our data

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Build the logistic regression model

In [11]:
model = LogisticRegression()
model.fit(X_train, y_train)

pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
print(classification_report(y_test, pred_test))
print(classification_report(y_train, pred_train))

              precision    recall  f1-score   support

          No       0.82      0.90      0.86      1041
         Yes       0.62      0.46      0.53       368

    accuracy                           0.78      1409
   macro avg       0.72      0.68      0.69      1409
weighted avg       0.77      0.78      0.77      1409

              precision    recall  f1-score   support

          No       0.82      0.91      0.87      4133
         Yes       0.66      0.46      0.54      1501

    accuracy                           0.79      5634
   macro avg       0.74      0.69      0.70      5634
weighted avg       0.78      0.79      0.78      5634



### Evaluate the model

* Accuracy here is almost 80% although there's a clear imbalance between No (86%) and Yes (53%). I think that might be because the F1 score on the No is so high. That means that our model has learned the 'No's of Churn much better than it did for the 'Yes's.

# Even a simple model will give us more than 70% accuracy. Why?

>We have more than a 73% of "NO" in our target variable, so we expect that the model learns well at least this part of the data. If we add the training of the "YES", even if this is very poor, we will have more than this 70%

# Synthetic Minority Oversampling TEchnique (SMOTE)

In [12]:
sm = SMOTE(k_neighbors=3)

X_train_SMOTE, y_train_SMOTE = sm.fit_resample(X_train, y_train)

In [13]:
model = LogisticRegression()
model.fit(X_train_SMOTE, y_train_SMOTE)

pred_train_SMOTE = model.predict(X_train_SMOTE)
pred_test_SMOTE = model.predict(X_test)
print(classification_report(y_test, pred_test_SMOTE))
print(classification_report(y_train_SMOTE, pred_train_SMOTE))

              precision    recall  f1-score   support

          No       0.88      0.72      0.79      1041
         Yes       0.47      0.71      0.57       368

    accuracy                           0.72      1409
   macro avg       0.67      0.72      0.68      1409
weighted avg       0.77      0.72      0.73      1409

              precision    recall  f1-score   support

          No       0.74      0.73      0.73      4133
         Yes       0.73      0.74      0.74      4133

    accuracy                           0.73      8266
   macro avg       0.73      0.73      0.73      8266
weighted avg       0.73      0.73      0.73      8266



#### Changing the number of neighbors from 3 to 1000

In [14]:
sm_1000 = SMOTE(k_neighbors=1000)

X_train_SMOTE_1000, y_train_SMOTE_1000 = sm_1000.fit_resample(X_train, y_train)

In [15]:
model = LogisticRegression()
model.fit(X_train_SMOTE_1000, y_train_SMOTE_1000)

pred_train_SMOTE_1000 = model.predict(X_train_SMOTE_1000)
pred_test_SMOTE_1000 = model.predict(X_test)
print(classification_report(y_test, pred_test_SMOTE_1000))
print(classification_report(y_train_SMOTE_1000, pred_train_SMOTE_1000))

              precision    recall  f1-score   support

          No       0.87      0.72      0.79      1041
         Yes       0.47      0.69      0.56       368

    accuracy                           0.71      1409
   macro avg       0.67      0.70      0.67      1409
weighted avg       0.76      0.71      0.73      1409

              precision    recall  f1-score   support

          No       0.77      0.74      0.76      4133
         Yes       0.75      0.78      0.77      4133

    accuracy                           0.76      8266
   macro avg       0.76      0.76      0.76      8266
weighted avg       0.76      0.76      0.76      8266



#### Conclusion
* Accuracy here is slightly reduced compared to the basic logistic regression model. This, imo, is a fair trade off since the 'Yes's and the 'No's are perfectly balanced now (74%). The predictive strength of our model is still very descent but now what the model has learned is much more balanced, which is very valuable. This is with k_neighbors = 3. Adjusting for a much great value (k_neighbors >= 1000) both accuracy and F1 scores go up but ust slight (about 0.2%).

# Tomek link

In [16]:
tl = TomekLinks(sampling_strategy='all')
X_train_tl, y_train_tl = tl.fit_resample(X_train, y_train)

In [17]:
model = LogisticRegression()
model.fit(X_train_tl, y_train_tl)

pred_train_tl = model.predict(X_train_tl)
pred_test_tl = model.predict(X_test)
print(classification_report(y_test, pred_test_tl))
print(classification_report(y_train_tl, pred_train_tl))

              precision    recall  f1-score   support

          No       0.83      0.90      0.86      1041
         Yes       0.63      0.46      0.53       368

    accuracy                           0.79      1409
   macro avg       0.73      0.68      0.70      1409
weighted avg       0.77      0.79      0.78      1409

              precision    recall  f1-score   support

          No       0.87      0.93      0.90      3744
         Yes       0.70      0.52      0.60      1112

    accuracy                           0.84      4856
   macro avg       0.78      0.73      0.75      4856
weighted avg       0.83      0.84      0.83      4856



#### Conclusion
* Finally, this downsampling technique has yet again increased our model's accuracy. The trade off here was the balance of 'Yes's and 'No's. However, checking out the F1 scores (No - 90%, Yes - 60%). We see that our model can very accurately predict the 'No's. I would say that this model performs better than the logistic regression model but is still outperformed by the SMOTE regression model.