# Step 1: Reading the data and analyzing it 

In [1]:
import pandas as pd

df = pd.read_csv(r'C:\Users\wailb\Desktop\IronHack LABS\imbalanced-data-processing\files_for_lab\customer_churn.csv')

df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


### Statistics:

* SeniorCitizen: Binary (mostly non-senior citizens).

* Tenure: Ranges from 0 to 72 months. The median tenure is 29 months, indicating a mix of new and long-term customers.

* MonthlyCharges: Varies significantly from  _18.25USD_ to _118.75USD_, with a median of _70.35USD_.
This variation suggests diverse service packages among customers.

* TotalCharges: After conversion to numeric, shows a wide range (_18.80USD_ to _8684.80USD_), reflecting the varied tenure and monthly charges.

# Step 2: Building Predictive Model 
###### Using Logicstic Regression

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix

# Extracting target variable 'Churn'
y = df['Churn'].map({'Yes': 1, 'No': 0}) # Converting categorical to numerical

# Extracting independent variables
X = df[['tenure', 'SeniorCitizen', 'MonthlyCharges']]

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scaling the independent variables
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Building the Logistic Regression model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

# Predictions
y_pred = model.predict(X_test_scaled)

# Evaluating the model
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

conf_matrix, class_report

(array([[957,  79],
        [192, 181]], dtype=int64),
 '              precision    recall  f1-score   support\n\n           0       0.83      0.92      0.88      1036\n           1       0.70      0.49      0.57       373\n\n    accuracy                           0.81      1409\n   macro avg       0.76      0.70      0.72      1409\nweighted avg       0.80      0.81      0.80      1409\n')

### Model Evaluation:

##### *Confusion Matrix:

    * True Negatives (TN): 957
    * False Positives (FP): 79
    * False Negatives (FN): 192
    * True Positives (TP): 181

##### *Classification Report:

    * Accuracy: 81% (Correct predictions overall)
    * Precision: 0.83 for class 0 (No Churn), 0.70 for class 1 (Churn)
    * Recall: 0.92 for class 0, 0.49 for class 1
    * F1-Score: 0.88 for class 0, 0.57 for class 1

#### Conclusion :  
Achieving more than 70% accuracy with a simple logistic regression model in a churn prediction context often reflects underlying class imbalance and the predictive strength of the chosen features, rather than the sophistication of the model itself.

# * Applying Synthetic Minority Oversampling Technique (SMOTE) *

In [6]:
from imblearn.over_sampling import SMOTE

# Applying SMOTE to the training data
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train_scaled, y_train)

# Checking the balance after SMOTE
balance_check = y_train_smote.value_counts()

# Re-building the Logistic Regression model with SMOTE-applied data
model_smote = LogisticRegression()
model_smote.fit(X_train_smote, y_train_smote)

# Predictions with SMOTE model
y_pred_smote = model_smote.predict(X_test_scaled)

# Evaluating the SMOTE model
conf_matrix_smote = confusion_matrix(y_test, y_pred_smote)
class_report_smote = classification_report(y_test, y_pred_smote)

balance_check, conf_matrix_smote, class_report_smote


(Churn
 0    4138
 1    4138
 Name: count, dtype: int64,
 array([[762, 274],
        [ 85, 288]], dtype=int64),
 '              precision    recall  f1-score   support\n\n           0       0.90      0.74      0.81      1036\n           1       0.51      0.77      0.62       373\n\n    accuracy                           0.75      1409\n   macro avg       0.71      0.75      0.71      1409\nweighted avg       0.80      0.75      0.76      1409\n')

### Model Evaluation with SMOTE:

#####  *Confusion Matrix:
    * True Negatives (TN): 762
    * False Positives (FP): 274
    * False Negatives (FN): 85
    * True Positives (TP): 288
    
#####  *Classification Report:
    * Accuracy: 75% (a decrease compared to the initial model without SMOTE, but this is not the whole story).
    
    * Precision for No Churn (0): 0.90 (High precision indicates a low false positive rate for the majority class).
    
    * Precision for Churn (1): 0.51 (Lower than No Churn, indicating a higher false positive rate for predicting churn).
    
    * Recall for No Churn (0): 0.74 (A significant portion of the actual No Churn cases were correctly identified).
    
    * Recall for Churn (1): 0.77 (A considerable improvement, showing a better identification of actual Churn cases compared                                 to the initial model).
    
    * F1-Score for No Churn (0): 0.81
    
    * F1-Score for Churn (1): 0.62


#### Conclusion :

While the accuracy slightly decreased after applying SMOTE, the model's ability to identify churn cases (which is often the primary objective in churn prediction) improved significantly. This showcases the importance of looking beyond accuracy when evaluating models, especially in imbalanced dataset scenarios.

# * Applying Tomek Links *

In [7]:
from imblearn.under_sampling import TomekLinks

# Applying TomekLinks for under-sampling
tl = TomekLinks()
X_train_tl, y_train_tl = tl.fit_resample(X_train_scaled, y_train)

# Checking the balance after applying TomekLinks
balance_check_tl = y_train_tl.value_counts()

# Re-building the Logistic Regression model with TomekLinks-applied data
model_tl = LogisticRegression()
model_tl.fit(X_train_tl, y_train_tl)

# Predictions with TomekLinks model
y_pred_tl = model_tl.predict(X_test_scaled)

# Evaluating the TomekLinks model
conf_matrix_tl = confusion_matrix(y_test, y_pred_tl)
class_report_tl = classification_report(y_test, y_pred_tl)

balance_check_tl, conf_matrix_tl, class_report_tl


(Churn
 0    3725
 1    1496
 Name: count, dtype: int64,
 array([[914, 122],
        [168, 205]], dtype=int64),
 '              precision    recall  f1-score   support\n\n           0       0.84      0.88      0.86      1036\n           1       0.63      0.55      0.59       373\n\n    accuracy                           0.79      1409\n   macro avg       0.74      0.72      0.72      1409\nweighted avg       0.79      0.79      0.79      1409\n')

### Model Evaluation with Tomek Links:

#####  *Confusion Matrix:
    * True Negatives (TN): 914
    * False Positives (FP): 122
    * False Negatives (FN): 168
    * True Positives (TP): 205

#####  *Classification Report:
    * Accuracy: 79% (an improvement compared to the SMOTE- adjusted model).
    
    * Precision for No Churn (0): 0.84 (Indicating a solid performance in correctly predicting the no churn cases).
    
    * Precision for Churn (1): 0.63 (A measure of the accuracy when the model predicts churn).
    
    * Recall for No Churn (0): 0.88 (A high rate of correctly identifying actual No Churn cases).
    
    * Recall for Churn (1): 0.55 (An improvement over the initial model but not as high as with SMOTE, showing a more balanced approach).
    
    * F1-Score for No Churn (0): 0.86
    
    * F1-Score for Churn (1): 0.59

#### Conclusion :
The use of Tomek Links has shown an improvement in the model's ability to classify, especially in achieving a more balanced performance between detecting churn and non-churn instances. This indicates the value of under-sampling techniques in refining the model's performance by improving the decision space between classes.