# Telecom Churn Analysis

## Problem Statement

The telecom industry is highly competitive, and customer churn is a significant concern. Retaining existing customers is often more cost-effective than acquiring new ones. This case study aims to build a predictive model to identify customers who are likely to churn, allowing the company to take proactive measures to retain them.

The primary objective is to develop a machine learning model that can predict customer churn with high accuracy. The model will use historical data to identify patterns or characteristics of customers who have churned in the past.

#### Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

## Importing data

In [2]:
df = pd.read_csv('train.csv')

### Feature Engineering

In [3]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,3704-IEAXF,Female,0,Yes,Yes,72,No,No phone service,DSL,No,...,No,Yes,Yes,Yes,Two year,No,Credit card (automatic),53.65,3784.0,0
1,5175-AOBHI,Female,0,No,No,4,Yes,No,DSL,No,...,No,No,No,No,Month-to-month,Yes,Mailed check,46.0,193.6,1
2,6922-NCEDI,Male,0,No,Yes,56,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,One year,Yes,Mailed check,21.2,1238.65,0
3,3143-ILDAL,Male,0,No,No,56,Yes,Yes,Fiber optic,No,...,No,Yes,No,Yes,Month-to-month,Yes,Electronic check,94.45,5124.6,1
4,0872-NXJYS,Female,0,No,No,9,Yes,No,Fiber optic,No,...,No,No,No,Yes,Month-to-month,Yes,Electronic check,79.55,723.4,1


In [4]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df = df.drop(['customerID'], axis = 1)
df.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,Yes,72,No,No phone service,DSL,No,Yes,No,Yes,Yes,Yes,Two year,No,Credit card (automatic),53.65,3784.0,0
1,Female,0,No,No,4,Yes,No,DSL,No,No,No,No,No,No,Month-to-month,Yes,Mailed check,46.0,193.6,1
2,Male,0,No,Yes,56,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,One year,Yes,Mailed check,21.2,1238.65,0
3,Male,0,No,No,56,Yes,Yes,Fiber optic,No,Yes,No,Yes,No,Yes,Month-to-month,Yes,Electronic check,94.45,5124.6,1
4,Female,0,No,No,9,Yes,No,Fiber optic,No,No,No,No,No,Yes,Month-to-month,Yes,Electronic check,79.55,723.4,1


In [5]:
# Checking individual categories of our categorical columns
for col in df.describe(include = 'object').columns:
    print(col)
    print(df[col].unique())
    print('-'*30)

gender
['Female' 'Male']
------------------------------
Partner
['Yes' 'No']
------------------------------
Dependents
['Yes' 'No']
------------------------------
PhoneService
['No' 'Yes']
------------------------------
MultipleLines
['No phone service' 'No' 'Yes']
------------------------------
InternetService
['DSL' 'No' 'Fiber optic']
------------------------------
OnlineSecurity
['No' 'No internet service' 'Yes']
------------------------------
OnlineBackup
['Yes' 'No' 'No internet service']
------------------------------
DeviceProtection
['No' 'No internet service' 'Yes']
------------------------------
TechSupport
['Yes' 'No' 'No internet service']
------------------------------
StreamingTV
['Yes' 'No' 'No internet service']
------------------------------
StreamingMovies
['Yes' 'No' 'No internet service']
------------------------------
Contract
['Two year' 'Month-to-month' 'One year']
------------------------------
PaperlessBilling
['No' 'Yes']
------------------------------
Paymen

In [6]:
df.columns

Index(['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure',
       'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity',
       'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
       'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod',
       'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5282 entries, 0 to 5281
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            5282 non-null   object 
 1   SeniorCitizen     5282 non-null   int64  
 2   Partner           5282 non-null   object 
 3   Dependents        5282 non-null   object 
 4   tenure            5282 non-null   int64  
 5   PhoneService      5282 non-null   object 
 6   MultipleLines     5282 non-null   object 
 7   InternetService   5282 non-null   object 
 8   OnlineSecurity    5282 non-null   object 
 9   OnlineBackup      5282 non-null   object 
 10  DeviceProtection  5282 non-null   object 
 11  TechSupport       5282 non-null   object 
 12  StreamingTV       5282 non-null   object 
 13  StreamingMovies   5282 non-null   object 
 14  Contract          5282 non-null   object 
 15  PaperlessBilling  5282 non-null   object 
 16  PaymentMethod     5282 non-null   object 


In [8]:
df['Churn'] = df['Churn'].astype('object')
df['SeniorCitizen'] = df['SeniorCitizen'].astype('object')


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5282 entries, 0 to 5281
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            5282 non-null   object 
 1   SeniorCitizen     5282 non-null   object 
 2   Partner           5282 non-null   object 
 3   Dependents        5282 non-null   object 
 4   tenure            5282 non-null   int64  
 5   PhoneService      5282 non-null   object 
 6   MultipleLines     5282 non-null   object 
 7   InternetService   5282 non-null   object 
 8   OnlineSecurity    5282 non-null   object 
 9   OnlineBackup      5282 non-null   object 
 10  DeviceProtection  5282 non-null   object 
 11  TechSupport       5282 non-null   object 
 12  StreamingTV       5282 non-null   object 
 13  StreamingMovies   5282 non-null   object 
 14  Contract          5282 non-null   object 
 15  PaperlessBilling  5282 non-null   object 
 16  PaymentMethod     5282 non-null   object 


In [10]:
# Divide the columns into 3 categories, one ofor standardisation, one for label encoding and one for
# one hot encoding
label_encoding = ['PaperlessBilling','PhoneService','Dependents','Partner','Churn','SeniorCitizen',
                 'gender']
num_cols = ["tenure", 'MonthlyCharges', 'TotalCharges']
one_hot_encoding = ['PaymentMethod','Contract','StreamingMovies','StreamingTV','TechSupport',
                   'DeviceProtection','OnlineBackup','InternetService','MultipleLines','OnlineSecurity']

In [11]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])


In [12]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
for col in label_encoding:
    df[col] = label_encoder.fit_transform(df[col])


In [13]:
df = pd.get_dummies(df, columns=one_hot_encoding, drop_first=True)


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5282 entries, 0 to 5281
Data columns (total 31 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   gender                                 5282 non-null   int32  
 1   SeniorCitizen                          5282 non-null   int32  
 2   Partner                                5282 non-null   int32  
 3   Dependents                             5282 non-null   int32  
 4   tenure                                 5282 non-null   float64
 5   PhoneService                           5282 non-null   int32  
 6   PaperlessBilling                       5282 non-null   int32  
 7   MonthlyCharges                         5282 non-null   float64
 8   TotalCharges                           5274 non-null   float64
 9   Churn                                  5282 non-null   int32  
 10  PaymentMethod_Credit card (automatic)  5282 non-null   uint8  
 11  Paym

In [15]:
df.isnull().sum()

gender                                   0
SeniorCitizen                            0
Partner                                  0
Dependents                               0
tenure                                   0
PhoneService                             0
PaperlessBilling                         0
MonthlyCharges                           0
TotalCharges                             8
Churn                                    0
PaymentMethod_Credit card (automatic)    0
PaymentMethod_Electronic check           0
PaymentMethod_Mailed check               0
Contract_One year                        0
Contract_Two year                        0
StreamingMovies_No internet service      0
StreamingMovies_Yes                      0
StreamingTV_No internet service          0
StreamingTV_Yes                          0
TechSupport_No internet service          0
TechSupport_Yes                          0
DeviceProtection_No internet service     0
DeviceProtection_Yes                     0
OnlineBacku

In [16]:
df.dropna(inplace= True)

In [17]:
df.isnull().sum()

gender                                   0
SeniorCitizen                            0
Partner                                  0
Dependents                               0
tenure                                   0
PhoneService                             0
PaperlessBilling                         0
MonthlyCharges                           0
TotalCharges                             0
Churn                                    0
PaymentMethod_Credit card (automatic)    0
PaymentMethod_Electronic check           0
PaymentMethod_Mailed check               0
Contract_One year                        0
Contract_Two year                        0
StreamingMovies_No internet service      0
StreamingMovies_Yes                      0
StreamingTV_No internet service          0
StreamingTV_Yes                          0
TechSupport_No internet service          0
TechSupport_Yes                          0
DeviceProtection_No internet service     0
DeviceProtection_Yes                     0
OnlineBacku

## Now we will develop some predictive models and compare them.

We will develop Logistic Regression, Decision Tree, Random Forest and Gradient Boosting

In [18]:
# Select features (X) and target (y)
X = df.drop('Churn', axis=1)
y = df['Churn']

# Encode categorical variables (create dummy variables)
# X = pd.get_dummies(X, drop_first=True)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

**1. Logistic Regression**

In [19]:
# Create a logistic regression model
model = LogisticRegression()

# Train the model on the training data
model.fit(X_train, y_train)


In [20]:
# Make predictions on the test data
y_pred = model.predict(X_test)

# Evaluate the model's performance
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.8104864181933038

Confusion Matrix:
 [[1039  109]
 [ 191  244]]

Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.91      0.87      1148
           1       0.69      0.56      0.62       435

    accuracy                           0.81      1583
   macro avg       0.77      0.73      0.75      1583
weighted avg       0.80      0.81      0.80      1583



As you can see that the accuracy is quite low, and as it's an imbalanced dataset, we shouldn't consider Accuracy as our metrics to measure the model, as Accuracy is cursed in imbalanced datasets.


Hence, we need to check recall, precision & f1 score for the minority class, and it's quite evident that the precision, recall & f1 score is too low for Class 1, i.e. churned customers.

### We have already seen that our data is highly inbalanced the chrnuers were around 26% where as Non-churnurs were 74%.Let's overcome this and try to improve our model accuracy.

In [21]:
from imblearn.over_sampling import SMOTE

smote = SMOTE()
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)


In [22]:
percentage_churn_0 = np.mean(y_train_resampled == 0) * 100
percentage_churn_1 = np.mean(y_train_resampled == 1) * 100

print(f"Percentage of Churn 0: {percentage_churn_0:.2f}%")
print(f"Percentage of Churn 1: {percentage_churn_1:.2f}%")


Percentage of Churn 0: 50.00%
Percentage of Churn 1: 50.00%


In [23]:
# Train the model on the upsampled training data
model.fit(X_train_resampled, y_train_resampled)

# Make predictions on the test set
y_pred_lr = model.predict(X_test)

# Evaluate the model's performance (e.g., using accuracy, precision, recall, F1-score)
from sklearn.metrics import classification_report, accuracy_score

print("Accuracy:", accuracy_score(y_test, y_pred_lr))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_lr))
print("Classification Report:\n", classification_report(y_test, y_pred_lr))


Accuracy: 0.7555274794693619

Confusion Matrix:
 [[866 282]
 [105 330]]
Classification Report:
               precision    recall  f1-score   support

           0       0.89      0.75      0.82      1148
           1       0.54      0.76      0.63       435

    accuracy                           0.76      1583
   macro avg       0.72      0.76      0.72      1583
weighted avg       0.79      0.76      0.77      1583



**2. Decision Tree Classifier**

In [24]:
from sklearn.tree import DecisionTreeClassifier

# Create a Decision Tree Classifier model
model_dt=DecisionTreeClassifier(criterion = "gini",random_state = 100,max_depth=6, min_samples_leaf=8)

# Train the model on the training data
model_dt.fit(X_train,y_train)


In [25]:
# Make predictions on the test data
y_preddt = model.predict(X_test)

# Evaluate the model's performance
print("Accuracy:", accuracy_score(y_test, y_preddt))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_preddt))
print("\nClassification Report:\n", classification_report(y_test, y_preddt))


Accuracy: 0.7555274794693619

Confusion Matrix:
 [[866 282]
 [105 330]]

Classification Report:
               precision    recall  f1-score   support

           0       0.89      0.75      0.82      1148
           1       0.54      0.76      0.63       435

    accuracy                           0.76      1583
   macro avg       0.72      0.76      0.72      1583
weighted avg       0.79      0.76      0.77      1583



### Let's see score after upsampling.

In [26]:
smote = SMOTE()
X_train_resampled1, y_train_resampled1 = smote.fit_resample(X_train, y_train)

# Train the model on the upsampled training data
model_dt.fit(X_train_resampled1, y_train_resampled1)

# Make predictions on the test set
y_pred_dt = model_dt.predict(X_test)

# Evaluate the model's performance
print("Accuracy:", accuracy_score(y_test, y_pred_dt))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_dt))
print("Classification Report:\n", classification_report(y_test, y_pred_dt))


Accuracy: 0.7195198989260897

Confusion Matrix:
 [[793 355]
 [ 89 346]]
Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.69      0.78      1148
           1       0.49      0.80      0.61       435

    accuracy                           0.72      1583
   macro avg       0.70      0.74      0.70      1583
weighted avg       0.79      0.72      0.73      1583



### Let's experiment with different hyperparameters for our Decision Tree model.

In [27]:
from sklearn.model_selection import  StratifiedKFold, GridSearchCV

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)

# Apply SMOTE for oversampling
smote = SMOTE()
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Create a Decision Tree Classifier model
model_dt = DecisionTreeClassifier(random_state=100)

# Define hyperparameters to tune
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Use StratifiedKFold cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=100)

# Perform hyperparameter tuning using GridSearchCV
grid_search = GridSearchCV(model_dt, param_grid, cv=cv, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train_resampled, y_train_resampled)

# Get the best model from the grid search
best_model = grid_search.best_estimator_

# Make predictions on the test data
y_pred_dt_hp = best_model.predict(X_test)

# Evaluate the model's performance
print("Accuracy:", accuracy_score(y_test, y_pred_dt_hp))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_dt_hp))
print("\nClassification Report:\n", classification_report(y_test, y_pred_dt_hp))


Accuracy: 0.7431279620853081

Confusion Matrix:
 [[593 187]
 [ 84 191]]

Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.76      0.81       780
           1       0.51      0.69      0.58       275

    accuracy                           0.74      1055
   macro avg       0.69      0.73      0.70      1055
weighted avg       0.78      0.74      0.75      1055



**3. Random Forest Classifier**

In [28]:
from sklearn.ensemble import RandomForestClassifier

# Create a Random Forest Classifier model
model_rf=RandomForestClassifier(n_estimators=100, criterion='gini', random_state = 100,max_depth=6, 
                                                                        min_samples_leaf=8)

# Train the model on the training data
model_rf.fit(X_train,y_train)

In [29]:
# Make predictions on the test data
y_predrf = model_rf.predict(X_test)

# Evaluate the model's performance
print("Accuracy:", accuracy_score(y_test, y_predrf))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_predrf))
print("\nClassification Report:\n", classification_report(y_test, y_predrf))

Accuracy: 0.8056872037914692

Confusion Matrix:
 [[715  65]
 [140 135]]

Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.92      0.87       780
           1       0.68      0.49      0.57       275

    accuracy                           0.81      1055
   macro avg       0.76      0.70      0.72      1055
weighted avg       0.79      0.81      0.79      1055



### Let's see score after upsampling.

In [30]:
# Train the model on the upsampled training data
model_rf.fit(X_train_resampled, y_train_resampled)

# Make predictions on the test set
y_pred_rf = model_rf.predict(X_test)

# Evaluate the model's performance
print("Accuracy:", accuracy_score(y_test, y_pred_rf))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))
print("Classification Report:\n", classification_report(y_test, y_pred_rf))


Accuracy: 0.7440758293838863

Confusion Matrix:
 [[567 213]
 [ 57 218]]
Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.73      0.81       780
           1       0.51      0.79      0.62       275

    accuracy                           0.74      1055
   macro avg       0.71      0.76      0.71      1055
weighted avg       0.80      0.74      0.76      1055



### Let's experiment with different hyperparameters for our Random forest model.

In [31]:
from sklearn.model_selection import GridSearchCV

# Create a Random Forest Classifier
model_rf = RandomForestClassifier(random_state=100)

# Define the parameter grid to search
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create GridSearchCV
grid_search = GridSearchCV(model_rf, param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Fit the grid search to the data
grid_search.fit(X_train_resampled, y_train_resampled)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Train the model with the best hyperparameters
best_model_rf = RandomForestClassifier(
    random_state=100,
    n_estimators=best_params['n_estimators'],
    max_depth=best_params['max_depth'],
    min_samples_split=best_params['min_samples_split'],
    min_samples_leaf=best_params['min_samples_leaf']
)

best_model_rf.fit(X_train_resampled, y_train_resampled)

# Make predictions on the test set
y_pred_rf_hp = best_model_rf.predict(X_test)

# Evaluate the model's performance
print("Accuracy:", accuracy_score(y_test, y_pred_rf_hp))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_rf_hp))
print("Classification Report:\n", classification_report(y_test, y_pred_rf_hp))


Best Hyperparameters: {'max_depth': 30, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
Accuracy: 0.7725118483412322

Confusion Matrix:
 [[629 151]
 [ 89 186]]
Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.81      0.84       780
           1       0.55      0.68      0.61       275

    accuracy                           0.77      1055
   macro avg       0.71      0.74      0.72      1055
weighted avg       0.79      0.77      0.78      1055



The resampled model appears to perform better for predicting churn compared to the original random forest model. It has a higher F1-Score for Churn = 1, indicating a better balance between precision and recall for customers who are likely to churn.

**4. Gradient Boosting**

In [32]:
from sklearn.ensemble import GradientBoostingClassifier

# Create a Gradient Boosting Classifier model
model_gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=100)

# Train the model on the training data
model_gb.fit(X_train, y_train)

# Make predictions on the test data
y_predgb = model_gb.predict(X_test)

# Evaluate the model's performance
print("Accuracy:", accuracy_score(y_test, y_predgb))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_predgb))
print("Classification Report:\n", classification_report(y_test, y_predgb))


Accuracy: 0.8037914691943128

Confusion Matrix:
 [[692  88]
 [119 156]]
Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.89      0.87       780
           1       0.64      0.57      0.60       275

    accuracy                           0.80      1055
   macro avg       0.75      0.73      0.74      1055
weighted avg       0.80      0.80      0.80      1055



### Let's see score after upsampling.

In [33]:
# Train the model on the upsampled training data
model_gb.fit(X_train_resampled, y_train_resampled)

# Make predictions on the test data
y_pred_gb = model_gb.predict(X_test)

# Evaluate the model's performance
print("Accuracy:", accuracy_score(y_test, y_pred_gb))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_gb))
print("Classification Report:\n", classification_report(y_test, y_pred_gb))

Accuracy: 0.7535545023696683

Confusion Matrix:
 [[580 200]
 [ 60 215]]
Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.74      0.82       780
           1       0.52      0.78      0.62       275

    accuracy                           0.75      1055
   macro avg       0.71      0.76      0.72      1055
weighted avg       0.81      0.75      0.77      1055



The resampled model appears to perform better for predicting churn compared to the original random forest model. It has a higher F1-Score for Churn = 1, indicating a better balance between precision and recall for customers who are likely to churn.

### Model Evaluation

**1. Logistic Regression:**

- Original:
- Accuracy: 0.8105
- F1-score for churn (1): 0.61

**After applying SMOTE (oversampling):**
- Accuracy: 0.7555
- F1-score for churn (1): 0.63

**2. Decision Tree Classifier:**

- Original:
- Accuracy: 0.7555
- F1-score for churn (1): 0.63

**After applying SMOTE (oversampling):**
- Accuracy: 0.7195
- F1-score for churn (1): 0.61

**After hyperparameter tuning:**
- Accuracy: 0.7431
- F1-score for churn (1): 0.58

**3. Random Forest Classifier:**

- Original:
- Accuracy: 0.8057
- F1-score for churn (1): 0.57

**After applying SMOTE (oversampling):**
- Accuracy: 0.7441
- F1-score for churn (1): 0.62

**After hyperparameter tuning:**
- Accuracy: 0.7725
- F1-score for churn (1): 0.61

**4. Gradient Boosting:**

- Original:
- Accuracy: 0.8038
- F1-score for churn (1):  0.60

**After applying SMOTE (oversampling):**
- Accuracy: 0.7536
- F1-score for churn (1): 0.62

### Conclusion

Based on these results, it appears that the original Logistic Regression model had the highest accuracy and reasonably good F1-scores for both classes. The Decision Tree Classifier performed well, especially after hyperparameter tuning. The Random Forest Classifier showed promise with a good F1-score after tuning, while Gradient Boosting had a lower overall accuracy.