## Customer Churn Forecasting for Interconnect Telecom

**Description**

Interconnect, a telecom operator, seeks to proactively identify customers at risk of churning to reduce attrition and improve customer retention. By analyzing customer demographics, service usage, and contractual data, the aim is to build a machine learning model that predicts churn, enabling timely intervention with promotions and support.

**Objective**
- Merge and clean multiple datasets containing customer information.
- Conduct exploratory data analysis (EDA) to uncover patterns associated with churn.
- Engineer relevant features and preprocess the data for modeling.
- Train classification models to predict churn using AUC-ROC as the primary evaluation metric.
- Optimize the model and evaluate its accuracy to meet the sprint scoring criteria.

**Data Sources**
- The project uses four CSV datasets located in /datasets/final_provider/:
    - contract.csv: Customer contract details.
    - personal.csv: Demographics and account information.
    - internet.csv: Internet service subscriptions.
    - phone.csv: Phone service details.
    - Each dataset includes a customerID field used for merging.

**Approach**
- Data Preparation: Load and merge all datasets, check for duplicates and missing values.
- EDA: Explore churn distribution, customer demographics, service patterns, and correlations.
- Model Development: Train classification models including Logistic Regression, Random Forest, Gradient Boosting, KNN, SCM, and Stoacking.
- Evaluation: Measure model performance with AUC-ROC and Accuracy.
- Deployment Readiness: Select the most reliable model and summarize findings for stakeholders.

**Tools & Libraries**
- import pandas as pd
- import numpy as np
- import matplotlib.pyplot as plt
- import seaborn as sns
- from sklearn.model_selection import train_test_split
- from sklearn.preprocessing import StandardScaler
- from sklearn.ensemble import RandomForestClassifier
- from sklearn.linear_model import LogisticRegression
- from sklearn.ensemble import GradientBoostingClassifier, StackingClassifier
- from sklearn.neighbors import KNeighborsClassifier
- from sklearn.svm import SVC
- from sklearn.metrics import roc_auc_score, accuracy_score, classification_report
- from sklearn.model_selection import GridSearchCV
- import warnings
- warnings.filterwarnings("ignore")

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Modeling
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, StackingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import roc_auc_score, accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Load all four datasets
contract = pd.read_csv('/datasets/final_provider/contract.csv')
personal = pd.read_csv('/datasets/final_provider/personal.csv')
internet = pd.read_csv('/datasets/final_provider/internet.csv')
phone = pd.read_csv('/datasets/final_provider/phone.csv')

# Display shapes
print("Contracts:", contract.shape)
print("Personal:", personal.shape)
print("Internet:", internet.shape)
print("Phone:", phone.shape)

Contracts: (7043, 8)
Personal: (7043, 5)
Internet: (5517, 8)
Phone: (6361, 2)


In [3]:
# Preview data
display(contract.head())
display(personal.head())
display(internet.head())
display(phone.head())

# Check for nulls and duplicates
print("\nMissing Values:")
print("Contract:\n", contract.isna().sum())
print("Personal:\n", personal.isna().sum())
print("Internet:\n", internet.isna().sum())
print("Phone:\n", phone.isna().sum())

print("\nDuplicate rows:")
print("Contract:", contract.duplicated().sum())
print("Personal:", personal.duplicated().sum())
print("Internet:", internet.duplicated().sum())
print("Phone:", phone.duplicated().sum())

Unnamed: 0,customerID,BeginDate,EndDate,Type,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
0,7590-VHVEG,2020-01-01,No,Month-to-month,Yes,Electronic check,29.85,29.85
1,5575-GNVDE,2017-04-01,No,One year,No,Mailed check,56.95,1889.5
2,3668-QPYBK,2019-10-01,2019-12-01 00:00:00,Month-to-month,Yes,Mailed check,53.85,108.15
3,7795-CFOCW,2016-05-01,No,One year,No,Bank transfer (automatic),42.3,1840.75
4,9237-HQITU,2019-09-01,2019-11-01 00:00:00,Month-to-month,Yes,Electronic check,70.7,151.65


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents
0,7590-VHVEG,Female,0,Yes,No
1,5575-GNVDE,Male,0,No,No
2,3668-QPYBK,Male,0,No,No
3,7795-CFOCW,Male,0,No,No
4,9237-HQITU,Female,0,No,No


Unnamed: 0,customerID,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies
0,7590-VHVEG,DSL,No,Yes,No,No,No,No
1,5575-GNVDE,DSL,Yes,No,Yes,No,No,No
2,3668-QPYBK,DSL,Yes,Yes,No,No,No,No
3,7795-CFOCW,DSL,Yes,No,Yes,Yes,No,No
4,9237-HQITU,Fiber optic,No,No,No,No,No,No


Unnamed: 0,customerID,MultipleLines
0,5575-GNVDE,No
1,3668-QPYBK,No
2,9237-HQITU,No
3,9305-CDSKC,Yes
4,1452-KIOVK,Yes



Missing Values:
Contract:
 customerID          0
BeginDate           0
EndDate             0
Type                0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
dtype: int64
Personal:
 customerID       0
gender           0
SeniorCitizen    0
Partner          0
Dependents       0
dtype: int64
Internet:
 customerID          0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
dtype: int64
Phone:
 customerID       0
MultipleLines    0
dtype: int64

Duplicate rows:
Contract: 0
Personal: 0
Internet: 0
Phone: 0


**Initial EDA Findings**
- Dataset Sizes:
    - contract.csv: 7,043 rows
    - personal.csv: 7,043 rows
    - internet.csv: 5,517 rows
    - phone.csv: 6,361 rows

- Key Observations:
    - No Missing or Duplicate Values:
        - All four datasets are clean with no null or duplicate entries. This simplifies preprocessing.
    - Customer Coverage Differences:
        - contract and personal each have data for all 7,043 customers.
        - internet has only 5,517 rows → suggests ~1,526 customers do not use Internet services.
        - phone has 6,361 rows → suggests ~682 customers do not use Phone services.

- Contract Data (Churn Target):
    - The EndDate column is the churn indicator. If EndDate is 'No', the customer is active; otherwise, the date shows when they churned.
    - This column will be converted to a binary feature (1 = churned, 0 = active) for modeling.
    
- No Preprocessing Issues in Categorical Columns:
    - All text-based fields (e.g., gender, InternetService, MultipleLines) appear properly encoded and interpretable.

In [4]:
# Merge datasets on 'customerID'
df_merged = contract.merge(personal, on='customerID', how='left')
df_merged = df_merged.merge(internet, on='customerID', how='left')
df_merged = df_merged.merge(phone, on='customerID', how='left')

# View result
print("Merged dataset shape:", df_merged.shape)
display(df_merged.head())

Merged dataset shape: (7043, 20)


Unnamed: 0,customerID,BeginDate,EndDate,Type,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,gender,SeniorCitizen,Partner,Dependents,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,MultipleLines
0,7590-VHVEG,2020-01-01,No,Month-to-month,Yes,Electronic check,29.85,29.85,Female,0,Yes,No,DSL,No,Yes,No,No,No,No,
1,5575-GNVDE,2017-04-01,No,One year,No,Mailed check,56.95,1889.5,Male,0,No,No,DSL,Yes,No,Yes,No,No,No,No
2,3668-QPYBK,2019-10-01,2019-12-01 00:00:00,Month-to-month,Yes,Mailed check,53.85,108.15,Male,0,No,No,DSL,Yes,Yes,No,No,No,No,No
3,7795-CFOCW,2016-05-01,No,One year,No,Bank transfer (automatic),42.3,1840.75,Male,0,No,No,DSL,Yes,No,Yes,Yes,No,No,
4,9237-HQITU,2019-09-01,2019-11-01 00:00:00,Month-to-month,Yes,Electronic check,70.7,151.65,Female,0,No,No,Fiber optic,No,No,No,No,No,No,No


In [5]:
# Step 1: Create binary target column 'churn'
df_merged['churn'] = df_merged['EndDate'].apply(lambda x: 0 if x == 'No' else 1)

# Step 2: Check basic target distribution
print("Churn value counts:")
print(df_merged['churn'].value_counts())
print("\nChurn distribution (%):")
print(df_merged['churn'].value_counts(normalize=True) * 100)

# Step 3: Check feature data types
print("\nData types by column:")
print(df_merged.dtypes)

# Optional: quick check of numeric columns
numeric_cols = df_merged.select_dtypes(include='number').columns
print("\nNumeric columns:", list(numeric_cols))

# Optional: quick check of categorical columns
categorical_cols = df_merged.select_dtypes(include='object').columns
print("\nCategorical columns:", list(categorical_cols))

Churn value counts:
0    5174
1    1869
Name: churn, dtype: int64

Churn distribution (%):
0    73.463013
1    26.536987
Name: churn, dtype: float64

Data types by column:
customerID           object
BeginDate            object
EndDate              object
Type                 object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
MultipleLines        object
churn                 int64
dtype: object

Numeric columns: ['MonthlyCharges', 'SeniorCitizen', 'churn']

Categorical columns: ['customerID', 'BeginDate', 'EndDate', 'Type', 'PaperlessBilling', 'PaymentMethod', 'TotalCharges', 'gender', 'Partner', 'Dependents', 'Inter

**Target Variable Creation and Initial Data Profiling**

- To prepare for model training, I defined the binary target variable churn based on the EndDate column:
    - Customers with 'EndDate' == 'No' are considered active (churn = 0)
    - Customers with a real end date (i.e., any value other than 'No') are considered churned (churn = 1)

- Target Distribution:
    - 73.5% of customers are still active (churn = 0)
    - 26.5% have churned (churn = 1)

This indicates a moderate class imbalance, which may affect model training. It’s important to account for this during evaluation by using metrics like AUC-ROC rather than just accuracy.

- Data Types Summary:
    - I identified 3 numerical columns: MonthlyCharges, SeniorCitizen, and churn
    - All other columns are categorical or object-type, including TotalCharges, which should ideally be numeric — this column will require conversion and cleaning
    - Time-related columns like BeginDate and EndDate are stored as strings and may be used to engineer time-based features (e.g., customer tenure)

In [6]:
print("Customers without internet service data:", df_merged['InternetService'].isna().sum())
print("Customers without phone service data:", df_merged['MultipleLines'].isna().sum())
print("Customers with neither service data:", df_merged[df_merged['InternetService'].isna() & df_merged['MultipleLines'].isna()].shape[0])

Customers without internet service data: 1526
Customers without phone service data: 682
Customers with neither service data: 0


In [7]:
# Check unique problematic values before conversion
print("Non-numeric TotalCharges values:")
print(df_merged[~df_merged['TotalCharges'].str.replace('.', '', 1).str.isnumeric()]['TotalCharges'].unique())

# Convert 'TotalCharges' to numeric, coercing errors to NaN
df_merged['TotalCharges'] = pd.to_numeric(df_merged['TotalCharges'], errors='coerce')

# Display number of NaNs created by the conversion
print("Number of missing TotalCharges after conversion:", df_merged['TotalCharges'].isna().sum())

# Fill NaNs with 0.0 (assumes no charges were accrued)
df_merged['TotalCharges'] = df_merged['TotalCharges'].fillna(0.0)

# Confirm changes
print("TotalCharges data type:", df_merged['TotalCharges'].dtype)

Non-numeric TotalCharges values:
[' ']
Number of missing TotalCharges after conversion: 11
TotalCharges data type: float64


**Data Cleaning: TotalCharges Column**

During EDA, it was discovered that the TotalCharges column was stored as an object (string) type, despite containing numerical billing information. Upon further inspection, I identified 11 entries where the value was a single space character ' '. These entries could not be converted to numeric values and were most likely associated with customers who had recently joined and had not yet been billed.

To resolve this:
- I used pd.to_numeric(..., errors='coerce') to convert valid values to floats and replace invalid entries (such as ' ') with NaN.
- I then filled these missing values with 0.0 to represent no charges incurred yet.

This conversion ensures the TotalCharges column can be treated as a numeric feature during model training.

In [8]:
# Step 1: Drop non-predictive columns
df_model = df_merged.drop(columns=['customerID', 'BeginDate', 'EndDate'])

# Step 2: One-hot encode categorical variables
df_encoded = pd.get_dummies(df_model, drop_first=True)

# Step 3: Split into features (X) and target (y)
X = df_encoded.drop(columns='churn')
y = df_encoded['churn']

# Step 4: Split into training and test sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

# Confirm shapes
print("Train set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

Train set shape: (5634, 20)
Test set shape: (1409, 20)


In [9]:
# Standardize the numeric features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train[['MonthlyCharges', 'TotalCharges']])
X_test_scaled = scaler.transform(X_test[['MonthlyCharges', 'TotalCharges']])

# Replace the original numeric columns with scaled ones
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=['MonthlyCharges', 'TotalCharges'], index=X_train.index)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=['MonthlyCharges', 'TotalCharges'], index=X_test.index)

X_train_model = X_train.copy()
X_test_model = X_test.copy()
X_train_model[['MonthlyCharges', 'TotalCharges']] = X_train_scaled_df
X_test_model[['MonthlyCharges', 'TotalCharges']] = X_test_scaled_df

# Logistic Regression (Model 1)
lr_model = LogisticRegression(max_iter=1000, random_state=42)
lr_model.fit(X_train_model, y_train)
lr_pred_proba = lr_model.predict_proba(X_test_model)[:, 1]
lr_pred = lr_model.predict(X_test_model)
lr_auc = roc_auc_score(y_test, lr_pred_proba)
lr_acc = accuracy_score(y_test, lr_pred)

print("Logistic Regression Results:")
print("AUC-ROC:", round(lr_auc, 4))
print("Accuracy:", round(lr_acc, 4))

# Random Forest Classifier (Model 2)
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_model, y_train)
rf_pred_proba = rf_model.predict_proba(X_test_model)[:, 1]
rf_pred = rf_model.predict(X_test_model)
rf_auc = roc_auc_score(y_test, rf_pred_proba)
rf_acc = accuracy_score(y_test, rf_pred)

print("\nRandom Forest Classifier Results:")
print("AUC-ROC:", round(rf_auc, 4))
print("Accuracy:", round(rf_acc, 4))

# Gradient Boosting (Model 3)
gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_model.fit(X_train_model, y_train)
gb_pred_proba = gb_model.predict_proba(X_test_model)[:, 1]
gb_pred = gb_model.predict(X_test_model)
gb_auc = roc_auc_score(y_test, gb_pred_proba)
gb_acc = accuracy_score(y_test, gb_pred)

print("\nGradient Boosting Results:")
print("AUC-ROC:", round(gb_auc, 4))
print("Accuracy:", round(gb_acc, 4))

# K-Nearest Neighbors (Model 4)
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train_model, y_train)
knn_pred_proba = knn_model.predict_proba(X_test_model)[:, 1]
knn_pred = knn_model.predict(X_test_model)
knn_auc = roc_auc_score(y_test, knn_pred_proba)
knn_acc = accuracy_score(y_test, knn_pred)

print("\nK-Nearest Neighbors Results:")
print("AUC-ROC:", round(knn_auc, 4))
print("Accuracy:", round(knn_acc, 4))

# Support Vector Machine (Model 5)
svm_model = SVC(probability=True, kernel='rbf', random_state=42)
svm_model.fit(X_train_model, y_train)
svm_pred_proba = svm_model.predict_proba(X_test_model)[:, 1]
svm_pred = svm_model.predict(X_test_model)
svm_auc = roc_auc_score(y_test, svm_pred_proba)
svm_acc = accuracy_score(y_test, svm_pred)

print("\nSupport Vector Machine Results:")
print("AUC-ROC:", round(svm_auc, 4))
print("Accuracy:", round(svm_acc, 4))

# Stacking Classifier (Model 6)
estimators = [
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('gb', GradientBoostingClassifier(n_estimators=100, random_state=42)),
    ('lr', LogisticRegression(max_iter=1000, random_state=42))
]
stack_model = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression(), cv=5)
stack_model.fit(X_train_model, y_train)
stack_pred_proba = stack_model.predict_proba(X_test_model)[:, 1]
stack_pred = stack_model.predict(X_test_model)
stack_auc = roc_auc_score(y_test, stack_pred_proba)
stack_acc = accuracy_score(y_test, stack_pred)

print("\nStacking Classifier Results:")
print("AUC-ROC:", round(stack_auc, 4))
print("Accuracy:", round(stack_acc, 4))

Logistic Regression Results:
AUC-ROC: 0.8259
Accuracy: 0.7878

Random Forest Classifier Results:
AUC-ROC: 0.8217
Accuracy: 0.7935

Gradient Boosting Results:
AUC-ROC: 0.8411
Accuracy: 0.7956

K-Nearest Neighbors Results:
AUC-ROC: 0.7803
Accuracy: 0.7601

Support Vector Machine Results:
AUC-ROC: 0.7965
Accuracy: 0.7885

Stacking Classifier Results:
AUC-ROC: 0.8397
Accuracy: 0.7999


**Interpretation**

- Logistic Regression (AUC-ROC: 0.8259, Accuracy: 0.7878)
    - Strong AUC-ROC with relatively good accuracy.
    - Very interpretable, useful for explaining which features affect churn.
    - A good baseline model with clear business value.
- Random Forest (AUC-ROC: 0.8217, Accuracy: 0.7935)
    - Robust and performs well, especially in terms of accuracy.
    - Good at handling non-linear relationships and feature interactions.
    - Less interpretable than logistic regression.
- Gradient Boosting (AUC-ROC: 0.8411, Accuracy: 0.7956)
    - Best overall performer in terms of AUC-ROC, meaning it most effectively separates churners from non-churners.
    - Also shows competitive accuracy, making it a strong candidate for deployment.
    - However, it is less interpretable than simpler models like logistic regression.
- K-Nearest Neighbors (AUC-ROC: 0.7803, Accuracy: 0.7601)
    - Lowest scores among all models.
    - Likely underperforms due to high-dimensional one-hot encoded data and sensitivity to feature scaling.
- Support Vector Machine (AUC-ROC: 0.7965, Accuracy: 0.7885)
    - Moderate performance across the board.
    - May require more tuning and is not inherently interpretable.
- Stacking Classifier (AUC-ROC: 0.8397, Accuracy: 0.7999)
    - Nearly matches Gradient Boosting in AUC-ROC while achieving the highest accuracy.
    - Combines multiple models and captures diverse patterns in the data well.
    - Slightly more complex to maintain and explain.

**Reasons for Choosing Gradient Boosting for Next Steps:**

- Highest AUC-ROC: Achieved the top score (0.8411), indicating strong performance in distinguishing churners from non-churners.
- Strong Accuracy: Delivered high classification accuracy (79.6%), close to the best among all tested models.
- Balanced Performance: Outperforms others in both ranking ability and overall correctness without being overly complex.
- Handles Feature Interactions Well: Captures subtle relationships between variables that simpler models may miss.
- Proven Track Record: Frequently used in production systems due to its reliability and predictive strength.

In [10]:
# Exploring combinations of parameters with cross-validation
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 4, 5],
    'subsample': [0.8, 1.0],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 3]
}

gb_model = GradientBoostingClassifier(random_state=42)
grid_search = GridSearchCV(gb_model, param_grid, cv=3, scoring='roc_auc', n_jobs=-1, verbose=1)

grid_search.fit(X_train_model, y_train)

print("Best AUC-ROC:", grid_search.best_score_)
print("Best parameters:", grid_search.best_params_)

Fitting 3 folds for each of 216 candidates, totalling 648 fits
Best AUC-ROC: 0.8474770829458538
Best parameters: {'learning_rate': 0.01, 'max_depth': 4, 'min_samples_leaf': 3, 'min_samples_split': 2, 'n_estimators': 300, 'subsample': 0.8}


In [11]:
# Initialize the model with best parameters
best_gb_model = GradientBoostingClassifier(
    learning_rate=0.01,
    max_depth=4,
    min_samples_leaf=3,
    min_samples_split=2,
    n_estimators=300,
    subsample=0.8,
    random_state=42
)

# Train on the full training data
best_gb_model.fit(X_train_model, y_train)

# Predict on the test set
gb_pred_proba = best_gb_model.predict_proba(X_test_model)[:, 1]
gb_pred = best_gb_model.predict(X_test_model)

# Evaluate
gb_auc = roc_auc_score(y_test, gb_pred_proba)
gb_acc = accuracy_score(y_test, gb_pred)

print("\nTuned Gradient Boosting Final Test Results:")
print("AUC-ROC:", round(gb_auc, 4))
print("Accuracy:", round(gb_acc, 4))


Tuned Gradient Boosting Final Test Results:
AUC-ROC: 0.8463
Accuracy: 0.807


**Tuned Gradient Boosting Model Analysis:**

The tuned Gradient Boosting Classifier delivered the best performance among all tested models, achieving an AUC-ROC of 0.8463 and an accuracy of 80.7% on the test data. These metrics indicate that the model is highly effective at both ranking churn risk and correctly classifying customers.

After hyperparameter optimization, the model benefited from:
- A lower learning rate (0.01), which allowed it to learn more gradually and avoid overfitting.
- A moderate tree depth (max_depth=4) and small leaf size, which helped capture patterns without introducing noise.
- A larger number of estimators (300), improving performance through additional iterations.
- Use of subsampling (subsample=0.8), which enhanced generalization by adding randomness during training.

## Overall Conclusion

This project aimed to build a reliable machine learning model to help Interconnect Telecom identify customers likely to churn. By predicting churn risk, the company can proactively offer incentives to retain valuable clients and reduce customer turnover.

**Key Accomplishments:**
- Data Integration & Preparation: We merged four datasets containing contract, personal, internet, and phone service information for 7,043 customers. We cleaned missing values, engineered a binary target variable (churn), and one-hot encoded categorical features for modeling.
- Exploratory Data Analysis: Analysis revealed that ~26.5% of customers churned. Service types, contract length, and payment methods appeared to have potential influence on churn behavior.
- Model Development: Several models were tested and evaluated using AUC-ROC and Accuracy:
    - Logistic Regression: AUC-ROC = 0.826, Accuracy = 78.8%
    - Random Forest: AUC-ROC = 0.822, Accuracy = 79.4%
    - Gradient Boosting: AUC-ROC = 0.841, Accuracy = 79.6%
    - KNN, SVM, Stacking were also tested, but Gradient Boosting showed the strongest performance.
- Model Optimization: Gradient Boosting was further tuned using GridSearchCV. The final tuned model achieved:
    - AUC-ROC = 0.8463
    - Accuracy = 80.7%

These metrics meet the project’s success threshold of AUC-ROC ≥ 0.81 and suggest the model is effective at identifying high-risk churners.

**Final Recommendation:**

The tuned Gradient Boosting Classifier is the best candidate for deployment. It offers strong predictive power and generalization, and can be integrated into the company’s CRM or retention workflow to prioritize customer outreach. Future improvements may involve retraining on more recent data, adding customer interaction history, or testing deep learning approaches for further gains.

