# Task-2: Customer Churn Prediction System üìâ

This project focuses on predicting customer churn using machine learning.
Using the Telco Customer Churn dataset, we build a classification model to identify customers who are likely to stop using the service.

üìå The **objective** is to:

- Analyze customer behavior and service usage

- Predict churn probability for each customer

- Identify key churn drivers

- Support business decision-making using dashboards

# üîß Importing Required Libraries

In this step, we import all the necessary Python libraries used for:

Data manipulation ```pandas, numpy```

Data visualization ```matplotlib, seaborn```

Machine learning preprocessing and evaluation ```scikit-learn```

Advanced classification model ```XGBoost```

In [1]:
import pandas as pd
import numpy as np
import pickle

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, roc_auc_score
from xgboost import XGBClassifier


# üìÇ Loading the Dataset

The Telco Customer Churn dataset is loaded into a pandas DataFrame.
Each row represents a customer, and each column represents customer demographics, services, billing details, and churn status.

Key columns used:

```Churn``` ‚Üí Target variable

```tenure```,``` MonthlyCharges```,``` TotalCharges```‚Üí Numerical features

In [2]:
df = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")

# üîç Initial Data Exploration

This step helps in understanding the dataset structure by:

- Viewing sample records

- Checking the number of rows and columns

- Inspecting data types

This ensures we understand what preprocessing steps are required.

In [3]:
df.info()
df.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


(7043, 21)

# üßπ Data Cleaning

Before training the model, the data must be cleaned.

In this step:

- ```TotalCharges``` is converted from string to numeric format

- Rows with missing values are removed

- ```customerID``` is dropped since it does not help in prediction

In [4]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df.dropna(inplace=True)
df.drop('customerID', axis=1, inplace=True)

# üîÅ Encoding the Target Variable

Machine learning models require numerical values.
The target variable Churn is converted into binary form:

- Yes ‚Üí 1

- No ‚Üí 0

In [5]:
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})

# üîÑ Encoding Categorical Features

Most features in the dataset are categorical.
To make them usable for machine learning, one-hot encoding is applied.

In [6]:
df = pd.get_dummies(df, drop_first=True)


# ‚öñÔ∏è Scaling Numerical Features

Numerical features have different value ranges.
To ensure fair contribution to the model, standard scaling is applied to:

- ```tenure```

- ```MonthlyCharges```

- ```TotalCharges```

In [7]:
num_cols = ["tenure", "MonthlyCharges", "TotalCharges"]

scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])


# ‚úÇÔ∏è Splitting Features and Target

The dataset is divided into:

```X``` ‚Üí Input features

```y``` ‚Üí Target variable (Churn)

In [8]:
X = df.drop("Churn", axis=1)
y = df["Churn"]



# üß™ Train-Test Split

To evaluate the model on unseen data, the dataset is split into:

- Training set (80%)

- Testing set (20%)

In [9]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# ü§ñ Building the Churn Prediction Model

An XGBoost Classifier is used due to its strong performance on structured tabular data.

The model is initialized with tuned hyperparameters for better accuracy.

In [10]:
xgb = XGBClassifier(
    n_estimators=200,
    max_depth=5,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric="logloss",
    random_state=42
)



# üß† Training the Model

The model is trained using the training dataset.

During this step, the model learns patterns that differentiate churned and non-churned customers.

In [11]:

xgb.fit(X_train, y_train)

# üîÆ Making Predictions

The trained model is used to:

- Predict churn class (0 or 1)

- Predict churn probability for each customer

In [12]:
y_pred = xgb.predict(X_test)
y_prob = xgb.predict_proba(X_test)[:, 1]


# üìè Model Evaluation

Model performance is evaluated using:

- Confusion Matrix

- Precision, Recall, F1-score

- ROC-AUC score

These metrics help assess how well the model identifies churners.

In [13]:
print(classification_report(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, y_prob))

              precision    recall  f1-score   support

           0       0.83      0.87      0.85      1033
           1       0.59      0.52      0.55       374

    accuracy                           0.77      1407
   macro avg       0.71      0.69      0.70      1407
weighted avg       0.77      0.77      0.77      1407

ROC-AUC: 0.8254150467720309


# üíæ Saving Trained Model and Preprocessing Objects

In this step, the trained machine learning components are saved using pickle so they can be reused later without retraining the model.

* The XGBoost model is saved to make churn predictions during deployment.

* The scaler is saved to ensure new input data is scaled in the same way as training data.

* The feature columns list is saved to maintain consistency between training and prediction features.

***Saving these objects is essential for:***

- Deploying the model in a Streamlit web app

- Generating predictions for new customers

- Avoiding feature mismatch errors

In [18]:
pickle.dump(xgb, open("xgb_model.pkl", "wb"))
pickle.dump(scaler, open("scaler.pkl", "wb"))
pickle.dump(X.columns, open("model_columns.pkl", "wb"))

print("‚úÖ Model, scaler, and columns saved successfully")

‚úÖ Model, scaler, and columns saved successfully


# üß† Creating Churn Risk Categories

Customers are segmented into risk groups based on churn probability:

- Low Risk

- Medium Risk

- High Risk

In [19]:
# Create dashboard dataframe from TEST data only
dashboard_df = X_test.copy()

dashboard_df["Actual_Churn"] = y_test.values
dashboard_df["Churn_Probability"] = y_prob

dashboard_df["Churn_Risk"] = pd.cut(
    dashboard_df["Churn_Probability"],
    bins=[0, 0.4, 0.7, 1],
    labels=["Low Risk", "Medium Risk", "High Risk"]
)



# üì§ Exporting Data for Power BI Dashboard

The final dataset with churn probability and risk categories is exported as a CSV file.
This file is used to create an interactive Power BI dashboard.

In [20]:

dashboard_df.to_csv("churn_dashboard_data.csv", index=False)
