<a id='0'></a>
# Libraries And Their Usages
<div style='border-width: 2px;
              border-bottom-width:4px;
              border-bottom-color:#ADD8E6;
              border-bottom-style: solid;'></div>

### Data Manipulation & Analysis
- **numpy**: For numerical computations.
- **pandas**: To manipulate and analyze data in tabular formats.

### Data Visualization
- **seaborn**: For creating statistical plots.
- **matplotlib.pyplot**: For general-purpose plotting.
- **plotly.graph_objs, plotly.express, plotly.graph_objects, plotly.subplots**: Interactive visualizations.
- **graphviz**: Visualize decision trees.

### Data Preprocessing
- **sklearn.preprocessing**: Tools like `StandardScaler`, `OneHotEncoder`, `LabelEncoder`, and `MinMaxScaler` for scaling and encoding.
- **sklearn.impute.SimpleImputer**: Handle missing values.
- **sklearn.compose.ColumnTransformer**: Apply transformations on column subsets.

### Feature Selection
- **mlxtend.feature_selection.SequentialFeatureSelector**: Sequential feature selection methods.

### Machine Learning Models
- **sklearn.ensemble.RandomForestClassifier, GradientBoostingClassifier**: Ensemble learning models.
- **sklearn.tree.DecisionTreeClassifier**: Decision tree model.
- **sklearn.linear_model.LogisticRegression**: Logistic regression model.
- **sklearn.neighbors.KNeighborsClassifier**: K-nearest neighbors model.
- **xgboost.XGBClassifier**: Gradient boosting for high-performance machine learning.

### Model Evaluation
- **sklearn.metrics**: Tools for classification (e.g., `roc_auc_score`, `confusion_matrix`, `classification_report`).
- **sklearn.model_selection**: Tools like `train_test_split`, `KFold`, `GridSearchCV`, and `RandomizedSearchCV` for model evaluation and parameter tuning.

### Pipeline and Hyperparameter Optimization
- **sklearn.pipeline.Pipeline**: Build pipelines for streamlined workflows.
- **hyperopt**: Bayesian optimization for hyperparameter tuning.

### Miscellaneous
- **sqlite3**: SQLite database operations.
- **folium, folium.plugins.MarkerCluster**: Interactive maps and geospatial data visualization.
- **IPython.display**: Display images and HTML objects in notebooks.

### Suppressing Warnings
- **warnings**: Suppress unnecessary warnings during execution.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import graphviz
from sklearn.preprocessing import StandardScaler,OneHotEncoder,LabelEncoder,MinMaxScaler
from sklearn.compose import ColumnTransformer
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.model_selection import train_test_split,GridSearchCV,cross_val_score,KFold,RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.metrics import classification_report,confusion_matrix ,roc_auc_score,ConfusionMatrixDisplay,accuracy_score,precision_score,recall_score,f1_score,precision_recall_curve
from sklearn.metrics import mean_squared_error
from sklearn import tree
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
import sqlite3
import folium
from folium.plugins import MarkerCluster
from sklearn.neighbors import KNeighborsClassifier
from plotly.graph_objs import *
import plotly.express as px 
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.pipeline import Pipeline
from hyperopt import fmin, tpe, hp, Trials
import warnings
from IPython.display import Image, display,IFrame
from plotly.offline import plot
from xgboost import XGBClassifier

# Suppress all warnings
warnings.filterwarnings('ignore')

<a id='1'></a>
# Data Exploration
<div style='border-width: 2px;
              border-bottom-width:4px;
              border-bottom-color:#ADD8E6;
              border-bottom-style: solid;'></div>

In [None]:
df = pd.read_csv("churn-training.csv")
df.head()

In [None]:
# plt.style.use('dark_background')
warnings.filterwarnings("ignore", category=UserWarning, message="findfont: Font family")

In [None]:
df.describe().T.style.background_gradient(cmap="Blues")

In [None]:
# Dataset Info
print("#" * 50)
print("DATASET INFO")
print("#" * 50)
print(df.info())
print("\n" + "-" * 50)

# Dataset Shape and Size
print("DATASET SHAPE AND SIZE")
print("-" * 50)
print(f"Shape of the dataset: {df.shape}")
print(f"Size of the dataset: {df.size}")
print("\n" + "-" * 50)

# Amount of Types
print("DATA TYPES COUNT")
print("-" * 50)
print(df.dtypes.value_counts())
print("\n" + "-" * 50)

# Types of Features
print("FEATURE DATA TYPES")
print("-" * 50)
print(df.dtypes)
print("\n" + "-" * 50)

# Number of Every Item in Every Column
print("VALUE COUNTS PER COLUMN")
print("-" * 50)
for col in df.columns:
    print(f"Counts of unique items in '{col}':")
    print(df[col].value_counts())
    print("\n" + "-" * 50)


<a id='2'></a>
# Data Cleaning
<div style='border-width: 2px;
              border-bottom-width:4px;
              border-bottom-color:#ADD8E6;
              border-bottom-style: solid;'></div>

## Strategies:
### 1. Null Values Checking and Imputation
### 2. Duplicated Values Checking and Imputation
### 3. Outlier Handling

In [None]:
# Check the Null Values :
df.isna().mean()

In [None]:
df=df.drop("CustomerID",axis=1)
df.columns

In [None]:
# Drop Any Null Values If Present :

df = df.dropna()
df.isna().mean()

In [None]:
# Duplicated Values Check :
duplicated_features=df.duplicated().sum()
print("Number of duplicates ----->>> ",duplicated_features)

df = df.drop_duplicates()
duplicated_features=df.duplicated().sum()
print("Number of duplicates of cleaning it ----->>> ",duplicated_features)

In [None]:
# Identify Numerical Features
features = df.select_dtypes(include="number").columns

# Check for Outliers in Each Feature
print("#" * 100)
print("OUTLIER ANALYSIS")
print("#" * 100)

for col in features:
    # Calculate Quartiles and IQR
    Q1_col, Q3_col = df[col].quantile([0.25, 0.75])
    iqr = Q3_col - Q1_col
    low_limit = Q1_col - 1.5 * iqr
    upper_limit = Q3_col + 1.5 * iqr
    
    # Identify Outliers
    outlier = [x for x in df[col] if (x > upper_limit or x < low_limit)]
    
    # Display Results
    if len(outlier) == 0:
        print(f"✅ No outliers in '{col}' feature.")
    else:
        print(f"❌ Outliers detected in '{col}' feature.")
    
    print(f"🔹 Q1 (25th percentile) of {col}: {Q1_col}")
    print(f"🔹 Q3 (75th percentile) of {col}: {Q3_col}")
    print(f"🔹 IQR (Interquartile Range): {iqr}")
    print(f"🔹 Lower Limit: {low_limit}")
    print(f"🔹 Upper Limit: {upper_limit}")
    print(f"🔹 Outliers: {outlier}")
    print(f"🔹 Number of Outliers: {len(outlier)}")
    print("-" * 90)

In [None]:
# columns=["Age","Tenure","Usage Frequency","Support Calls","Payment Delay","Total Spend","Last Interaction"]
# for col in columns :
#     fig2 = px.box(df, y=df[col],color='Churn', title=col + "_Distribution")
#     filename="box.html"
#     plot(fig2, filename=filename, auto_open=False)
#     display(IFrame(filename, width=800, height=600))
#     print("="*100)

## Note: File's too big cant even upload it to GitHub :(

In [None]:
# Updated Color Palette (Blue-Themed)
color_palette = ['#1E90FF', '#00BFFF', '#4682B4', '#5F9EA0', '#87CEEB', '#6495ED']

# Observation Between Age & Payment Delay
print("🔹 Positive Relation:")
fig = px.scatter(
    df, 
    x='Age', 
    y='Payment Delay', 
    color='Payment Delay', 
    color_discrete_sequence=color_palette, 
    trendline='ols'
)
fig.update_layout(
    title="Age vs. Payment Delay",
    xaxis_title="Age",
    yaxis_title="Payment Delay",
    title_font=dict(size=16),
    title_x=0.5
)
fig.show()
print("=" * 75)

# Observation Between Age & Total Spend
print("🔹 Negative Relation:")
fig = px.scatter(
    df, 
    x='Age', 
    y='Total Spend', 
    color='Total Spend', 
    color_discrete_sequence=color_palette, 
    trendline='ols'
)
fig.update_layout(
    title="Age vs. Total Spend",
    xaxis_title="Age",
    yaxis_title="Total Spend",
    title_font=dict(size=16),
    title_x=0.5
)
fig.show()
print("=" * 75)


In [None]:
# plot the histgram:
df.hist(figsize=(25,25),color="b");

In [None]:
# Histogram: Churn by Subscription Type
fig = px.histogram(df, x="Churn", color="Subscription Type")
fig.update_layout(
    bargap=0.2,
    title="Subscription Type vs. Churn",
    legend_title="Subscription Type",
    width=800,
    height=600
)
fig.show()

print("=" * 100)

# Histogram: Churn by Gender
fig = px.histogram(df, x="Churn", color="Gender")
fig.update_layout(
    bargap=0.2,
    title="Gender vs. Churn",
    legend_title="Gender",
    width=800,
    height=600
)
fig.show()

print("=" * 100)

# Histogram: Payment Delay by Contract Length
fig = px.histogram(df, x="Payment Delay", color="Contract Length")
fig.update_layout(
    bargap=0.2,
    title="Contract Length vs. Payment Delay",
    legend_title="Contract Length",
    width=800,
    height=600
)
fig.show()

In [None]:
# Float Data Observation
# plt.figure(figsize=(25, 25), dpi=250)
# sns.set(style="whitegrid")
# sns.set_palette("coolwarm")
# sns.pairplot(df.select_dtypes("number"), plot_kws={'alpha': 0.6, 's': 80})

# Commented so the Notebook can be uploaded to GitHub :'(

In [None]:
for col in df.select_dtypes("number"):
    if col !="Churn":
       with sns.axes_style("white"):
          sns.set(style="whitegrid")
          sns.set_palette("Oranges")
          sns.jointplot(x=df[col],y=df["Churn"],kind="hex")

<a id='3'></a>
# Modeling
<div style='border-width: 2px;
              border-bottom-width:4px;
              border-bottom-color:#ADD8E6;
              border-bottom-style: solid;'></div>

In [None]:
## Define All Models Used :

gridsearch1=GridSearchCV(estimator=RandomForestClassifier(),             # the model used
                        param_grid={"n_estimators":[50,100,160],         # Number of Decision Trees at each state
                        "max_depth":[50,120,180],                        # Number of maximum depth at each state
                        "max_features":[2,3,6]} ,                        # Number of features at naximum in your data
                         cv=3,
                         return_train_score=False,
                         scoring='accuracy')

gridsearch3=GridSearchCV(estimator=LogisticRegression(max_iter=200),
                        param_grid = {'C': [0.1, 1, 10], 'penalty': ['l2']},
                         cv=3,
                         return_train_score=False,
                         scoring='accuracy')

models={
    "LogisticRegression":gridsearch3,
    "RandomForestClassifier":gridsearch1,
    "DecisionTreeClassifier":DecisionTreeClassifier(max_depth=5,max_features=6,random_state=42)}

In [None]:
df=pd.get_dummies(df)

# Selecting Numerical Features :
numerical_features = df.select_dtypes(include=['number'])

scaler = MinMaxScaler()
scaled_numerical_features = scaler.fit_transform(numerical_features)

# Create a DataFrame From Scaled Numerical Features
scaled_numerical_df = pd.DataFrame(scaled_numerical_features, columns=numerical_features.columns)

scaled_numerical_df

In [None]:
# Splitting The Data Into Train & Test :

x_class,y_clss=make_classification(n_samples=100,random_state=42)

x_class=scaled_numerical_df.drop(columns="Churn",axis=1)
y_class=scaled_numerical_df["Churn"]

x_train,x_test,y_train,y_test=train_test_split(x_class,y_class,test_size=0.3,random_state=42)
print("x_train shape : ",x_train.shape)
print("x_test shape : ",x_test.shape)
print("y_train shape : ",y_train.shape)
print("y_test shape : ",y_test.shape)

In [None]:
for model_name, model in models.items():
    # Fit the model to the training data
    model.fit(x_train, y_train)
    
    # Make predictions on the test & train data
    y_pred = model.predict(x_test)
    y_train_pred = model.predict(x_train)
    
    # Calculate accuracy and mean squared error for train and test data
    acc_train = model.score(x_train, y_train)
    acc_test = model.score(x_test, y_test)
    mse_train = mean_squared_error(y_train, y_train_pred)
    mse_test = mean_squared_error(y_test, y_pred)
    
    # Evaluate additional metrics
    f1 = f1_score(y_test, y_pred, average='binary')
    auc_score = roc_auc_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='macro')
    recall = recall_score(y_test, y_pred, average='macro')
    
    # Calculate confusion matrix and classification report
    cm = confusion_matrix(y_test, y_pred)
    classif_report = classification_report(y_test, y_pred)
    
    # Display heatmap of the confusion matrix
    plt.figure(figsize=(7, 5))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
    plt.title(f"Confusion Matrix for {model_name}")
    plt.xlabel(f"Predicted by {model_name}")
    plt.ylabel("Truth")
    plt.show()
    
    # Print evaluation metrics
    print("=" * 60)
    print(f"Model: {model_name}")
    print(f"Train Accuracy: {acc_train * 100:.2f}%")
    print(f"Test Accuracy: {acc_test * 100:.2f}%")
    print(f"Train Mean Squared Error (MSE): {mse_train:.4f}")
    print(f"Test Mean Squared Error (MSE): {mse_test:.4f}")
    print(f"F1 Score: {f1 * 100:.2f}%")
    print(f"AUC Score: {auc_score * 100:.2f}%")
    print(f"Precision: {precision * 100:.2f}%")
    print(f"Recall: {recall * 100:.2f}%")
    print("\nClassification Report:")
    print(classif_report)
    print("=" * 60 + "\n")


In [None]:
# Actual & Prediction Values for Every Model (testing samples) :

for model_name, model in models.items():
     
    model.fit(x_train, y_train)
    y_pred = model.predict(x_test)
    print(f"👉🏻Model: {model_name}")
    
    # Create a dataframe to display actual and predicted values
    df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
    print(df.head(10))
    print("=" * 70)

<a id='99'></a>
## Random Forest Classifier
<div style='border-width: 2px;
              border-bottom-width:4px;
              border-bottom-color:#ADD8E6;
              border-bottom-style: solid;'></div>

In [None]:
best_estimator = gridsearch1.best_estimator_
print("Best estimator:", best_estimator)
feature_importances2 = best_estimator.feature_importances_
feature_names = x_class.columns

# Create a DataFrame to display feature importances

importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances2})
importance_df = importance_df.sort_values(by='Importance', ascending=False)
importance_df.style.background_gradient(cmap="Blues")

<a id='100'></a>
## Decision Tree Classifier Feature Importances
<div style='border-width: 2px;
              border-bottom-width:4px;
              border-bottom-color:#ADD8E6;
              border-bottom-style: solid;'></div>

In [None]:
font_properties = {
    'family': 'serif',
    'color': 'blue',
    'weight': 'bold',
    'size': 45}

decision_tree_model = models["DecisionTreeClassifier"]
plt.figure(figsize=(85,75),dpi=150)
tree.plot_tree(decision_tree_model, filled=True, feature_names=x_class.columns, node_ids=True, fontsize=42)
plt.title("Decision Tree Classifier",fontdict=font_properties)
plt.show()

In [None]:
feature_importances1 = decision_tree_model.feature_importances_
feature_names = x_class.columns

importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances1})
importance_df = importance_df.sort_values(by='Importance', ascending=False)
importance_df.style.background_gradient(cmap="Blues")

<a id='101'></a>
## Logistic Regression
<div style='border-width: 2px;
              border-bottom-width:4px;
              border-bottom-color:#ADD8E6;
              border-bottom-style: solid;'></div>

In [None]:
best_estimator = gridsearch3.best_estimator_
print("Best estimator:", best_estimator)

coefficients = best_estimator.coef_[0] # Used Coffients Of Logistic to Determine Importances
feature_names = x_class.columns

feature_importances = pd.DataFrame({
    'Feature': feature_names,
    'Importance': coefficients})

feature_importances = feature_importances.sort_values(by='Importance', ascending=False)
feature_importances.style.background_gradient(cmap="Blues")

<a id='102'></a>
## Saving And Load Model
<div style='border-width: 2px;
              border-bottom-width:4px;
              border-bottom-color:#ADD8E6;
              border-bottom-style: solid;'></div>

In [None]:
import pickle 
with open('RandomForestClassifier.sav', 'wb') as file:
    pickle.dump(RandomForestClassifier, file)

with open('RandomForestClassifier.sav', 'rb') as file:
    my_object_loaded = pickle.load(file)
print("Model Saved .......")

<a id='103'></a>
# Summary
<div style='border-width: 2px;
              border-bottom-width:4px;
              border-bottom-color:#ADD8E6;
              border-bottom-style: solid;'></div>

<div class="alert alert-block alert-info">
    <h4>Observation</h4>
    First up, this notebook is all about figuring out why customers might leave a service (that's what we call "churn" in business speak). The next part is I really dug into the data to understand what's going on with their customers. I looked at stuff like how age relates to late payments, how much people spend, and what kind of subscriptions they have. Think of it like being a detective, looking for patterns and clues in customer behavior. I also cleaned up the data with fixing missing information and removing duplicate entries, kind of like organizing a messy drawer. 
    <br>
    <br>
    Here's where the cool tech stuff comes in. I used three different types of prediction models (Random Forest, Logistic Regression, and Decision Tree). I tweaked these models to work better, like fine-tuning a car engine for the best performance. I checked how well each model worked using various measurements - basically asking "how good are you at predicting who's going to leave?" 
    <br>
    <br>
    Finally, I figured out which customer characteristics are the most important in predicting whether someone will leave or stay. Each model gave its own take on what matters most. I wrapped it all up by saving their best model (the Random Forest one) so it could be used again later. It's like writing down a winning recipe - you want to keep it for future use! The whole thing is super practical, giving you both the insights into why customers leave and a tool to predict who might leave next.
</div>