![](https://uruit.com/blog/wp-content/uploads/2020/11/Churn1-1024x724.jpg)

# Business Problem

- You are asked to develop a machine learning model that can predict customers who will leave the company.
- You are expected to perform the necessary data analysis and feature engineering steps before developing the model.

# Dataset Story

- Telco customer churn contains information about a fictitious telecom company providing home phone and Internet services to 7043 customers in California in the third quarter. It includes which customers left, stayed or signed up for service.

- The data set consists of 21 Variables and 7043 Observations.

- **CustomerId :** Customer Id
- **Gender :** Gender
- **SeniorCitizen :** Whether the customer is a senior citizen (1, 0)
- **Partner :** Whether the client has a partner (Yes, No) ? Married or not. Living together, being roommates
- **Dependents :** Whether the client has dependents (Yes, No) (Child, mother, father, grandmother)
- **tenure :** Number of months the customer stays with the company
- **PhoneService :** Whether the customer has phone service (Yes, No)
- **MultipleLines :** Whether the customer has more than one line (Yes, No, No phone service)
- **InternetService :** Customer's internet service provider (DSL, Fiber optic, No)
- **OnlineSecurity :** Whether the customer has online security (Yes, No, No Internet service)
- **OnlineBackup :** Whether the customer has online backup (Yes, No, No Internet service)
- **DeviceProtection :** Whether the customer has device protection (Yes, No, No Internet service)
- **TechSupport :** Whether the customer receives technical support (Yes, No, No Internet service)

- **StreamingTV :** Whether the customer has streaming TV (Yes, No, no Internet service) (The customer has a third-party
  indicates whether the provider uses the Internet service to broadcast television programs)

- **StreamingMovies :** Whether the customer has streaming movies (Yes, No, No Internet service) (Customer has a third-party
  Indicates whether the customer is using the Internet service to stream movies from the provider)

- **Contract :** Duration of the customer's contract (Month to month, One year, Two years)
- **PaperlessBilling :** Whether the customer has a paperless bill (Yes, No)
- **PaymentMethod :** Customer's payment method (Electronic check, Postal check, Bank transfer (automatic), Credit card (automatic)
- **MonthlyCharges :** Amount charged to the customer monthly
- **TotalCharges :** Total amount charged to the customer
- **Churn :** Whether the customer is using or not (Yes or No) - Customers who left in the last month or quarter.

- Each row represents a unique customer. Variables contain information about customer service, account and demographic data.
-  Services that customers sign up for => phone, multiple lines, internet, online security, online backup, device protection,   technical support and TV and movie streaming.
- Customer account information => how long they have been a customer, contract, payment method, paperless billing, monthly fees and total fees.
- Demographic information about clients => gender, age range and partners and dependents whether or not

# Road Map

- **1. Import Required Libraries**

- **2. Adjusting Row Column Settings**

- **3. Loading the data Set**

- **4. Exploratory Data Analysis**

- **5. Capturing / Detecting Numeric and Categorical Variables**

- **6. Analysis of Categorical Variables**

- **7. Analysis of Numerical Variables**

- **8. Analysis of Numeric Variables by Target**

- **9. Analysis of Categorical Variables by Target**

- **10. Examining the Logarithm of the Dependent Variable**

- **11. Correlation Analysis**

- **12. Feature Engineering**

- **13. Missing Value Analysis**

- **14. Outlier Analysis**

- **15. Base Model**

- **16. Comparison of Metrics for Different Models Before Feature Engineering**

- **17. Feature Importance For Base Model**

- **18. Feature Extraction**

- **19. ENCODING**

- **20. Standardization Process**

- **21. Creating Model**

- **22. Comparison of Metrics for Different Models After Feature Engineering**

- **23. Feature Importance**

- **24. Metric Improvement Comparison After Feature Engineering**

- **25. Hyperparameter Optimization**

# 1. Import Required Libraries

In [None]:
!pip install pydotplus

In [None]:
pip install skompiler

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import joblib
import graphviz
import pydotplus
import plotly.graph_objects as go

from scipy import stats
from datetime import date
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,roc_auc_score
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.tree import DecisionTreeClassifier, export_graphviz, export_text
from sklearn.model_selection import train_test_split
from sklearn.neighbors import LocalOutlierFactor
from sklearn.model_selection import GridSearchCV, cross_validate, RandomizedSearchCV, validation_curve
from sklearn.preprocessing import MinMaxScaler, LabelEncoder, StandardScaler, RobustScaler
from skompiler import skompile

import warnings
warnings.simplefilter(action="ignore")

# 2. Adjusting Row Column Settings

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('display.max_rows', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# 3. Loading the data Set

In [None]:
df = pd.read_csv("/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv")

# 4. Exploratory Data Analysis

In [None]:
def check_df(dataframe,):
    print("##################### Shape #####################")
    print(dataframe.shape)
    print("##################### Types #####################")
    print(dataframe.dtypes)
    print("##################### NA #####################")
    print(dataframe.isnull().sum())
    print("##################### Quantiles #####################")
    print(dataframe.quantile([0, 0.05, 0.50, 0.95, 0.99, 1]).T)
    # print("##################### Head #####################")
    # dataframe.head()
    # print("##################### Tail #####################")
    # print(dataframe.tail(head))

check_df(df)

In [None]:
df.head()

In [None]:
df['Churn'].value_counts()

In [None]:
# We changed the type of the TotalCharges variable.

df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors='coerce')

In [None]:
df['TotalCharges'].dtypes

In [None]:
df["SeniorCitizen"].dtypes

In [None]:
# We changed the type of the SeniorCitizen variable.

df["SeniorCitizen"] = df["SeniorCitizen"].astype("O")

In [None]:
df.info()

In [None]:
# We changed the representation of the Churn variable in the dataset from "Yes" and "No" to 1 and 0.

df["Churn"] = df["Churn"].apply(lambda x : 1 if x == "Yes" else 0)

In [None]:
df.head()

In [None]:
df['gender'].value_counts()

In [None]:
df.info()

# 5. Capturing / Detecting Numeric and Categorical Variables

In [None]:
import pandas as pd

class DataFrameAnalyzer:
    """
    A class to analyze the columns of a DataFrame and categorize them into different types
    based on their characteristics (e.g., categorical, numerical, and cardinal).
    
    Attributes
    ----------
    dataframe : pd.DataFrame
        The DataFrame to analyze.
    cat_th : int, optional
        The threshold for a variable to be considered as categorical but numeric (default is 10).
    car_th : int, optional
        The threshold for a variable to be considered categorical but cardinal (default is 20).
        
    Methods
    -------
    grab_col_names() -> tuple
        Returns the names of categorical, numeric, and categorical but cardinal variables in the DataFrame.
    """

    def __init__(self, dataframe: pd.DataFrame, cat_th: int = 10, car_th: int = 20):
        """
        Initializes the DataFrameAnalyzer with a DataFrame and optional thresholds for categorization.
        
        Parameters
        ----------
        dataframe : pd.DataFrame
            The DataFrame to analyze.
        cat_th : int, optional
            The threshold for a variable to be considered as categorical but numeric (default is 10).
        car_th : int, optional
            The threshold for a variable to be considered categorical but cardinal (default is 20).
        """
        self.dataframe = dataframe
        self.cat_th = cat_th
        self.car_th = car_th
    
    def grab_col_names(self):
        """
        Categorizes the columns of the DataFrame into categorical, numerical, and cardinal variables.

        Returns
        -------
        tuple
            A tuple containing four lists:
            - cat_cols: Categorical variables (including numeric but categorical variables).
            - num_cols: Numeric variables.
            - cat_but_car: Categorical variables that are cardinal.
            - num_but_cat: Numeric variables that appear categorical.
        """
        # Identifying categorical variables
        cat_cols = [col for col in self.dataframe.columns if self.dataframe[col].dtypes == "O"]
        
        # Identifying numeric variables that have fewer unique values (treated as categorical)
        num_but_cat = [col for col in self.dataframe.columns if self.dataframe[col].nunique() < self.cat_th and self.dataframe[col].dtypes != "O"]
        
        # Identifying categorical variables that have more unique values (treated as cardinal)
        cat_but_car = [col for col in self.dataframe.columns if self.dataframe[col].nunique() > self.car_th and self.dataframe[col].dtypes == "O"]
        
        # Finalizing categorical columns list (including numeric columns treated as categorical)
        cat_cols += num_but_cat
        cat_cols = [col for col in cat_cols if col not in cat_but_car]

        # Identifying numeric variables that are not in num_but_cat
        num_cols = [col for col in self.dataframe.columns if self.dataframe[col].dtypes != "O"]
        num_cols = [col for col in num_cols if col not in num_but_cat]

        # Printing some useful information
        print(f"Observations: {self.dataframe.shape[0]}")
        print(f"Variables: {self.dataframe.shape[1]}")
        print(f'cat_cols: {len(cat_cols)}')
        print(f'num_cols: {len(num_cols)}')
        print(f'cat_but_car: {len(cat_but_car)}')
        print(f'num_but_cat: {len(num_but_cat)}')

        return cat_cols, num_cols, cat_but_car, num_but_cat

# Example usage
# Assuming `df` is a DataFrame
analyzer = DataFrameAnalyzer(df)
cat_cols, num_cols, cat_but_car, num_but_cat = analyzer.grab_col_names()


In [None]:
cat_cols

In [None]:
num_cols

In [None]:
cat_but_car

# 6. Analysis of Categorical Variables

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

class CategoricalSummary:
    """
    A class to provide a summary for categorical columns in a DataFrame.
    
    Attributes
    ----------
    dataframe : pd.DataFrame
        The DataFrame containing the data to be analyzed.
    
    Methods
    -------
    __init__(dataframe):
        Initializes the CategoricalSummary class with a DataFrame.
    
    show_summary(col_name):
        Prints the summary of the specified categorical column, including value counts and their ratios.
    
    plot_distribution(col_name):
        Plots a countplot for the specified categorical column.
    """
    
    def __init__(self, dataframe):
        """
        Initializes the CategoricalSummary class with a DataFrame.
        
        Parameters
        ----------
        dataframe : pd.DataFrame
            The DataFrame containing the data to be analyzed.
        """
        self.dataframe = dataframe
    
    def show_summary(self, col_name):
        """
        Prints the value counts and their ratios for a specified categorical column.
        
        Parameters
        ----------
        col_name : str
            The name of the categorical column to summarize.
        """
        value_counts = self.dataframe[col_name].value_counts()
        ratio = 100 * value_counts / len(self.dataframe)
        summary = pd.DataFrame({col_name: value_counts, "Ratio": ratio})
        print(summary)
        print("##########################################")
    
    def plot_distribution(self, col_name):
        """
        Plots the distribution of a specified categorical column using a countplot.
        
        Parameters
        ----------
        col_name : str
            The name of the categorical column to plot.
        """
        sns.countplot(x=self.dataframe[col_name], data=self.dataframe)
        plt.show(block=True)

# Usage example:
for col in cat_cols:
    # Assuming 'df' is your DataFrame and 'col_name' is the categorical column you want to analyze
    cat_summary = CategoricalSummary(df)
    
    # Show summary of a specific categorical column
    cat_summary.show_summary(col)
    
    # Plot the distribution if needed
    cat_summary.plot_distribution(col)


In [None]:
def cat_summary(dataframe, col_name, plot=False):
    print(pd.DataFrame({col_name: dataframe[col_name].value_counts(),
                        "Ratio": 100 * dataframe[col_name].value_counts() / len(dataframe)}))
    print("##########################################")
    if plot:
        sns.countplot(x=dataframe[col_name], data=dataframe)
        plt.show(block=True)

In [None]:
for col in cat_cols:
    cat_summary(df, col, plot=True)

# 7. Analysis of Numerical Variables

In [None]:
def num_summary(dataframe, numerical_col, plot=False):
    quantiles = [0.05, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 0.95, 0.99]
    print(dataframe[numerical_col].describe(quantiles).T)

    if plot:
        dataframe[numerical_col].hist(bins=20)
        plt.xlabel(numerical_col)
        plt.title(numerical_col)
        plt.show(block=True)

In [None]:
for col in num_cols:
    num_summary(df, col, plot=True)

# 8. Analysis of Numeric Variables by Target

In [None]:
def target_summary_with_num(dataframe, target, numerical_col):
    print(dataframe.groupby(target).agg({numerical_col: "mean"}), end="\n\n\n")

In [None]:
for col in num_cols:
    target_summary_with_num(df, "Churn", col)

# 9. Analysis of Categorical Variables by Target

In [None]:
def target_summary_with_cat(dataframe, target, categorical_col):
    print(categorical_col)
    print(pd.DataFrame({"TARGET_MEAN": dataframe.groupby(categorical_col)[target].mean(),
                        "Count": dataframe[categorical_col].value_counts(),
                        "Ratio": 100 * dataframe[categorical_col].value_counts() / len(dataframe)}), end="\n\n\n")


In [None]:
for col in cat_cols:
    target_summary_with_cat(df, "Churn", col)

# 10. Examining the Logarithm of the Dependent Variable

In [None]:
np.log1p(df["Churn"]).hist(bins=50)
plt.show(block=True)

# 11. Correlation Analysis

In [None]:
df.info()

In [None]:
corr = df[num_cols].corr()

In [None]:
corr

In [None]:
def high_correlated_cols(dataframe, plot=False, corr_th=0.70):
    corr = dataframe.corr()
    cor_matrix = corr.abs()
    upper_triangle_matrix = cor_matrix.where(np.triu(np.ones(cor_matrix.shape), k=1).astype(np.bool))
    drop_list = [col for col in upper_triangle_matrix.columns if any(upper_triangle_matrix[col] > corr_th)]
    if plot:
        import seaborn as sns
        import matplotlib.pyplot as plt
        sns.set(rc={"figure.figsize": (12, 12)})
        corr_values = corr.round(2)
        sns.heatmap(corr, cmap="RdBu", annot=corr_values)
        plt.show(block=True)
    return drop_list

In [None]:
high_correlated_cols(df, plot=True)

# 12. Feature Engineering

**In this section, we will perform the following variable engineering operations**.

- Missing Values Detection
- Outlier Detection (Outliers)
- Feature Extraction

# 13. Missing Value Analysis

In [None]:
df.isnull().sum()

In [None]:
def missing_values_table(dataframe, na_name=False, plot=False):
    na_columns = [col for col in dataframe.columns if dataframe[col].isnull().sum() > 0]
    n_miss = dataframe[na_columns].isnull().sum().sort_values(ascending=False)
    ratio = (dataframe[na_columns].isnull().sum() / dataframe.shape[0] * 100).sort_values(ascending=False)
    missing_df = pd.concat([n_miss, np.round(ratio, 2)], axis=1, keys=['n_miss', 'ratio'])
    print(missing_df, end="\n")
    
    if plot:
        # Plotting the missing values
        plt.figure(figsize=(10, 8))
        bars = plt.bar(missing_df.index, missing_df['ratio'], color='purple')
        plt.xlabel('Features')
        plt.ylabel('Percentage of Missing Values')
        plt.title('Missing Values by Feature')
        
        for bar in bars:
            yval = bar.get_height()
            plt.text(bar.get_x() + bar.get_width() / 2, yval, f'{yval:.2f}%', ha='center', va='bottom')
        
        plt.xticks(rotation=90)
        plt.tight_layout()
        plt.show(block=True)
    
    if na_name:
        return na_columns

In [None]:
na_columns = missing_values_table(df, na_name=True, plot=True)

In [None]:
def missing_vs_target(dataframe, target, na_columns, plot=False):
    temp_df = dataframe.copy()
    for col in na_columns:
        temp_df[col + '_NA_FLAG'] = np.where(temp_df[col].isnull(), 1, 0)
    na_flags = temp_df.loc[:, temp_df.columns.str.contains("_NA_")].columns
    for col in na_flags:
        print(pd.DataFrame({"TARGET_MEAN": temp_df.groupby(col)[target].mean(),
                            "Count": temp_df.groupby(col)[target].count()}), end="\n\n\n")
        if plot:
            # Plotting the target mean by NA flag
            plt.figure(figsize=(6, 4))
            temp_df.groupby(col)[target].mean().plot(kind='bar', color='purple')
            plt.xlabel(col)
            plt.ylabel('Target Mean')
            plt.title(f'Target Mean by {col}')
            plt.xticks(rotation=0)
            plt.tight_layout()
            plt.show(block=True)
            print("######################################################################")

In [None]:
missing_vs_target(df, "Churn", na_columns, plot=True)

In [None]:
df["TotalCharges"].fillna(df["TotalCharges"].median(), inplace=True)

In [None]:
df.isnull().sum()

# 14. Outlier Analysis

In [None]:
def outlier_thresholds(dataframe, col_name, q1=0.05, q3=0.95):
    quartile1 = dataframe[col_name].quantile(q1)
    quartile3 = dataframe[col_name].quantile(q3)
    interquantile_range = quartile3 - quartile1
    up_limit = quartile3 + 1.5 * interquantile_range
    low_limit = quartile1 - 1.5 * interquantile_range
    return low_limit, up_limit

In [None]:
def check_outlier(dataframe, col_name, plot=False):
    low_limit, up_limit = outlier_thresholds(dataframe, col_name)
    outliers = dataframe[(dataframe[col_name] > up_limit) | (dataframe[col_name] < low_limit)]
    if outliers.any(axis=None):
        if plot:
            plt.figure(figsize=(8, 6))
            sns.boxplot(x=dataframe[col_name])
            plt.title(f'Outliers in {col_name}')
            plt.show()
        return True
    else:
        return False

In [None]:
def replace_with_thresholds(dataframe, variable, q1=0.05, q3=0.95):
    low_limit, up_limit = outlier_thresholds(dataframe, variable, q1=0.05, q3=0.95)
    dataframe.loc[(dataframe[variable] < low_limit), variable] = low_limit
    dataframe.loc[(dataframe[variable] > up_limit), variable] = up_limit

In [None]:
for col in num_cols:
    print(col, check_outlier(df, col))
    if check_outlier(df, col):
        replace_with_thresholds(df, col)

# 15. Base Model

In [None]:
dff = df.copy()

In [None]:
cat_cols = [col for col in cat_cols if col not in ["Churn"]]

In [None]:
cat_cols

In [None]:
# One-Hot-Encoding

def one_hot_encoder(dataframe, categorical_cols, drop_first=False):
    dataframe = pd.get_dummies(dataframe, columns=categorical_cols, drop_first=drop_first)
    return dataframe

In [None]:
dff = one_hot_encoder(dff, cat_cols)

In [None]:
dff.head()

In [None]:
dff.shape

In [None]:
# Standardization

scaler = RobustScaler()

In [None]:
dff[num_cols] = scaler.fit_transform(dff[num_cols])

In [None]:
dff.head()

In [None]:
# Creating the Dependent Variable.

y = dff["Churn"]

In [None]:
# Creating Independent Variables.

X = dff.drop(["Churn","customerID"], axis=1)

In [None]:
models = [('LR', LogisticRegression(random_state=12345)),
          ('KNN', KNeighborsClassifier()),
          ('CART', DecisionTreeClassifier(random_state=12345)),
          ('RF', RandomForestClassifier(random_state=12345)),
          ('XGB', XGBClassifier(random_state=12345)),
          ("LightGBM", LGBMClassifier(random_state=12345)),
          ("CatBoost", CatBoostClassifier(verbose=False, random_state=12345))]

In [None]:
base_models_metrics = []

In [None]:
import pickle

for name, model in models:
    cv_results = cross_validate(model, X, y, cv=5, scoring=["accuracy", "f1", "roc_auc", "precision", "recall"])
    
    accuracy = round(cv_results['test_accuracy'].mean(), 4)
    auc = round(cv_results['test_roc_auc'].mean(), 4)
    recall = round(cv_results['test_recall'].mean(), 4)
    precision = round(cv_results['test_precision'].mean(), 4)
    f1 = round(cv_results['test_f1'].mean(), 4)
    
    base_models_metrics.append({
        "Model": name,
        "Accuracy": accuracy,
        "AUC": auc,
        "Recall": recall,
        "Precision": precision,
        "F1": f1
    })

    # Save the model as a pickle file
    with open(f"/kaggle/working/{name}_model.pkl", "wb") as model_file:
        pickle.dump(model, model_file)
    
    print(f"########## {name} ##########")
    print(f"Accuracy: {accuracy}")
    print(f"AUC: {auc}")
    print(f"Recall: {recall}")
    print(f"Precision: {precision}")
    print(f"F1: {f1}")
  
