# üß© 1. Introduction

## üìå 1.1 Problem Definition

Customer churn is a major challenge for subscription-based businesses, especially in the telecommunications industry.  
Churn refers to customers who discontinue their service, leading to significant long-term revenue loss.

The goal of this project is to develop a **machine learning model to predict customer churn**.  
Early identification of high-risk customers helps companies:

- üîπ Reduce churn through targeted retention actions  
- üîπ Improve customer satisfaction  
- üîπ Personalize marketing and retention strategies  
- üîπ Optimize operational and marketing costs  

Predictive churn modeling is therefore a key component of data-driven customer retention.

---

## üìä 1.2 Dataset Overview

This dataset contains customer information from a fictional telecommunications company based in California.  
It includes customer demographics, service subscriptions, billing patterns, contract types, and churn status.

---

## üìÅ 1.3 Dataset Summary

- **üìå 7043 customers**  
- **üìå 21 features**  
- **üéØ Target variable:** Churn (Yes/No)

# üìö Dataset Story

The **Telco Customer Churn** dataset represents information from a fictitious telecommunications company that provides home phone and internet services to **7,043 customers** living in California.  
The data covers customer activity during the **third quarter**, including which customers:

- üîπ Stayed with the company  
- üîπ Left the service (churned)  
- üîπ Or signed up for service  

The dataset contains **21 variables** and **7,043 unique customer records**, offering a comprehensive view of customer demographics, services used, account details, and churn behavior.

---

## üßæ Variable Description (Modified & Clarified)

Below is an improved and clearer explanation of each variable:

### üîë Customer Information
- **CustomerId** ‚Äì Unique identifier assigned to each customer  
- **Gender** ‚Äì Customer gender  
- **SeniorCitizen** ‚Äì Indicates whether the customer is a senior citizen (1 = Yes, 0 = No)  
- **Partner** ‚Äì Whether the customer has a partner (Yes/No)  
- **Dependents** ‚Äì Whether the customer has dependents such as children or elderly family members (Yes/No)

### üìÖ Customer Lifecycle
- **tenure** ‚Äì Number of months the customer has stayed with the company  

### üìû Phone & Internet Services
- **PhoneService** ‚Äì Whether the customer has phone service (Yes/No)  
- **MultipleLines** ‚Äì Whether the customer has multiple phone lines (Yes/No/No phone service)  
- **InternetService** ‚Äì Type of internet service (DSL, Fiber optic, No)  

### üîê Security & Support Services
- **OnlineSecurity** ‚Äì Online security add-on (Yes/No/No Internet service)  
- **OnlineBackup** ‚Äì Cloud backup service (Yes/No/No Internet service)  
- **DeviceProtection** ‚Äì Device protection plan (Yes/No/No Internet service)  
- **TechSupport** ‚Äì Technical support add-on (Yes/No/No Internet service)

### üé¨ Entertainment Services
- **StreamingTV** ‚Äì Streaming TV service usage (Yes/No/No Internet service)  
- **StreamingMovies** ‚Äì Streaming movie service usage (Yes/No/No Internet service)

### üìÑ Contract & Billing Details
- **Contract** ‚Äì Contract term (Month-to-month, One year, Two year)  
- **PaperlessBilling** ‚Äì Whether billing is paperless (Yes/No)  
- **PaymentMethod** ‚Äì Customer‚Äôs payment type  
  - Electronic check  
  - Mailed check  
  - Bank transfer (automatic)  
  - Credit card (automatic)

### üíµ Financial Information
- **MonthlyCharges** ‚Äì Monthly amount billed to the customer  
- **TotalCharges** ‚Äì Total amount billed over the entire tenure  

### üéØ Target Variable
- **Churn** ‚Äì Indicates whether the customer left the company (Yes/No)

---

## üß© What Each Row Represents

Each row corresponds to a **single customer**, including information from three main categories:

### 1Ô∏è‚É£ **Services Subscribed**
Phone, multiple lines, internet, online security, backups, device protection, tech support, TV streaming, and movie streaming.

### 2Ô∏è‚É£ **Account Information**
Contract duration, tenure, monthly charges, total charges, billing preference, and payment method.

### 3Ô∏è‚É£ **Demographics**
Gender, senior status, partner status, dependents.

---

# üîß 1. Import Required Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.float_format', lambda x: '%.2f' % x)

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, roc_auc_score, classification_report,
    confusion_matrix, RocCurveDisplay, ConfusionMatrixDisplay
)
from xgboost import XGBClassifier
!pip install missingno
import missingno as msno
from datetime import date
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import MinMaxScaler, LabelEncoder, StandardScaler, RobustScaler

# üì• 2. Loading the Dataset

In [None]:
df = pd.read_csv("/kaggle/input/telco-customer-churn/Telco-Customer-Churn.csv")
df.head()

# üîç 3. Exploratory Data Analysis (EDA)


In [None]:
def data_overview(df):
    """
    üîç Quick overview of a pandas DataFrame.
    Prints:
    - Descriptive statistics
    - Missing values
    - Data information
    - Dataset shape
    """

    print("### Descriptive Statistics ###\n")
    print(df.describe().T)
    print("--" * 50)

    print("\n### Missing Values ###\n")
    print(df.isnull().sum())
    print("--" * 10)

    print("\n### Data Information ###\n")
    df.info()
    print("--" * 10)

    print("\n### Dataset Shape ###\n")
    print(df.shape)

data_overview(df)

In [None]:
df[df['TotalCharges'].str.strip() == ""]

df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df.info()

In [None]:
df[df['Churn'].str.strip() == ""]
df["Churn"] = df["Churn"].apply(lambda x : 1 if x == "Yes" else 0)
df["SeniorCitizen"] = df["SeniorCitizen"].astype("object")


# üî¢ 4. Classifying Variables: Numerical vs Categorical

In [None]:
def grab_col_names(dataframe, cat_th=10, car_th=20):
    # Categorical columns
    cat_cols = [col for col in dataframe.columns if dataframe[col].dtypes == "O"]

    # Numeric but categorical
    num_but_cat = [col for col in dataframe.columns
                   if dataframe[col].nunique() < cat_th and dataframe[col].dtypes != "O"]

    # Categorical but cardinal
    cat_but_car = [col for col in dataframe.columns
                   if dataframe[col].nunique() > car_th and dataframe[col].dtypes == "O"]

    cat_cols = cat_cols + num_but_cat
    cat_cols = [col for col in cat_cols if col not in cat_but_car]

    # Numeric columns
    num_cols = [col for col in dataframe.columns if dataframe[col].dtypes != "O"]
    num_cols = [col for col in num_cols if col not in num_but_cat]

    print(f"Observations: {dataframe.shape[0]}")
    print(f"Variables: {dataframe.shape[1]}")
    print(f"cat_cols: {len(cat_cols)}")
    print(f"num_cols: {len(num_cols)}")
    print(f"cat_but_car: {len(cat_but_car)}")
    print(f"num_but_cat: {len(num_but_cat)}")

    return cat_cols, cat_but_car, num_cols

cat_cols , cat_but_car , num_cols = grab_col_names(df)
print("Num_cols : " ,cat_cols)
print("Cat_but_car : " , cat_but_car)
print("Num_cols : " ,num_cols)

# üî† 5. Analysis of Categorical Variables

In [None]:
def cat_summary(dataframe, col_name, plot=False):
    print(pd.DataFrame({col_name: dataframe[col_name].value_counts(),
                        "Ratio": 100 * dataframe[col_name].value_counts() / len(dataframe)}))
    print("##########################################")
    if plot:
        sns.countplot(x=dataframe[col_name], data=dataframe)
        plt.show(block=True)
for col in cat_cols:
    cat_summary(df , col , True)

# üßÆ 6. Understanding Numerical Features

In [None]:
def num_summary(dataframe, numerical_col, plot=False):
    quantiles = [0.05, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 0.95, 0.99]
    print(dataframe[numerical_col].describe(quantiles).T)

    if plot:
        dataframe[numerical_col].hist(bins=20)
        plt.xlabel(numerical_col)
        plt.title(numerical_col)
        plt.show(block=True)

for col in num_cols:
    num_summary(df , col , True)

# üìå 7. Analysis of Categorical Variables by Target

In [None]:
def target_summary_with_cat(dataframe, target, categorical_col):
    print(categorical_col)
    print(pd.DataFrame({"TARGET_MEAN": dataframe.groupby(categorical_col)[target].mean(),
                        "Count": dataframe[categorical_col].value_counts(),
                        "Ratio": 100 * dataframe[categorical_col].value_counts() / len(dataframe)}), end="\n\n\n")

for col in cat_cols:
    target_summary_with_cat(df, "Churn", col)



# üìå 8. Analysis of Numerical Variables by Target

In [None]:
def target_summary_with_num(dataframe, target, numerical_col):
    print(dataframe.groupby(target).agg({numerical_col: "mean"}), end="\n\n\n")
    
for col in num_cols:
    target_summary_with_num(df, "Churn", col)

# üìà 9. Correlation Heatmap & Analysis

In [None]:
corr = df.select_dtypes(include=["number"]).corr()
corr

In [None]:
def high_correlated_cols(dataframe, plot=False, corr_th=0.70):
    # 1) sadece numerik kolonlar
    corr = dataframe.select_dtypes(include=["number"]).corr()
    
    # 2) korelasyon mutlak deƒüer matrisi
    cor_matrix = corr.abs()
    
    # 3) √ºst √º√ßgen (np.bool yerine bool kullanƒ±ldƒ±)
    upper_triangle_matrix = cor_matrix.where(np.triu(np.ones(cor_matrix.shape), k=1).astype(bool))
    
    # 4) e≈üik √ºzerindeki s√ºtunlarƒ± topla
    drop_list = [col for col in upper_triangle_matrix.columns if any(upper_triangle_matrix[col] > corr_th)]
    
    # 5) plot
    if plot:
        import seaborn as sns
        import matplotlib.pyplot as plt
        sns.set(rc={"figure.figsize": (12, 12)})
        corr_values = corr.round(2)
        sns.heatmap(corr, cmap="RdBu", annot=corr_values)
        plt.show()
    
    return drop_list


high_correlated_cols(df, plot=True)


# üõ†Ô∏è 12. Feature Engineering

In this section, we will apply several feature engineering steps to enhance the quality and predictive power of the dataset.  
These steps help the model better understand patterns and relationships within the data.

---
## üö® 12.1 Outlier Detection  
Outliers can negatively affect model performance.  
We will detect and treat outliers in numerical variables using appropriate statistical methods.


---

## üîç 12.2 Missing Values Detection  
We identify and handle missing values to prevent biases and errors during model training.

---

## üß™ 12.3 Feature Extraction  
New features will be created from existing variables to strengthen the model‚Äôs learning capability.  
This may include transformations, ratios, categorization, or domain-driven feature creation.


# üö® 12.1 Outlier Detection

In [None]:
def outlier_thresholds(dataframe, col_name, q1=0.05, q3=0.95):
    quartile1 = dataframe[col_name].quantile(q1)
    quartile3 = dataframe[col_name].quantile(q3)
    interquantile_range = quartile3 - quartile1
    up_limit = quartile3 + 1.5 * interquantile_range
    low_limit = quartile1 - 1.5 * interquantile_range
    return low_limit, up_limit


def check_outlier(dataframe, col_name, plot=False):
    low_limit, up_limit = outlier_thresholds(dataframe, col_name)
    outliers = dataframe[(dataframe[col_name] > up_limit) | (dataframe[col_name] < low_limit)]
    if outliers.any(axis=None):
        if plot:
            plt.figure(figsize=(8, 6))
            sns.boxplot(x=dataframe[col_name])
            plt.title(f'Outliers in {col_name}')
            plt.show()
        return True
    else:
        return False


def replace_with_thresholds(dataframe, variable, q1=0.05, q3=0.95):
    low_limit, up_limit = outlier_thresholds(dataframe, variable, q1=0.05, q3=0.95)
    dataframe.loc[(dataframe[variable] < low_limit), variable] = low_limit
    dataframe.loc[(dataframe[variable] > up_limit), variable] = up_limit

In [None]:
for col in num_cols:
    print(col, check_outlier(df, col))
    if check_outlier(df, col):
        replace_with_thresholds(df, col)

# üîç 12.2 Missing Values Detection

In [None]:
df.isnull().sum()

In [None]:
def missing_values_table(dataframe, na_name=False):
    na_columns = [col for col in dataframe.columns if dataframe[col].isnull().sum() > 0]

    n_miss = dataframe[na_columns].isnull().sum().sort_values(ascending=False)
    ratio = (dataframe[na_columns].isnull().sum() / dataframe.shape[0] * 100).sort_values(ascending=False)
    missing_df = pd.concat([n_miss, np.round(ratio, 2)], axis=1, keys=['n_miss', 'ratio'])
    print(missing_df, end="\n")

    if na_name:
        return na_columns

na_columns = missing_values_table(df, True )

In [None]:
def missing_vs_target(dataframe, target, na_columns):
    temp_df = dataframe.copy()

    # Eksik deƒüer bayraklarƒ±nƒ± olu≈ütur
    for col in na_columns:
        temp_df[col + "_NA_FLAG"] = np.where(temp_df[col].isnull(), 1, 0)

    # Sadece NA_FLAG kolonlarƒ±nƒ± se√ß
    na_flags = temp_df.loc[:, temp_df.columns.str.contains("_NA_")].columns

    # Her NA_FLAG i√ßin target ortalamasƒ±
    for col in na_flags:
        print(
            pd.DataFrame({
                "TARGET_MEAN": temp_df.groupby(col)[target].mean(),
                "Count": temp_df.groupby(col)[target].count()
            }),
            end="\n\n\n"
        )


missing_vs_target(df , "Churn" ,na_columns)

In [None]:
df["tenure_group"] = pd.cut(
    df["tenure"],
    bins=[-1, 12, 24, 36, 48, 60, 72],
    labels=["0‚Äì12", "12‚Äì24", "24‚Äì36", "36‚Äì48", "48‚Äì60", "60‚Äì72"]
)

df["TotalCharges"] = df.groupby("tenure_group")["TotalCharges"].transform(
    lambda x: x.fillna(x.mean())
)


df.drop("tenure_group", axis=1, inplace=True)


In [None]:
df.isnull().sum()

In [None]:
df.head()

In [None]:
df["avg_monthly_spend"] = df["TotalCharges"] / df["tenure"].replace(0, 1)
df.head()

# üß™ 12.3 Feature Extraction 

In [None]:
# 1) Tenure Features
df["tenure_group"] = pd.cut(
    df["tenure"],
    bins=[-1, 12, 24, 36, 48, 60, 72],
    labels=["0-1 yƒ±l", "1-2 yƒ±l", "2-3 yƒ±l", "3-4 yƒ±l", "4-5 yƒ±l", "5-6 yƒ±l"]
)

df["tenure_year"] = (df["tenure"] // 12).clip(upper=6)
df["loyalty_score"] = df["tenure"] / 72

df["loyalty_level"] = pd.cut(
    df["loyalty_score"],
    bins=[0, 0.20, 0.40, 0.60, 0.80, 1.0],
    labels=["Very Low", "Low", "Medium", "High", "Very High"]
)

# 2) Spending Features
df["avg_monthly_spend"] = df["TotalCharges"] / df["tenure"].replace(0, 1)
df["price_sensitivity"] = df["MonthlyCharges"] / df["MonthlyCharges"].mean()
df["expected_total_if_stayed"] = df["MonthlyCharges"] * (72 - df["tenure"])
df["charge_growth"] = df["MonthlyCharges"] - df["avg_monthly_spend"]

# 3) Service Features 
service_cols = ["PhoneService", "OnlineSecurity", "OnlineBackup",
                "DeviceProtection", "TechSupport", "StreamingTV", 
                "StreamingMovies"]



stream_cols = ["StreamingTV", "StreamingMovies"]
df["streaming_services"] = df[stream_cols].sum(axis=1)



# 4) Family Features
df["has_family"] = ((df["Partner"] == "Yes") | (df["Dependents"] == "Yes")).astype(int)
df["family_size"] = (df["Partner"].map({"Yes": 1, "No": 0}) +
                     df["Dependents"].map({"Yes": 1, "No": 0}))

df["single_flag"] = ((df["Partner"] == "No") & (df["Dependents"] == "No")).astype(int)
df["has_kids"] = (df["Dependents"] == "Yes").astype(int)
df["couple_flag"] = (df["Partner"] == "Yes").astype(int)

df["family_monthly_contract"] = df["has_family"] * (df["Contract"] == "Month-to-month").astype(int)



df["family_loyalty"] = df["has_family"] * df["tenure"]

# 5) Risk Features
df["fiber_no_support"] = (df["InternetService"] == "Fiber optic").astype(int) * (df["TechSupport"] == 0)
df["security_risk"] = (df["OnlineSecurity"] == 0).astype(int) * (df["MonthlyCharges"] > df["MonthlyCharges"].median()).astype(int)
df["new_customer_high_charge"] = (df["tenure"] < 6).astype(int) * (df["MonthlyCharges"] > df["MonthlyCharges"].median()).astype(int)


In [None]:
df.head()

# üî§ 13. Encoding

In [None]:
cat_cols , cat_but_car , num_cols = grab_col_names(df)
print("Cat_cols : " ,cat_cols)
print("Cat_but_car : " , cat_but_car)
print("Num_cols : " ,num_cols)

In [None]:
def label_encoder(dataframe, binary_col):
    labelencoder = LabelEncoder()
    dataframe[binary_col] = labelencoder.fit_transform(dataframe[binary_col])
    return dataframe

binary_cols = [col for col in df.columns if df[col].dtype not in [int, float]
               and df[col].nunique() == 2]

binary_cols
for col in binary_cols:
    df = label_encoder(df, col)

In [None]:
df.head()

In [None]:
def one_hot_encoder(dataframe, categorical_cols, drop_first=False):
    dataframe = pd.get_dummies(dataframe, columns=categorical_cols, drop_first=drop_first)
    return dataframe
cat_cols = [col for col in cat_cols if col not in binary_cols and col not in ["Churn"]]

df = one_hot_encoder(df, cat_cols)

# ‚öôÔ∏è 14. Feature Scaling with StandardScaler

In [None]:
scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])
df.head()

# ü§ñ 15. Baseline Modeling: Logistic Regression

In [None]:
# -----------------------------
# 1) X ve y 
# -----------------------------
y = df["Churn"]
X = df.drop(["Churn", "customerID"], axis=1)

# -----------------------------
# 2) Train-Test Split 
# -----------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=1
)

# -----------------------------
# 3) Logistic Regression 
# -----------------------------
log_model = LogisticRegression(max_iter=2000)
log_model.fit(X_train, y_train)

y_pred_log = log_model.predict(X_test)
y_prob_log = log_model.predict_proba(X_test)[:, 1]

log_acc = accuracy_score(y_test, y_pred_log)
log_auc = roc_auc_score(y_test, y_prob_log)

print("==== Logistic Regression ====")
print("Accuracy:", log_acc)
print("AUC:", log_auc)
print("\nClassification Report:\n", classification_report(y_test, y_pred_log))

# üîç 16. Visualizing the Confusion Matrix

In [None]:
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap="Blues")
plt.title("Confusion Matrix")
plt.show()

# üöÄ 17. Advanced Modeling with XGBoost

In [None]:
# -----------------------------
# 4) XGBoost Model
# -----------------------------

# class imbalance i√ßin scale_pos_weight
neg, pos = np.bincount(y_train)  # 0 ve 1 sayƒ±larƒ±
scale_pos_weight = neg / pos

xgb_model = XGBClassifier(
    n_estimators=400,
    learning_rate=0.05,
    max_depth=4,
    subsample=0.8,
    colsample_bytree=0.8,
    objective="binary:logistic",
    eval_metric="logloss",
    scale_pos_weight=scale_pos_weight,
    random_state=1,
    n_jobs=-1
)

xgb_model.fit(X_train, y_train)

y_pred_xgb = xgb_model.predict(X_test)
y_prob_xgb = xgb_model.predict_proba(X_test)[:, 1]

xgb_acc = accuracy_score(y_test, y_pred_xgb)
xgb_auc = roc_auc_score(y_test, y_prob_xgb)

print("\n\n==== XGBoost ====")
print("Accuracy:", xgb_acc)
print("AUC:", xgb_auc)
print("\nClassification Report:\n", classification_report(y_test, y_pred_xgb))


In [None]:
cm_xgb = confusion_matrix(y_test, y_pred_xgb)
disp = ConfusionMatrixDisplay(confusion_matrix=cm_xgb)
disp.plot(cmap="Blues")
plt.title("Confusion Matrix - XGBoost")
plt.show()

# üìâ 18. ROC Curve Comparison


In [None]:
plt.figure(figsize=(6,6))

RocCurveDisplay.from_estimator(
    log_model, X_test, y_test, name=f"Logistic (AUC={log_auc:.2f})"
)
RocCurveDisplay.from_estimator(
    xgb_model, X_test, y_test, name=f"XGBoost (AUC={xgb_auc:.2f})"
)

plt.plot([0,1], [0,1], "--", color="gray")
plt.title("ROC Curve Comparison")
plt.show()