Andrew Marasco \
Customer Churn Prediction Model \
Flatiron School Capstone Project #1 \
January, 2026

## Step 0: Setup and Imports

In [2]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate, RandomizedSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

from sklearn.metrics import (
    classification_report, confusion_matrix,
    roc_auc_score, RocCurveDisplay,
    precision_recall_curve, PrecisionRecallDisplay
)

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)


## Step 1: Data Acquisition/Understanding

Cloning GitHub repo into Colab Notebook

In [5]:
!git clone https://github.com/andrewmarasco/Capstone_Project__Customer_Churn_Binary_Classification.git


Cloning into 'Capstone_Project__Customer_Churn_Binary_Classification'...
remote: Enumerating objects: 5, done.[K
remote: Counting objects: 100% (5/5), done.[K
remote: Compressing objects: 100% (5/5), done.[K
remote: Total 5 (delta 0), reused 5 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (5/5), 334.48 KiB | 1.37 MiB/s, done.


In [6]:
!ls

Capstone_Project__Customer_Churn_Binary_Classification	sample_data


In [7]:
!ls Capstone_Project__Customer_Churn_Binary_Classification

archive.zip  Telco_Cusomer_Churn.csv


Loading Telco Customer Dataset into Colab

In [12]:
import pandas as pd

DATA_PATH = (
    "Capstone_Project__Customer_Churn_Binary_Classification/"
    "Telco_Cusomer_Churn.csv"
)

df = pd.read_csv(DATA_PATH)

df.shape, df.head()

((7043, 21),
    customerID  gender  SeniorCitizen Partner Dependents  tenure PhoneService  \
 0  7590-VHVEG  Female              0     Yes         No       1           No   
 1  5575-GNVDE    Male              0      No         No      34          Yes   
 2  3668-QPYBK    Male              0      No         No       2          Yes   
 3  7795-CFOCW    Male              0      No         No      45           No   
 4  9237-HQITU  Female              0      No         No       2          Yes   
 
       MultipleLines InternetService OnlineSecurity  ... DeviceProtection  \
 0  No phone service             DSL             No  ...               No   
 1                No             DSL            Yes  ...              Yes   
 2                No             DSL            Yes  ...               No   
 3  No phone service             DSL            Yes  ...              Yes   
 4                No     Fiber optic             No  ...               No   
 
   TechSupport StreamingTV Streamin

Checking Dataset Schema

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


Checking Target Distribution

In [14]:
df["Churn"].value_counts()
df["Churn"].value_counts(normalize=True)


Unnamed: 0_level_0,proportion
Churn,Unnamed: 1_level_1
No,0.73463
Yes,0.26537


Data Quality Scan

In [15]:
df.isna().sum().sort_values(ascending=False)

Unnamed: 0,0
customerID,0
gender,0
SeniorCitizen,0
Partner,0
Dependents,0
tenure,0
PhoneService,0
MultipleLines,0
InternetService,0
OnlineSecurity,0


## Step 2: Data Cleaning

Diagnosing 'TotalCharges' Issue (is object)

In [16]:
df["TotalCharges"].head(20)

Unnamed: 0,TotalCharges
0,29.85
1,1889.5
2,108.15
3,1840.75
4,151.65
5,820.5
6,1949.4
7,301.9
8,3046.05
9,3487.95


In [17]:
df["TotalCharges"].value_counts().head(10)

Unnamed: 0_level_0,count
TotalCharges,Unnamed: 1_level_1
,11
20.2,11
19.75,9
20.05,8
19.9,8
19.65,8
19.55,7
45.3,7
20.15,6
20.25,6


In [18]:
# Converting TotalCharges to numeric, coercing blanks to NaN
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")

# Checking how many NaNs we now have
df["TotalCharges"].isna().sum()


np.int64(11)

In [19]:
df.loc[df["TotalCharges"].isna(), ["tenure", "MonthlyCharges", "TotalCharges"]]


Unnamed: 0,tenure,MonthlyCharges,TotalCharges
488,0,52.55,
753,0,20.25,
936,0,80.85,
1082,0,25.75,
1340,0,56.05,
3331,0,19.85,
3826,0,25.35,
4380,0,20.0,
5218,0,19.7,
6670,0,73.35,


Dropping Identifier (CustomerID)

In [20]:
df = df.drop(columns=["customerID"])

Encoding Target Cleanly

In [21]:
df["Churn"].value_counts()

Unnamed: 0_level_0,count
Churn,Unnamed: 1_level_1
No,5174
Yes,1869


In [24]:
y = (df["Churn"] == "Yes").astype(int)
X = df.drop(columns=["Churn"])

y.mean()

np.float64(0.2653698707936959)

Splitting Data to Avoid Peeking

In [25]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    stratify=y,
    random_state=42
)

X_train.shape, X_test.shape, y_train.mean(), y_test.mean()

((5634, 19),
 (1409, 19),
 np.float64(0.2653532126375577),
 np.float64(0.2654364797728886))

## Step 3: Exploratory Data Analysis

Churn Rate By Key Categorical Features

In [None]:
def churn_rate_by_feature(X, y, feature):
  temp = X[[feature]].copy()
  temp["churn"] = y.values
  return (
      temp.groupby(feature)["churn"]
      .mean()
      .sort_values(ascending=False)
  )