# Customer Churn Prediction

Vincent Luong

## Introduction: Predicting Customer Churn in Subscription-Based Services

In recent years, monthly subscription-based services have surged in popularity, spanning industries such as streaming (e.g., Netflix, Spotify), SaaS (Software as a Service), fitness (e.g., Peloton), and e-commerce (e.g., Amazon Prime). These models offer convenience and consistent revenue streams, making them an attractive business strategy. However, they also introduce a critical metric to monitor: **customer churn**.

**Customer Churn**: refers to the percentage of customers who cancel or stop renewing their subcriptions during a given time period.  High churn rates can significantlyt impact revenue and long-term growth, especially in competitive markets were acquiring new customers is often more expenstive then retaining existing ones.

To mitigate churn and retain consumers, we can implement strategies such as:
- Personalized offers and retention campaigns
- Improving onboarding and customer support
- Monitoring engagement metrics to intervene before customers churn

To stay ahead of potential losses, we can utilize data-driven churn prediction models that help us act proactively rather than reactively. These models analyze customer behavior and identify individuals who are at a high risk of leaving, allowing companies to intervene before it‚Äôs too late. Common machine learning approaches used for predicting churn include:
- **Logistic Regression**: A simple and interpretable baseline model for binary classification
- **Decision Trees and Random Forests**: Useful for capturing nonlinear patterns and feature importance
- **Gradient Boosting Machines**: Models such as XGBoost and LightGBM are power ensemble methods with strong predictive performance
- **Neural Networks**: Applied for complex. high-dimensional data scenarios
- **Survival Analysis**: Useful when modeling when a customer will churn, not just if.

In this project, we aim to develop a machine learning model to predict whether a customer is likely to churn based on historical subscription and behavioral data introduced below. This prediction can empower businesses to make informed decisions that reduce churn and enhance customer lifetime value.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Data

The [Telco Customer Churn dataset 11.1.3+](https://www.kaggle.com/datasets/blastchar/telco-customer-churn/data), originally provided by IBM and hosted on Kaggle, offers detailed information about a telecommunications company's customers and their subscription behavior. It contains 7,043 customer records with 33 features, covering demographics, account details, and service usage patterns. The dataset has been updated to include more comprehensive information, providing deeper insights into customer behavior and churn factors.

Features Include:

üîë Identifiers & Location
1. `CustomerID`: A unique ID that identifies each customer
2. `Count`: A value used in reporting/dashboarding to sum up the number of customers in a filtered set
3. `Country`: The country of the customer's primary residence
4. `State`: The state of hte customer's primary residence
5. `City`: The city of hte customer's primary residence
6. `Zip Code`: The zip code of the customer's primary residence
7. `Lat Long`: The combined latitude and longitude of the customer's primary residence
8. `Lattitude`: The latitude of the customer's primary residence
9. `Longitude`: The longitude of the customer's primary residence
    
üë§ Demographics
1. `Age`: A numerical feature incdicating the age of a customer
2. `Married`: A binary feature indicating whether the customer is married (0 or 1)
3. `Dependents`: Whether the customer has dependents or not (Yes or No)
4. `Number of Dependents`: A numerical feature indicating the number of dependents a customer has
5. `Referred a Friend`: A binary feature indicating whether a customer has referred a friend (0 or 1)
6. `Number of Referrals`: A numerical feature indicatin the number of referrals a customer has given
   
‚è≥ Customer Tenure
1. `Tenure Months`:  Indicates the total amount of months that the customer has been with the company by the end of the quarter specified above

üìû Services Subscribed
1. `PhoneService`: Indicates if the customer subscribes to home phone service with the company (Yes, No)
2. `MultipleLines`:  Indicates if the customer subscribes to multiple telephone lines with the company (Yes, No)
3. `InternetService`: Indicates if the customer subscribes to Internet service with the company (No, DSL, Fiber Optic, Cable)
4. `OnlineSecurity`: Indicates if the customer subscribes to an additional online security service provided by the company (Yes, No)
5. `OnlineBackup`: Indicates if the customer subscribes to an additional online backup service provided by the company (Yes, No)
6. `Device Protection`: Indicates if the customer subscribes to an additional device protection plan for their Internet equipment provided by the company (Yes, No)
7. `Tech Support`: Indicates if the customer subscribes to an additional technical support plan from the company with reduced wait times (Yes, No)
8. `Streaming TV`: Indicates if the customer uses their Internet service to stream television programing from a third party provider (Yes, No, The company does not charge an additional fee for this service)
9. `Streaming Movies`: Indicates if the customer uses their Internet service to stream movies from a third party provider (Yes, No, The company does not charge an additional fee for this service)

üí≥ Billing & Payment
1. `Contract`: Indicates the customer‚Äôs current contract type (Month-to-Month, One Year, Two Year)
2. `Paperless Billing`: Indicates if the customer has chosen paperless billing (Yes, No)
3. `Payment Method`: Indicates how the customer pays their bill (Bank Withdrawal, Credit Card, Mailed Check)
4. `Monthly Charge`: Indicates the customer‚Äôs current total monthly charge for all their services from the company.
5. `Total Charges`: Indicates the customer's total charges, calulated to the end of the quarter specified above.

üîç Churn-Related Information
1. `Churn Label`: Yes = the customer left the company this quarter. No = the customer remained with the company. Directly related to Churn Value.
2. `Churn Value`: 1 = the customer left the company this quarter. 0 = the customer remained with the company. Directly related to Churn Label.
3. `Churn Score`: A value from 0-100 that is calculated using the predictive tool IBM SPSS Modeler. The model incorporates multiple factors known to cause churn. The higher the score, the more likely the customer will churn.
4. `CLTV`: Customer Lifetime Value. A predicted CLTV is calculated using corporate formulas and existing data. The higher the value, the more valuable the customer. High value customers should be monitored for churn.
5. `Churn Reason`:  A customer‚Äôs specific reason for leaving the company. Directly related to Churn Category.

In [3]:
df_telco = pd.read_excel('data/Telco_customer_churn.xlsx').drop(columns=['Count','Country','State', 'Zip Code','Lat Long','Latitude','Longitude','Payment Method','Churn Label'])
df_demo = pd.read_excel('data/Telco_customer_churn_demographics.xlsx').drop(columns=['Gender','Senior Citizen','Dependents','Under 30','Count']).rename(columns={'Customer ID':'CustomerID'})
df_services = pd.read_excel('data/Telco_customer_churn_services.xlsx').drop(columns=['Count','Quarter','Tenure in Months','Phone Service','Multiple Lines','Internet Type','Internet Service','Online Security','Online Backup','Device Protection Plan','Streaming TV','Streaming Movies','Contract','Paperless Billing','Monthly Charge','Total Charges']).rename(columns={'Customer ID':'CustomerID'})
df_churn = pd.read_excel('data/Telco_customer_churn_status.xlsx').drop(columns=['Count','Quarter','Churn Label','Churn Value','Churn Reason','Churn Score','CLTV']).rename(columns={'Customer ID':'CustomerID'})
df = pd.merge(left=df_telco,right=df_demo, on='CustomerID').merge(right=df_services, on='CustomerID').merge(right=df_churn,on='CustomerID').rename(columns={'Churn Value':'Churn'}).drop(columns=['CustomerID','City','Tech Support'])
demo_graph_var = ['Gender','Age','Senior Citizen', 'Married', 'Dependents','Number of Dependents']
services_var = ['Referred a Friend','Number of Referrals','Tenure Months','Offer','Phone Service','Avg Monthly Long Distance Charges','Multiple Lines','Internet Service','Avg Monthly GB Download','Online Security','Online Backup','Device Protection','Premium Tech Support','Streaming TV','Streaming Movies','Streaming Music','Unlimited Data','Contract','Paperless Billing','Payment Method','Monthly Charges','Total Charges','Total Refunds','Total Extra Data Charges','Total Long Distance Charges','Total Revenue']
con_dat = df[demo_graph_var+services_var+['Churn Category']+['Churn']]

con_dat.head()

Unnamed: 0,Gender,Age,Senior Citizen,Married,Dependents,Number of Dependents,Referred a Friend,Number of Referrals,Tenure Months,Offer,...,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Total Refunds,Total Extra Data Charges,Total Long Distance Charges,Total Revenue,Churn Category,Churn
0,Male,37,No,No,No,0,No,0,2,,...,Yes,Credit Card,53.85,108.15,0.0,0,20.94,129.09,Competitor,1
1,Female,19,No,No,Yes,2,No,0,2,,...,Yes,Bank Withdrawal,70.7,151.65,0.0,0,18.24,169.89,Other,1
2,Female,31,No,No,Yes,2,No,0,8,,...,Yes,Bank Withdrawal,99.65,820.5,0.0,0,97.2,917.7,Other,1
3,Female,23,No,Yes,Yes,3,No,0,28,Offer C,...,Yes,Bank Withdrawal,104.8,3046.05,0.0,0,136.92,3182.97,Other,1
4,Male,38,No,No,Yes,1,No,0,49,,...,Yes,Bank Withdrawal,103.7,5036.3,0.0,0,2172.17,7208.47,Competitor,1


In [22]:
con_dat.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7043 entries, 0 to 7042
Data columns (total 34 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Gender                             7043 non-null   object 
 1   Age                                7043 non-null   int64  
 2   Senior Citizen                     7043 non-null   object 
 3   Married                            7043 non-null   object 
 4   Dependents                         7043 non-null   object 
 5   Number of Dependents               7043 non-null   int64  
 6   Referred a Friend                  7043 non-null   object 
 7   Number of Referrals                7043 non-null   int64  
 8   Tenure Months                      7043 non-null   int64  
 9   Offer                              7043 non-null   object 
 10  Phone Service                      7043 non-null   object 
 11  Avg Monthly Long Distance Charges  7043 non-null   float

In [21]:
df_services.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 14 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   CustomerID                         7043 non-null   object 
 1   Referred a Friend                  7043 non-null   object 
 2   Number of Referrals                7043 non-null   int64  
 3   Offer                              7043 non-null   object 
 4   Avg Monthly Long Distance Charges  7043 non-null   float64
 5   Avg Monthly GB Download            7043 non-null   int64  
 6   Premium Tech Support               7043 non-null   object 
 7   Streaming Music                    7043 non-null   object 
 8   Unlimited Data                     7043 non-null   object 
 9   Payment Method                     7043 non-null   object 
 10  Total Refunds                      7043 non-null   float64
 11  Total Extra Data Charges           7043 non-null   int64

From the dataset above, we observe that the columns `Churn Label` and `Churn Value` are directly correlated, with a one-to-one relationship. Including both as target variables would introduce redundancy and could lead to artificially inflated model performance. To avoid this, we will remove `Churn Label` and use only `Churn Value` as our target variable during the train/test split.

## Explanatory Data Analysis

**Exploratory Data Analysis** (EDA) is the process of investigating and summarizing the key characteristics of a dataset before applying any modeling techniques. It helps uncover patterns, spot anomalies, identify missing values, test assumptions, and gain insights into the structure of the data. Through visualizations and statistical summaries, EDA allows us to help make informed data-driven decisions about data cleaning, feature engineering, and model selection, ensuring that the data is well-understood and ready for analysis.

### Data Shape and Data Types

Below, we examine the structure of the dataset by displaying its shape, which includes the number of rows and columns, as well as the data types of each feature. This helps us understand what kind of preprocessing may be required.

In [4]:
con_dat.shape

(7043, 34)

In [5]:
con_dat.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7043 entries, 0 to 7042
Data columns (total 34 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Gender                             7043 non-null   object 
 1   Age                                7043 non-null   int64  
 2   Senior Citizen                     7043 non-null   object 
 3   Married                            7043 non-null   object 
 4   Dependents                         7043 non-null   object 
 5   Number of Dependents               7043 non-null   int64  
 6   Referred a Friend                  7043 non-null   object 
 7   Number of Referrals                7043 non-null   int64  
 8   Tenure Months                      7043 non-null   int64  
 9   Offer                              7043 non-null   object 
 10  Phone Service                      7043 non-null   object 
 11  Avg Monthly Long Distance Charges  7043 non-null   float

Above, we see that `Total Charges` is read as an object, but we need it as a float value.

In [6]:
con_dat['Total Charges'] = pd.to_numeric(con_dat['Total Charges'], errors='coerce')
con_dat['Total Charges'].head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  con_dat['Total Charges'] = pd.to_numeric(con_dat['Total Charges'], errors='coerce')


0     108.15
1     151.65
2     820.50
3    3046.05
4    5036.30
Name: Total Charges, dtype: float64

#### Missing Values

We will analyze missing values to ensure the quality and reliability of our data before performing any analysis or building predictive models.  Missing values can lead to biased results, reduce model accuracy, and cause errors during processing. Identifying and handling them appropriately helps maintain the integrity of the dataset and ensures that the insights or predictions we generate are based on complete and meaningful information.

In [7]:
con_dat.isnull().sum()

Gender                                  0
Age                                     0
Senior Citizen                          0
Married                                 0
Dependents                              0
Number of Dependents                    0
Referred a Friend                       0
Number of Referrals                     0
Tenure Months                           0
Offer                                   0
Phone Service                           0
Avg Monthly Long Distance Charges       0
Multiple Lines                          0
Internet Service                        0
Avg Monthly GB Download                 0
Online Security                         0
Online Backup                           0
Device Protection                       0
Premium Tech Support                    0
Streaming TV                            0
Streaming Movies                        0
Streaming Music                         0
Unlimited Data                          0
Contract                          

#### Missing Values Rationale

From the dataset, we observe some **non-response bias**, as certain customers did not provide a reason for why they churned. Aside from this, the remaining features appear relatively clean, with no missing values present.

On the contary, we see that there are a total of **11** missing values for `Total Charges`.  This can be attributed from me changing the type cast from object to float.  We will further explore them below:

In [8]:
con_dat[con_dat['Total Charges'].isnull()]

Unnamed: 0,Gender,Age,Senior Citizen,Married,Dependents,Number of Dependents,Referred a Friend,Number of Referrals,Tenure Months,Offer,...,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Total Refunds,Total Extra Data Charges,Total Long Distance Charges,Total Revenue,Churn Category,Churn
2234,Female,43,No,Yes,No,0,Yes,2,0,,...,Yes,Bank Withdrawal,52.55,,0.0,0,0.0,525.5,,0
2438,Male,24,No,No,No,0,No,0,0,,...,No,Credit Card,20.25,,0.0,0,131.6,334.1,,0
2568,Female,40,No,Yes,No,0,Yes,8,0,,...,No,Mailed Check,80.85,,0.0,0,310.9,1119.4,,0
2667,Male,39,No,Yes,Yes,1,Yes,5,0,,...,No,Credit Card,25.75,,0.0,0,228.3,485.8,,0
2856,Female,64,No,Yes,No,0,Yes,2,0,,...,No,Credit Card,56.05,,0.0,0,0.0,560.5,,0
4331,Male,56,No,Yes,Yes,1,Yes,5,0,,...,No,Credit Card,19.85,,0.0,0,155.1,353.6,,0
4687,Male,22,No,Yes,Yes,2,Yes,3,0,,...,No,Credit Card,25.35,,0.0,0,363.7,617.2,,0
5104,Female,23,No,Yes,Yes,3,Yes,4,0,Offer E,...,No,Credit Card,20.0,,0.0,0,200.5,400.5,,0
5719,Male,38,No,Yes,Yes,2,Yes,5,0,Offer E,...,Yes,Credit Card,19.7,,0.0,0,462.3,659.84,,0
6772,Female,25,No,Yes,Yes,3,Yes,6,0,Offer E,...,No,Credit Card,73.35,,0.0,0,55.9,789.4,,0


From the above dataframe, we see that those have **0** `Total Charges` are typically customers with **0** Tenure Months and have existing `Monthly Charges`.  This can indicate that they are probably a brand-new customer and we will impute the `Total Charges` to **0** for those specific indexes and treat them as customers who are new and have not been charged yet.

#### 0 Values For Numerical Features

Some features such as `Monthly Charges`, `Total Charges` or `Tenure Months` would not make sense if their values were 0, meaning that imputation might be necessary towards these values

In [9]:
con_dat[con_dat['Tenure Months'] == 0]

Unnamed: 0,Gender,Age,Senior Citizen,Married,Dependents,Number of Dependents,Referred a Friend,Number of Referrals,Tenure Months,Offer,...,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Total Refunds,Total Extra Data Charges,Total Long Distance Charges,Total Revenue,Churn Category,Churn
2234,Female,43,No,Yes,No,0,Yes,2,0,,...,Yes,Bank Withdrawal,52.55,,0.0,0,0.0,525.5,,0
2438,Male,24,No,No,No,0,No,0,0,,...,No,Credit Card,20.25,,0.0,0,131.6,334.1,,0
2568,Female,40,No,Yes,No,0,Yes,8,0,,...,No,Mailed Check,80.85,,0.0,0,310.9,1119.4,,0
2667,Male,39,No,Yes,Yes,1,Yes,5,0,,...,No,Credit Card,25.75,,0.0,0,228.3,485.8,,0
2856,Female,64,No,Yes,No,0,Yes,2,0,,...,No,Credit Card,56.05,,0.0,0,0.0,560.5,,0
4331,Male,56,No,Yes,Yes,1,Yes,5,0,,...,No,Credit Card,19.85,,0.0,0,155.1,353.6,,0
4687,Male,22,No,Yes,Yes,2,Yes,3,0,,...,No,Credit Card,25.35,,0.0,0,363.7,617.2,,0
5104,Female,23,No,Yes,Yes,3,Yes,4,0,Offer E,...,No,Credit Card,20.0,,0.0,0,200.5,400.5,,0
5719,Male,38,No,Yes,Yes,2,Yes,5,0,Offer E,...,Yes,Credit Card,19.7,,0.0,0,462.3,659.84,,0
6772,Female,25,No,Yes,Yes,3,Yes,6,0,Offer E,...,No,Credit Card,73.35,,0.0,0,55.9,789.4,,0


In [10]:
con_dat[con_dat['Total Charges'] == 0]

Unnamed: 0,Gender,Age,Senior Citizen,Married,Dependents,Number of Dependents,Referred a Friend,Number of Referrals,Tenure Months,Offer,...,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Total Refunds,Total Extra Data Charges,Total Long Distance Charges,Total Revenue,Churn Category,Churn


In [11]:
con_dat[con_dat['Total Charges'].isnull()]

Unnamed: 0,Gender,Age,Senior Citizen,Married,Dependents,Number of Dependents,Referred a Friend,Number of Referrals,Tenure Months,Offer,...,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Total Refunds,Total Extra Data Charges,Total Long Distance Charges,Total Revenue,Churn Category,Churn
2234,Female,43,No,Yes,No,0,Yes,2,0,,...,Yes,Bank Withdrawal,52.55,,0.0,0,0.0,525.5,,0
2438,Male,24,No,No,No,0,No,0,0,,...,No,Credit Card,20.25,,0.0,0,131.6,334.1,,0
2568,Female,40,No,Yes,No,0,Yes,8,0,,...,No,Mailed Check,80.85,,0.0,0,310.9,1119.4,,0
2667,Male,39,No,Yes,Yes,1,Yes,5,0,,...,No,Credit Card,25.75,,0.0,0,228.3,485.8,,0
2856,Female,64,No,Yes,No,0,Yes,2,0,,...,No,Credit Card,56.05,,0.0,0,0.0,560.5,,0
4331,Male,56,No,Yes,Yes,1,Yes,5,0,,...,No,Credit Card,19.85,,0.0,0,155.1,353.6,,0
4687,Male,22,No,Yes,Yes,2,Yes,3,0,,...,No,Credit Card,25.35,,0.0,0,363.7,617.2,,0
5104,Female,23,No,Yes,Yes,3,Yes,4,0,Offer E,...,No,Credit Card,20.0,,0.0,0,200.5,400.5,,0
5719,Male,38,No,Yes,Yes,2,Yes,5,0,Offer E,...,Yes,Credit Card,19.7,,0.0,0,462.3,659.84,,0
6772,Female,25,No,Yes,Yes,3,Yes,6,0,Offer E,...,No,Credit Card,73.35,,0.0,0,55.9,789.4,,0


In [12]:
con_dat[con_dat['Monthly Charges'] == 0].head()

Unnamed: 0,Gender,Age,Senior Citizen,Married,Dependents,Number of Dependents,Referred a Friend,Number of Referrals,Tenure Months,Offer,...,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Total Refunds,Total Extra Data Charges,Total Long Distance Charges,Total Revenue,Churn Category,Churn


As seen from the dataframes generated above, we can discover that there are some **0** values for number of `Tenure Months` of some customers in the dataset.  This is normal for this feature, as this can most likely mean that they are a brand new customer.

Additionally, we see that there are some **NaN** values for `Total Charges`.  This might also indicate that the customer hasn't been charged, therefore I will impute **NaN** values as 0 and moving forward, we will assume that the cusomters haven't been charged.

In [17]:
con_dat['Total Charges'].fillna(con_dat['Total Charges'].median(), inplace=True)

pd.DataFrame(con_dat['Total Charges'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  con_dat['Total Charges'].fillna(con_dat['Total Charges'].median(), inplace=True)


Unnamed: 0,Total Charges
0,108.15
1,151.65
2,820.50
3,3046.05
4,5036.30
...,...
7038,1419.40
7039,1990.50
7040,7362.90
7041,346.45


#### Unique Features

Unique features help us better understand the structure and variability of our dataset.  Identifying the number of unique values in each feature helps determine whether a feature is categorical, binary, or continuous, and whether its suitable for analysis or modeling.  It also helps us detect potential issues such as constant columns with no variability, which provide little to no predictive power and may be removed.  Understanding uniqueness ensures we handle features appropriately during preprocessing and model training.

In [14]:
con_dat.select_dtypes(include='object').nunique()

Gender                  2
Senior Citizen          2
Married                 2
Dependents              2
Referred a Friend       2
Offer                   6
Phone Service           2
Multiple Lines          3
Internet Service        3
Online Security         3
Online Backup           3
Device Protection       3
Premium Tech Support    2
Streaming TV            3
Streaming Movies        3
Streaming Music         2
Unlimited Data          2
Contract                3
Paperless Billing       2
Payment Method          3
Churn Category          5
dtype: int64

In [15]:
con_dat.select_dtypes(include=['int64', 'float64']).nunique()

Age                                    62
Number of Dependents                   10
Number of Referrals                    12
Tenure Months                          73
Avg Monthly Long Distance Charges    3584
Avg Monthly GB Download                50
Monthly Charges                      1585
Total Charges                        6531
Total Refunds                         500
Total Extra Data Charges               16
Total Long Distance Charges          6110
Total Revenue                        6996
Churn                                   2
dtype: int64

From the dataset above, we observe that both Country and State contain only a single unique value.  `Country` is "United States" and `State` is "California." This indicates that all customers are located in the same region. Since these features do not offer any variability, they are unlikely to contribute meaningful predictive power and can be safely dropped from the analysis.

Additionally, we see that`CustomerID` is the unique key for identifying a customer, thus meaning that our model prediction will not be affected by `CustomerID`, we can also remove this column below.

In [16]:
con_dat.drop(['CustomerID', 'Country', 'State'], axis=1, inplace=True)
con_dat.head()

KeyError: "['CustomerID', 'Country', 'State'] not found in axis"

Additionally, the `Lat Long` features is redundant, as the dataset already includes separate `Latitude` and `Longitude` columns.  Keeping it would introduce unecessary noise without adding meaningful information, so we will drop the `Lat Long` feature from the dataset.

In [None]:
con_dat.drop('Lat Long', axis = 1, inplace = True)
con_dat.head()

Lastly, we observe that the features `Churn Label` and `Churn Value` convey the same information, with `Churn Label` using categorical values **(Yes/No)** and `Churn Value` using numerical values **(1/0)**. To simplify modeling and ensure consistency, we will drop the `Churn Label` feature and use `Churn Value` as our target variable.

In [None]:
con_dat.drop('Churn Label', axis = 1, inplace = True)
con_dat.head()

In [None]:
con_dat.select_dtypes(include='object').nunique()

### Churn Reasons Distributions

Beneath, we analyze the reasons why customers have churned in order to gain deeper insights that can help improve customer retention strategies.

In [None]:
con_dat['Churn Reason'].value_counts()

In [None]:
# Set figure size
plt.figure(figsize=(12, 6))

# Plot
sns.countplot(y='Churn Reason', data=con_dat, order=con_dat['Churn Reason'].value_counts().index)

# Add titles and labels
plt.title('Distribution of Churn Reasons')
plt.xlabel('Number of Customers')
plt.ylabel('Churn Reason')
plt.tight_layout()
plt.show()

From the visualizations above, we observe that many customers churn due to issues related to customer support and more attractive offers from competitors.

#### Summary Statistics

Below are the summary statistics about the dataset. These summary statistics provides a quick overview of the central tendency, spread, and distribution of numerical features. It helps identify outliers, anomalies, and potential data entry errors, such as unexpected negative values. Differences between metrics like the mean and median can reveal skewed distributions, which may require transformation before modeling. Summary statistics also allow for easy comparison between features, guiding decisions on normalization or feature scaling. Overall, this step ensures that you understand the structure and quality of your data before moving forward with analysis or modeling.

In [None]:
con_dat.describe()

From the summary statistics, we can see that `Count` has a $\sigma$ = 0, min of `Count` = 0, and a max `Count` of 1; symbolizing that count is a static number that is always **1**.  To reduce the noise in our model, we will be excluding this variable since this provide no information and will cause additional noise towards our model.

In [None]:
con_dat.drop('Count', axis=1, inplace=True)
con_dat.head()

### Data Visualization

#### Distribution of Churn Population

We visualize the distribution of the churn population as it helps us understand the balance between churned and retained customers, which is crucial for building effective predictive models.  If the dataset is imbalanced (e.g., many more non-churned than churned customers), it can bias the model toward the majority class, leading to misleading accuracy and poor performance in detecting actual churn. Visualization also provides an immediate, intuitive grasp of class proportions, helps guide decisions like resampling (oversampling or undersampling), and highlights whether churn is a significant concern for the business.

In [None]:
sns.countplot(x='Churn Value', data=con_dat)

From the graph above, we see that there appears some class imbalance in the dataset.  Since we observe class imbalance in the dataset, we need to take steps to ensure our model doesn't become biased toward the majority class. This imbalance can lead to misleading accuracy and poor performance in identifying actual churners. To address this, we can apply resampling techniques such as oversampling the minority class or undersampling the majority class. Additionally, we should use evaluation metrics like precision, recall, F1-score, and AUC rather than relying solely on accuracy. These steps will help create a more balanced and reliable churn prediction model.

#### Distribution of Numerical Features

We will now visualize the distribution of the numerical features in our dataset.


In [None]:
numeric_features = [
    'Zip Code',
    'Latitude',
    'Longitude',
    'Tenure Months',
    'Monthly Charges',
    'Churn Score',
    'CLTV',
    'Total Charges'
]

# --- HISTOGRAMS ---
plt.figure(figsize=(14, 10))
for i, col in enumerate(numeric_features, 1):
    plt.subplot(3, 3, i)
    plt.hist(con_dat[col], bins=30, edgecolor='black')
    plt.title(f'{col} Histogram')
    plt.xlabel(col)
    plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

##### Zip Code Discussion

Intuitively, we expect `Zip Code` to have no predictive power as *Zip Codes* are typically used for customer identification.  We will check the churn rate correlation to `Zip Code` to see if there is a statistically significant difference.

In [None]:
con_dat['Zip Code'].nunique()

In [None]:
con_dat['Zip Code'].corr(con_dat['Churn Value'])

Since **0.003** is very close to 0, and we have **n=7043**, we can assume that there isnt a statistically significant relationship between `Zip Code` and `Churn Rate` and we will be dropping this value.

In [None]:
con_dat.drop('Zip Code', axis=1, inplace=True)
con_dat.head()

### Distribution of Churn Population of Given Features

#### Distribution of Churn Population Given City

Below we will visualize the distribuion of how many customers churned based on the category we will first start by dispalying the top 10 cities:

In [None]:
# Get the top 10 cities by number of customers
top_10_cities = con_dat['City'].value_counts().head(10).index

# Filter the dataset to include only those top 10 cities
top_10_df = con_dat[con_dat['City'].isin(top_10_cities)]

# Preview the result
top_10_df['City'].value_counts()

In [None]:
#churn rate by city
churn_rate_by_city = (
    top_10_df.groupby('City')['Churn Value']
    .mean()
    .sort_values(ascending=False)
)

#plot the charts
plt.figure(figsize=(12, 6))
sns.barplot(
    x=churn_rate_by_city.values,
    y=churn_rate_by_city.index,
    color='skyblue'
)
plt.title('Churn Rate by City (Top 10 Most Frequent Cities)')
plt.xlabel('Churn Rate')
plt.ylabel('City')
plt.xlim(0, 1)
plt.tight_layout()
plt.show()

From the top 10 most populated cities, we see that *San Diego* tends to have the highest churn rate.  Additionally, we have seen that there seems to be a pattern associated with `City` and `Churn` 

### Churn Distribution Across Categorical Features

The dataset originally contains 24 categorical features. We have removed 3 of them (`CustomerID`, `Country`, and `State`) due to their lack of predictive value, and previously visualized churn rates by `City`. Below, we explore the relationship between selected categorical features and the target variable `Churn` using bar charts. 

These visualizations help us uncover potential patterns or trends that may influence customer churn.

In [None]:
# Identify categorical columns (excluding ID-like or geographic ones)
categorical_cols = [
    'Gender', 'Senior Citizen', 'Partner', 'Dependents',
    'Phone Service', 'Multiple Lines', 'Internet Service',
    'Online Security', 'Online Backup', 'Device Protection',
    'Tech Support', 'Streaming TV', 'Streaming Movies',
    'Contract', 'Paperless Billing', 'Payment Method'
]

# Set global plot style
sns.set(style="whitegrid")

# Plot each categorical variable vs Churn
for col in categorical_cols:
    plt.figure(figsize=(6, 4))
    sns.countplot(data=con_dat, x=col, hue='Churn Value',
                  palette={0: 'blue', 1: 'red'})
    plt.title(f'Churn by {col}')
    plt.xlabel(col)
    plt.ylabel('Count')
    plt.xticks(rotation=30)
    plt.legend(title='Churn')
    plt.tight_layout()
    plt.show()

Above, we can sort of see some valuable information, like positive correlation between features such as `Phone Service`, `Dependents`, `Partner`, `Senior Citizen`, `Multiple Lines`, `Online Security`, `Online Backup`, `Device Protection`, `Tech Support`, `Contract`, `Paperless Billing`, and `Payment Method`.  The inverse of correlation could also be said for some features as well.  In all, this could lead to some statistical associations our model could pick up.

### Data Cleaning

**Data Cleaning** is the process of identifying and correcting errors, inconsistencies, and inaccuracies in a dataset to improve its quality and ensure reliable analysis. This involves handling missing values, correcting data types, removing duplicate or irrelevant entries, standardizing formats, and resolving inconsistencies. Clean data is essential for building accurate and trustworthy models, as poor-quality data can lead to misleading insights and reduced model performance.

We will first split the data to prevent any **data-leakage** from occuring.  **Data Leakage** is the occurence when information from outside the training dataset unintentionally influences the model during training.  This can happen when preporcessing steps like scaling, encoding, or imputing are applied before the train-test split, or when target-related information is included in the features.  Data leakage leads to overly optimistic and overfitting results during evaluation; thus not generalizing well to real-world data and performance.

### Feature Engineering



#### Data Train/Test Split Creation

We will be splitting the data into a 80/20 test split

In [None]:
from sklearn.model_selection import train_test_split

# Creating dataset with without target
X = con_dat.drop('Churn Value', axis=1)
# Creating dataset with target
y = con_dat['Churn Value']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=123)

#### Churn Reason Imputing

In our **EDA**, we saw that for some of the reasons stated there appeared some non-response bias or unknown reasons for why customers have churned.  We will impute this also fit the category of "Don't know".

In [None]:
X_train['Churn Reason'].isna().sum()

In [None]:
(X_train['Churn Reason'] == "Don't know").sum()

In [None]:
X_train['Churn Reason'].fillna("Don't know", inplace=True)
X_test['Churn Reason'].fillna("Don't know", inplace=True)

X_train['Churn Reason'].head()

Imputing missing values in `Churn Reason` with "Don't know" is a practical approach that preserves all data, avoids model errors from NaN values, and maintains interpretability by clearly indicating that no reason was provided. It also prevents data leakage by applying imputation after splitting and may capture meaningful patterns; as customers who don‚Äôt provide a reason could behave differently from those who do.

## Baseline Model

Before applying preprocessing and utilizing a variety of different models, we will create a baseline model so we have something to compare our preprocessed complex model against.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

X_train_enc = pd.get_dummies(X_train, drop_first=True)
X_test_enc  = pd.get_dummies(X_test, drop_first=True)

X_train_enc, X_test_enc = X_train_enc.align(X_test_enc, join='left', axis=1, fill_value=0)

baseline_model = LogisticRegression(max_iter=1000, class_weight="balanced", n_jobs=-1)
baseline_model.fit(X_train_enc, y_train)

In [None]:
y_pred = baseline_model.predict(X_test_enc)
y_proba = baseline_model.predict_proba(X_test_enc)[:, 1]

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred, digits=4))
print("ROC-AUC:", roc_auc_score(y_test, y_proba))

#### One Hot Encoding

Below we will be One-Hot Encoding our Categorical Variables to help create values for our ML-based model.

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

encoder = ColumnTransformer(
    transformers=[
        ('onehot', OneHotEncoder(drop='first', handle_unknown='ignore'), categorical_cols)
    ],
    remainder='passthrough'
)

X_train_encoded = encoder.fit_transform(X_train)
X_test_encoded = encoder.transform(X_test)

#### Standard Scaler

Below we will also be importing a Standard Scaler, this normalizes our data to be normally distributied with means centered around 0.

In [None]:
from sklearn.preprocessing import StandardScaler

new_numeric_features = [
    'Latitude',
    'Longitude',
    'Tenure Months',
    'Monthly Charges',
    'Churn Score',
    'CLTV'
]
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), new_numeric_features),
        ('cat', OneHotEncoder(drop='first', handle_unknown='ignore'), categorical_cols)
    ],
    remainder='passthrough'
)