# Customer Churn Prediction

## Overview of Process:

1. Data Loading: Load the dataset to understand its structure and contents.
2. Data Exploration: Explore the dataset to identify patterns, missing values, and anomalies.
3. Data Preprocessing: Clean and prepare the data for modeling, including handling missing values, encoding categorical variables, and normalizing data if necessary.
4. Feature Selection: Select relevant features that contribute most to the prediction of customer churn.
5. Model Building: Choose an appropriate algorithm to build the churn prediction model. Common choices include Logistic Regression, Decision Trees, Random Forest, and Gradient Boosting.
6. Model Evaluation: Evaluate the model using appropriate metrics (like accuracy, precision, recall, F1-score, and ROC-AUC) to ensure its reliability.
7. Model Tuning: Fine-tune the model parameters to improve performance.
8. Prediction & Analysis: Use the model to predict churn and analyze the results to gain business insights.

## Dataset
The [dataset](https://www.kaggle.com/datasets/blastchar/telco-customer-churn) contains various features related to customer accounts, including demographic information, account details, and the churn label indicating whether the customer has left the company:
```
customerID: Identifier for the customer
gender: Customer's gender (male/female)
SeniorCitizen: Whether the customer is a senior citizen (1) or not (0)
Partner: Whether the customer has a partner (Yes/No)
Dependents: Whether the customer has dependents (Yes/No)
tenure: Number of months the customer has stayed with the company
PhoneService: Whether the customer has phone service (Yes/No)
MultipleLines: Whether the customer has multiple lines (Yes/No/No phone service)
InternetService: Customer’s internet service provider (DSL, Fiber optic, No)
OnlineSecurity: Whether the customer has online security (Yes/No/No internet service)
OnlineBackup: Whether the customer has online backup (Yes/No/No internet service)
DeviceProtection: Whether the customer has device protection (Yes/No/No internet service)
TechSupport: Whether the customer has tech support (Yes/No/No internet service)
StreamingTV: Whether the customer has streaming TV (Yes/No/No internet service)
StreamingMovies: Whether the customer has streaming movies (Yes/No/No internet service)
Contract: The contract term of the customer (Month-to-month, One year, Two year)
PaperlessBilling: Whether the customer has paperless billing (Yes/No)
PaymentMethod: The customer’s payment method
MonthlyCharges: The amount charged to the customer monthly
TotalCharges: The total amount charged to the customer
Churn: Whether the customer churned (Yes/No)
```

### Data Loading and Exploration: 

In [1]:
import pandas as pd

# Load the dataset
file_path = 'C:/Users/dhill/Downloads/Customer Churn/WA_Fn-UseC_-Telco-Customer-Churn.csv'
data = pd.read_csv(file_path)

# Display the first few rows of the dataset
data.head()


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [2]:
data.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


In [3]:
data.describe(include=object)

Unnamed: 0,customerID,gender,Partner,Dependents,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,TotalCharges,Churn
count,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043.0,7043
unique,7043,2,2,2,2,3,3,3,3,3,3,3,3,3,2,4,6531.0,2
top,7590-VHVEG,Male,No,No,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,,No
freq,1,3555,3641,4933,6361,3390,3096,3498,3088,3095,3473,2810,2785,3875,4171,2365,11.0,5174


### Data preprocessing:

- Handle Missing Values: Check for and handle any missing values in the dataset.
- Encode Categorical Variables: Convert categorical variables to numeric format since machine learning models require numerical input.
- Normalize Numerical Features: Scale numerical features to ensure they have equal weight in the model.
- Split the Data: Divide the dataset into training and testing sets to evaluate the model's performance.


There were no missing values in most columns, but we found and handled 11 missing values in the TotalCharges column by filling them with the mean value of the column.

Next, we'll proceed with encoding the categorical variables. We'll use one-hot encoding for nominal categorical variables without intrinsic ordering (like InternetService) and label encoding for binary variables (like Partner, Dependents, PhoneService, etc.). This process will transform these variables into a format that can be provided to machine learning algorithms.

After encoding, we'll normalize the numerical features (tenure, MonthlyCharges, and TotalCharges) to ensure they contribute equally to the model's training process.

Finally, we'll split the data into training and testing sets, typically using a 80-20 or 70-30 split.

In [4]:

# Check for missing values in the dataset
missing_values = data.isnull().sum()

# Check for empty strings in 'TotalCharges' as it should be numeric
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')
missing_values_in_total_charges = data['TotalCharges'].isnull().sum()

# Fill missing 'TotalCharges' with the mean value of the column
data['TotalCharges'].fillna(data['TotalCharges'].mean(), inplace=True)

(missing_values, missing_values_in_total_charges)


(customerID          0
 gender              0
 SeniorCitizen       0
 Partner             0
 Dependents          0
 tenure              0
 PhoneService        0
 MultipleLines       0
 InternetService     0
 OnlineSecurity      0
 OnlineBackup        0
 DeviceProtection    0
 TechSupport         0
 StreamingTV         0
 StreamingMovies     0
 Contract            0
 PaperlessBilling    0
 PaymentMethod       0
 MonthlyCharges      0
 TotalCharges        0
 Churn               0
 dtype: int64,
 11)

The data has been successfully preprocessed:

- Categorical variables have been encoded.
- Numerical features (tenure, MonthlyCharges, TotalCharges) have been normalized.
- The dataset has been split into training (70%) and testing (30%) sets, resulting in 4,930 training samples and 2,113 testing samples.

In [5]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split

# Define the columns for one-hot encoding and label encoding
one_hot_columns = ['InternetService', 'Contract', 'PaymentMethod']
label_columns = ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'PaperlessBilling', 'Churn']

# Replace 'Yes', 'No', 'Male', 'Female' with 1 and 0 in label columns
data[label_columns] = data[label_columns].replace({'Yes': 1, 'No': 0, 'Male': 1, 'Female': 0 ,'No phone service':2, 'No internet service':2})

# Apply one-hot encoding to the one-hot columns
one_hot_encoder = OneHotEncoder()
one_hot_encoded = one_hot_encoder.fit_transform(data[one_hot_columns])

# Drop the original one-hot columns and add the new one-hot encoded columns
data.drop(one_hot_columns, axis=1, inplace=True)
data = data.join(pd.DataFrame(one_hot_encoded.toarray(), columns=one_hot_encoder.get_feature_names_out()))

# Define features and target
X = data.drop(['customerID', 'Churn'], axis=1)
y = data['Churn']

# Normalize the numerical features
scaler = StandardScaler()
X[['tenure', 'MonthlyCharges', 'TotalCharges']] = scaler.fit_transform(X[['tenure', 'MonthlyCharges', 'TotalCharges']])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Display the shapes of the training and testing sets
(X_train.shape, X_test.shape, y_train.shape, y_test.shape)


((4930, 26), (2113, 26), (4930,), (2113,))

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score

# Initialize the Logistic Regression model
log_reg = LogisticRegression(max_iter=1000, random_state=42)

# Train the model
log_reg.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = log_reg.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

(accuracy, precision, recall, roc_auc)


(0.812588736393753, 0.6854166666666667, 0.573170731707317, 0.7375275360940744)

The Logistic Regression model has been successfully trained and evaluated. Here are the performance metrics:

- Accuracy: 81.12% - The percentage of total predictions that were correct.
- Precision: 68.19% - The percentage of positive predictions that were actually positive.
- Recall: 57.14% - The percentage of actual positive cases that were correctly predicted.
- ROC-AUC Score: 73.60% - Represents the model's ability to distinguish between positive and negative classes.

These metrics provide a solid starting point. Depending on your specific needs, you might want to improve precision (to minimize false positives) or recall (to minimize false negatives). Further model tuning, feature engineering, or even exploring more complex models could enhance these results.

In [7]:
# Correct the encoding for 'MultipleLines' and similar columns with more than two categories
additional_one_hot_columns = ['MultipleLines', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']
one_hot_columns += additional_one_hot_columns  # Add these columns to the one-hot encoding list

# Remove these columns from label encoding list
for column in additional_one_hot_columns:
    if column in label_columns:
        label_columns.remove(column)

# Re-encode the categorical variables
data = pd.read_csv(file_path)  # Reload the data to reset changes
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')  # Handle 'TotalCharges' as before
data['TotalCharges'].fillna(data['TotalCharges'].mean(), inplace=True)  # Handle missing values as before
data[label_columns] = data[label_columns].replace({'Yes': 1, 'No': 0, 'Male': 1, 'Female': 0})  # Re-apply label encoding

# Apply one-hot encoding to the updated list of one-hot columns
one_hot_encoded = one_hot_encoder.fit_transform(data[one_hot_columns])

# Drop the original one-hot columns and add the new one-hot encoded columns
data.drop(one_hot_columns, axis=1, inplace=True)
data = data.join(pd.DataFrame(one_hot_encoded.toarray(), columns=one_hot_encoder.get_feature_names_out()))

# Redefine features and target without 'customerID'
X = data.drop(['customerID', 'Churn'], axis=1)
y = data['Churn']

# Normalize the numerical features again
X[['tenure', 'MonthlyCharges', 'TotalCharges']] = scaler.fit_transform(X[['tenure', 'MonthlyCharges', 'TotalCharges']])

# Split the data into training and testing sets again
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Retrain the Logistic Regression model
log_reg.fit(X_train, y_train)

# Make predictions on the testing set and evaluate the model again
y_pred = log_reg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

(accuracy, precision, recall, roc_auc)


(0.8111689540937056, 0.681912681912682, 0.5714285714285714, 0.7360066833751044)

To improve the model's performance, we can consider the following steps:

- Feature Engineering: Create new features or modify existing ones to better capture the underlying patterns in the data.
- Model Tuning: Adjust the model's hyperparameters to find the optimal settings.
- Advanced Models: Explore more complex models like Decision Trees, Random Forests, or Gradient Boosting Machines.

Given these options, we could start with model tuning for the Logistic Regression model to see if we can improve the performance metrics. Tuning involves adjusting hyperparameters like the regularization strength (C) and the type of solver used.

### Hyperparameter tuning for the Logistic Regression model. 

We'll use Grid Search with cross-validation to find the best combination of hyperparameters. 

The primary hyperparameters we'll focus on are:

`C`: Inverse of regularization strength; smaller values specify stronger regularization.

`solver`: Algorithm to use in the optimization problem.

We'll define a range of values for C and a set of solvers to try. Then, we'll perform Grid Search to find the best combination based on cross-validated performance. This process can help improve the model by finding a balance between bias and variance, potentially leading to better generalization on unseen data.

### Tuning process

In [8]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid for hyperparameter tuning
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'solver': ['newton-cg', 'lbfgs', 'liblinear']
}

# Initialize the Grid Search model
grid_search = GridSearchCV(LogisticRegression(max_iter=1000, random_state=42), param_grid, cv=5, scoring='accuracy')

# Perform grid search on the training data
grid_search.fit(X_train, y_train)

# Best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

(best_params, best_score)


({'C': 1, 'solver': 'liblinear'}, 0.8032454361054766)

The hyperparameter tuning process identified the best parameters for the Logistic Regression model as C=1 and solver='liblinear' with the best cross-validated accuracy of approximately 80.32%.

Now that we have the best parameters, let's retrain the Logistic Regression model using these parameters and evaluate its performance on the test set to see if there's an improvement. 

In [9]:
# Retrain the model with the best parameters
optimized_log_reg = LogisticRegression(C=1, solver='liblinear', max_iter=1000, random_state=42)
optimized_log_reg.fit(X_train, y_train)

# Make predictions on the testing set
y_pred_optimized = optimized_log_reg.predict(X_test)

# Evaluate the optimized model
accuracy_optimized = accuracy_score(y_test, y_pred_optimized)
precision_optimized = precision_score(y_test, y_pred_optimized)
recall_optimized = recall_score(y_test, y_pred_optimized)
roc_auc_optimized = roc_auc_score(y_test, y_pred_optimized)

(accuracy_optimized, precision_optimized, recall_optimized, roc_auc_optimized)


(0.8121154756270705, 0.683991683991684, 0.573170731707317, 0.7372026498042757)

The performance of the optimized Logistic Regression model is as follows:

- Accuracy: 81.21% (a slight improvement from the initial 81.12%)
- Precision: 68.40% (improved from 68.19%)
- Recall: 57.32% (improved from 57.14%)
- ROC-AUC Score: 73.72% (improved from 73.60%)

### Exploring feature engineering to potentially improve the model's performance. 

**Feature engineering** involves creating new features or modifying existing ones to better capture the underlying patterns in the data. 
Here are some common approaches:
- **Interaction Terms**: Create new features that are combinations of existing features, which might reveal new insights when their effects are combined.
- **Polynomial Features**: Generate polynomial and interaction features. Sometimes relationships between features and the target are not linear and can be better captured with polynomial terms.
- **Binning**: Convert continuous variables into categorical variables by grouping them into bins. This can be useful for capturing non-linear relationships.
- **Aggregations**: For datasets with multiple entries per subject, creating aggregate features (like mean, sum, max) can provide useful insights.


Given the nature of our dataset, let's explore the creation of interaction terms and binning as our feature engineering strategies. 

Specifically, we could look at interactions between tenure and other features like **MonthlyCharges or TotalCharges**, as the length of time a customer has been with the company could influence the impact of charges on their decision to churn. 

We can also bin the tenure feature to categorize customers into **'new', 'medium-term', and 'long-term'**.

In [10]:
from sklearn.preprocessing import PolynomialFeatures

# Create interaction terms
interaction_transformer = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_interactions = interaction_transformer.fit_transform(X[['tenure', 'MonthlyCharges', 'TotalCharges']])

# Convert the interaction terms into a DataFrame
interaction_columns = interaction_transformer.get_feature_names_out(['tenure', 'MonthlyCharges', 'TotalCharges'])
X_interactions_df = pd.DataFrame(X_interactions, columns=interaction_columns)

# Binning the 'tenure' feature
tenure_bins = pd.cut(X['tenure'], bins=[-1, 12, 36, 60, 1000], labels=['new', 'medium-term', 'long-term', 'very-long-term'])
X_binned = pd.get_dummies(tenure_bins)

# Combine the original features with the new engineered features
X_engineered = pd.concat([X, X_interactions_df, X_binned], axis=1)

# Split the engineered dataset into training and testing sets
X_train_eng, X_test_eng, y_train_eng, y_test_eng = train_test_split(X_engineered, y, test_size=0.3, random_state=42)

# Retrain the Logistic Regression model with the engineered features
log_reg_eng = LogisticRegression(max_iter=1000, random_state=42)
log_reg_eng.fit(X_train_eng, y_train_eng)

# Make predictions and evaluate the model
y_pred_eng = log_reg_eng.predict(X_test_eng)
accuracy_eng = accuracy_score(y_test_eng, y_pred_eng)
precision_eng = precision_score(y_test_eng, y_pred_eng)
recall_eng = recall_score(y_test_eng, y_pred_eng)
roc_auc_eng = roc_auc_score(y_test_eng, y_pred_eng)

(accuracy_eng, precision_eng, recall_eng, roc_auc_eng)


(0.8045433033601515,
 0.6792873051224945,
 0.5313588850174216,
 0.7188958167777166)

After incorporating the engineered features, including interaction terms and binned tenure, the performance of the Logistic Regression model is as follows:

- Accuracy: 80.41% (a slight decrease from the optimized model's 81.21%)
- Precision: 67.86% (a slight decrease from 68.40%)
- Recall: 52.96% (a decrease from 57.32%)
- ROC-AUC Score: 71.80% (a decrease from 73.72%)

These results suggest that the specific feature engineering strategies applied didn't improve the model's performance and, in some cases, slightly decreased it. 

This outcome highlights a key aspect of feature engineering: it's an iterative and experimental process, where not all changes lead to improvements.