# Assessing Customer Churn Using Machine Learning

The telecommunications (telecom) sector in India is rapidly changing, with more and more telecom businesses being created and many customers deciding to switch between providers. "Churn" refers to the process where customers or subscribers stop using a company's services or products. Understanding the factors that influence keeping a customer as a client in predicting churn is crucial for telecom companies to enhance their service quality and customer satisfaction. As the data scientist on this project, you aim to explore the intricate dynamics of customer behavior and demographics in the Indian telecom sector in predicting customer churn, utilizing two comprehensive datasets from four major telecom partners: Airtel, Reliance Jio, Vodafone, and BSNL:

- `telecom_demographics.csv` contains information related to Indian customer demographics:

| Variable             | Description                                      |
|----------------------|--------------------------------------------------|
| `customer_id `         | Unique identifier for each customer.             |
| `telecom_partner `     | The telecom partner associated with the customer.|
| `gender `              | The gender of the customer.                      |
| `age `                 | The age of the customer.                         |
| `state`                | The Indian state in which the customer is located.|
| `city`                 | The city in which the customer is located.       |
| `pincode`              | The pincode of the customer's location.          |
| `registration_event` | When the customer registered with the telecom partner.|
| `num_dependents`      | The number of dependents (e.g., children) the customer has.|
| `estimated_salary`     | The customer's estimated salary.                 |

- `telecom_usage` contains information about the usage patterns of Indian customers:

| Variable   | Description                                                  |
|------------|--------------------------------------------------------------|
| `customer_id` | Unique identifier for each customer.                         |
| `calls_made` | The number of calls made by the customer.                    |
| `sms_sent`   | The number of SMS messages sent by the customer.             |
| `data_used`  | The amount of data used by the customer.                     |
| `churn`    | Binary variable indicating whether the customer has churned or not (1 = churned, 0 = not churned).|


### Project instructions
Does Logistic Regression or Random Forest produce a higher accuracy score in predicting telecom churn in India?

In [101]:
# Import libraries and methods/functions
import pandas as pd
import numpy as np
from datetime import datetime
from sklearn.preprocessing import StandardScaler, OneHotEncoder 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, precision_score, recall_score

### 1. Loading and exploring data

In [102]:
# load datasets
df_demographics = pd.read_csv('telecom_demographics.csv')
#df_demographics.info()
#print(df_demographics.head())

df_usage = pd.read_csv('telecom_usage.csv')
#df_usage.info()
#print(df_usage.head())

# merge datasets
churn_df = pd.merge(df_demographics, df_usage, on='customer_id')
#churn_df.info()

# calculate proportion of customers who have churned
print(f"Number of customers churned: {str(churn_df['churn'].sum())}")
print(f"Proportion of customers churned: {str(churn_df['churn'].mean())}")

Number of customers churned: 1303
Proportion of customers churned: 0.20046153846153847


In [103]:
#churn_df.describe()

### 2. Processing data
- define target variable
- convert categorical data to numerical representation
- handle datetime columns
- remove redundant features
- standardize relevant features

In [104]:
# Define target variable
target = churn_df['churn']

In [105]:
# Determine categorical variables
categorical_features = ['telecom_partner', 'gender', 'state', 'city']
#for feature in categorical_features:
#    display(churn_df[feature].value_counts())

# convert categorical variables to numeric
dummies = pd.get_dummies(churn_df[categorical_features], drop_first=True)
#dummies.shape
#dummies.head()

# change registration_event to datetime
churn_df['registration_event'] = pd.to_datetime(churn_df['registration_event'])

# calculate days_since_registration 
churn_df['days_since_registration'] = (datetime.now() - churn_df['registration_event']).dt.days

# Drop irrelevant columns

# drop registration_event column
churn_df = churn_df.drop('registration_event', axis='columns')

# drop column 'pincode'
#churn_df['pincode'].value_counts()
churn_df = churn_df.drop('pincode', axis='columns')

# drop columns 'churn' and 'customer_id'
churn_df = churn_df.drop(['churn', 'customer_id'], axis='columns')
print(churn_df.shape)
churn_df.head()

(6500, 11)


Unnamed: 0,telecom_partner,gender,age,state,city,num_dependents,estimated_salary,calls_made,sms_sent,data_used,days_since_registration
0,Airtel,F,26,Himachal Pradesh,Delhi,4,85979,75,21,4532,1985
1,Airtel,F,74,Uttarakhand,Hyderabad,0,69445,35,38,723,1314
2,Airtel,F,54,Jharkhand,Chennai,2,75949,70,47,4688,1319
3,Reliance Jio,M,29,Bihar,Hyderabad,3,34272,95,32,10241,1123
4,Vodafone,M,45,Nagaland,Bangalore,4,34157,66,23,5246,1990


In [106]:
# scale relevant numeric columns
numeric_columns = ['age', 'num_dependents', 'estimated_salary', 'calls_made', 'sms_sent', 'data_used', 'days_since_registration']
scaler = StandardScaler()
scaled_num = pd.DataFrame(
    scaler.fit_transform(churn_df[numeric_columns]),
    columns=numeric_columns,
    index=churn_df.index)

In [107]:
features_scaled = pd.concat([scaled_num, dummies], axis=1)

In [108]:
print("Features shape:", features_scaled.shape)
print("Target shape:", target.shape)
display(features_scaled.head())

Features shape: (6500, 43)
Target shape: (6500,)


Unnamed: 0,age,num_dependents,estimated_salary,calls_made,sms_sent,data_used,days_since_registration,telecom_partner_BSNL,telecom_partner_Reliance Jio,telecom_partner_Vodafone,gender_M,state_Arunachal Pradesh,state_Assam,state_Bihar,state_Chhattisgarh,state_Goa,state_Gujarat,state_Haryana,state_Himachal Pradesh,state_Jharkhand,state_Karnataka,state_Kerala,state_Madhya Pradesh,state_Maharashtra,state_Manipur,state_Meghalaya,state_Mizoram,state_Nagaland,state_Odisha,state_Punjab,state_Rajasthan,state_Sikkim,state_Tamil Nadu,state_Telangana,state_Tripura,state_Uttar Pradesh,state_Uttarakhand,state_West Bengal,city_Chennai,city_Delhi,city_Hyderabad,city_Kolkata,city_Mumbai
0,-1.22297,1.436539,0.011981,0.846076,-0.222385,-0.159488,1.506077,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
1,1.696304,-1.411346,-0.428423,-0.496344,0.938056,-1.454896,-0.399641,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0
2,0.47994,0.012596,-0.255181,0.678273,1.552407,-0.106434,-0.385441,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
3,-1.040515,0.724568,-1.365302,1.517286,0.528489,1.782094,-0.942103,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,-0.067424,1.436539,-1.368365,0.544031,-0.085862,0.083337,1.520278,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### 3. Train-test-split

In [109]:
# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    features_scaled, 
    target, 
    test_size=0.2,      # 20% for testing
    random_state=42,    # ensures reproducibility
)

print("Train shape:", X_train.shape, y_train.shape)
print("Test shape:", X_test.shape, y_test.shape)

Train shape: (5200, 43) (5200,)
Test shape: (1300, 43) (1300,)


In [110]:
# investigate the class imbalance in train and test set
print(f"Proportion of customers churned in train_set: {str(y_train.mean())}")
print(f"Proportion of customers churned in test_set: {str(y_train.mean())}")

Proportion of customers churned in train_set: 0.19807692307692307
Proportion of customers churned in test_set: 0.19807692307692307


### 4. Training the models and getting predictions

In [111]:
# instantiate and fit each model

# logistic regression
log_reg = RidgeClassifier(random_state = 42)
log_reg.fit(X_train, y_train)
logreg_pred = log_reg.predict(X_test)
print("Show value counts of predictions")
print(np.unique(logreg_pred, return_counts=True))

# random forest
rf_clf = RandomForestClassifier(random_state=42)
rf_clf.fit(X_train, y_train)
rf_pred = rf_clf.predict(X_test)
print("Show value counts of predictions")
print(np.unique(rf_pred, return_counts=True))

Show value counts of predictions
(array([0]), array([1300]))
Show value counts of predictions
(array([0, 1]), array([1299,    1]))


### 5. Assessing the models

In [112]:
# confusion matrices
## logistic regression
print("Performance of logistic regression")
cm = confusion_matrix(y_test, logreg_pred)
# print confustion matrix with labels
cm_df = pd.DataFrame(
    cm,
    index=[f"Actual {label}" for label in log_reg.classes_],
    columns=[f"Predicted {label}" for label in log_reg.classes_]
)
print(cm_df)
print(classification_report(y_test, logreg_pred))

## random forest
print("Performance of random forest")
cm = confusion_matrix(y_test, rf_pred)
cm_df = pd.DataFrame(
    cm,
    index=[f"Actual {label}" for label in log_reg.classes_],
    columns=[f"Predicted {label}" for label in log_reg.classes_]
)
print(cm_df)
print(classification_report(y_test, rf_pred))

Performance of logistic regression
          Predicted 0  Predicted 1
Actual 0         1027            0
Actual 1          273            0
              precision    recall  f1-score   support

           0       0.79      1.00      0.88      1027
           1       0.00      0.00      0.00       273

    accuracy                           0.79      1300
   macro avg       0.40      0.50      0.44      1300
weighted avg       0.62      0.79      0.70      1300

Performance of random forest
          Predicted 0  Predicted 1
Actual 0         1026            1
Actual 1          273            0
              precision    recall  f1-score   support

           0       0.79      1.00      0.88      1027
           1       0.00      0.00      0.00       273

    accuracy                           0.79      1300
   macro avg       0.39      0.50      0.44      1300
weighted avg       0.62      0.79      0.70      1300



## Preliminary conclusion:
Both models can't handle class imbalance and predict a 0 for each of the observations. While the accuracy of 80% seems good, it is misleading because simply predicting every observation to 0 leads to this accuracy. 

### Solution:
First, I have to handle the class imbalance in the train set. This can be most easily solved by undersampling the majority class. The true occurance of churn is around 20% and the train dataset contains more than 5000 observations, this would leave me with enough data to train the model.

### 6a. Undersampling majority class

In [113]:
np.random.seed(42)

# undersample X_train and y_train
# Find the class counts
class_counts = y_train.value_counts()
print("Class distribution before undersampling:\n", class_counts)

# Identify majority and minority classes
majority_class = class_counts.idxmax()
minority_class = class_counts.idxmin()
print(f"Majority class: {majority_class}, Minority class: {minority_class}")

# Get indices of the majoriy and minority class
majority_indices = y_train[y_train == majority_class].index
minority_indices = y_train[y_train == minority_class].index

# Randomly sample majority class to match minority size
undersampled_majority_indices = np.random.choice(
    majority_indices,
    size=len(minority_indices),
    replace=False
)

print(f"Number of samples from majority class: {len(undersampled_majority_indices)}")

# Combine undersampled majority + all minority
undersampled_indices = np.concatenate([undersampled_majority_indices, minority_indices])

# Subset the training data
X_train_under = X_train.loc[undersampled_indices]
y_train_under = y_train.loc[undersampled_indices]

print(f'Shape of train data: {X_train_under.shape}')
print("Class distribution after undersampling:\n", y_train_under.value_counts())

Class distribution before undersampling:
 0    4170
1    1030
Name: churn, dtype: int64
Majority class: 0, Minority class: 1
Number of samples from majority class: 1030
Shape of train data: (2060, 43)
Class distribution after undersampling:
 0    1030
1    1030
Name: churn, dtype: int64


### 6b. Fit the models again

In [114]:
# instantiate and fit each model

# logistic regression
log_reg = RidgeClassifier(random_state = 42)
log_reg.fit(X_train_under, y_train_under)
logreg_pred = log_reg.predict(X_test)
print("Show value counts of predictions")
print(np.unique(logreg_pred, return_counts=True))

# random forest
rf_clf = RandomForestClassifier(random_state=42)
rf_clf.fit(X_train_under, y_train_under)
rf_pred = rf_clf.predict(X_test)
print("Show value counts of predictions")
print(np.unique(rf_pred, return_counts=True))

Show value counts of predictions
(array([0, 1]), array([646, 654]))
Show value counts of predictions
(array([0, 1]), array([648, 652]))


In [115]:
# confusion matrices
## logistic regression
print("Performance of logistic regression")
print(f"Precision: {precision_score(y_test, logreg_pred)}")
print(f"Recall: {recall_score(y_test, logreg_pred)}")
cm = confusion_matrix(y_test, logreg_pred)
# print confustion matrix with labels
cm_df = pd.DataFrame(
    cm,
    index=[f"Actual {label}" for label in log_reg.classes_],
    columns=[f"Predicted {label}" for label in log_reg.classes_]
)
print(cm_df)
print(classification_report(y_test, logreg_pred))

## random forest
print("Performance of random forest")
print(f"Precision: {precision_score(y_test, rf_pred)}")
print(f"Recall: {recall_score(y_test, rf_pred)}")
cm = confusion_matrix(y_test, rf_pred)
cm_df = pd.DataFrame(
    cm,
    index=[f"Actual {label}" for label in log_reg.classes_],
    columns=[f"Predicted {label}" for label in log_reg.classes_]
)
print(cm_df)
print(classification_report(y_test, rf_pred))

Performance of logistic regression
Precision: 0.2018348623853211
Recall: 0.4835164835164835
          Predicted 0  Predicted 1
Actual 0          505          522
Actual 1          141          132
              precision    recall  f1-score   support

           0       0.78      0.49      0.60      1027
           1       0.20      0.48      0.28       273

    accuracy                           0.49      1300
   macro avg       0.49      0.49      0.44      1300
weighted avg       0.66      0.49      0.54      1300

Performance of random forest
Precision: 0.200920245398773
Recall: 0.47985347985347987
          Predicted 0  Predicted 1
Actual 0          506          521
Actual 1          142          131
              precision    recall  f1-score   support

           0       0.78      0.49      0.60      1027
           1       0.20      0.48      0.28       273

    accuracy                           0.49      1300
   macro avg       0.49      0.49      0.44      1300
weighted avg 

## Conclusion:
The performance of both models is very similar. 
I am mostly interested in predicting the actual churn correctly, which means that I prefer True Positives at the cost of False Positives. Thus I need a model with a higher recall.   
The logistic regression has the best performance.