# Customer Churn Prediction System for Telecom Services

## Overview
This project aims to predict customer churn for a telecom service provider. By analyzing historical customer data, the model identifies customers who are likely to leave (churn), allowing the company to take proactive steps to retain them and improve customer satisfaction.

## Features
- **Predicts churn likelihood** based on customer demographics, service details, and billing information.
- **Processes raw data** by cleaning text, handling missing values, and converting categorical variables into numerical format.
- **Builds a predictive model** using logistic regression and evaluates its performance with cross-validation.
- **Achieves 85.8% accuracy**, helping the company prioritize retention efforts effectively.

## Data Processing
The project uses customer data containing:
- **Demographics**: Gender, senior citizen status
- **Service details**: Phone service, internet service, streaming services
- **Billing information**: Monthly charges, total charges
- **Churn status**: Whether the customer has left the service

Data is preprocessed by:
- Cleaning text data (lowercasing, removing spaces)
- Handling missing values
- Encoding categorical features into numerical values

## Model Building
- The dataset is **split into training and testing sets**.
- A **logistic regression model** is trained to identify churn patterns.
- The model is evaluated using **cross-validation** to ensure reliability.
- Predictions are generated to rank customers by their likelihood to churn.

## Results
- The model achieves **85.8% accuracy** in predicting customer churn.
- This allows the telecom company to:
  - Identify at-risk customers
  - Implement targeted retention strategies
  - Improve overall customer experience

Initial Setup and Imports

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

Data Preprocessing

In [12]:
# Load and clean data
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
df.columns = df.columns.str.lower().str.replace(' ', '_')

# Clean categorical columns
categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)
for c in categorical_columns:
    df[c] = df[c].str.lower().str.replace(' ', '_')

# Handle numeric data
df.totalcharges = pd.to_numeric(df.totalcharges, errors='coerce')
df.totalcharges = df.totalcharges.fillna(0)

# Convert target variable
df.churn = (df.churn == 'yes').astype(int)

Feature Definition

In [None]:
# Split features into numerical and categorical
numerical = ['tenure', 'monthlycharges', 'totalcharges']

categorical = [
    'gender',
    'seniorcitizen',
    'partner',
    'dependents',
    'phoneservice',
    'multiplelines',
    'internetservice',
    'onlinesecurity',
    'onlinebackup',
    'deviceprotection',
    'techsupport',
    'streamingtv',
    'streamingmovies',
    'contract',
    'paperlessbilling',
    'paymentmethod',
]

Model Functions

In [13]:
def train(df_train, y_train, C=1.0):
    dicts = df_train[categorical + numerical].to_dict(orient='records')
    dv = DictVectorizer(sparse=False)
    X_train = dv.fit_transform(dicts)
    model = LogisticRegression(C=C, max_iter=10000)
    model.fit(X_train, y_train)
    return dv, model

def predict(df, dv, model):
    dicts = df[categorical + numerical].to_dict(orient='records')
    X = dv.transform(dicts)
    y_pred = model.predict_proba(X)[:, 1]
    return y_pred

Model Validation

In [14]:
# Split data
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)

# Cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=1)
scores = []

for train_idx, val_idx in kfold.split(df_full_train):
    df_train = df_full_train.iloc[train_idx]
    df_val = df_full_train.iloc[val_idx]
    
    y_train = df_train.churn.values
    y_val = df_val.churn.values
    
    dv, model = train(df_train, y_train, C=1.0)
    y_pred = predict(df_val, dv, model)
    
    auc = roc_auc_score(y_val, y_pred)
    scores.append(auc)

Final Model and Evaluation

In [16]:
# Train final model on full training data
dv, model = train(df_full_train, df_full_train.churn.values, C=1.0)

# Evaluate on test set
y_pred = predict(df_test, dv, model)
y_test = df_test.churn.values
auc = roc_auc_score(y_test, y_pred)
auc

0.8583788336745859

Save the model

In [23]:
import pickle

# Save the trained model to disk
output_file = f'model_C={C}.bin'
output_file

with open(output_file, 'wb') as f_out: 
    pickle.dump((dv, model), f_out)

Load the model

In [None]:
# Retrieve the trained model from disk
input_file = 'model_C=1.0.bin'

with open(input_file, 'rb') as f_in: 
    dv, model = pickle.load(f_in)

dv, model

(DictVectorizer(sparse=False), LogisticRegression(max_iter=10000))

In [None]:
# Checking model on one sample
customer = {
    'gender': 'female',
    'seniorcitizen': 0,
    'partner': 'yes',
    'dependents': 'no',
    'phoneservice': 'no',
    'multiplelines': 'no_phone_service',
    'internetservice': 'dsl',
    'onlinesecurity': 'no',
    'onlinebackup': 'yes',
    'deviceprotection': 'no',
    'techsupport': 'no',
    'streamingtv': 'no',
    'streamingmovies': 'no',
    'contract': 'month-to-month',
    'paperlessbilling': 'yes',
    'paymentmethod': 'electronic_check',
    'tenure': 1,
    'monthlycharges': 29.85,
    'totalcharges': 29.85
}

X = dv.transform([customer])
y_pred = model.predict_proba(X)[0, 1]

print('input:', customer)
print('output:', y_pred)

input: {'gender': 'female', 'seniorcitizen': 0, 'partner': 'yes', 'dependents': 'no', 'phoneservice': 'no', 'multiplelines': 'no_phone_service', 'internetservice': 'dsl', 'onlinesecurity': 'no', 'onlinebackup': 'yes', 'deviceprotection': 'no', 'techsupport': 'no', 'streamingtv': 'no', 'streamingmovies': 'no', 'contract': 'month-to-month', 'paperlessbilling': 'yes', 'paymentmethod': 'electronic_check', 'tenure': 1, 'monthlycharges': 29.85, 'totalcharges': 29.85}
output: 0.628486084102498


Making requests

In [None]:
import requests

url = 'http://localhost:9696/predict'

# Example user
customer = {
    'gender': 'female',
    'seniorcitizen': 0,
    'partner': 'yes',
    'dependents': 'no',
    'phoneservice': 'no',
    'multiplelines': 'no_phone_service',
    'internetservice': 'dsl',
    'onlinesecurity': 'no',
    'onlinebackup': 'yes',
    'deviceprotection': 'no',
    'techsupport': 'no',
    'streamingtv': 'no',
    'streamingmovies': 'no',
    'contract': 'one_year',
    'paperlessbilling': 'yes',
    'paymentmethod': 'electronic_check',
    'tenure': 1,
    'monthlycharges': 29.85,
    'totalcharges': 29.85
}

# Make Request and Get Response
response = requests.post(url, json=customer).json()
response

{'churn': False, 'churn_probability': 0.47102333925429607}