<a href="https://colab.research.google.com/github/aminayusif/Retentify/blob/main/Retentify_Customer_Retention_And_Recommendation_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Introduction

This notebook presents the development of **Retentify**, a comprehensive system designed to analyze telco customer churn and provide targeted recommendations for retention and value enhancement. Leveraging data analysis, machine learning, and customer segmentation techniques, Retentify aims to identify customers at risk of churning and suggest personalized interventions.

The system is built in several stages:

1.  **Data Loading and Preprocessing:** Initial loading of the telco customer churn dataset, followed by essential cleaning and preprocessing steps to handle missing values, correct data types, and standardize categorical features.
2.  **Exploratory Data Analysis (EDA):** Conducting univariate and bivariate analysis, correlation analysis, and customer segmentation to gain insights into churn drivers and identify distinct customer groups.
3.  **Feature Engineering:** Creating new features from the raw data to improve the performance of predictive models and enhance the understanding of customer behavior.
4.  **Churn Prediction Model Development:** Building and evaluating machine learning models (Logistic Regression, Random Forest, XGBoost) to predict customer churn probability, including addressing class imbalance and hyperparameter tuning.
5.  **Model Explainability and Rule Extraction:** Analyzing the trained churn model to understand the key factors influencing churn predictions using techniques like feature importance and SHAP values, and translating these insights into actionable business rules.
6.  **Customer Segmentation:** Applying clustering techniques to group customers into distinct segments based on their characteristics and behavior.
7.  **Recommendation System Design:** Developing a hybrid recommendation system that combines rule-based recommendations (derived from EDA and model explainability) and collaborative filtering (using matrix factorization) to provide personalized suggestions.
8.  **API Development:** Creating a web API using FastAPI to serve the churn predictions and recommendations.

Retentify provides a data-driven approach to proactively manage customer churn, allowing telco companies to implement targeted strategies to retain valuable customers and optimize their service offerings.

### Data Loading and Preprocessing

In [5]:
import pandas as pd

# Load the dataset
df = pd.read_csv('/content/WA_Fn-UseC_-Telco-Customer-Churn.csv')

# Quick integrity checks
df.head()
df.info()
df.describe()
df.isna().sum()

# Check for duplicates
print(f"Number of duplicate rows: {df.duplicated().sum()}")

# Check data types
print("\nData types:\n", df.dtypes)

# Check for unexpected types (specifically TotalCharges)
print("\nUnique values in TotalCharges:\n", df['TotalCharges'].unique())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


#### Fix data types

In [6]:
# Convert 'TotalCharges' to numeric, coercing errors to NaN
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

# Fill NaN values resulting from the conversion (empty strings) with 0
df['TotalCharges'] = df['TotalCharges'].fillna(0)

# Check data types again to confirm the change
print("\nData types after converting TotalCharges:\n", df.dtypes)

# Convert the 'SeniorCitizen' column to boolean
df['SeniorCitizen'] = df['SeniorCitizen'].astype(bool)

# Check unique values for object type columns (excluding customerID)
for col in df.select_dtypes(include='object').columns:
    if col != 'customerID':
        print(f"\nUnique values in {col}:\n{df[col].unique()}")


Data types after converting TotalCharges:
 customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges        float64
Churn                object
dtype: object

Unique values in gender:
['Female' 'Male']

Unique values in Partner:
['Yes' 'No']

Unique values in Dependents:
['No' 'Yes']

Unique values in PhoneService:
['No' 'Yes']

Unique values in MultipleLines:
['No phone service' 'No' 'Yes']

Unique values in InternetService:
['DSL' 'Fiber optic' 'No']

Unique values in OnlineSecurity:
['No' 'Yes' 'No inter

#### Handle missing values

In [7]:
df.isnull().sum()

Unnamed: 0,0
customerID,0
gender,0
SeniorCitizen,0
Partner,0
Dependents,0
tenure,0
PhoneService,0
MultipleLines,0
InternetService,0
OnlineSecurity,0


#### Standardize categorical features

In [8]:
# Identify object columns to standardize (excluding customerID)
cols_to_standardize = [col for col in df.select_dtypes(include='object').columns if col != 'customerID']

# Iterate through the identified columns and standardize values
for col in cols_to_standardize:
    if 'No internet service' in df[col].unique() or 'No phone service' in df[col].unique():
        print(f"Standardizing column: {col}")
        df[col] = df[col].replace(['No internet service', 'No phone service'], 'No')

# Verify the changes by printing unique values of the modified columns
for col in cols_to_standardize:
    if 'No internet service' not in df[col].unique() and 'No phone service' not in df[col].unique():
        print(f"\nUnique values in {col} after standardization:\n{df[col].unique()}")

Standardizing column: MultipleLines
Standardizing column: OnlineSecurity
Standardizing column: OnlineBackup
Standardizing column: DeviceProtection
Standardizing column: TechSupport
Standardizing column: StreamingTV
Standardizing column: StreamingMovies

Unique values in gender after standardization:
['Female' 'Male']

Unique values in Partner after standardization:
['Yes' 'No']

Unique values in Dependents after standardization:
['No' 'Yes']

Unique values in PhoneService after standardization:
['No' 'Yes']

Unique values in MultipleLines after standardization:
['No' 'Yes']

Unique values in InternetService after standardization:
['DSL' 'Fiber optic' 'No']

Unique values in OnlineSecurity after standardization:
['No' 'Yes']

Unique values in OnlineBackup after standardization:
['Yes' 'No']

Unique values in DeviceProtection after standardization:
['No' 'Yes']

Unique values in TechSupport after standardization:
['No' 'Yes']

Unique values in StreamingTV after standardization:
['No' 'Ye

#### Create target variable



Map 'Yes'/'No' in the 'Churn' column to 1/0.

In [9]:
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})
print("\nValue counts after mapping 'Churn':\n", df['Churn'].value_counts())


Value counts after mapping 'Churn':
 Churn
0    5174
1    1869
Name: count, dtype: int64
