<a href="https://colab.research.google.com/github/chmawnt/Data-Science-Projects/blob/main/predicting_customer_churn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Predicting Customer Churn in an E-commerce **Platform**
Problem Definition
Goal: Predict which customers are likely to churn.
Business Impact: Reduce churn, improve customer retention, and increase revenue.

# New section

In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix



In [20]:
df = pd.read_csv('/content/customer_churn.csv')

Data Collection & Preprocessing
Handle missing values (imputation, removal).
Convert categorical variables into numerical format.
Normalize or standardize numerical features.
Create churn labels (e.g., a customer is considered churned if no purchase in the last 6 months).





In [21]:
df['Contract'].value_counts()

Unnamed: 0_level_0,count
Contract,Unnamed: 1_level_1
Month-to-month,3875
Two year,1695
One year,1473


In [22]:
# Data Preprocessing
df.isnull().sum()

Unnamed: 0,0
customerID,0
gender,0
SeniorCitizen,0
Partner,0
Dependents,0
tenure,0
PhoneService,0
MultipleLines,0
InternetService,0
OnlineSecurity,0


In [23]:
# Define the columns to keep
columns_to_encode = ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'Contract', 'TotalCharges', 'Churn']
# Select only the specified columns
df = df[columns_to_encode]


Encode binary variables (e.g., Yes/No columns)
binary_columns = ['Partner', 'Dependents', 'PhoneService', 'PaperlessBilling', 'Churn']

In [24]:
#use label encoder
from sklearn.preprocessing import LabelEncoder
# Initialize the LabelEncoder
label_encoder = LabelEncoder()
# List of columns to label encode
categorical_cols = ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'Contract','Churn']
# Apply label encoding to each column
for col in categorical_cols:
    df[col] = label_encoder.fit_transform(df[col])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = label_encoder.fit_transform(df[col])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = label_encoder.fit_transform(df[col])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = label_encoder.fit_transform(df[col])
A value is trying to be set on a copy of a slice from a DataFram

In [25]:
binary_columns = ['Partner', 'Dependents', 'PhoneService', 'Contract','Churn']
df[binary_columns] = df[binary_columns].replace({'Yes': 1, 'No': 0})


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[binary_columns] = df[binary_columns].replace({'Yes': 1, 'No': 0})


In [26]:
df['gender'] = df['gender'].replace({'Male': 1, 'Female': 0})
df['PhoneService'] = df['PhoneService'].replace({'No phone service': 1, 'No': 0})
df

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,Contract,TotalCharges,Churn
0,0,0,1,0,1,0,1,0,29.85,0
1,1,0,0,0,34,1,0,1,1889.5,0
2,1,0,0,0,2,1,0,0,108.15,1
3,1,0,0,0,45,0,1,1,1840.75,0
4,0,0,0,0,2,1,0,0,151.65,1
...,...,...,...,...,...,...,...,...,...,...
7038,1,0,1,1,24,1,2,1,1990.5,0
7039,0,0,1,1,72,1,2,1,7362.9,0
7040,0,0,1,1,11,0,1,0,346.45,0
7041,1,1,1,0,4,1,2,0,306.6,1


 Model Training & Evaluation
Split Data: Train (80%), Test (20%).
Train Model: Fit the chosen algorithm to the dataset.
Evaluate Performance using metrics:
Accuracy (overall correctness).
Precision & Recall (identifies true churn cases).
F1 Score (balances precision & recall).
ROC-AUC Score (measures model's ability to separate churners from non-churners).

In [27]:
X = df.drop('Churn', axis=1)
y = df['Churn']

In [28]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5634 entries, 2142 to 860
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   gender         5634 non-null   int64 
 1   SeniorCitizen  5634 non-null   int64 
 2   Partner        5634 non-null   int64 
 3   Dependents     5634 non-null   int64 
 4   tenure         5634 non-null   int64 
 5   PhoneService   5634 non-null   int64 
 6   MultipleLines  5634 non-null   int64 
 7   Contract       5634 non-null   int64 
 8   TotalCharges   5634 non-null   object
dtypes: int64(8), object(1)
memory usage: 440.2+ KB


In [29]:
#Replace missing in 'Total Charges' with mean of column ,convert to float
#handle errors 'coerce' to replace non-numeric values with Nan
X_train['TotalCharges'] = pd.to_numeric(X_train['TotalCharges'],errors='coerce')
X_test['TotalCharges'] = pd.to_numeric(X_test['TotalCharges'],errors='coerce')

In [30]:
X_train['TotalCharges'].fillna(X_train['TotalCharges'].mean(),inplace=True)
X_test['TotalCharges'].fillna(X_test['TotalCharges'].mean(),inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_train['TotalCharges'].fillna(X_train['TotalCharges'].mean(),inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_test['TotalCharges'].fillna(X_test['TotalCharges'].mean(),inplace=True)


In [31]:
from sklearn.preprocessing import StandardScaler
#Initialize Scaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [32]:
X_train_scaled

array([[-1.02516569, -0.4377492 , -0.96957859, ..., -1.00053704,
         0.37290835, -0.42210502],
       [-1.02516569, -0.4377492 , -0.96957859, ...,  1.10833901,
         1.5775905 ,  1.25536015],
       [ 0.97545208, -0.4377492 ,  1.03137591, ...,  0.05390099,
        -0.83177379, -1.00299144],
       ...,
       [ 0.97545208, -0.4377492 ,  1.03137591, ..., -1.00053704,
        -0.83177379, -0.87799925],
       [ 0.97545208,  2.28441306, -0.96957859, ...,  1.10833901,
        -0.83177379, -0.48254445],
       [ 0.97545208, -0.4377492 , -0.96957859, ..., -1.00053704,
         0.37290835, -0.81110232]])

**Logistic Regression**

In [33]:
from sklearn.linear_model import LogisticRegression
lg = LogisticRegression()
lg.fit(X_train_scaled,y_train)
y_pred = lg.predict(X_test_scaled)


**Accuracy Score**

In [34]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.7757274662881476

In [35]:
import pickle
pickle.dump(lg,open('churnmodel.pk','wb'))

In [36]:
#Classification
def predictive():
  pass

In [37]:
df.iloc[0].values

array([0, 0, 1, 0, 1, 0, 1, 0, '29.85', 0], dtype=object)

In [38]:
def prediction(gender,Seniorcitizen,Partner,Dependents,tenure,Phoneservice,multiline,contact,totalcharge):
    data = {
    'gender': [gender],
    'SeniorCitizen': [Dependents],
    'Partner': [Partner],
    'Dependents': [Phoneservice],
    'tenure': [tenure],
    'PhoneService': [Phoneservice],
    'MultipleLines': [multiline],
    'Contract': [contact],
    'TotalCharges': [totalcharge]
    }
    # Create a DataFrame from the dictionary
    df = pd.DataFrame(data)


    # Encode the categorical columns
    categorical_columns = ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'Contract']
    for column in categorical_columns:
        df[column] = label_encoder.fit_transform(df[column])
    df = scaler.fit_transform(df)

    result = lg.predict(df).reshape(1,-1)
    return result[0]

In [39]:
gender = "Female"
Seniorcitizen = "No"
Partner = "Yes"
Dependents = "No"
tenure = 1
Phoneservice="No"
multiline = "No phone service"
contact="Month-to-month"
totalcharge = 29.85
result = prediction(gender,Seniorcitizen,Partner,Dependents,tenure,Phoneservice,multiline,contact,totalcharge)

if result==1:
    print('churn')
else:
    print('not churn')

not churn
