# <div style="text-align: center"> <h1>Project Title : E-Commerce Customer Churn Prediction</h1></div>

## OBJECTIVE:

### E-commerce customer churn prediction project aims to reduce customer loss and enhance revenue by predicting and retaining at-risk customers. Key steps include data collection, model building, intervention strategies, and continuous monitoring for business impact.

   1) Minimize Churn: Identify and reduce customer churn to maintain a stable customer base.

   2) Enhance Retention: Improve customer retention and loyalty to increase revenue.

   3) Data Analysis: Analyze customer data for insights into churn-related factors.

   4) Predictive Models: Develop machine learning models to forecast potential churn.

   5) Targeted Interventions: Implement personalized strategies to retain at-risk customers.

### Import Important Libraries

In [76]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sbn
import pickle

In [77]:
df=pd.read_csv("/home/sunbeam/Machine Learning/project/code/backend/churn.csv")
df.head()

Unnamed: 0,user_id,age,gender,region_category,joining_date,joined_through_referral,preferred_offer_types,medium_of_operation,internet_option,last_visit_time,...,avg_time_spent,avg_transaction_value,avg_frequency_login_days,points_in_wallet,used_special_discount,offer_application_preference,past_complaint,complaint_status,feedback,churn_risk_score
0,9f420209e7d129f3,29,F,,2017-04-05,Yes,Without Offers,,Wi-Fi,22:29:49,...,1184.49,38604.69,10.0,627.48,Yes,Yes,Yes,No Information Available,Poor Website,1
1,ac6e97806267549e,50,M,,2017-03-31,Yes,Without Offers,Desktop,Fiber_Optic,15:44:56,...,338.15,7665.66,17.0,575.97,Yes,No,Yes,No Information Available,Poor Customer Service,1
2,a6aa19b1580eed4e,26,F,City,2017-02-11,,Credit/Debit Card Offers,,Fiber_Optic,20:31:53,...,235.14,37671.69,5.0,767.93,Yes,No,Yes,No Information Available,Too many ads,0
3,aeee343277211c2f,63,F,Village,2015-12-23,No,Credit/Debit Card Offers,Desktop,Fiber_Optic,14:28:05,...,56.67,15678.14,11.0,590.22,No,Yes,No,Not Applicable,Too many ads,1
4,82448b5c8ce6390c,64,M,Town,2015-03-20,,Gift Vouchers/Coupons,Smartphone,Wi-Fi,04:16:48,...,153.99,8422.68,0.0,722.04,Yes,No,No,Not Applicable,Poor Product Quality,0


### Exploratory Data Analysis

In [78]:
df.info()
df=df.drop(['user_id','joining_date','last_visit_time'],axis=1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37010 entries, 0 to 37009
Data columns (total 21 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   user_id                       37010 non-null  object 
 1   age                           37010 non-null  int64  
 2   gender                        36951 non-null  object 
 3   region_category               31579 non-null  object 
 4   joining_date                  37010 non-null  object 
 5   joined_through_referral       31568 non-null  object 
 6   preferred_offer_types         36722 non-null  object 
 7   medium_of_operation           31615 non-null  object 
 8   internet_option               37010 non-null  object 
 9   last_visit_time               37010 non-null  object 
 10  days_since_last_login         37010 non-null  int64  
 11  avg_time_spent                37010 non-null  float64
 12  avg_transaction_value         37010 non-null  float64
 13  a

In [79]:
df['churn_risk_score'].value_counts()

churn_risk_score
1    20018
0    16992
Name: count, dtype: int64

#### Data Sample Balancing

In [80]:
df_passed=df[df['churn_risk_score']==1]
df_failed=df[df['churn_risk_score']==0]

df_under=df_passed.sample(18500)
df_balanced=pd.concat([df_under,df_failed],axis=0)
df=df_balanced
df['churn_risk_score'].value_counts()

churn_risk_score
1    18500
0    16992
Name: count, dtype: int64

In [81]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 35492 entries, 3636 to 37008
Data columns (total 18 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   age                           35492 non-null  int64  
 1   gender                        35434 non-null  object 
 2   region_category               30311 non-null  object 
 3   joined_through_referral       30302 non-null  object 
 4   preferred_offer_types         35212 non-null  object 
 5   medium_of_operation           30326 non-null  object 
 6   internet_option               35492 non-null  object 
 7   days_since_last_login         35492 non-null  int64  
 8   avg_time_spent                35492 non-null  float64
 9   avg_transaction_value         35492 non-null  float64
 10  avg_frequency_login_days      35492 non-null  float64
 11  points_in_wallet              35492 non-null  float64
 12  used_special_discount         35492 non-null  object 
 13  off

#### Finding Null Values

In [82]:
df.isna().sum()

age                                0
gender                            58
region_category                 5181
joined_through_referral         5190
preferred_offer_types            280
medium_of_operation             5166
internet_option                    0
days_since_last_login              0
avg_time_spent                     0
avg_transaction_value              0
avg_frequency_login_days           0
points_in_wallet                   0
used_special_discount              0
offer_application_preference       0
past_complaint                     0
complaint_status                   0
feedback                           0
churn_risk_score                   0
dtype: int64

In [83]:
df=df.dropna()
df

Unnamed: 0,age,gender,region_category,joined_through_referral,preferred_offer_types,medium_of_operation,internet_option,days_since_last_login,avg_time_spent,avg_transaction_value,avg_frequency_login_days,points_in_wallet,used_special_discount,offer_application_preference,past_complaint,complaint_status,feedback,churn_risk_score
3636,27,F,Town,No,Gift Vouchers/Coupons,Smartphone,Fiber_Optic,15,42.04,43005.07,17.0,605.310000,Yes,No,Yes,No Information Available,No reason specified,1
6572,39,F,Town,Yes,Without Offers,Desktop,Wi-Fi,17,196.77,38078.82,15.0,640.010000,Yes,No,Yes,Solved,Poor Website,1
2584,19,F,Village,Yes,Credit/Debit Card Offers,Smartphone,Wi-Fi,22,113.47,30893.51,28.0,621.940000,Yes,No,Yes,No Information Available,Too many ads,1
18623,22,M,City,Yes,Credit/Debit Card Offers,Both,Fiber_Optic,10,233.34,4716.56,17.0,1035.120489,Yes,Yes,No,Not Applicable,No reason specified,1
31791,46,F,City,No,Credit/Debit Card Offers,Desktop,Mobile_Data,7,0.00,16320.03,28.0,717.140000,No,Yes,No,Not Applicable,Poor Website,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36993,28,F,City,Yes,Gift Vouchers/Coupons,Both,Mobile_Data,2,1900.54,31120.74,18.0,0.000000,Yes,No,Yes,Unsolved,Poor Website,0
36994,10,F,City,No,Credit/Debit Card Offers,Smartphone,Mobile_Data,20,72.79,34366.19,22.0,718.520000,Yes,No,Yes,Unsolved,Poor Website,0
37004,19,F,Town,Yes,Without Offers,Smartphone,Fiber_Optic,21,119.96,14218.43,7.0,781.840000,Yes,No,No,Not Applicable,Too many ads,0
37006,27,F,City,Yes,Without Offers,Desktop,Wi-Fi,15,368.50,27038.47,8.0,835.980000,No,Yes,No,Not Applicable,Reasonable Price,0


#### Label Encoding for Categorical Values

In [84]:
from sklearn.preprocessing import LabelEncoder
encoder=LabelEncoder()
df[['feedback','complaint_status','past_complaint','offer_application_preference','used_special_discount','gender','region_category','joined_through_referral','preferred_offer_types','medium_of_operation','internet_option']]=df[['feedback','complaint_status','past_complaint','offer_application_preference','used_special_discount','gender','region_category','joined_through_referral','preferred_offer_types','medium_of_operation','internet_option']].apply(encoder.fit_transform)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[['feedback','complaint_status','past_complaint','offer_application_preference','used_special_discount','gender','region_category','joined_through_referral','preferred_offer_types','medium_of_operation','internet_option']]=df[['feedback','complaint_status','past_complaint','offer_application_preference','used_special_discount','gender','region_category','joined_through_referral','preferred_offer_types','medium_of_operation','internet_option']].apply(encoder.fit_transform)


In [85]:
df.corr()

Unnamed: 0,age,gender,region_category,joined_through_referral,preferred_offer_types,medium_of_operation,internet_option,days_since_last_login,avg_time_spent,avg_transaction_value,avg_frequency_login_days,points_in_wallet,used_special_discount,offer_application_preference,past_complaint,complaint_status,feedback,churn_risk_score
age,1.0,0.002949,0.012674,0.007589,-0.006038,0.003673,0.010056,-0.005537,0.000568,-0.002148,0.001209,-0.006289,0.000733,-0.000408,-0.003498,0.006214,-0.008712,0.005047
gender,0.002949,1.0,-0.003133,-0.004466,0.002229,-0.004505,-0.013318,0.003264,-0.013774,0.003912,-0.009219,-0.009726,-0.012498,0.008188,-0.011196,-0.00023,-0.003463,-0.001687
region_category,0.012674,-0.003133,1.0,-0.007451,-0.015771,0.003665,-0.003972,-0.003772,0.002436,0.01616,-0.006921,0.015655,-0.003238,-0.007426,0.00274,-0.001319,0.01615,-0.019711
joined_through_referral,0.007589,-0.004466,-0.007451,1.0,0.005231,-0.043832,-0.004518,-0.012316,0.167543,-0.037482,0.02364,-0.006964,0.021805,0.020576,0.001005,0.001064,-0.033295,0.031202
preferred_offer_types,-0.006038,0.002229,-0.015771,0.005231,1.0,0.0113,-0.00357,0.000787,-0.010738,-0.023614,0.005517,-0.015832,0.000883,-0.000635,-0.006708,0.002778,-0.034735,0.026656
medium_of_operation,0.003673,-0.004505,0.003665,-0.043832,0.0113,1.0,-5.5e-05,0.010499,-0.209406,-0.015362,0.007312,-0.005764,-0.051308,-0.038208,-0.007815,-0.012304,-0.010783,0.023712
internet_option,0.010056,-0.013318,-0.003972,-0.004518,-0.00357,-5.5e-05,1.0,-0.001815,0.000438,-0.005594,0.008889,0.004955,-0.00094,0.004,0.001394,-0.005399,0.008562,-0.006033
days_since_last_login,-0.005537,0.003264,-0.003772,-0.012316,0.000787,0.010499,-0.001815,1.0,0.000613,0.003499,-0.002258,-0.003727,0.006808,-0.014546,-0.009995,5.1e-05,0.009644,0.000682
avg_time_spent,0.000568,-0.013774,0.002436,0.167543,-0.010738,-0.209406,0.000438,0.000613,1.0,0.027232,-0.007158,-0.01033,0.084314,0.078015,0.014253,0.011204,0.013332,-0.010641
avg_transaction_value,-0.002148,0.003912,0.01616,-0.037482,-0.023614,-0.015362,-0.005594,0.003499,0.027232,1.0,-0.121596,0.058214,0.001011,0.02694,-0.002034,0.0003,0.220866,-0.216156


In [86]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(df)
print(df)

with open('minmax_scaler.pkl', 'wb') as file:
    pickle.dump(scaler, file)

       age  gender  region_category  joined_through_referral  \
3636    27       0                1                        0   
6572    39       0                1                        1   
2584    19       0                2                        1   
18623   22       1                0                        1   
31791   46       0                0                        0   
...    ...     ...              ...                      ...   
36993   28       0                0                        1   
36994   10       0                0                        0   
37004   19       0                1                        1   
37006   27       0                0                        1   
37008   38       0                0                        1   

       preferred_offer_types  medium_of_operation  internet_option  \
3636                       1                    2                0   
6572                       2                    1                2   
2584                 

In [87]:
X=df.drop(["churn_risk_score",'days_since_last_login','medium_of_operation'],axis=1)
Y=df['churn_risk_score']
X.info()

<class 'pandas.core.frame.DataFrame'>
Index: 21970 entries, 3636 to 37008
Data columns (total 15 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   age                           21970 non-null  int64  
 1   gender                        21970 non-null  int64  
 2   region_category               21970 non-null  int64  
 3   joined_through_referral       21970 non-null  int64  
 4   preferred_offer_types         21970 non-null  int64  
 5   internet_option               21970 non-null  int64  
 6   avg_time_spent                21970 non-null  float64
 7   avg_transaction_value         21970 non-null  float64
 8   avg_frequency_login_days      21970 non-null  float64
 9   points_in_wallet              21970 non-null  float64
 10  used_special_discount         21970 non-null  int64  
 11  offer_application_preference  21970 non-null  int64  
 12  past_complaint                21970 non-null  int64  
 13  com

In [88]:
X

Unnamed: 0,age,gender,region_category,joined_through_referral,preferred_offer_types,internet_option,avg_time_spent,avg_transaction_value,avg_frequency_login_days,points_in_wallet,used_special_discount,offer_application_preference,past_complaint,complaint_status,feedback
3636,27,0,1,0,1,0,42.04,43005.07,17.0,605.310000,1,0,1,0,0
6572,39,0,1,1,2,2,196.77,38078.82,15.0,640.010000,1,0,1,2,3
2584,19,0,2,1,0,2,113.47,30893.51,28.0,621.940000,1,0,1,0,7
18623,22,1,0,1,0,0,233.34,4716.56,17.0,1035.120489,1,1,0,1,0
31791,46,0,0,0,0,1,0.00,16320.03,28.0,717.140000,0,1,0,1,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36993,28,0,0,1,1,1,1900.54,31120.74,18.0,0.000000,1,0,1,4,3
36994,10,0,0,0,0,1,72.79,34366.19,22.0,718.520000,1,0,1,4,3
37004,19,0,1,1,2,0,119.96,14218.43,7.0,781.840000,1,0,0,1,7
37006,27,0,0,1,2,2,368.50,27038.47,8.0,835.980000,0,1,0,1,6


In [89]:
# from sklearn.decomposition import PCA
# pca=PCA(n_components=1)
# pca.fit_transform(X)

#### Splitting Dataset Into Traning & Testing

In [90]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(X,Y,random_state=12345,test_size=0.3)

### Building Model 

In [91]:
def save_model(model, name):
    with open(name, 'wb') as file:
        pickle.dump(model, file)



In [92]:
def build_svm_model():
    from sklearn.svm import SVC
    model = SVC()
    model.fit(x_train, y_train)
    save_model(model, 'svm.pkl')
    return model

In [93]:
def build_lg_model():
    from sklearn.linear_model import LogisticRegressionCV
    model = LogisticRegressionCV()
    model.fit(x_train, y_train)
    save_model(model, 'lg.pkl')
    return model

In [94]:
def build_knn_model():
    from sklearn.neighbors import KNeighborsClassifier
    model = KNeighborsClassifier()
    model.fit(x_train, y_train)
    save_model(model, 'knn.pkl')
    return model

In [95]:
def build_nb_model():
    from sklearn.naive_bayes import GaussianNB
    model = GaussianNB()
    model.fit(x_train, y_train)
    save_model(model, 'nb.pkl')
    return model

In [96]:
def build_dt_model():
    from sklearn.tree import DecisionTreeClassifier
    model = DecisionTreeClassifier()
    model.fit(x_train, y_train)
    save_model(model, 'dt.pkl')
    return model

In [97]:
def build_rf_model():
    from sklearn.ensemble import RandomForestClassifier
    model = RandomForestClassifier()
    model.fit(x_train, y_train)
    save_model(model, 'rf.pkl')
    return model

In [98]:
def build_catboost_model():
    from catboost import CatBoostClassifier
    model = CatBoostClassifier(verbose=False)
    model.fit(x_train, y_train)
    save_model(model, 'cb.pkl')
    return model

In [99]:
def build_xgb_model():
    from xgboost import XGBClassifier
    model = XGBClassifier()
    model.fit(x_train, y_train)
    save_model(model, 'xgb.pkl')
    return model

#### Evaluation Matrics

In [100]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def evaluate_model_test(model):
    # define y_true and y_pred
    y_true = y_test
    y_pred = model.predict(x_test)
    accuracy = f"{accuracy_score(y_true, y_pred)*100:.2f}%"
    precision = f"{precision_score(y_true, y_pred,average='macro')*100:.2f}%"
    recall = f"{recall_score(y_true, y_pred,average='macro')*100:.2f}%"
    f1 = f"{f1_score(y_true, y_pred,average='macro')*100:.2f}%"
    return accuracy, precision, recall, f1

In [101]:
model_functions = [
    {"name": "SVM", "function": build_svm_model},
    {"name": "LG", "function": build_lg_model},
    {"name": "KNN", "function": build_knn_model},
    {"name": "NB", "function": build_nb_model},
    {"name": "DT", "function": build_dt_model},
    {"name": "RF", "function": build_rf_model},
    {"name": "CatBoost", "function": build_catboost_model},
    {"name": "XGBoost", "function": build_xgb_model}
]
model_evaluation_report = []

# iterate over the list, create model and evaluate the model
for model_info in model_functions:
    model = model_info["function"]()
    metrics_train = evaluate_model_train(model)
    metrics_test = evaluate_model_test(model)
    model_evaluation_report.append({
        "name": model_info["name"],
        "train_accuracy": metrics_train[0],
        "train_precision": metrics_train[1], 
        "train_recall": metrics_train[2],
        "train_f1": metrics_train[3],
        "accuracy": metrics_test[0],
        "precision": metrics_test[1], 
        "recall": metrics_test[2],
        "f1": metrics_test[3]
            
    })

# create a data frame of the result
df_result = pd.DataFrame(model_evaluation_report)
df_result

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  if is_sparse(dtype):
  is_categorical_dtype(dtype) or is_pa_ext_categorical_dtype(dtype)
  if is_categorical_dtype(dtype):
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_sparse(data):
  if is_sparse(dtype):
  is_categ

Unnamed: 0,name,train_accuracy,train_precision,train_recall,train_f1,accuracy,precision,recall,f1
0,SVM,60.02%,78.21%,58.55%,50.68%,59.31%,77.93%,58.04%,49.70%
1,LG,60.86%,60.81%,60.81%,60.81%,60.64%,60.61%,60.61%,60.61%
2,KNN,84.13%,84.22%,84.03%,84.08%,74.92%,75.00%,74.80%,74.82%
3,NB,63.43%,64.23%,62.90%,62.33%,63.12%,63.88%,62.68%,62.12%
4,DT,100.00%,100.00%,100.00%,100.00%,82.83%,82.81%,82.84%,82.82%
5,RF,100.00%,100.00%,100.00%,100.00%,86.77%,86.91%,86.67%,86.73%
6,CatBoost,93.90%,94.20%,93.77%,93.87%,86.75%,87.15%,86.60%,86.67%
7,XGBoost,97.11%,97.25%,97.04%,97.10%,86.24%,86.51%,86.11%,86.17%


## DATA VISUALIZATION USING TABLEAU

### [Tableau Dashboard](https://public.tableau.com/app/profile/gulshan.gedam/viz/ProjectWorkinProgess/Dashboard3?publish=yes)