# CRM Data Analyst Case - Part 2

### Description

The objective of the team is to build a predictive model that will produce the highest profit for the
next direct marketing campaign, scheduled for the next month. 
The new campaign, sixth, aims at
selling a new gadget to the Customer Database. 
To build the model, a pilot campaign involving **2240 customers** was carried out. 
The customers were selected at random and contacted by phone regarding the acquisition of the gadget. 
During the following months, customers who bought the offer were properly labeled. 
The total cost of the sample campaign was 6.720MU and the revenue generated by the customers who accepted the offer was 3.674MU. 
Globally the campaign had a profit of -3.046MU. 
The success rate of the campaign was 15%. 

The objective is of the team is to develop a model that predicts customer behavior and to apply it to the rest of the customer base.
Hopefully the model will allow the company to cherry pick the customers that are most likely to
purchase the offer while leaving out the non-respondents, making the next campaign highly
profitable. Moreover, other than maximizing the profit of the campaign, the CMO is interested in
understanding to study the characteristic features of those customers who are willing to buy the
gadget.

### Key Objectives are:

1. Explore the data – don’t just plot means and counts. Provide insights, define cause and
effect. Provide a better understanding of the characteristic features of respondents;
2. Propose and describe a customer segmentation based on customers behaviors;
3. Create a predictive model which allows the company to maximize the profit of the next
marketing campaign.
4. Whatever else you think is necessary.

### Deliverables:

1. Data Exploration;
2. Segmentation;
3. Classification Model;
4. Feature Importance.

In [64]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
import os
import warnings
warnings.filterwarnings("ignore")

import nest_asyncio
nest_asyncio.apply()

import dataframe_image as dfi # converting dataframe into Image
pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)

In [34]:
# Read Dataset
#url = '/Users/adityaagarwal/Library/CloudStorage/OneDrive-NortheasternUniversity/Jupyter Notebook/Projects/CRM-Analysis-for-Marketing-data/marketing data/clean_marketing_data.csv'
url = 'D:/OneDrive - Northeastern University/Jupyter Notebook/Data Science Projects/CRM-Analysis-for-Marketing-data/marketing data/marketing_data_with_features.csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Response,Complain,Country,Age,Is_Parent,Total_Kids,Marital_Status_Simplified,Total_Spent,MntWines_Share,MntFruits_Share,MntMeatProducts_Share,MntFishProducts_Share,MntSweetProducts_Share,MntGoldProds_Share,Total_Purchases,Avg_Spend_per_Purchase,Deal_Rate,Web_Purchase_Share,Web_Visit_to_Purchase_Ratio,Is_Online_Buyer,Customer_Since_Days,Is_Active,Lifetime_Spend_per_Day,Total_Accepted_Campaigns,Campaign_Response_Rate
0,8996,1957,PhD,Married,51381,2,1,11/19/2012,4,230,42,192,49,37,53,12,7,2,8,9,0,0,0,0,0,0,0,GER,68,1,3,Partnered,603,0.381426,0.069652,0.318408,0.08126,0.06136,0.087894,29,20.793103,0.413793,0.241379,1.125,0,4060,1,0.148522,0,0.0
1,1994,1983,Graduation,Married,51381,1,0,11/15/2013,11,5,5,6,0,2,1,1,1,0,2,7,0,0,0,0,0,0,0,US,42,1,1,Partnered,19,0.263158,0.263158,0.315789,0.0,0.105263,0.052632,4,4.75,0.25,0.25,3.5,0,3699,1,0.005137,0,0.0
2,3769,1972,PhD,Together,51381,1,0,3/2/2014,17,25,1,13,0,0,3,1,1,0,3,7,0,0,0,0,0,0,0,AUS,53,1,1,Partnered,42,0.595238,0.02381,0.309524,0.0,0.0,0.071429,5,8.4,0.2,0.2,3.5,0,3592,1,0.011693,0,0.0
3,5255,1986,Graduation,Single,51381,1,0,2/20/2013,19,5,1,3,3,263,362,0,27,0,0,1,0,0,0,0,0,0,0,AUS,39,1,1,Single,637,0.007849,0.00157,0.00471,0.00471,0.412873,0.568289,27,23.592593,0.0,1.0,0.035714,1,3967,1,0.160575,0,0.0
4,8268,1961,PhD,Married,51381,0,1,7/11/2013,23,352,0,27,10,0,15,3,6,1,7,6,0,0,0,0,0,0,0,CA,64,1,1,Partnered,404,0.871287,0.0,0.066832,0.024752,0.0,0.037129,17,23.764706,0.176471,0.352941,0.857143,0,3826,1,0.105593,0,0.0


In [35]:
df_new = df.copy()
df_new.drop(['ID', 'Year_Birth', 'Education', 'Marital_Status', 'Dt_Customer', 'Country'], axis=1, inplace=True)

In [36]:
df_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 44 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Income                       2240 non-null   int64  
 1   Kidhome                      2240 non-null   int64  
 2   Teenhome                     2240 non-null   int64  
 3   Recency                      2240 non-null   int64  
 4   MntWines                     2240 non-null   int64  
 5   MntFruits                    2240 non-null   int64  
 6   MntMeatProducts              2240 non-null   int64  
 7   MntFishProducts              2240 non-null   int64  
 8   MntSweetProducts             2240 non-null   int64  
 9   MntGoldProds                 2240 non-null   int64  
 10  NumDealsPurchases            2240 non-null   int64  
 11  NumWebPurchases              2240 non-null   int64  
 12  NumCatalogPurchases          2240 non-null   int64  
 13  NumStorePurchases 

In [38]:
df_new['Marital_Status_Simplified'] = df['Marital_Status_Simplified'].map({'Partnered': 0, 'Single': 1})

In [39]:
df_new['Marital_Status_Simplified'].value_counts().sum

<bound method NDFrame._add_numeric_operations.<locals>.sum of Marital_Status_Simplified
0    1444
1     796
Name: count, dtype: int64>

In [41]:
df_new.head()

Unnamed: 0,Income,Kidhome,Teenhome,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Response,Complain,Age,Is_Parent,Total_Kids,Marital_Status_Simplified,Total_Spent,MntWines_Share,MntFruits_Share,MntMeatProducts_Share,MntFishProducts_Share,MntSweetProducts_Share,MntGoldProds_Share,Total_Purchases,Avg_Spend_per_Purchase,Deal_Rate,Web_Purchase_Share,Web_Visit_to_Purchase_Ratio,Is_Online_Buyer,Customer_Since_Days,Is_Active,Lifetime_Spend_per_Day,Total_Accepted_Campaigns,Campaign_Response_Rate
0,51381,2,1,4,230,42,192,49,37,53,12,7,2,8,9,0,0,0,0,0,0,0,68,1,3,0,603,0.381426,0.069652,0.318408,0.08126,0.06136,0.087894,29,20.793103,0.413793,0.241379,1.125,0,4060,1,0.148522,0,0.0
1,51381,1,0,11,5,5,6,0,2,1,1,1,0,2,7,0,0,0,0,0,0,0,42,1,1,0,19,0.263158,0.263158,0.315789,0.0,0.105263,0.052632,4,4.75,0.25,0.25,3.5,0,3699,1,0.005137,0,0.0
2,51381,1,0,17,25,1,13,0,0,3,1,1,0,3,7,0,0,0,0,0,0,0,53,1,1,0,42,0.595238,0.02381,0.309524,0.0,0.0,0.071429,5,8.4,0.2,0.2,3.5,0,3592,1,0.011693,0,0.0
3,51381,1,0,19,5,1,3,3,263,362,0,27,0,0,1,0,0,0,0,0,0,0,39,1,1,1,637,0.007849,0.00157,0.00471,0.00471,0.412873,0.568289,27,23.592593,0.0,1.0,0.035714,1,3967,1,0.160575,0,0.0
4,51381,0,1,23,352,0,27,10,0,15,3,6,1,7,6,0,0,0,0,0,0,0,64,1,1,0,404,0.871287,0.0,0.066832,0.024752,0.0,0.037129,17,23.764706,0.176471,0.352941,0.857143,0,3826,1,0.105593,0,0.0


In [42]:
# Split dataset into features and labels
features = df_new.drop('Response', axis =1)
labels = df.Response

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.40, random_state = 42)

In [43]:
print('Train_x:',X_train.shape)
print('Train_y:',y_train.shape)
print('Test_x:',X_test.shape)
print('Test_y:',y_test.shape)

Train_x: (1344, 43)
Train_y: (1344,)
Test_x: (896, 43)
Test_y: (896,)


In [44]:
## Standardizing the data
from sklearn.preprocessing import StandardScaler
obje_ss=StandardScaler()

x_train_ss=obje_ss.fit_transform(X_train)
x_test_ss=obje_ss.fit_transform(X_test)

## 1. Logistic regression

In [45]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

model_le=LogisticRegression(random_state=42, max_iter=10000)
model_le.fit(X_train, y_train)

parameters = {'C':[0.8,0.9,1,1.1,1.2], 'penalty':['l1','l2'],  'solver':['liblinear','saga']}

model_le_grid = GridSearchCV(model_le, parameters, cv=10, verbose=1, n_jobs=-1).fit(X_train, y_train)
print(model_le_grid.best_params_)

print('Logistic Regression Train score:', model_le.score(X_train,y_train)*100)
print('Logistic Regression Cross validation score:', model_le_grid.best_score_*100)

Fitting 10 folds for each of 20 candidates, totalling 200 fits
{'C': 1, 'penalty': 'l1', 'solver': 'liblinear'}
Logistic Regression Train score: 88.69047619047619
Logistic Regression Cross validation score: 88.76672194582642


In [46]:
model_le=LogisticRegression(C=1.1, penalty='l1', solver='liblinear')
model_le.fit(X_train,y_train)
print('Logistic Regression Train score:', model_le.score(X_train,y_train)*100)

predictions = model_le.predict(X_test)
print("Accuracy:",metrics.accuracy_score(predictions, y_test)*100)

Logistic Regression Train score: 89.88095238095238
Accuracy: 89.62053571428571


## 2. Support vector machine

In [47]:
from sklearn import svm

model_svc=svm.SVC(random_state=42)
parameters = {'kernel':('linear', 'rbf','poly'), 'C':[0.8,0.9,1,1.1,1.2],'degree':[3,4,5,6]}
model_svc_grid = GridSearchCV(model_svc, parameters,cv=10,verbose=1,n_jobs=-1).fit(x_train_ss,y_train)
print(model_svc_grid.best_params_)

print('Support Vecktor Classification Cross validation score:', model_svc_grid.best_score_*100)

Fitting 10 folds for each of 60 candidates, totalling 600 fits
{'C': 1.2, 'degree': 3, 'kernel': 'linear'}
Support Vecktor Classification Cross validation score: 88.61746821448314


In [48]:
model_svc=svm.SVC(degree =3, C=0.8,kernel='linear')
model_svc.fit(x_train_ss,y_train)
print('Support Vecktor Classification Train score:', model_svc.score(x_train_ss,y_train)*100)

predictions = model_svc_grid.predict(x_test_ss)
print("Accuracy:",metrics.accuracy_score(predictions, y_test)*100)

Support Vecktor Classification Train score: 89.95535714285714
Accuracy: 89.17410714285714


## 3. XGBoost

In [49]:
from xgboost import XGBClassifier

model_xgb=XGBClassifier(n_jobs=-1,random_state=42,eval_metric='logloss')
parameters = {'n_estimators':[50,100,200],'max_depth':[3,4,5,6],'learning_rate':[0.1,0.01]}
model_xgb_grid = GridSearchCV(model_xgb, parameters,cv=10,verbose=1).fit(X_train,y_train)
print(model_xgb_grid.best_params_)

print('XGB Classifier Cros validation score:',model_xgb_grid.best_score_*100)


Fitting 10 folds for each of 24 candidates, totalling 240 fits
{'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100}
XGB Classifier Cros validation score: 88.91708126036484


In [50]:
model_xgb=XGBClassifier(n_jobs=-1,random_state=42,n_estimators=50,max_depth=5,learning_rate=0.1,eval_metric='logloss')
model_xgb.fit(X_train,y_train)
print('XGB Classifier Classifier Train score:',model_xgb.score(X_train,y_train)*100)

predictions = model_xgb_grid.predict(X_test)
print("Accuracy:", metrics.accuracy_score(predictions, y_test)*100)

XGB Classifier Classifier Train score: 98.28869047619048
Accuracy: 88.28125


## 4. Generating the Model - Random Forest Regressor using GridSearchCv


In [51]:
# Using Grid Search to find the best parameters
param_grid = { 
    'n_estimators': [50, 100, 200],
    'max_features': ['sqrt', 'log2'],
    'max_depth' : [None,3,5,8],
    'criterion' :['gini'],
    'min_samples_split':[2,3,4]
}

# Training RF Models with K-Fold of 5 
rf_models = GridSearchCV(RandomForestClassifier(random_state = 5), 
                         param_grid=param_grid, cv=5, verbose=1, error_score='raise')

rf_models.fit(X_train, y_train)

Fitting 5 folds for each of 72 candidates, totalling 360 fits


In [52]:
# Get the predictions
predictions = rf_models.predict(X_test)

# Print the Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(predictions, y_test))

Accuracy: 0.8794642857142857


## Feature Importance

Calculating top 15 features

In [53]:
# Print Feature Importance

feature_importance = pd.DataFrame(data={"features": X_test.columns, 
                            "importance":rf_models.best_estimator_.feature_importances_*100})

feature_importance = feature_importance.sort_values('importance', 
            ascending=False).head(15)
feature_imp_styled = feature_importance.style.background_gradient(cmap='coolwarm', low=1, high=0)

feature_imp_styled

Unnamed: 0,features,importance
42,Campaign_Response_Rate,7.245125
3,Recency,6.602929
41,Total_Accepted_Campaigns,6.526032
38,Customer_Since_Days,5.810366
33,Avg_Spend_per_Purchase,4.208614
0,Income,3.73274
25,Total_Spent,3.635439
4,MntWines,3.358252
28,MntMeatProducts_Share,3.356411
40,Lifetime_Spend_per_Day,3.156106


In [65]:
# Saving dataframe
os.chdir(r'D:\OneDrive - Northeastern University\Jupyter Notebook\Data Science Projects\CRM-Analysis-for-Marketing-data\marketing data')
feature_importance.to_csv("Feature_Importance.csv", index = False)

In [66]:
feat_values = feature_importance.features.values.tolist()
feat_values

['Campaign_Response_Rate',
 'Recency',
 'Total_Accepted_Campaigns',
 'Customer_Since_Days',
 'Avg_Spend_per_Purchase',
 'Income',
 'Total_Spent',
 'MntWines',
 'MntMeatProducts_Share',
 'Lifetime_Spend_per_Day',
 'MntMeatProducts',
 'MntGoldProds_Share',
 'AcceptedCmp5',
 'Deal_Rate',
 'NumWebVisitsMonth']

In [68]:
# Split dataset into features and labels
features_feat = df_new.loc[:, feat_values]
labels_feat = df.Response

# Split dataset into training set and test set
X_train_feat, X_test_feat, y_train_feat, y_test_feat = train_test_split(features_feat, labels_feat, test_size=0.40, random_state = 42)

In [69]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

model_le=LogisticRegression(random_state=42, max_iter=10000)
model_le.fit(X_train_feat, y_train_feat)

parameters = {'C':[0.8,0.9,1,1.1,1.2], 'penalty':['l1','l2'],  'solver':['liblinear','saga']}

model_le_grid = GridSearchCV(model_le, parameters, cv=10, verbose=1, n_jobs=-1).fit(X_train_feat, y_train_feat)
print(model_le_grid.best_params_)

print('Logistic Regression Train score:', model_le.score(X_train_feat,y_train_feat)*100)
print('Logistic Regression Cross validation score:', model_le_grid.best_score_*100)

Fitting 10 folds for each of 20 candidates, totalling 200 fits
{'C': 1, 'penalty': 'l1', 'solver': 'liblinear'}
Logistic Regression Train score: 85.41666666666666
Logistic Regression Cross validation score: 88.02432283029299


In [70]:
model_le=LogisticRegression(C=1.1, penalty='l1', solver='liblinear')
model_le.fit(X_train_feat, y_train_feat)
print('Logistic Regression Train score:', model_le.score(X_train_feat,y_train_feat)*100)

predictions = model_le.predict(X_test_feat)
print("Accuracy:", metrics.accuracy_score(predictions, y_test_feat)*100)

Logistic Regression Train score: 88.09523809523809
Accuracy: 88.16964285714286
