# CRM Data Analyst Case - Part 2

### Description

The objective of the team is to build a predictive model that will produce the highest profit for the
next direct marketing campaign, scheduled for the next month. 
The new campaign, sixth, aims at
selling a new gadget to the Customer Database. 
To build the model, a pilot campaign involving **2240 customers** was carried out. 
The customers were selected at random and contacted by phone regarding the acquisition of the gadget. 
During the following months, customers who bought the offer were properly labeled. 
The total cost of the sample campaign was 6.720MU and the revenue generated by the customers who accepted the offer was 3.674MU. 
Globally the campaign had a profit of -3.046MU. 
The success rate of the campaign was 15%. 

The objective is of the team is to develop a model that predicts customer behavior and to apply it to the rest of the customer base.
Hopefully the model will allow the company to cherry pick the customers that are most likely to
purchase the offer while leaving out the non-respondents, making the next campaign highly
profitable. Moreover, other than maximizing the profit of the campaign, the CMO is interested in
understanding to study the characteristic features of those customers who are willing to buy the
gadget.

### Key Objectives are:

1. Explore the data – don’t just plot means and counts. Provide insights, define cause and
effect. Provide a better understanding of the characteristic features of respondents;
2. Propose and describe a customer segmentation based on customers behaviors;
3. Create a predictive model which allows the company to maximize the profit of the next
marketing campaign.
4. Whatever else you think is necessary.

### Deliverables:

1. Data Exploration;
2. Segmentation;
3. Classification Model;
4. Feature Importance.

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

import warnings
warnings.filterwarnings("ignore")

import dataframe_image as dfi # converting dataframe into Image
pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)

In [3]:
# Read Dataset
#url = '/Users/adityaagarwal/Library/CloudStorage/OneDrive-NortheasternUniversity/Jupyter Notebook/Projects/CRM-Analysis-for-Marketing-data/marketing data/clean_marketing_data.csv'
url = 'D:/OneDrive - Northeastern University/Jupyter Notebook/Data Science Projects/CRM-Analysis-for-Marketing-data/marketing data/marketing_data.csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Response,Complain,Country
0,1826,1970,Graduation,Divorced,"$84,835.00",0,0,6/16/2014,0,189,104,379,111,189,218,1,4,4,6,1,0,0,0,0,0,1,0,SP
1,1,1961,Graduation,Single,"$57,091.00",0,0,6/15/2014,0,464,5,64,7,0,37,1,7,3,7,5,0,0,0,0,1,1,0,CA
2,10476,1958,Graduation,Married,"$67,267.00",0,1,5/13/2014,0,134,11,59,15,2,30,1,3,2,5,2,0,0,0,0,0,0,0,US
3,1386,1967,Graduation,Together,"$32,474.00",1,1,5/11/2014,0,10,0,1,0,0,0,1,1,0,2,7,0,0,0,0,0,0,0,AUS
4,5371,1989,Graduation,Single,"$21,474.00",1,0,4/8/2014,0,6,16,24,11,0,34,2,3,1,2,7,1,0,0,0,0,1,0,SP


In [4]:
df_new = df.copy()
df_new.drop(['ID', 'Year_Birth', 'Dt_Customer', 'Country'], axis=1, inplace=True)

In [5]:
df_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 24 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Education            2240 non-null   object
 1   Marital_Status       2240 non-null   object
 2    Income              2216 non-null   object
 3   Kidhome              2240 non-null   int64 
 4   Teenhome             2240 non-null   int64 
 5   Recency              2240 non-null   int64 
 6   MntWines             2240 non-null   int64 
 7   MntFruits            2240 non-null   int64 
 8   MntMeatProducts      2240 non-null   int64 
 9   MntFishProducts      2240 non-null   int64 
 10  MntSweetProducts     2240 non-null   int64 
 11  MntGoldProds         2240 non-null   int64 
 12  NumDealsPurchases    2240 non-null   int64 
 13  NumWebPurchases      2240 non-null   int64 
 14  NumCatalogPurchases  2240 non-null   int64 
 15  NumStorePurchases    2240 non-null   int64 
 16  NumWeb

In [10]:
education = pd.get_dummies(df_new['Education'],drop_first=True)
marital = pd.get_dummies(df_new['Marital_Status'],drop_first=True)

In [11]:
new_data = pd.concat([df_new,education,marital],axis=1)

In [12]:
new_data

Unnamed: 0,Education,Marital_Status,Income,Kidhome,Teenhome,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Response,Complain,Basic,Graduation,Master,PhD,Alone,Divorced,Married,Single,Together,Widow,YOLO
0,Graduation,Divorced,"$84,835.00",0,0,0,189,104,379,111,189,218,1,4,4,6,1,0,0,0,0,0,1,0,False,True,False,False,False,True,False,False,False,False,False
1,Graduation,Single,"$57,091.00",0,0,0,464,5,64,7,0,37,1,7,3,7,5,0,0,0,0,1,1,0,False,True,False,False,False,False,False,True,False,False,False
2,Graduation,Married,"$67,267.00",0,1,0,134,11,59,15,2,30,1,3,2,5,2,0,0,0,0,0,0,0,False,True,False,False,False,False,True,False,False,False,False
3,Graduation,Together,"$32,474.00",1,1,0,10,0,1,0,0,0,1,1,0,2,7,0,0,0,0,0,0,0,False,True,False,False,False,False,False,False,True,False,False
4,Graduation,Single,"$21,474.00",1,0,0,6,16,24,11,0,34,2,3,1,2,7,1,0,0,0,0,1,0,False,True,False,False,False,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2235,PhD,Divorced,"$66,476.00",0,1,99,372,18,126,47,48,78,2,5,2,11,4,0,0,0,0,0,0,0,False,False,False,True,False,True,False,False,False,False,False
2236,2n Cycle,Married,"$31,056.00",1,0,99,5,10,13,3,8,16,1,1,0,3,8,0,0,0,0,0,0,0,False,False,False,False,False,False,True,False,False,False,False
2237,Graduation,Divorced,"$46,310.00",1,0,99,185,2,88,15,5,14,2,6,1,5,8,0,0,0,0,0,0,0,False,True,False,False,False,True,False,False,False,False,False
2238,Graduation,Married,"$65,819.00",0,0,99,267,38,701,149,165,63,1,5,4,10,3,0,0,0,0,0,0,0,False,True,False,False,False,False,True,False,False,False,False


In [10]:
new_data.drop(['Education','Marital_Status'],axis=1,inplace=True)

In [11]:
# Split dataset into features and labels
features = new_data.drop('Response', axis =1)
labels = df.Response

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.40, random_state = 5)

In [12]:
print('Train_x:',X_train.shape)
print('Train_y:',y_train.shape)
print('Test_x:',X_test.shape)
print('Test_y:',y_test.shape)

Train_x: (1344, 32)
Train_y: (1344,)
Test_x: (896, 32)
Test_y: (896,)


In [13]:
## Standardizing the data
from sklearn.preprocessing import StandardScaler
obje_ss=StandardScaler()

x_train_ss=obje_ss.fit_transform(X_train)
x_test_ss=obje_ss.fit_transform(X_test)

## 1. Logistic regression

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

model_le=LogisticRegression(random_state=42, max_iter=10000)
model_le.fit(X_train, y_train)

parameters = {'C':[0.8,0.9,1,1.1,1.2], 'penalty':['l1','l2'],  'solver':['liblinear','saga']}

model_le_grid = GridSearchCV(model_le, parameters, cv=10, verbose=1, n_jobs=-1).fit(X_train, y_train)
print(model_le_grid.best_params_)

print('Logistic Regression Train score:', model_le.score(X_train,y_train)*100)
print('Logistic Regression Cross validation score:', model_le_grid.best_score_*100)

Fitting 10 folds for each of 20 candidates, totalling 200 fits
{'C': 1.1, 'penalty': 'l1', 'solver': 'liblinear'}
Logistic Regression Train score: 87.5
Logistic Regression Cross validation score: 89.05527915975678


In [39]:
model_le=LogisticRegression(C=1.1, penalty='l1', solver='liblinear')
model_le.fit(X_train,y_train)
print('Logistic Regression Train score:', model_le.score(X_train,y_train)*100)

predictions = model_le.predict(X_test)
print("Accuracy:",metrics.accuracy_score(predictions, y_test)*100)

Logistic Regression Train score: 89.0625
Accuracy: 85.71428571428571


## 2. Support vector machine

In [16]:
from sklearn import svm

model_svc=svm.SVC(random_state=42)
parameters = {'kernel':('linear', 'rbf','poly'), 'C':[0.8,0.9,1,1.1,1.2],'degree':[3,4,5,6]}
model_svc_grid = GridSearchCV(model_svc, parameters,cv=10,verbose=1,n_jobs=-1).fit(x_train_ss,y_train)
print(model_svc_grid.best_params_)

print('Support Vecktor Classification Cross validation score:', model_svc_grid.best_score_*100)

Fitting 10 folds for each of 60 candidates, totalling 600 fits
{'C': 0.8, 'degree': 3, 'kernel': 'linear'}
Support Vecktor Classification Cross validation score: 88.98231066887783


In [17]:
model_svc=svm.SVC(degree =3, C=0.8,kernel='linear')
model_svc.fit(x_train_ss,y_train)
print('Support Vecktor Classification Train score:', model_svc.score(x_train_ss,y_train)*100)

predictions = model_svc_grid.predict(x_test_ss)
print("Accuracy:",metrics.accuracy_score(predictions, y_test)*100)

Support Vecktor Classification Train score: 90.32738095238095
Accuracy: 86.49553571428571


## 3. XGBoost

In [18]:
from xgboost import XGBClassifier

model_xgb=XGBClassifier(n_jobs=-1,random_state=42,eval_metric='logloss')
parameters = {'n_estimators':[50,100,200],'max_depth':[3,4,5,6],'learning_rate':[0.1,0.01]}
model_xgb_grid = GridSearchCV(model_xgb, parameters,cv=10,verbose=1).fit(X_train,y_train)
print(model_xgb_grid.best_params_)

print('XGB Classifier Cros validation score:',model_xgb_grid.best_score_*100)


Fitting 10 folds for each of 24 candidates, totalling 240 fits
{'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 50}
XGB Classifier Cros validation score: 89.2072968490879


In [19]:
model_xgb=XGBClassifier(n_jobs=-1,random_state=42,n_estimators=50,max_depth=5,learning_rate=0.1,eval_metric='logloss')
model_xgb.fit(X_train,y_train)
print('XGB Classifier Classifier Train score:',model_xgb.score(X_train,y_train)*100)

predictions = model_xgb_grid.predict(X_test)
print("Accuracy:", metrics.accuracy_score(predictions, y_test)*100)

XGB Classifier Classifier Train score: 96.20535714285714
Accuracy: 86.49553571428571


## 4. Generating the Model - Random Forest Regressor using GridSearchCv


In [20]:
# Using Grid Search to find the best parameters
param_grid = { 
    'n_estimators': [50, 100, 200],
    'max_features': ['auto'],
    'max_depth' : [None,3,5,8],
    'criterion' :['gini'],
    'min_samples_split':[2,3,4]
}

# Training RF Models with K-Fold of 5 
rf_models = GridSearchCV(RandomForestClassifier(random_state = 5), 
                         param_grid=param_grid, cv=5, verbose=1)

rf_models.fit(X_train, y_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits


GridSearchCV(cv=5, estimator=RandomForestClassifier(random_state=5),
             param_grid={'criterion': ['gini'], 'max_depth': [None, 3, 5, 8],
                         'max_features': ['auto'],
                         'min_samples_split': [2, 3, 4],
                         'n_estimators': [50, 100, 200]},
             verbose=1)

In [21]:
# Get the predictions
predictions = rf_models.predict(X_test)

# Print the Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(predictions, y_test))

Accuracy: 0.8616071428571429


## Feature Importance

Calculating top 15 features

In [27]:
# Print Feature Importance

feature_importance = pd.DataFrame(data={"features": X_test.columns, 
                            "importance":rf_models.best_estimator_.feature_importances_*100})

feature_importance = feature_importance.sort_values('importance', 
            ascending=False).head(15)
feature_imp_styled = feature_importance.style.background_gradient(cmap='coolwarm', low=1, high=0)

feature_imp_styled

Unnamed: 0,features,importance
3,Recency,10.658893
4,MntWines,8.670528
6,MntMeatProducts,8.194544
0,Income,7.778617
9,MntGoldProds,6.087945
14,NumWebVisitsMonth,4.758755
18,AcceptedCmp1,4.727924
5,MntFruits,4.722485
17,AcceptedCmp5,4.479031
8,MntSweetProducts,4.460659


In [115]:
# Saving dataframe as an image
dfi.export(feature_imp_styled, 'Images/Feature_Importance_IFood_CRM.png')

objc[83865]: Class WebSwapCGLLayer is implemented in both /System/Library/Frameworks/WebKit.framework/Versions/A/Frameworks/WebCore.framework/Versions/A/Frameworks/libANGLE-shared.dylib (0x7ffa52cec550) and /Applications/Google Chrome.app/Contents/Frameworks/Google Chrome Framework.framework/Versions/109.0.5414.87/Libraries/libGLESv2.dylib (0x115443880). One of the two will be used. Which one is undefined.
[0124/160259.219500:INFO:headless_shell.cc(223)] 49087 bytes written to file /var/folders/4k/bqkmm_s90gbdmjr_kbd0jv7w0000gn/T/tmp9u1t167n/temp.png


In [32]:
feat_values = feature_importance.features.values.tolist()
feat_values

['Recency',
 'MntWines',
 'MntMeatProducts',
 'Income',
 'MntGoldProds',
 'NumWebVisitsMonth',
 'AcceptedCmp1',
 'MntFruits',
 'AcceptedCmp5',
 'MntSweetProducts',
 'MntFishProducts',
 'NumStorePurchases',
 'NumCatalogPurchases',
 'AcceptedCmp3',
 'NumWebPurchases']

In [35]:
# Split dataset into features and labels
features_feat = new_data.loc[:, feat_values]
labels_feat = df.Response

# Split dataset into training set and test set
X_train_feat, X_test_feat, y_train_feat, y_test_feat = train_test_split(features_feat, labels_feat, test_size=0.40, random_state = 5)

In [37]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

model_le=LogisticRegression(random_state=42, max_iter=10000)
model_le.fit(X_train_feat, y_train_feat)

parameters = {'C':[0.8,0.9,1,1.1,1.2], 'penalty':['l1','l2'],  'solver':['liblinear','saga']}

model_le_grid = GridSearchCV(model_le, parameters, cv=10, verbose=1, n_jobs=-1).fit(X_train_feat, y_train_feat)
print(model_le_grid.best_params_)

print('Logistic Regression Train score:', model_le.score(X_train_feat,y_train_feat)*100)
print('Logistic Regression Cross validation score:', model_le_grid.best_score_*100)

Fitting 10 folds for each of 20 candidates, totalling 200 fits
{'C': 1.1, 'penalty': 'l1', 'solver': 'liblinear'}
Logistic Regression Train score: 87.42559523809523
Logistic Regression Cross validation score: 88.38529574350471


In [41]:
model_le=LogisticRegression(C=1.1, penalty='l1', solver='liblinear')
model_le.fit(X_train_feat, y_train_feat)
print('Logistic Regression Train score:', model_le.score(X_train_feat,y_train_feat)*100)

predictions = model_le.predict(X_test_feat)
print("Accuracy:", metrics.accuracy_score(predictions, y_test_feat)*100)

Logistic Regression Train score: 89.0625
Accuracy: 85.71428571428571
