## Context

ABC Supermarket is planning for the year-end sale - they want to launch a new offer i.e. gold membership for only \\$499 that is of \\$999 on normal days (that gives 20% discount on all purchases) only for existing customers, for that they need to do a campaign through phone calls - the best way to reduce the cost of the campaign is to make a predictive model to classify customers who might purchase the offer, using the data they gathered during last year's campaign.

We will build a model for classifying whether customers will reply with a positive
response or not.

## Import Statements

In [None]:
# Preprocessing Imports
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler, NearMiss

# Model Imports
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# Ensemble Imports
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier

# Visualization Imports
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

In [None]:
df = pd.read_excel("marketing_data.xlsx")

In [None]:
df.shape

In [None]:
df

## Column Definitions
- Response (target) - 1 if customer accepted the offer in the last campaign, 0
otherwise
- Complain - 1 if a customer complained in the last 2 years
- DtCustomer - date of customer’s enrolment with the company
- Education - customer’s level of education
- Marital - customer’s marital status
- Kidhome - number of small children in customer’s household
- Teenhome - number of teenagers in customer’s household
- Income - customer’s yearly household income
- MntFishProducts - the amount spent on fish products in the last 2 years
- MntMeatProducts - the amount spent on meat products in the last 2 years
- MntFruits - the amount spent on fruits products in the last 2 years
- MntSweetProducts - amount spent on sweet products in the last 2 years
- MntWines - the amount spent on wine products in the last 2 years
- MntGoldProds - the amount spent on gold products in the last 2 years
- NumDealsPurchases - number of purchases made with discount
- NumCatalogPurchases - number of purchases made using catalog
- NumStorePurchases - number of purchases made directly in stores
- NumWebPurchases - number of purchases made through the company’s web site
- NumWebVisitsMonth - number of visits to company’s web site in the last month
- Recency - number of days since the last purchase
- ID - unique customer-id
- Year_Birth - customer's year of birth ( Can be converted to age)

## Data Profiling

### Data Types and list of Unique Rows

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
df['Education'].unique()


In [None]:
df['Marital_Status'].unique()


In [None]:
df['Recency'].unique()

In [None]:
df.describe()

## Initial Exploratory Data Analysis

### Outlier Detection

In [None]:
import datetime

current_year = datetime.datetime.now().year
df['Age'] = current_year - df['Year_Birth']


In [None]:
df.loc[df.Income.isnull(), 'Income'] = np.nanmean(df.Income)

'Year of Birth' is changed to 'Age' for ease of use and null Income rows are replaced with the mean of the Income column

In [None]:
fig, ax = plt.subplots()

sns.scatterplot(df, x='Age', y='Income', ax=ax, color='blue')

plt.show()

The graph above shows several outliers in the 'Age' and 'Income' Columns which are needed to be addressed to improve metrics

### Response Percentage

In [None]:
colors = sns.color_palette('pastel')[0:len(df['Response'].value_counts())]

#create pie chart
plt.pie(df['Response'].value_counts(), labels=['Didnt Accept', 'Accepted'], colors = colors, autopct='%.0f%%')
plt.legend()
plt.show()

Classification algorithms tend to be biased towards the majority class. This can lead to poor performance in identifying the minority class, even though it might be the class of interest.

As shown above, there is a great imbalance between rows that accepted and did not accept calls. This should be balanced in order to maximize model precision and recall metrics.

In [None]:
colors = sns.color_palette('pastel')[0:len(df['Complain'].value_counts())]

#create pie chart
plt.pie(df['Complain'].value_counts(), labels=['Didnt Complain', 'Complained'], colors = colors, autopct='%.0f%%')
plt.legend()
plt.show()

Similar to the response column, there is also a large contrast in people that complained vs didn't complain. 

In [None]:
fig, axes = plt.subplots(ncols=2, nrows=1, figsize=(12, 4))

for i, ax, col in zip(range(2), axes.flat, ['Kidhome', 'Teenhome']):
    tmp = df[[col, 'Response']].value_counts().to_frame().reset_index()
    tmp['Response'] = tmp['Response'].replace({0: 'Didnt Accept', 1: 'Accepted'})

    sns.barplot(x = col, y='count', 
                   hue = 'Response',data=tmp, ax=ax)
    
plt.show()

### Categorical Feature Analysis

In [None]:
categorical_cols = ['Education', 'Marital_Status', 'Kidhome', 'Teenhome', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth', 'Complain']

fig, ax = plt.subplots(ncols=2, nrows=5, figsize=(16, 18))

cont = 0
for i in range(5):
    for j in range(2):
        sns.countplot(df, x=categorical_cols[cont], hue='Response', ax=ax[i][j], palette='Set2')
        
        ax[i][j].set_title(categorical_cols[cont])
        ax[i][j].set_ylabel('')
        ax[i][j].set_xlabel('')
        
        cont = cont + 1
        
plt.show()

### Continuous feature Analysis

In [None]:
Continuous_cols = ['Age', 'Income', 'Recency', 'MntFishProducts', 'MntMeatProducts', 'MntFruits', 'MntSweetProducts', 'MntWines', 'MntGoldProds']

fig, axes = plt.subplots(ncols=3, nrows=3, figsize=(12, 10))

for i, ax, col in zip(range(9), axes.flat, Continuous_cols):
    sns.kdeplot(df, x=col, hue='Response', fill=True, ax=ax)
    
plt.show()

According to the graphs above, the actual values of MnTMeatProducts and MntWines are significantly higher than other categories which may lead to more distant outliers.

## Data Preprocessing

### Outlier Removal 

In [None]:
df.drop(df.index[df.Age >= 80], inplace=True)
df.drop(df.index[df.Income >= 200000], inplace=True)

df.drop(df.index[df.MntMeatProducts > 1000], inplace=True)
df.drop(df.index[df.MntWines > 1500], inplace=True)
df.drop(df.index[df.MntSweetProducts > 220], inplace=True)
df.drop(df.index[df.MntGoldProds > 270], inplace=True)

The above column's outliers were dropped seeing that the number of outliers are more distant than other columns with outliers

## Baseline Modelling

### Feature Selection

In [None]:
X = df.drop('Response', axis=1).iloc[:,1:][
     [
     #'Dt_Customer',
     'Age',
     'Education',
     'Marital_Status',
     'MntFishProducts', 
     'MntMeatProducts', 
     'MntFruits', 
     'MntSweetProducts', 
     'MntWines', 
     'MntGoldProds',
     'Income',
     #'Complain',
     #'Kidhome',
     'Teenhome',  
     'NumDealsPurchases',  
     'NumCatalogPurchases', 
     'NumStorePurchases', 
     'NumWebPurchases',  
     #'NumWebVisitsMonth',  
     'Recency'  
    ]]


y = df['Response']



In [None]:
X

In [None]:
from sklearn.preprocessing import OrdinalEncoder

# Replace Marital_Status values
marital_status_mapping = {
    'Widow': 'Single',
    'Alone': 'Single',
    'Absurd': 'Single',
    'YOLO': 'Single',
    'Together': 'Married'
}

for status in marital_status_mapping:
    X.loc[X['Marital_Status'] == status, 'Marital_Status'] = marital_status_mapping[status]

# Replace Education values
education_mapping = {
    'Graduation': 'Bachelors',
    '2n Cycle': 'Master',
    'Basic': 'High School'
}

for edu in education_mapping:
    X.loc[X['Education'] == edu, 'Education'] = education_mapping[edu]

# Ordinal encode the Education column
ordinal_encoder = OrdinalEncoder(categories=[['High School', 'Bachelors', 'Master', 'PhD']])
X['Education'] = ordinal_encoder.fit_transform(X[['Education']])

def one_hot_encode(data, column):
    encoded = pd.get_dummies(data[column], drop_first=True)
    data = data.drop(column, axis=1)
    data = data.join(encoded)
    return data

X = one_hot_encode(X, 'Marital_Status')


In [None]:
fig, ax = plt.subplots(figsize=(16, 16))

sns.heatmap(X.corr(), ax=ax, cmap='crest')

plt.show()

In [None]:
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

### Oversampling 

In [None]:
ros = RandomOverSampler()
X_resampled, y_resampled = ros.fit_resample(X, y)

X_resampled.shape, y_resampled.shape

In [None]:
# sm = SMOTE(random_state=42)
# X_resampled, y_resampled = sm.fit_resample(X, y)

# X_resampled.shape, y_resampled.shape

### Undersampling

In [None]:
# rus = RandomUnderSampler()
# X_resampled, y_resampled = rus.fit_resample(X, y)

# X_resampled.shape, y_resampled.shape

In [None]:
# nm = NearMiss()
# X_resampled, y_resampled = nm.fit_resample(X, y)

# X_resampled.shape, y_resampled.shape

In [None]:
colors = sns.color_palette('pastel')[0:len(y.value_counts())]

#create pie chart
plt.pie(y_resampled.value_counts(), labels=['Didnt Accept', 'Accepted'], colors = colors, autopct='%.0f%%')
plt.legend()
plt.show()


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, 
                                                    stratify=y_resampled, test_size=0.3, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

### Standardization

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

## Machine Learning Models

In [None]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)

In [None]:
log_params = {
    'C': [ 100, 1000],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'saga', 'sag'],
    'class_weight': ['balanced', None]
    
}

clf = LogisticRegression(max_iter=5000)
log_grid = GridSearchCV(
    clf, 
    log_params, 
    scoring='f1',
    return_train_score=True,
    cv=kf
)
log_grid.fit(X_train, y_train)
print("Best Parameters: \n", log_grid.best_params_)
log_preds = log_grid.predict(X_test)

In [None]:
nb_params = {
    'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5],
    'priors': [None , [0.5, 0.5], [0.6, 0.4], [0.4, 0.6]]
}

gnb = GaussianNB()
nb_grid = GridSearchCV(
    gnb, 
    nb_params, 
    scoring='f1',
    return_train_score=True,
    cv=kf
)
nb_grid.fit(X_train, y_train)
print("Best Parameters: \n", nb_grid.best_params_)
naive_preds = nb_grid.predict(X_test)

In [None]:
dt_params = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [ 5, 10, 15],
    'splitter': ['best', 'random'],
    'min_samples_split': [10, 15, 20],
    'min_samples_leaf': [5, 10, 15],
    'max_features': ['sqrt', 'log2'],
    'max_leaf_nodes': [10, 20, 50, 100]
}

clf = DecisionTreeClassifier()
dt_grid = GridSearchCV(
    clf, 
    dt_params, 
    scoring='f1',
    return_train_score=True,
    cv=kf
)
dt_grid.fit(X_train, y_train)
print("Best Parameters: \n", dt_grid.best_params_)
tree_preds = dt_grid.predict(X_test)

In [None]:
svm_params = {
    'C': [0.1, 1, 10, 100, 1000],
    'gamma': [1, 0.1, 0.01, 0.001, 0.0001, 'auto'],
    'kernel': ['rbf']
}

svm = SVC(gamma='auto')
svm_grid = GridSearchCV(
    svm, 
    svm_params, 
    scoring='f1',
    return_train_score=True
)
svm_grid.fit(X_train, y_train)
print("Best Parameters: \n", svm_grid.best_params_)
svm_preds = svm_grid.predict(X_test)


In [None]:
knn_params = {
    'n_neighbors': [3, 5, 9, 11],
    'weights': ['uniform', 'distance'],
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
    }

knn = KNeighborsClassifier()
knn_grid = GridSearchCV(
    knn, 
    knn_params, 
    scoring='f1',
    return_train_score=True,
    cv=kf
)
knn_grid.fit(X_train, y_train)
print("Best Parameters: \n", knn_grid.best_params_)
knn_preds = knn_grid.predict(X_test)

## Ensemble Methods

## Results

In [None]:
acc = accuracy_score(y_test, log_preds)
prec = precision_score(y_test, log_preds)
rec = recall_score(y_test, log_preds)
f1 = f1_score(y_test, log_preds)
auc = roc_auc_score(y_test, log_preds)

naive_acc = accuracy_score(y_test, naive_preds)
naive_prec = precision_score(y_test, naive_preds)
naive_rec = recall_score(y_test, naive_preds)
naive_f1 = f1_score(y_test, naive_preds)
naive_auc = roc_auc_score(y_test, naive_preds)

svm_acc = accuracy_score(y_test, svm_preds)
svm_prec = precision_score(y_test, svm_preds)
svm_rec = recall_score(y_test, svm_preds)
svm_f1 = f1_score(y_test, svm_preds)
svm_auc = roc_auc_score(y_test, svm_preds)

tree_acc = accuracy_score(y_test, tree_preds)
tree_prec = precision_score(y_test, tree_preds)
tree_rec = recall_score(y_test, tree_preds)
tree_f1 = f1_score(y_test, tree_preds)
tree_auc = roc_auc_score(y_test, tree_preds)

knn_acc = accuracy_score(y_test, knn_preds)
knn_prec = precision_score(y_test, knn_preds)
knn_rec = recall_score(y_test, knn_preds)
knn_f1 = f1_score(y_test, knn_preds)
knn_auc = roc_auc_score(y_test, knn_preds)


In [None]:
print("Logistic Regression Accuracy: %.4f" % acc)
print("Logistic Regression Precision: %.4f" % prec)
print("Logistic Regression Recall: %.4f" % rec)
print("Logistic Regression F1: %.4f" % f1)
print("Logistic Regression AUC: %.4f" % auc)

print("\nNaive Bayes Accuracy: %.4f" % naive_acc)
print("Naive Bayes Precision: %.4f" % naive_prec)
print("Naive Bayes Recall: %.4f" % naive_rec)
print("Naive Bayes F1: %.4f" % naive_f1)
print("Naive Bayes AUC: %.4f" % naive_auc)

print("\nSVM Accuracy: %.4f" % svm_acc)
print("SVM Precision: %.4f" % svm_prec)
print("SVM Recall: %.4f" % svm_rec)
print("SVM F1: %.4f" % svm_f1)
print("SVM AUC: %.4f" % svm_auc)

print("\nDecision Tree Accuracy: %.4f" % tree_acc)
print("Decision Tree Precision: %.4f" % tree_prec)
print("Decision Tree Recall: %.4f" % tree_rec)
print("Decision Tree F1: %.4f" % tree_f1)
print("Decision Tree AUC: %.4f" % tree_auc)

print("\nKNN Accuracy: %.4f" % knn_acc)
print("KNN Precision: %.4f" % knn_prec)
print("KNN Recall: %.4f" % knn_rec)
print("KNN F1: %.4f" % knn_f1)
print("KNN AUC: %.4f" % knn_auc)


### Logistic Regression Results

In [None]:
print("Logistic Regression Accuracy: %.4f" % acc)
print("Logistic Regression Precision: %.4f" % prec)
print("Logistic Regression Recall: %.4f" % rec)
print("Logistic Regression F1: %.4f" % f1)
print("Logistic Regression AUC: %.4f" % auc)

log_results = pd.DataFrame(log_grid.cv_results_)
log_results.sort_values(by='rank_test_score').head()

In [None]:
cm_log = confusion_matrix(log_preds, y_test)

fig, ax = plt.subplots()

sns.heatmap(cm_log, ax=ax, cmap='Blues', annot=True, fmt='g', cbar=False,
            xticklabels=['Predicton: 0', 'Predicton: 1'],
            yticklabels=['Actual: 0', 'Actual: 1'])

plt.show()

### Naive Bayes Results

In [None]:
print("\nNaive Bayes Accuracy: %.4f" % naive_acc)
print("Naive Bayes Precision: %.4f" % naive_prec)
print("Naive Bayes Recall: %.4f" % naive_rec)
print("Naive Bayes F1: %.4f" % naive_f1)
print("Naive Bayes AUC: %.4f" % naive_auc)


nb_results = pd.DataFrame(nb_grid.cv_results_)
nb_results.sort_values(by='rank_test_score').head()

In [None]:
cm_nb = confusion_matrix(naive_preds, y_test)

fig, ax = plt.subplots()

sns.heatmap(cm_nb, ax=ax, cmap='Blues', annot=True, fmt='g', cbar=False,
            xticklabels=['Predicton: 0', 'Predicton: 1'],
            yticklabels=['Actual: 0', 'Actual: 1'])

plt.show()

### Support Vector Machine Results

In [None]:
print("\nSVM Accuracy: %.4f" % svm_acc)
print("SVM Precision: %.4f" % svm_prec)
print("SVM Recall: %.4f" % svm_rec)
print("SVM F1: %.4f" % svm_f1)
print("SVM AUC: %.4f" % svm_auc)

svm_results = pd.DataFrame(svm_grid.cv_results_)
svm_results.sort_values(by='rank_test_score').head()

In [None]:
cm_svc = confusion_matrix(svm_preds, y_test)

fig, ax = plt.subplots()

sns.heatmap(cm_svc, ax=ax, cmap='Blues', annot=True, fmt='g', cbar=False,
            xticklabels=['Predicton: 0', 'Predicton: 1'],
            yticklabels=['Actual: 0', 'Actual: 1'])

plt.show()

### Decision Tree Results

In [None]:
print("\nDecision Tree Accuracy: %.4f" % tree_acc)
print("Decision Tree Precision: %.4f" % tree_prec)
print("Decision Tree Recall: %.4f" % tree_rec)
print("Decision Tree F1: %.4f" % tree_f1)
print("Decision Tree AUC: %.4f" % tree_auc)

dt_results = pd.DataFrame(dt_grid.cv_results_)
dt_results.sort_values(by='rank_test_score')

In [None]:
cm_dt = confusion_matrix(tree_preds, y_test)

fig, ax = plt.subplots()

sns.heatmap(cm_dt, ax=ax, cmap='Blues', annot=True, fmt='g', cbar=False,
            xticklabels=['Predicton: 0', 'Predicton: 1'],
            yticklabels=['Actual: 0', 'Actual: 1'])

plt.show()

### K-Nearest Neighbors Results

In [None]:
print("\nKNN Accuracy: %.4f" % knn_acc)
print("KNN Precision: %.4f" % knn_prec)
print("KNN Recall: %.4f" % knn_rec)
print("KNN F1: %.4f" % knn_f1)
print("KNN AUC: %.4f" % knn_auc)

knn_results = pd.DataFrame(knn_grid.cv_results_)
knn_results.sort_values(by='rank_test_score').head()

In [None]:
cm_knn = confusion_matrix(knn_preds, y_test)

fig, ax = plt.subplots()

sns.heatmap(cm_knn, ax=ax, cmap='Blues', annot=True, fmt='g', cbar=False,
            xticklabels=['Predicton: 0', 'Predicton: 1'],
            yticklabels=['Actual: 0', 'Actual: 1'])

plt.show()