## Problem Definition

MarketCo is a supermarket that keeps track of everything that their clients buy. The supermarket plans to launch an email marketing campaign and they want to be able to predict if a client would visit the supermarket in the next seven days so they can focus marketing campaign on them and avoid wasting marketing efforts and resources on clients who would probably not visit the supermarket.

### Importing Libraries

In [1]:
from datetime import datetime,date, timedelta
import pandas as pd
%matplotlib inline
from sklearn import metrics
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import sklearn
import warnings
warnings.filterwarnings("ignore")

#import machine learning related libraries
from sklearn.svm import SVC
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
from sklearn.model_selection import KFold, cross_val_score, train_test_split


### Loading the data

In [2]:
data = pd.read_csv('../data/data.csv')
target = pd.read_csv('../data/train_labels.csv')
test_labels = pd.read_csv('../data/test_labels.csv')

In [3]:
print("The shape of the data is:" + str(data.shape))

The shape of the data is:(10664209, 8)


In [None]:
#First rows of the data
data.head()

In [None]:
#Checking each feature's type.
data.info()

In [None]:
#Checking if there are any null values at any feature column.
data.isnull().sum()

In [None]:
#Converting date column into datetime type
data['date'] = pd.to_datetime(data['date'])

In [None]:
#Eliminating columns I will not use for modeling
data.drop(['category_code', 'item_code','qty_sold'], axis=1, inplace=True)

### Making sure data is clean

In [None]:
#Making sure all dates are inside the interval of 2020-06-01 & 2020-10-24

data = data[(data['date'] >= '2020-06-01') & (data['date'] <= '2020-10-24')]
data.shape

In [None]:
#Removing data that has more than 75% of NaN values
perc = 75
min_count =  int(((100-perc)/100)*data.shape[1] + 1)
verified_data = data.dropna( axis=0, 
                    thresh=min_count)

verified_data.shape

### Creating relevant features

Feature 1: Days since the last purchase made by each client

In [None]:
#Creating a column to refer 2020-10-24 as if it were today's date 
verified_data['today'] = pd.to_datetime('2020-10-24')

In [None]:
#Creating a column to calculate the number of days passed since every date in the dataset
#until today(2020-10-24)

verified_data['days_since_last'] = (verified_data.today - verified_data.date).dt.days.abs()

In [None]:
#Setting client_id as index column
verified_data.set_index('client_id')

Creating a profit column (price - cost)

In [None]:
verified_data['profit'] = (verified_data['price']-verified_data['cost'])

In [None]:
#creating a dataframe to show days since last visit per client
feature1= verified_data.groupby('client_id', as_index=False)['days_since_last']
feature1_df = pd.DataFrame(feature1.min())
feature1_df = feature1_df.rename(columns={'days' : 'days_since_last'})

In [None]:
feature1_df.shape

In [None]:
#Creating a dataframe that shows profit generated per client
money = verified_data.groupby(['client_id','date'])['profit']
money_df = pd.DataFrame(money.sum().reset_index(name = 'profit'))

Feature 2 : Average days between each purchase (per client)

In [None]:
money_df['days'] = (money_df.sort_values('date').groupby('client_id').date.shift() - money_df.date).dt.days.abs()

In [None]:
money_df['days'] = money_df['days'].fillna(0)

In [None]:
money_df.head()

In [None]:
feature2 = money_df.groupby(['client_id'], as_index=False)['days']
feature2_df = pd.DataFrame(feature2.mean().round())
feature2_df = feature2_df.rename(columns={'days' : 'avg_frequency'})

Feature 3: Profit per client

In [None]:
#This feature will be used only for strategy, to select the clients that generate more profit
#for the company, but it will not be used to predict if client will come back in 7 days.

feature3 = money_df.groupby(['client_id'], as_index=False)['profit']
feature3_df = pd.DataFrame(feature3.mean().round())
feature3_df = feature3_df.rename(columns={'days' : 'avg_profit'})

In [None]:
feature1_df = feature1_df.set_index('client_id')

### Joining Data

In [None]:
inner_merged = pd.merge(feature1_df, feature2_df, on='client_id')

In [None]:
inner_merged.head()

Creating a new column that estimates how many days left for the next clients purchase

In [None]:
inner_merged['prob_next_purchase'] = (0 - inner_merged['days_since_last'] + inner_merged['avg_frequency'])

In [None]:
inner_merged.shape

In [None]:
#Setting client_id as index column in train_labels dataset
target = target.set_index('client_id')

Joining our data with the training labels to be able to train the model

In [None]:
final_data = pd.merge(inner_merged, target, on='client_id')

In [None]:
final_data.head()

In [None]:
final_data.set_index('client_id')

### Modeling

In [None]:
#Now we divide again the final data into target_variable and train data to be able to split it for validation

target_variable= final_data['target_visit']
train_data = final_data.drop('target_visit',axis=1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train_data, target_variable, test_size=0.2, random_state=42)

**Baseline Model**

In [None]:
#Let's visualize the distribution of our target variable.

sns.countplot(x=target_variable)
plt.title('Distribution of clients in the 7 next days')
plt.show()

In [None]:
target.value_counts(normalize = True) *100

In [None]:
#Creating a dymmy model that estimates the results by considering the most frequent value in our target labels

from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X_train, y_train)

In [None]:
dummy_clf.predict(X_train)
dummy_clf.score(X_train, y_train)

We used this baseline model to know how our final model will perform. Now, any machine learning model that improves on this baseline is adding value.

**Selecting the best algorithm**

In [None]:
#create an array of models
models = []
models.append(("LR",LogisticRegression()))
models.append(("Dtree",DecisionTreeClassifier()))
models.append(("RF",RandomForestClassifier()))
models.append(("XGB",xgb.XGBClassifier()))
models.append(("NB",GaussianNB()))
#models.append(("SVC",SVC()))

#measure the f1 score for each one
for name,model in models:
    kfold = KFold(n_splits=4, random_state=None)
    cv_result = cross_val_score(model,X_train,y_train, cv = kfold,scoring = "f1_weighted")
    print(name, cv_result) 

We can see that the best F1 score came from using XGBoost model, so we will focus on finding the best hyperparamters to model our data using XGBoost.

In [None]:
xgb_model = xgb.XGBClassifier( use_label_encoder =False, objective= 'binary:logistic', nthread=4, seed=42)
xgb_model.fit(X_train, y_train)

print('Accuracy of XGB classifier on training set: {:.2f}'
       .format(xgb_model.score(X_train, y_train)))
print('Accuracy of XGB classifier on test set: {:.2f}'
       .format(xgb_model.score(X_test[X_train.columns], y_test)))

In [None]:
xgb_predictions = xgb_model.predict(X_test)

In [None]:
#Model Accuracy, how often is the classifier correct?
print("Accuracy:", metrics.accuracy_score(y_test, xgb_predictions))

In [None]:
#Classification report
print(classification_report(y_test, xgb_predictions))

In [None]:
# Confusion matrix
fig, ax = plt.subplots()
sns.heatmap(confusion_matrix(y_test, xgb_predictions, normalize='true'), annot=True, ax=ax)
ax.set_title('Confusion Matrix')
ax.set_ylabel('Real Value')
ax.set_xlabel('Predicted Value')

plt.show()

**Hyperparameter tuning**

In [None]:
#We try with some diferent parameters
xgb_model = xgb.XGBClassifier(learning_rate=0.02, n_estimators=600, objective='binary:logistic',
                    silent=True, nthread=1)

We will use a RandomizedSearch to search randomly for best parameters (for time efficiency)

In [None]:
# A parameter grid for XGBoost
params = {
        'min_child_weight': [1, 5, 10],
        'gamma': [0.5, 1, 1.5, 2, 5],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'max_depth': [3, 4, 5]
        }

In [None]:
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold

folds = 3
param_comb = 5

skf = StratifiedKFold(n_splits=folds, shuffle = True, random_state = 1001)

random_search = RandomizedSearchCV(xgb_model, param_distributions=params, n_iter=param_comb, scoring='f1_weighted', n_jobs=4, cv=skf.split(X_train,y_train), verbose=3, random_state=1001 )
random_search.fit(X_train, y_train)

In [None]:
print('\n Best hyperparameters:')
print(random_search.best_params_)

Building my model again putting all the parameters together.

In [None]:
xgb_model = xgb.XGBClassifier(learning_rate=0.02, n_estimators=600, objective='binary:logistic', nthread=1, use_label_encoder =False, 
                              seed=42, subsample= 0.6, min_child_weight= 1, max_depth= 5, gamma=1.5, colsample_bytree= 0.8)
xgb_model.fit(X_train, y_train)

print('Accuracy of XGB classifier on training set: {:.2f}'
       .format(xgb_model.score(X_train, y_train)))
print('Accuracy of XGB classifier on test set: {:.2f}'
       .format(xgb_model.score(X_test[X_train.columns], y_test)))

In [None]:
xgb_predictions = xgb_model.predict(X_test)

In [None]:
#Classification report
print(classification_report(y_test, xgb_predictions))

### Saving my model and predictions

We need to merge our test_labels dataset (that has no labels yet) with our final data so we can have a final test dataset.

In [None]:
test_data = pd.merge(inner_merged, test_labels, on='client_id')

In [None]:
test_data.head()

In [None]:
#Let's confirm that we have matched our 10,000 rows that were not used for training, with our clients with no label yet.
test_data.shape

In [None]:
test_data.set_index('client_id')

In [None]:
final_predictions = xgb_model.predict(test_data)

In [None]:
print(len(final_predictions))

In [None]:
test_pred = pd.DataFrame({'client_id': test_data['client_id'], 'target_visit': final_predictions})

In [None]:
test_pred.to_csv('submission.csv', index=False)

In [None]:
test_pred.head()

In [None]:
#Save the model 

import pickle

filename = "final_model.pkl"

pickle.dump(xgb_model, open(filename, 'wb'))