# Introduction

The notebook is intended to perform a binary classification over the 'Response' label.

In [13]:
# Import Standard Modules
import pandas as pd

from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Set Pandas Options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [14]:
# Notebook configurations
scaler_type = 'StandardScaler'

# Read Data

In [7]:
# Read data
data = pd.read_csv('./../data/marketing_campaign_prepared.csv', encoding='latin1', sep=';')

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1533 entries, 0 to 1532
Data columns (total 28 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   ID                     1533 non-null   int64  
 1   Year_Birth             1533 non-null   int64  
 2   Education              1533 non-null   object 
 3   Marital_Status         1533 non-null   object 
 4   Income                 1533 non-null   float64
 5   Kidhome                1533 non-null   int64  
 6   Teenhome               1533 non-null   int64  
 7   Recency                1533 non-null   int64  
 8   MntWines               1533 non-null   int64  
 9   MntFruits              1533 non-null   int64  
 10  MntMeatProducts        1533 non-null   int64  
 11  MntFishProducts        1533 non-null   int64  
 12  MntSweetProducts       1533 non-null   int64  
 13  MntGoldProds           1533 non-null   int64  
 14  NumDealsPurchases      1533 non-null   int64  
 15  NumW

In [9]:
data.head(5)

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Recency,MntWines,MntFruits,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Response,Dt_Customer_month,Dt_Customer_dayofweek
0,2174,1954,Graduation,Single,46344.0,1,1,38,11,1,...,5,0,0,0,0,0,0,0,3,5
1,4141,1965,Graduation,Together,71613.0,0,0,26,426,49,...,4,0,0,0,0,0,0,0,8,2
2,6182,1984,Graduation,Together,26646.0,1,0,26,11,4,...,6,0,0,0,0,0,0,0,2,0
3,5324,1981,PhD,Married,58293.0,1,0,94,173,43,...,5,0,0,0,0,0,0,0,1,6
4,7446,1967,Master,Together,62513.0,0,1,16,520,42,...,6,0,0,0,0,0,0,0,9,0


# Train & Test Split

Since the label is characterized by a strong imbalancing in the class distribution, we need to address it carefully:
1. Ensure that the training and test sets have the same proportions of the two classes
2. Oversample the minor class (i.e., randomly duplicate examples)
3. Undersample the major class (i.e., randomly delete examples)
4. Use several metrics (e.g., Accuracy, Precision, Recall, AUC)

Use StratifiedShuffleSplit. This cross-validation object is a merge of StratifiedKFold and ShuffleSplit, which returns stratified randomized folds. The folds are made by preserving the percentage of samples for each class.

Note: like the ShuffleSplit strategy, stratified random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.

In [None]:
# Define the Splitter
stratified_kfold = StratifiedShuffleSplit(n_splits=5,
                                          test_size=.3, 
                                          random_state=0)

# Data Standardization

Transform the individual features to look more or less like standard normally distributed data: Gaussian with zero mean and unit variance.

Keep in mind that tree-based methods are scale-invariant, so data standardization is not required.

Standardization has to go after training-test split. That's because, standardizing the whole dataset and then split, would introduce into the training set some information about the mean and std of the test set. Remember to standardize the test set with the same scaler trained on the training set. This wo

In [15]:
# Define the scaler
if scaler_type == 'StandardScaler':
    
    scaler = StandardScaler()