# Introduction

Fred Etter - March 2019  

This notebook uses data from a Kaggle competition to predict if a customer will buy a product from Banco Santander during a typical transaction.  There are a total of 200,000 rows of customer transactions along with the binary outcome of whether they made a purchase or not.  There are 200 columns, or 200 features, to use as inputs to each of the models that are presented.  

The workflow in this notebook is as follows:
1.  Import and Clean Data  
2.  Data Exploration  
3.  Modeling and Evaluation
4.  Conclusion and Discussion  


In [1]:
# Import necessary modules
import pandas as pd
import random
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier
import sklearn
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.feature_selection import SelectFromModel
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
import seaborn as sns

# 1.  Import and clean the data  
In this section, the data is imported and cleaned.  

Import the data from a csv file; then print the number of rows and columns of the data; then show the first 5 lines of the dataframe.

In [2]:
df = pd.read_csv(r"C:\Users\Fred\Documents\PythonDirectory\Unit 3\train_san.csv")
print(df.shape)
df.head()

(200000, 202)


Unnamed: 0,ID_code,target,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
0,train_0,0,8.9255,-6.7863,11.9081,5.093,11.4607,-9.2834,5.1187,18.6266,...,4.4354,3.9642,3.1364,1.691,18.5227,-2.3978,7.8784,8.5635,12.7803,-1.0914
1,train_1,0,11.5006,-4.1473,13.8588,5.389,12.3622,7.0433,5.6208,16.5338,...,7.6421,7.7214,2.5837,10.9516,15.4305,2.0339,8.1267,8.7889,18.356,1.9518
2,train_2,0,8.6093,-2.7457,12.0805,7.8928,10.5825,-9.0837,6.9427,14.6155,...,2.9057,9.7905,1.6704,1.6858,21.6042,3.1417,-6.5213,8.2675,14.7222,0.3965
3,train_3,0,11.0604,-2.1518,8.9522,7.1957,12.5846,-1.8361,5.8428,14.925,...,4.4666,4.7433,0.7178,1.4214,23.0347,-1.2706,-2.9275,10.2922,17.9697,-8.9996
4,train_4,0,9.8369,-1.4834,12.8746,6.6375,12.2772,2.4486,5.9405,19.2514,...,-1.4905,9.5214,-0.1508,9.1942,13.2876,-1.5121,3.9267,9.5031,17.9974,-8.8104


There are 200,000 rows and 202 columns.  All of the feature data are float numbers.  We can drop the ID_code later because it is irrevelant.  The var_1, var_2, etc are the column names given in the orginal dataset.  These features have not been specifically defined in Kaggle.  They are anonymized features that have been captured during customer transactions.  

Next, we'll print the sum of the 'target' column which is the number of all of the rows where the customer bought a product.  Then we calculate the buy ratio.  As you can see here, a customer bought something about 10% of the time.

In [3]:
df['target'].sum()


20098

In [4]:
# print the sume of the target column
df['target'].sum()

# divide the target sum by the total number of rows
print("Customers average buy ratio, {}".format(df['target'].sum()/len(df['target'])))

Customers average buy ratio, 0.10049


Now we know we have a highly imbalanced dataset; 10% are 'buy' transactions, 90% are 'not buy' transactions.  In order to capture this imbalance we will use following technique:
- we will randomly remove most of the 'not buy' rows so the number of 'buy' and 'not buy' rows are even.  

Next, let's look at some of the data types in our dataset and check for NaN values.

In [5]:
df.dtypes.head()

ID_code     object
target       int64
var_0      float64
var_1      float64
var_2      float64
dtype: object

We see that the 'target' data type is an integer which is what we'll need for the machine learning algorithms.  We have float values for the columns - which is what we want - and an 'ID_code' that is an object (we'll delete this column later).

In [6]:
df.isnull().sum().sum()

0

Perfect, no cleaning necessary for NaN values.

Drop the ID_code column since it is a string and irrelevant to making a buy or no_buy prediction.  Also, drop a random sample of 140,000 rows to account for hardware / memory limitations.

In [7]:
# drop ID_code column
df.drop(['ID_code'], axis=1, inplace=True)

# drop a random sample of 140,000 rows
df.drop(df.sample(195000).index, inplace=True)

Look at the number of rows of the data to make sure we dropped correct amount and then shuffle the data randomly to prepare data for modeling.

In [8]:
print(df.shape[0])
df = df.sample(frac=1)

5000


Take another look at the data to see random shuffled rows.

In [9]:
df.head()

Unnamed: 0,target,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
127670,0,13.1134,0.9263,8.4017,7.0261,10.2461,-13.9023,6.0891,12.5448,-4.6748,...,6.9387,6.0567,3.5393,4.7091,17.2364,-0.168,-3.0132,10.3269,18.9674,-10.7604
91738,0,6.3999,3.568,14.7459,4.7192,10.9572,-5.4206,5.6457,23.9838,-3.2039,...,2.8688,4.3747,0.6455,-7.0703,22.0905,0.0519,5.6649,9.4528,16.8323,14.4631
138164,0,7.3829,-3.5773,11.7715,3.8721,12.6857,-9.6839,6.0539,15.8856,2.3073,...,0.5405,9.4594,1.7664,2.4748,16.4584,-1.3579,-7.5438,9.1225,19.7478,-5.1241
32634,0,18.9618,-10.0936,6.6822,6.2581,9.4195,-21.2975,6.3479,13.39,-5.1408,...,0.5705,10.795,1.8621,1.4947,17.5174,-1.965,0.8396,8.6497,14.5091,-14.1486
5445,0,6.6791,-7.988,9.2067,5.3087,12.5855,-3.8427,4.7047,14.6442,-1.9235,...,0.2276,13.9635,3.3346,-7.4988,14.7375,-0.7395,-6.7301,8.854,13.6309,-9.5104


At this point our data looks clean and ready to start exploring in more detail to look for features that might be better predictors for the target data.  We can use the pairplot and the heatmap functionality to begin this analysis.

# 2.  Data Exploration  
First, let's look at the correlation between the target and the features. 

In [10]:
# use absolute value
np.absolute(df.corr().unstack().sort_values().drop_duplicates())[:10]

target   var_169    0.082870
         var_109    0.076477
         var_33     0.074832
var_34   target     0.068878
var_149  target     0.068421
var_146  target     0.066935
target   var_139    0.065506
var_166  target     0.065490
target   var_122    0.064091
         var_13     0.062257
dtype: float64

There aren't any strongly correlated features in this dataset.  The highest correlation is only around 9%.


Next, we'll look at the pairplot for the top 10 features as determined by the correlations found in the previous step.

In [11]:
# df_pp = df[['target', 'var_81', 'var_139', 'var_12', 'var_174', 
#             'var_146', 'var_80', 'var_76', 'var_165', 'var_148', 'var_166']].copy()

# sns.set(font_scale=1.7)
# sns.pairplot(df_pp)
# plt.show()

Nothing too exciting here; looks like a lot of noise.  Features look highly uncorrelated.  We do see that all of these features are very close to normally distributed.  We can also see the previously observed ratio of 1 buy transaction for every 10 transations.  

We can look at a heatmap for the 10 best features (determined above) for a more visual representation of the correlations between target and feature data.

In [12]:
# import seaborn as sns
# plt.figure(figsize=(10,9))
# corr = df_pp.corr()
# sns.heatmap(corr, cmap='coolwarm')
# plt.savefig('heatmat.png')

The above heatmap and correlation values show that there is very little correlation between the target variable and the top 10 best predictive features.  It looks like, at this point, we will need to incorporate many features (maybe most of the 200) to improve the predictive value of each model.

# 3.  Build the models and evaluate
The first step to building a good model is to separate the data into training and test data.  We'll train the model on the training data and test it with the test data.  This next line of code breaks the dataframe into 2 dataframes: 1 for training and 1 for test.

In [13]:
# Create training and test sets.
offset = int(df.shape[0] * 0.8)

df_train = df[:offset]
df_test = df[offset:]

Confirm the new shapes of the 2 new dataframes.

In [14]:
df_train.shape

(4000, 201)

In [15]:
df_test.shape

(1000, 201)

Again, confirm the existence of about 10% 'buy' rows for the training data. (8034 is about 10% of 80,000 from above)

In [16]:
print("Customers average buy ratio, {0:.3f}".format(df['target'].sum()/len(df['target'])))

Customers average buy ratio, 0.101


Next, we need to balance the data to account for the current 9 to 1 ratio of 'not buy' to 'buy'.  In this case, dropping around 90% of the zeros was performed to ....

In [17]:
# set variable buy_num to the total number of buys
buy_num = df_train['target'].sum()
print(buy_num)

# calculate the difference between the total rows and the number of buys
diff = df_train.shape[0] - buy_num

# calculate the number of rows to drop and drop those rows
df_train.drop(df_train.query('target < 1').sample(frac=(1 - (buy_num/diff))).index, inplace=True)
print(df_train.shape)

416
(832, 201)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


Confirm the new shape of the training data.  The new training dataframe has an equal number of 'buy' and 'not buy' rows.  As you can see, the number of rows has twice the number of the number of 'buys' in the dataframe.

In [18]:
print("Number of buys, {}".format(buy_num))
print("Number of rows and columns, {}".format(df_train.shape))

Number of buys, 416
Number of rows and columns, (832, 201)


We will use the Area Under the ROC curve insead of the more typical r-squared accuracy measure.  This is the preferred metric in accordance with Kaggle scoring.

Basic Logistic Regression is the first model used to calculate the ROC accuracy.  

#### 3.1  Logistic Regression without PCA

In [19]:
# 1.  Logistic Regression without PCA

from datetime import datetime
start_time = datetime.now()

# Instantiate our model.
regr = linear_model.LogisticRegression(solver='sag')

# set features and dependent variable for training data
y_train = df_train['target'].values

# drop the 'target' column to obtain the feature inputs
df_train.drop(['target'], axis=1, inplace=True)

# normalize the training data
x_train = sklearn.preprocessing.normalize(df_train)

# now for test...
y_test = df_test['target'].values

# drop the 'target' column to obtain the feature inputs
df_test.drop(['target'], axis=1, inplace=True)

# normalize the test data
x_test = sklearn.preprocessing.normalize(df_test)

# fit model to training data
regr.fit(x_train, y_train)

end_time = datetime.now()
print('\nDuration: {}'.format(end_time - start_time))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)



Duration: 0:00:00.192011


Calculating the are under the ROC curve next for the Linear Regression Model.

In [20]:
from sklearn import metrics
y_test_pred = regr.predict(x_test)
sklearn.metrics.roc_auc_score(y_test, y_test_pred)

0.6668661371631669

Computing the Confusion Matrix for the LR Model.

In [21]:
from sklearn.metrics import confusion_matrix
sklearn.metrics.confusion_matrix(y_test, y_test_pred, labels=None, sample_weight=None)
# print(len(y_test), np.sum(y_test))

array([[633, 276],
       [ 33,  58]], dtype=int64)

Computing the raw accuracy for the LR Model.

In [22]:
sklearn.metrics.accuracy_score(y_test, y_test_pred)

0.691

#### 3.2  Logistic Regression with PCA

In [23]:
# 1.  Logistic Regression

from datetime import datetime
start_time = datetime.now()

# Instantiate our model.
regr_pca = linear_model.LogisticRegression(solver='sag')

# ----------------------------------------------------------------
from sklearn.decomposition import PCA
# Make an instance of the Model
# .95 means sklearn will chose the minimum number of components such that 95% of the variance is retained.
pca = PCA(.95) 

pca.fit(x_train)

print(pca.n_components_)

x_train = pca.transform(x_train)
x_test = pca.transform(x_test)

# -----------------------------------------------------------------
# fit model to training data
regr_pca.fit(x_train, y_train)

end_time = datetime.now()
print('\nDuration: {}'.format(end_time - start_time))

106

Duration: 0:00:00.177010


In [24]:
from sklearn import metrics
y_test_pred = regr_pca.predict(x_test)
sklearn.metrics.roc_auc_score(y_test, y_test_pred)

0.6641098175751641

In [25]:
from sklearn.metrics import confusion_matrix
sklearn.metrics.confusion_matrix(y_test, y_test_pred, labels=None, sample_weight=None)
# print(len(y_test), np.sum(y_test))

array([[618, 291],
       [ 32,  59]], dtype=int64)

In [26]:
sklearn.metrics.accuracy_score(y_test, y_test_pred)

0.677

## Conclusion

LR without PCA:
- area under ROC:  0.667
- raw accuracy:  0.691  

LR with PCA:  
- area under ROC:  0.664  
- raw accuracy:  0.677  

Number of features retained was 106 out of 200.