
## Background Information ##

Financial institutions, like Santander, help people and businesses prosper by providing tools and services to assess their personal financial health and to identify additional ways to help customers reach their monetary goals. In the United States, it is estimated that 40% of Americans cannot cover a $400 emergency expense1. As a result, it is imperative that financial institutions learn consumer habits to adopt new technologies to better serve their financial needs. 

## Problem Statement
Santander, a financial institution, is trying to predict the next transaction a given customer is trying to complete based on historical banking information. This is a binary classification problem where the input data contains 299 unnamed normally-distributed feature variables. The solution to this problem will be evaluated on a provided test data set by Santander.

## Solution Statement #

The provided train.csv file contains 200,000 unique rows corresponding to customer data. Given the large dataset, and the need to complete binary classification, there are many solutions to this problem: mine will involve using a deep neural network, after preprocessing the inputs by normalizing and scaling features, to classify the two target variables. After the model is trained and validated on a subset of the data from the train.csv file, I will run my trained model on the provided test set from Santander and measure the accuracy of each prediction.

## Load Data 

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import train_test_split


import os
print(os.listdir("../input"))

# Pretty display for notebooks
%matplotlib inline

# Load the Santander dataset
train_data = pd.read_csv('../input/train.csv')
test_data = pd.read_csv('../input/test.csv')
submission_data = pd.read_csv('../input/sample_submission.csv')

ModuleNotFoundError: No module named 'sklearn'

## Data Exploration & Visualizations ## 


In [None]:
#Size of training data
train_data.shape

In [None]:
train_data['target'].head(5)

In [None]:
train_data.describe()


Immediate Key Takeaways: The mean, standard deviation, and maximum values of the features vary widely; if we choose to implement a black-box algorithm, like neural networks, a key step will be data preprocessing, which will involve feature scaling, potentially outlier-detection and removal, and definitely normalization of these variables.



 ### Separate Training and Validation Datasets


In [None]:
feature_train_data = train_data.drop(['ID_code','target'], axis=1) #remove target column & ID column

#Split the dataset into training and validation sets so that we can implement models without touching 
#the test set
X_train, X_val, y_train, y_val = train_test_split(feature_train_data, train_data['target'], test_size = 0.20, random_state = 25)

In [None]:
X_train.describe()


From the dataset statistics, we can see that the mean and standard deviatation of each var feature significantly varies. If we apply supervised learning algorithms without any feature scaling or preprocessing, the algorithm will bias to certain features over others when learning the relationship between the input features and the target classification of the customer.

Binary classification without any preprocessing using a simple decision tree classifier could provide insight into what minimum baseline performance we achieve.



### Develop simple model

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

treeClassifier = DecisionTreeClassifier()
treeClassifier.fit(X_train, y_train)

y_pred = treeClassifier.predict(X_val)

# TODO: Report the score of the prediction using the testing set
score = accuracy_score(y_val, y_pred)

In [None]:
score

In [None]:
#Prepare input and output; use decision tree classifier to make predictions 
X_test = test_data.drop('ID_code', axis=1)
y_test_ID = test_data['ID_code']
y_pred_testSet = treeClassifier.predict(X_test)

#Convert array into panda dataframe
F=pd.DataFrame(np.vstack(y_pred_testSet) , columns = ['target'])

#Build output submission dataframe
competitionSub = pd.concat([y_test_ID, F], axis=1)
competitionSub.head()

### Analyzing Model Performance - Decision Tree


On the test set, when we submit to Kaggle, a decision tree classifier gets us 55% accuracy, which is slightly better than random guessing. Alas, it's not a great solution to this problem and now we'll dive deeper into data exploration. (I placed 7,000+, YIKES!)

If we saw redundant features in the dataset, we might see strong linear correlatations to multiple features; this also lets us see whether or not our features are normally distributed or if the dataset contains any outliers that could skew our results.



In [None]:
feature_train_data.corr()


### Transforming & Normalizing Numerical Features
If our data is not normally distributed, especially if the mean and medians vary significantly (indicating a large skew in the data), it's appropriate to apply a non-linear scaling to reduce this; one example is applying the natural logarithm to the dataset.

In addition to performing transformations on features that are highly skewed, it is often good practice to perform some type of scaling on numerical features. Applying a scaling to the data does not change the shape of each feature's distribution (such as 'capital-gain' or 'capital-loss' above); however, normalization ensures that each feature is treated equally when applying supervised learners. Note that once scaling is applied, observing the data in its raw form will no longer have the same original meaning. Care must be taken when applying this transformation

In [None]:
scaled_features_train = {} #store scaling values for conversion back to original values later on
scaled_features_test = {} #store scaling values for conversion back to original values later on

feature_test_data = test_data.drop(['ID_code'], axis=1)

#Continuous Variables are set to be between 0 & 1 and to have zero mean and a standard deviation of 1# Yielded 91% accuracy with 50 epochs
for each in cont_columns:
    mean_train, std_train =feature_train_data[each].mean(), feature_train_data[each].std()
    scaled_features_train[each]= [mean_train , std_train]
    feature_train_data.loc[:,each] = (feature_train_data[each] - mean_train)/std_train
    
    mean_test, std_test = feature_test_data[each].mean(), feature_test_data[each].std()
    scaled_features_test[each]= [mean_test , std_test]
    feature_test_data.loc[:,each] = (feature_test_data[each] - mean_test)/std_test

In [None]:
feature_train_data.describe()


Key takeaway: Look closely at the standard deviation, mean, min, and max of some of the first few var features. We standardized it and made the mean approach the value of 0, meaning that a supervised learning algorithm can now take these inputs and output a more equal understanding of the blend of features to identify the two targets.

From another posted kernel on Kaggle, we observed that this was an unbalanced classification challenge (Link: https://www.kaggle.com/allunia/santander-customer-transaction-eda)

So, I used their visualization as inspiration to provide readers of this kernel a figure to view:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.countplot(train_data.target.values, palette="husl")

### Implementing Dimensionality Reduction (Principal Component Analysis)¶
It's worth understand how much of the input data is redundant, or less important when classifying the two classes. In our case, we'll implement Principal Component Analysis (PCA), a form of unsupervised learning, that will define how much of the dataset's variance is explained by a number of features we choose to explore. We'll work with the original data and NOT the data that was normalized with zero mean and standard deviation of 1. That makes it harder to figure out which feature is most responsible for the variance in the dataset.

In [None]:
from sklearn.decomposition import PCA

#Separate preprocessed training and validation sets
X_train_prep, X_val_prep, y_train, y_val = train_test_split(feature_train_data, train_data['target'], test_size = 0.20, random_state = 25)

# Compute a PCA (eigenfaces) on the face dataset (treated as unlabeled
# dataset): unsupervised feature extraction / dimensionality reduction
n_components = 5

#Extracting the top input variables from 200,000 samples of training data after standard scaling
#was applied; capture 95% of the variance in the number of components we are reducing down to.
pca = PCA(.95,
          whiten=True).fit(X_train_prep)

#Projecting the input data onto the most important principal axises
X_train_pca = pca.transform(X_train_prep)
X_val_pca = pca.transform(X_val_prep)
pca.explained_variance_ratio_.cumsum()

Key takeaway: We can confirm that the cumulative sum of the variance explained by the features is 0.9531571, a number we set when we implemented PCA above.



In [None]:
 pca.n_components_


key takeaway: it looks like that with 190 of the 200 features, we can capture 95% of the variance of the data. We reduced the dimensions, but it's worthwhile to note that this is not a significant reduction in dimensionality of the training data.



### Performing PCA on Original Dataset For Comparison


In [None]:
#Extracting the top input variables from 200,000 samples of training data after standard scaling
#was applied; capture 95% of the variance in the number of components we are reducing down to.
pca1 = PCA(.95,
          whiten=True).fit(X_train)

#Projecting the input data onto the most important principal axises
X_train_pca1 = pca1.transform(X_train)
X_val_pca1 = pca1.transform(X_val)


In [None]:
 pca1.n_components_


Key takeaway: PCA on the original training set, without any standard scaling (i.e., setting the mean to 0 and standard deviation to 1 for each feature), yields fewer principal components that capture 95% of the variance in the data.



In [None]:
from sklearn.linear_model import LogisticRegression

#Use PCA values from the preprocessed input data after mean is set to zero and standard devitation is set to 1.
logisticRegr0 = LogisticRegression(solver = 'lbfgs')
logisticRegr0.fit(X_train_pca, y_train)
y_pred_logisticRegr0 = logisticRegr0.predict(X_val_pca)

logisticRegr1 = LogisticRegression(solver = 'lbfgs')
logisticRegr1.fit(X_train_pca1, y_train)
y_pred_logisticRegr1 = logisticRegr1.predict(X_val_pca1)

In [2]:
#This scores our logistic regression using the scaled data and after applying PCA; here we use 190 feature variables that capture 95% variance in the data
score0 = accuracy_score(y_val, y_pred_logisticRegr0)
score0

NameError: name 'accuracy_score' is not defined

In [None]:
#This scores our logistic regression using the raw data and after applying PCA; here we use 111 feature variables that capture 95% variance in the data

score1 = accuracy_score(y_val, y_pred_logisticRegr1)
score1

#On the submission page, this yielded a test set score of .

Key Takeaway: It seems that a simple logistic regression performs better in the scenario where we set the mean to zero and the standard deviatation to 1 of the raw data provided. Then, we apply PCA and then we fit a logistic regression to the transformed feature data that captures 95% of the variance.

### Implement ADABoost


In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),
                         algorithm="SAMME",
                         n_estimators=1)

bdt.fit(X_train_pca, y_train)
y_pred_bdt = bdt.predict(X_val_pca)
score_bdt = accuracy_score(y_val, y_pred_bdt)
score_bdt

Key Takeaway: Here, we're using ensemble methods with AdaBoost to see if we can get a better results on the test set; we notice that score_bdt > score_0, which represented the logistic regression. Our submission result slightly increased to 0.637. Time to bring in the big guns... Neural Networks.m

## Implementing Neural Network ##


In [2]:
# Import Necessary Libraries
import keras
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD
from keras.utils import np_utils
from keras import backend as K
from keras.layers.normalization import BatchNormalization

#To get categorical output, we're going to one-hot encode the output vector for our train and validation dataset.
targets = np.array(keras.utils.to_categorical(y_train, 2))
targets_val = np.array(keras.utils.to_categorical(y_val, 2))

# Building the model
model = Sequential() #
model.add(Dense(1024, activation='relu', input_shape=(X_train_prep.shape[1],)))
#model.add(Dropout(rate=1))
model.add(BatchNormalization())

model.add(Dense(512, activation='relu'))
model.add(Dropout(rate=1))
model.add(BatchNormalization())

model.add(Dense(512, activation='relu'))
model.add(Dropout(rate=1))
model.add(BatchNormalization())

model.add(Dense(254, activation='relu'))
#model.add(Dropout(rate=1))
model.add(BatchNormalization())

model.add(Dense(254, activation='relu'))
model.add(Dropout(rate=.1))
model.add(BatchNormalization())

model.add(Dense(128, activation='relu'))
model.add(Dropout(rate=.1))
model.add(BatchNormalization())

model.add(Dense(128, activation='relu'))
model.add(Dropout(rate=.1))
model.add(BatchNormalization())

model.add(Dense(32, activation='relu'))
model.add(Dropout(rate=.1))
model.add(BatchNormalization())

model.add(Dense(32, activation='relu'))
model.add(Dropout(rate=.1))
model.add(BatchNormalization())

model.add(Dense(16, activation='relu'))
model.add(Dropout(rate=.1))
model.add(BatchNormalization())

model.add(Dense(16, activation='relu'))
model.add(Dropout(rate=.1))
model.add(BatchNormalization())

model.add(Dense(8, activation='relu'))
model.add(Dropout(rate=.1))
model.add(BatchNormalization())

model.add(Dense(8, activation='relu'))
model.add(Dropout(rate=.1))
model.add(BatchNormalization())


model.add(Dense(2, activation='softmax'))

ModuleNotFoundError: No module named 'keras'

Notes for Building a Neural Network Architecture 1.) Beware overfitting: too many neurons, too many layers, or too little data will overfit your training set and result in worse performance on test set data.

2.) Batch normalization between relu activation functions is a good way to ensure samples are less skewed while entering subsequent layers. It is however, more computationally expensive to complete.

3.) Beware setting too many epochs to train. It leads to overfitting at times.

### Train Neural Network ####


In [3]:
from keras.callbacks import ModelCheckpoint  
# features and targets are Numpy arrays --just like in the Scikit-Learn API.

#Set hyperparameters 
epochs = 15
batch_size = 300

checkpointer = ModelCheckpoint(filepath='weights.best.from_scratch.hdf5', 
                               verbose=1, save_best_only=True)

model.fit(X_train_prep, targets,
          epochs=epochs, batch_size=batch_size,validation_data = (X_val_prep, targets_val), callbacks=[checkpointer], verbose=1)

ModuleNotFoundError: No module named 'keras'

In [None]:
#y_pred_nn = model.predict(X_val_pca)
#score_nn = accuracy_score(y_val, y_pred_nn)
#score_nn
best_weights_filepath = "weights.best.from_scratch.hdf5"
model.load_weights(best_weights_filepath)

score = model.evaluate(X_val_prep, targets_val)
print("\n Training Accuracy:", score[1])

### Test Set Neural Network Performance


In [None]:
feature_test_data.head()


In [None]:
#Prepare input and output; 
#Prepare input and output; use decision tree classifier to make predictions 
#X_test = pca.transform(feature_test_data)
X_test = feature_test_data

y_test = test_data['ID_code']
y_pred_testSet = model.predict(X_test)
y_pred_testSet=pd.DataFrame(y_pred_testSet)
y_pred_testSet = pd.concat([y_test, y_pred_testSet],axis=1)
y_pred_testSet['target'] = np.where(y_pred_testSet[0] > y_pred_testSet[1], 0, 1)
drop = [0,1]
submission=y_pred_testSet.drop(columns=drop, axis =1)

submission

Key takeaways: The neural nets didn't do so hot! Submission score was .639. Now, we should probably try something like XGboost with GridSearchCV optimization to see if we can identify the most important features contributing to the targets and go from there.

In [None]:
submission.to_csv('Adrian_Lievano_NN', columns = ['ID_code','target'], index=False)


### Implement XGBoost Algorithm


In [None]:
import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV

xgb_model = xgb.XGBClassifier(objective = 'binary:logistic', random_state = 42)

params = {
    "learning_rate": [0.03, .1, 0.3], # default 0.1 
    "max_depth": [2, 6], # default 3
    "n_estimators": [10, 20, 100], # default 100
}

search = RandomizedSearchCV(xgb_model, param_distributions=params, random_state=42, n_iter=200, cv=3, verbose=1, n_jobs=1, return_train_score=True)

search.fit(X_train_prep, y_train)

In [None]:
y_pred_xgb = search.predict(X_val_prep)


In [None]:
score_xgb = accuracy_score(y_val, y_pred_xgb)
score_xgb

In [None]:
X_test = feature_test_data

y_test = test_data['ID_code']
y_pred_testSet = search.predict(X_test)
y_pred_testSet=pd.DataFrame(np.vstack(y_pred_testSet) , columns = ['target'])
competitionSub_xgb = pd.concat([y_test, y_pred_testSet], axis=1)
competitionSub_xgb

In [None]:
competitionSub_xgb.to_csv('AAL_xgb.csv', columns = ['ID_code','target'], index = False)


Conclusions:

Supervised machine learning algorithms, including neural networks, are not enough to get the highest accuracy on the test set; instead, it's just as important to understand and visualize the statistics in your provided dataset.

This notebook walks Kagglers through a variety of supervised machine learning algorithms, presenting the tradeoffs and performance numbers on the test set for this data competition.

I still have much to learn!