# Santander Kaggle Competition

## Summary

We're going to learn data science by participating in a machine learning competition. We 

### Competition Description

From frontline support teams to C-suites, customer satisfaction is a key measure of success. Unhappy customers don't stick around. What's more, unhappy customers rarely voice their dissatisfaction before leaving.

Santander Bank is asking Kagglers to help them identify dissatisfied customers early in their relationship. Doing so would allow Santander to take proactive steps to improve a customer's happiness before it's too late.

In this competition, you'll work with hundreds of anonymized features to predict if a customer is satisfied or dissatisfied with their banking experience.

### Get the Data
1. Sign up for Kaggle
2. Download the data https://www.kaggle.com/c/santander-customer-satisfaction/data
3. Open the files and place the 'train' and 'test' csv files in the same folder as this notebook.



### Get our libraries

In [1]:
%matplotlib inline
# inline is an option to make our plots appear inline (instead of pop out)

import matplotlib.pyplot as plt  # our standard plotting library; see 'seaborn' as alternative
import numpy as np  # fast arrays made for scientific computing, needed for sklearn
import scipy  # scientific computing tools, needed for sklearn
import pandas as pd  # great for data manipulation, looking at data
import sklearn  # aka sci-kit learn; python machine learning, built on top of numpy, scipy

### Glimpse at our data

We want to load our data into a pandas dataframe. You're probably familiar with one-dimensional data like lists/arrays where it's a row of data. A dataframe is two-dimensional in that there's both rows and columns; this is what 

In [2]:
# Let's load our data
DATA_LOCATION = '/Users/williamliu/Desktop/Santander/'
df_train = pd.read_csv(DATA_LOCATION + 'train.csv')

In [3]:
df_train.describe()

Unnamed: 0,ID,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,...,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38,TARGET
count,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,...,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0
mean,75964.050723,-1523.199277,33.212865,86.208265,72.363067,119.529632,3.55913,6.472698,0.412946,0.567352,...,7.935824,1.365146,12.21558,8.784074,31.505324,1.858575,76.026165,56.614351,117235.80943,0.039569
std,43781.947379,39033.462364,12.956486,1614.757313,339.315831,546.266294,93.155749,153.737066,30.604864,36.513513,...,455.887218,113.959637,783.207399,538.439211,2013.125393,147.786584,4040.337842,2852.579397,182664.598503,0.194945
min,1.0,-999999.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5163.75,0.0
25%,38104.75,2.0,23.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67870.6125,0.0
50%,76043.0,2.0,28.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,106409.16,0.0
75%,113748.75,2.0,40.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,118756.2525,0.0
max,151838.0,238.0,105.0,210000.0,12888.03,21024.81,8237.82,11073.57,6600.0,6600.0,...,50003.88,20385.72,138831.63,91778.73,438329.22,24650.01,681462.9,397884.3,22034738.76,1.0


In [4]:
df_train.head()

Unnamed: 0,ID,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,...,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38,TARGET
0,1,2,23,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,39205.17,0
1,3,2,34,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,49278.03,0
2,4,2,23,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,67333.77,0
3,8,2,37,0,195,195,0,0,0,0,...,0,0,0,0,0,0,0,0,64007.97,0
4,10,2,39,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,117310.979016,0


In [8]:
df_train.columns

Index([u'ID', u'var3', u'var15', u'imp_ent_var16_ult1',
       u'imp_op_var39_comer_ult1', u'imp_op_var39_comer_ult3',
       u'imp_op_var40_comer_ult1', u'imp_op_var40_comer_ult3',
       u'imp_op_var40_efect_ult1', u'imp_op_var40_efect_ult3', 
       ...
       u'saldo_medio_var33_hace2', u'saldo_medio_var33_hace3',
       u'saldo_medio_var33_ult1', u'saldo_medio_var33_ult3',
       u'saldo_medio_var44_hace2', u'saldo_medio_var44_hace3',
       u'saldo_medio_var44_ult1', u'saldo_medio_var44_ult3', u'var38',
       u'TARGET'],
      dtype='object', length=371)

## Split the Data

I mentioned that we should only glimpse at our data because we don't want to cheat and look at all our data (which would cause our algorithm to overfit). So before we do any in-depth analysis, we want to split our data into train and test with __X__ being the features we might want and __y__ being the target.

In [34]:
X = df_train.drop(['TARGET'], axis=1)  # We want all the data except the Target
y = df_train.TARGET  # We only want the target

In [35]:
X.info()
X.head()  # Let's double check the column TARGET is no longer there

<class 'pandas.core.frame.DataFrame'>
Int64Index: 76020 entries, 0 to 76019
Columns: 370 entries, ID to var38
dtypes: float64(111), int64(259)
memory usage: 215.2 MB


Unnamed: 0,ID,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,...,saldo_medio_var29_ult3,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38
0,1,2,23,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,39205.17
1,3,2,34,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,49278.03
2,4,2,23,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,67333.77
3,8,2,37,0,195,195,0,0,0,0,...,0,0,0,0,0,0,0,0,0,64007.97
4,10,2,39,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,117310.979016


In [36]:
y.head()  # let's double check the column TARGET is the only data there

0    0
1    0
2    0
3    0
4    0
Name: TARGET, dtype: int64

### Variations to splitting the data and validating

There's a few ways we can split and validate our data. First, we can do a simple __train and test split__ (default .75 to .25); the advantage with this method is that it's quick and easy, but the issue is we don't use all the training data. Another method is to do __K-Fold__ where you split your data K times, then do validation and average up the scores; the advantage with this method is you use more data, but this takes more computing time. Depending on how balanced your dataset, you might also want to consider __StratifiedKFold__.

In [37]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train.shape, X_test.shape

((57015, 370), (19005, 370))

In [39]:
from sklearn.cross_validation import KFold


sklearn.cross_validation.KFold(n=4, n_folds=2, shuffle=False, random_state=None)