# Table of Contents
 <p><div class="lev1"><a href="#Preprocessing"><span class="toc-item-num">1&nbsp;&nbsp;</span>Preprocessing</a></div><div class="lev2"><a href="#Imports-and-loading-the-data"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Imports and loading the data</a></div><div class="lev2"><a href="#Cleaning-the-data"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Cleaning the data</a></div><div class="lev3"><a href="#Remove-constant-a-duplicate-columns"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Remove constant a duplicate columns</a></div><div class="lev3"><a href="#Save-the-IDs-and-TARGETs-and-drop-them-from-the-dataframe"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Save the IDs and TARGETs and drop them from the dataframe</a></div><div class="lev1"><a href="#Random-Forest"><span class="toc-item-num">2&nbsp;&nbsp;</span>Random Forest</a></div><div class="lev1"><a href="#Output"><span class="toc-item-num">3&nbsp;&nbsp;</span>Output</a></div>

# Preprocessing
## Imports and loading the data

In [1]:
%matplotlib inline
from __future__ import division

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Input data files are available in the "./input/" directory.
# load data
df_train = pd.read_csv('../input/train.csv')
df_test = pd.read_csv('../input/test.csv')

## Cleaning the data
### Remove constant a duplicate columns
We remove any constant columns and any duplicated columns (identical values) as these can have no signature in the dependent variable. Note that we remove the constant and duplicate columns in the training set **and the test set**.

In [2]:
# remove constant columns
remove = []
for col in df_train.columns:
    if df_train[col].std() == 0:
        remove.append(col)

df_train.drop(remove, axis=1, inplace=True)
df_test.drop(remove, axis=1, inplace=True)

# remove duplicated columns
remove = []
c = df_train.columns
for i in range(len(c)-1):
    v = df_train[c[i]].values
    for j in range(i+1,len(c)):
        if np.array_equal(v,df_train[c[j]].values):
            remove.append(c[j])

df_train.drop(remove, axis=1, inplace=True)
df_test.drop(remove, axis=1, inplace=True)

### Save the IDs and TARGETs and drop them from the dataframe

In [3]:
IDs = df_train["ID"]
IDs_test = df_test["ID"]
TARGETs = df_train["TARGET"]

df_train.drop(["ID", "TARGET"], axis=1, inplace=True)
df_test.drop(["ID"], axis=1, inplace=True)

# Random Forest
Promising results. Computationally expensive, because we have so many features. We need the number of trees to be significantly larger than the number of features.

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import VarianceThreshold
from sklearn.metrics import confusion_matrix, roc_auc_score

X, y = df_train, TARGETs

n_features = 'all' #Only use the top 'n_features' features
n_estimators = 500 #Number of trees
weights = {0: 1, 1:3.45} #Attempt to balance the classes
clf = Pipeline([
        ('remove_zero_variance', VarianceThreshold()),
        ('feature_selection', SelectKBest(f_classif, k=n_features)),
        ('classification', RandomForestClassifier(n_estimators,
                                                max_features = 40,
                                                n_jobs=4,
                                                class_weight= weights,
                                                warm_start=False))
])
clf.fit(X, y)

y_test_pred = clf.predict(X)

# Test on the training set:
y_test_pred = clf.predict(X)
print(confusion_matrix(TARGETs, y_test_pred))

# Calculate the roc_auc score
print('Overall AUC:', roc_auc_score(y, clf.predict_proba(X)[:,1]))

[[72846   166]
 [  224  2784]]
Overall AUC: 0.998808656381


Those numbers don't look too bad. While we have our model. Let's see what the important factors are:

In [20]:
feature_importances = clf.named_steps["classification"].feature_importances_
relative_importance = feature_importances/max(feature_importances)
feature_IDs = df_test.columns.values

feature_df = pd.DataFrame({"Feature":feature_IDs, "Importance":relative_importance}).sort_values(by="Importance", 
                                                                                                 ascending = False)
print(feature_df.iloc[0:20,:])

                    Feature  Importance
305                   var38    1.000000
1                     var15    0.655383
270   saldo_medio_var5_ult3    0.111242
268  saldo_medio_var5_hace3    0.103625
143             saldo_var30    0.094445
266          num_var45_ult3    0.068516
267  saldo_medio_var5_hace2    0.066033
150             saldo_var42    0.063755
264         num_var45_hace3    0.055194
269   saldo_medio_var5_ult1    0.054414
221          num_var22_ult3    0.048071
128              saldo_var5    0.046764
263         num_var45_hace2    0.046704
265          num_var45_ult1    0.033356
219         num_var22_hace3    0.033277
224     num_meses_var5_ult3    0.031059
218         num_var22_hace2    0.030720
152                   var36    0.029828
223      num_med_var45_ult3    0.029616
109               num_var30    0.029137


The top features here all show up in the analysis performed by "Selfish Gene":
https://www.kaggle.com/selfishgene/santander-customer-satisfaction/advanced-feature-exploration

# Output

In [19]:
from datetime import datetime
y_probs = clf.predict_proba(df_test)[:,1]

# Stamp the output with the current time
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

# Output the submission
submission = pd.DataFrame({"ID":IDs_test, "TARGET":y_probs})
submission.to_csv("../results/random_forest_" + timestamp + ".csv", 
                  index=False, float_format="%10.8f")

# Output the important features
feature_df.to_csv("../misc/features_random_forest_" + timestamp + ".csv", 
                  index=False, float_format="%8.6f")