# Non-Parametric Approach (Random Forests)

Here we explore a python implementation of random forest for this competition. The reason for the transition to python was a computational one. R was proving to be far too slow to create even trivially sized forests. Python gives an increase in speed by nature, but also has more natural integration of parallel tree creation, allowing for the creation of larger forests. That being said, computation is still an issue and will come into play as we go through.

In [1]:
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.datasets import make_classification
from sklearn.ensemble import ExtraTreesClassifier

## Gini Coefficient

Code, adapted from the code on collab, to compute the gini index

In [3]:
def unnormalized_gini_index(g, predicted_probabilities):
  
    if len(g) != len(predicted_probabilities):
        print("Actual and Predicted need to be equal lengths!")
        return

    # arrange data into table with columns of index, predicted values, and actual values
    d = {"truth": g, "pred": predicted_probabilities}
    gini_table = pd.DataFrame(data = d, index = range(1,len(g) + 1))

    # sort rows in decreasing order of the predicted values, breaking ties according to the index
    # gini_table = gini.table[order(-gini.table$predicted.probabilities, gini.table$index), ]
    gini_table = gini_table.sort_values("pred", ascending = False)

    # get the per-row increment for positives accumulated by the model 
    num_ground_truth_positivies = sum(gini_table["truth"])
    model_percentage_positives_accumulated = gini_table["truth"] / num_ground_truth_positivies

    # get the per-row increment for positives accumulated by a random guess
    random_guess_percentage_positives_accumulated = 1 / len(gini_table["truth"])

    # calculate gini index
    gini_sum = np.cumsum(model_percentage_positives_accumulated - random_guess_percentage_positives_accumulated)
    gini_index = sum(gini_sum) / len(gini_table["truth"]) 
    return(gini_index)


#' Calculates normalized Gini index from ground truth and predicted probabilities.
#' @param ground.truth Ground-truth scalar values (e.g., 0 and 1)
#' @param predicted.probabilities Predicted probabilities for the items listed in ground.truth
#' @return Normalized Gini index, accounting for theoretical optimal.
def normalized_gini_index(g, predicted_probabilities):
    model_gini_index = unnormalized_gini_index(g, predicted_probabilities)
    optimal_gini_index = unnormalized_gini_index(g, g)
    return(model_gini_index / optimal_gini_index)


In [4]:
train = pd.read_csv("train.csv")

In [36]:
y = train["target"]
x = train.drop(["id", "target"], axis = 1)

First we are going to just try and fit a random forest to the raw train data and get a baseline for the gini index.

In [16]:
clf = RandomForestClassifier(n_estimators=500, max_depth=2, random_state=0, n_jobs=-1, max_features="log2")

In [37]:
model = clf.fit(x, y)

In [38]:
test = pd.read_csv("test.csv")
ids = test["id"]
test = test.drop(["id"], axis = 1)

In [39]:
probs = clf.predict_proba(test)
probs_final = [x[1] for x in probs]

In [46]:
result1 = {'id':ids, 'target':probs_final}
result1_df = pd.DataFrame(data = result1)

result1_df.to_csv("predictionsRF10-24-17.csv", index=False)

Gave a baseline score of  0.232

Now we transition to our actual approach. We will use a 3 fold cross validation approach at first. We have to select a small number of folds because it is computational infeasible to do this with a larger number of folds. We also want to collect which variables are the most important. The way we will approach this, is that for each of the trees generated in cross fold, we will store which variables are in the top 20 most important variables for that tree. 20 was chosen somewhat arbitrarily, but it was a value at which the importance seemed to drop off a bit. We will then see which variables show up as important in all folds and proceed with those.

In [6]:
y_kfold = np.array(train["target"])
x_kfold = np.array(train.drop(["id", "target"], axis = 1))
kf = StratifiedKFold(n_splits=3)
kf.get_n_splits(x_kfold, y_kfold)
forest = ExtraTreesClassifier(n_estimators=500, max_depth=2, random_state=0, n_jobs=-1, max_features="log2")

ginis = []

ratio = [0]*57

for train_index, test_index in kf.split(x_kfold, y_kfold):
    X_train, X_test = x_kfold[train_index], x_kfold[test_index]
    y_train, y_test = y_kfold[train_index], y_kfold[test_index]
    
    model = forest.fit(X_train, y_train)
    probs = forest.predict_proba(X_test)
    probs_final = [x[1] for x in probs]
    importances = forest.feature_importances_
    std = np.std([tree.feature_importances_ for tree in forest.estimators_],
                 axis=0)
    indices = np.argsort(importances)[::-1]

    index = 0
    for f in range(X_train.shape[1]):
        # print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
        if(index < 20):
            ratio[indices[f]] += 1
        index += 1

    ginis.append(normalized_gini_index(g=y_test, predicted_probabilities=probs_final))
    print(len(ginis))

1
2
3


In [8]:
# Print out the cross validation Gini
print(ginis)

# Find which variables top level of important in all folds
interest = [x for x in range(0,57) if ratio[x] > 2]
interest

[0.21582272536451991, 0.22396413793473299, 0.22061223708988106]


[3, 4, 5, 6, 15, 16, 18, 19, 20, 21, 22, 23, 24, 25, 27, 28, 34]

Gives Gini scores of [0.21581968897515907, 0.22396513332529083, 0.22062593969640684]
and important columns of [3, 4, 5, 6, 15, 16, 18, 19, 20, 21, 22, 23, 24, 25, 27, 28, 34]

We now redo but this time use only the most important predictors

In [148]:
y_kfold = np.array(train["target"])
x_kfold = train.drop(["id", "target"], axis = 1)
x_kfold = np.array(x_kfold.iloc[:, interest])
kf = StratifiedKFold(n_splits=3)
kf.get_n_splits(x_kfold, y_kfold)
forest = RandomForestClassifier(n_estimators=500, max_depth=2, random_state=0, n_jobs=-1)

ginis = []

for train_index, test_index in kf.split(x_kfold, y_kfold):
    X_train, X_test = x_kfold[train_index], x_kfold[test_index]
    y_train, y_test = y_kfold[train_index], y_kfold[test_index]
    
    model = forest.fit(X_train, y_train)
    probs = forest.predict_proba(X_test)
    probs_final = [x[1] for x in probs]
    
    ginis.append(normalized_gini_index(g=y_test, predicted_probabilities=probs_final))
    print(len(ginis))

1
2
3


In [151]:
ginis

[0.22515354264447354, 0.23158850697725489, 0.22609265323812991]

The new Ginis are [0.22515354264447354, 0.23158850697725489, 0.22609265323812991]
The reduction in features didn't cause the Gini's to drop at all (in fact, they went up a bit), but the computation time also did not drop at all. The next step is to use our reduced model and try to up the number of trees.

In [20]:
y_kfold = np.array(train["target"])
x_kfold = train.drop(["id", "target"], axis = 1)
x_kfold = np.array(x_kfold.iloc[:, interest])
kf = StratifiedKFold(n_splits=3)
kf.get_n_splits(x_kfold, y_kfold)
forest = RandomForestClassifier(n_estimators=500, n_jobs=-1, max_depth=11)

ginis = []

for train_index, test_index in kf.split(x_kfold, y_kfold):
    X_train, X_test = x_kfold[train_index], x_kfold[test_index]
    y_train, y_test = y_kfold[train_index], y_kfold[test_index]
    
    forest.fit(X_train, y_train)
    probs = forest.predict_proba(X_test)
    probs_final = [x[1] for x in probs]
    
    ginis.append(normalized_gini_index(g=y_test, predicted_probabilities=probs_final))
    print(len(ginis))

1
2
3


Ginis for various tree sizes, 3 fold, max depth 2:

n = 500  : [0.22450751789755535, 0.23204761929794665, 0.22267593701940497]

n = 1000 : [0.22280302864951337, 0.23167452434680696, 0.22335295226606269]

n = 2000 : [0.22311711050854513, 0.23229558294968261, 0.22425416160044812]

n = 3000 : [0.22302677539348212, 0.2321572048538286, 0.22492528983348847]

Tree size seems to have little effects, now we look at the results of increasing depth, n = 500, 3 fold

max depth = 3 : [0.2323206135305306, 0.23691938332400231, 0.23321463325695618]

max depth = 4 : [0.2386918741050057, 0.24210880425554068, 0.23738067414812231]

max depth = 5 : [0.24352711368719027, 0.24594164530106855, 0.24283873295515726]

max depth = 6 : [0.24726379801321008, 0.24851153683176316, 0.24561519123072281]

max depth = 7 : [0.24926826777510488, 0.25154541925431562, 0.24873424215860127]

max depth = 8 : [0.25186935115884101, 0.25244487181359732, 0.25104524729788208]

max depth = 9 : [0.25291066201294671, 0.25376964260539919, 0.25267636240651586]

max depth = 10 : [0.25414194176838978, 0.25408899804070506, 0.25366274897182656]

max depth = 11 : [0.25389078795774939, 0.25407960693588516, 0.25374236182468013]

In [21]:
ginis

[0.25389078795774939, 0.25407960693588516, 0.25374236182468013]