# Validation on Hierarchical Data

You are familiar with cross-validation and holdout-validation, but when should each of them be used? and what happens if our data has a complex structure?

In this notebook, we will demonstrate some examples to help answer these questions.

## Holdout validation

During holdout validation, we select a holdout sample at the beginning of the modelling process, before we have done any data processing or model training. This is how we test the performance of a model on a situation it has not seen before. If possible, it is good to use data which has also been collected separately, for example a separate year of data, or data from a seperate geographical sample. 

This helps ensure the relations in the model are *general* to the population and not *specific* to your dataset. 

It is vital that the validation data data remains untouched till testing!

Let's use the travel mode dataset to investigate scoring methods. In this dataset there isn't an easy way of selecting data which was collected sepearately, as we do not information such as the trip dates for instance. 

We will therefore have to rely on some sort of random sampling.

Let's load the data and the libraries we need.

In [None]:
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics, preprocessing, model_selection

import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
df = pd.read_csv('./data/travel_mode.csv')
bool_map = {'no': 0, 'yes': 1}
education_map = {'lower': 0, 'middle': 1, 'higher': 2}
income_map = {'less20': 0, '20to40': 1, 'more40': 2}
mode_map = {'car': 0, 'walk': 1, 'bike': 2, 'pt': 3}

df.mode_main = df.mode_main.map(mode_map)
df.male = df.male.map(bool_map)
df.license = df.license.map(bool_map)
df.weekend = df.weekend.map(bool_map)
df.education = df.education.map(education_map)
df.income = df.income.map(income_map)

df = pd.get_dummies(df)

#drop native column so n-1 catgories
df.drop('ethnicity_native', axis=1, inplace=True)

Let's have a look at the `household_id` column, along with some other columns.

In [None]:
df.head(20)

So we can clearly see that people from the same household seem to be making multiple trips, and they will be highly correlated, in terms of explanitary variables (e.g. green, the percentage of green space in the vicinity around the household), and the target variable (e.g. mode choice). 

For instance, four trips are made a 24 year old male in household 3460, all walking, and all with identical explanitary variables, except distance and density.

This data is therefore *hierarchical*.

### Implications of hierarchical data

So what is the relevance of the hierarchical nature of the data?

When we sample data randomly to form a test set, each individual row has an equal probability of being selected, which means there is a high chance of rows (trips) made by the same household appearing in both the test and train dataset. This is bad, as we can then overfit the model to noise in that households data - i.e. unique features of the household/individual, and not general relations which indicate someone is likely to take one mode over another.

Let's dig a little deeper to find out more.

In [None]:
print('There are {} trips made by {} households in the data'.format(len(df), max(df.household_id)))

So it looks like we have around 5 trips made by each household! For sure that will introduce correlations between test and train data which are specific to the household, and not general.

First, let's generate some results for a model that we know will overfit, a decision tree with no maximum depth, using our original random sampling. First lets generate train and test folds. We will add the suffix `_r` for random.

In [None]:
y = df.mode_main
hh = df.household_id
X = df.drop(['mode_main', 'household_id'], axis=1)

In [None]:
X_train_r, X_test_r, y_train_r, y_test_r = model_selection.train_test_split(X, y)

Then we can create our candidate model. We'll use standard parameters, except we wont restrict the tree depth!

In [None]:
clf_r = RandomForestClassifier(n_estimators=50, max_features=3, max_depth=None, n_jobs=-1)

The next cell might be slow!

In [None]:
clf_r.fit(X_train_r, y_train_r)

Next, caluclate the discrete and probabilistic classifications for the classifier, and use them to calculate the predicted log loss and accuracy.

In [None]:
# generate y_pred and y_probs
y_pred_r = clf_r.predict(X_test_r)
y_probs_r = clf_r.predict_proba(X_test_r)


In [None]:
# and the accuracy score
metrics.accuracy_score(y_test_r, y_pred_r)


In [None]:
# and the log loss
metrics.log_loss(y_test_r, y_probs_r)


Great, we have some baseline scores to compare against! By not restricing the tree depth at all, we are almost definitely overfitting our model to the noise from the hierarchical data. We can see our metrics have improved hugely from yesterday, just by not restricting the tree depth. 

We are essenitally predicting the training data!

### Grouped sampling
But how do we deal with hierarchical data when sampling test data?

Scikit learn can help! We can use `GroupShuffleSplit` from the `model_selection` module, see the documentation [here](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupShuffleSplit.html).

Remember, we only need one split for holdout-validation, so be sure to set the approriate parameters. Also stick to a 70:30 split.

Also, when using split, make sure to use the household series (stored as `hh`) as the groups.

*HINT* `split` returns a generator, so you will need to get the result out of the generator by iterating over it. Remember, you can use [`next`](https://docs.python.org/2/library/functions.html#next) to simplify things!

In [None]:
gss = model_selection.GroupShuffleSplit(n_splits=1, train_size=.7, test_size=.3, 
                                        random_state=42)
train_idx, test_idx = next(gss.split(X, y, groups=hh))


We can then use the indices from the split to generate the train and test data (yes it is a little longwinded!) We can use the suffix `_g` for grouped.

In [None]:
X_train_g, X_test_g, y_train_g, y_test_g = X.iloc[train_idx], X.iloc[test_idx], y[train_idx], y[test_idx]

Now lets fit the same classifier to our correctly sampled data...

In [None]:
clf_g = RandomForestClassifier(n_estimators=50, max_features=3, max_depth=None, n_jobs=-1)
clf_g.fit(X_train_g, y_train_g)

In [None]:
y_pred_g = clf_g.predict(X_test_g)
y_probs_g = clf_g.predict_proba(X_test_g)

In [None]:
metrics.accuracy_score(y_test_g, y_pred_g)

In [None]:
metrics.log_loss(y_test_g, y_probs_g)

Wow! That's a pretty big drop in our metrics - accuracy has dropped to from 83% to 68%! Goes to show it is **very** important to deal with hierarchical data properly.

### Cross-validation

Cross validation is when we can estimate model performance on a dataset of fixed size. It is very useful for instance when trying to select model parameters.

It is important to note, that the validation we do to select model parameters/feature engineering, needs to be completely sepearate to the holdout-validation to test the model. For instance, if we separate data into a testing and a holdout training set, we should then only do model selction using the training set. Otherwise, we will be selecting parameters which allow us to fit to our testing data. This is known as *data leakage*.

So how do we do cross-validation with our hiearchical data? This time, we can use [GroupKFold](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupKFold.html )!

We will use GroupKFold to do a simple parameter search on max-depth for the forest.

Firstly, we need to get hold of the *groups* (`household_id`) for the train data.

In [None]:
gss = model_selection.GroupShuffleSplit(n_splits=1, train_size=.7, test_size=.3, 
                                        random_state=42)
train_idx, test_idx = next(gss.split(X, y, groups=hh))
X_train_g, X_test_g, y_train_g, y_test_g = X.iloc[train_idx], X.iloc[test_idx], y[train_idx], y[test_idx]
hh_train_g, hh_test_g = hh.iloc[train_idx], hh.iloc[test_idx]

Next we can use `GroupKFold` with `GridSearchCV`, documented [here](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).

*HINT* Use the `cv` parameter within `GridSearchCV` to do a grouped K-fold CV.

Let's leave the other parameters the same, and just investigate the `max_depth` parameter. We can try values of 2,5 and 10, as a crude search.

In [None]:
clf_cv = RandomForestClassifier(n_estimators=50, max_features=3, n_jobs=-1)

In [None]:
params = {'max_depth': [2, 5, 10]}
# use GroupKFold with 3 splits and GridSearchCV to search max_depth valus of 2, 6 and 12.

gcv = model_selection.GroupKFold(n_splits=3).split(X_train_g, y_train_g, groups=hh_train_g)
gs = model_selection.GridSearchCV(clf_cv, param_grid=params, scoring=['accuracy', 'neg_log_loss'],
                                  n_jobs=1, cv=gcv, refit='neg_log_loss', verbose=3)
gs.fit(X_train_g, y_train_g)


So it seems the `max_depth` of 12 worked best out of the values we tried. Let's see how well this classifier works on our validation data.

In [None]:
clf_cv = gs.best_estimator_

In [None]:
clf_cv_pred = gs.predict(X_test_g)
clf_cv_probs = gs.predict_proba(X_test_g)

In [None]:
metrics.accuracy_score(y_test_g, clf_cv_pred)

In [None]:
metrics.log_loss(y_test_g, clf_cv_probs)

Finally, we can investigate which features were most important in the two different classifiers. The following cells plot bar charts of the feature importances for each classifier.

In [None]:
features = X.columns
importances = clf_r.feature_importances_
indices = np.argsort(importances)

plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), features[indices])
plt.xlabel('Relative Importance')
plt.show()

In [None]:
features = X.columns
importances = clf_cv.feature_importances_
indices = np.argsort(importances)

plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), features[indices])
plt.xlabel('Relative Importance')
plt.show()

So it seems the first classifier, with the random sampling, was overfitting to the `float` values, `density`, `green`, and `diversity`. This makes sense, as each tree could just memorise the mode associated with each unique value, and repeat it for the test data. The classifier using grouped sampling puts much more emphasis on car ownership and trip distance.