# Course lecture
[3 - Performance, Validation and Model Interpretation](http://course18.fast.ai/lessonsml1/lesson3.html)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/bogeholm/fastai-intro-to-ml/blob/master/3-performance-validation-and-model-interpretation.ipynb)

# Notes
## Notes from last
- Defaults of Random Forests in `scikit-learn` are very reasonable
- `proc_df()` from `fast.ai` add a Boolean null column from *numerical* values, after replacing nulls with the **median**
  - There may be values in the test set that weren't in the training set
  - The median may differ between the training and test sets
  - `proc_df()` changed to return `nas`, a dictionary with names of columns with nulls as keys, and medians as values
  - `nas` can be passed as an optional argument to the new `proc_df()`

## Today
- Understanding the results of the model - Random Forests are not black boxes
- Working with large datasets ($> 10^8$ rows), specifically the [Kaggle groceries competition](https://www.kaggle.com/c/favorita-grocery-sales-forecasting), a relational, start-schema dataset
- When working with time series, you will generally have dates **later** in the test set, than in the training set. If you want to use a subset of the training set for experimenting, use the **latest** dates possible (*in general*)

### Tips on large datasets
- Create a type dict for pandas to use while loading data, to not consume too much RAM for inferring types; eg:
```python
types = {'id': 'int32', 'category':'int8'}
df = pd.read_csv('data.csv', dtype=types)
```
- Use UNIX `shuf` command to get a random sample of the lines in the dataset. Then work as much as possible on the sample before progressing
- Convert as many as possible `object` columns into more primitive types (`bool`s, `int32`s, ...)
- Consider `set_rf_samples()` again
- Set `min_samples_leaf=100` (31:00)
- Do not use `oob_score=True` when using `set_rf_samples` on very large datasets

### Other tips
- pandas `df.describe(include='all')`
- Check out `%prun` for profiling function calls in notebooks

## Background
- Random Forests are fine for structured data
- For unstructured data (images, speech, ...), try deep learning
- Quality of validation set directly influences quality of model in production
- Plotting "quality" on validation set on x-axis and quality in production on y-axis should ideally lie on a straight line

## Further notes
- Look at `train_cats(df)` and `apply_cats(df1, df2)` which are called (in that order) before `proc_df()`
- RandomForests do not work well on the groceries dataset out of the box
- Tip: take the last two week of data; use store, item, on_promotion average sales - that got into the top 30 on Kaggle!
- Coding Machine Lerning is not technically difficult - but if you get a tiny detail wrong, it means your model ends up worse than it could. If you're not doing eg. Kaggle, there's not a good way to know that. This is an open problem.
- Supplement using external data, if possible.
- Check out [Rossman store sales](https://www.kaggle.com/c/rossmann-store-sales/overview/description) on Kaggle, and [XGBoost](https://xgboost.readthedocs.io/en/latest/index.html). See also the [Kaggle blog](https://medium.com/kaggle-blog)


# Implementation plan
- Determine *test* vs. *validation* set nomenclature
- Clean up notes
- Read the data you saved last time
- Keep track of missing values `in proc_df`? Return dict with missing column values, as well as medians, to use on unseen data, or data subsets
- Use all values for training (test set), *last* 12_000 for validation

## Implementation notes
- We don't just want a prediction, we want a confidence of the prediction. Any 'unusual' rows should get low confidence
- If the standard deviation of a row prediction between trees is high, we have low confidence
- `min_samples_leaf=3` and `max_features=0.5` turned out fine last time. Show?
- Parallelize! (1:00:00) `parallel_trees()` from `fast.ai`
- Inspect the data, lookong at relative standard deviation based on categories (`std` / `prediction`)
- Compare predictions using all variables, and only the - say - top ten
- Use pandas plotting functions, for the sport of it
- Possibly learn more about the variables. Fuzzy variables (eg. `fiModelDesc`); can they be ordered? Can we split the strings?

### Feature importance
- `rf_feat_importance` (from `fast.ai`, based on `scikit-learn`)
  - Plot ordered, barplots, however it looks good
  - Importance can be used to find out *which features to learn more about*
  - If importance goes against expectations, it is worthwhile to investigate. Data problems? New knowledge?
  - Maybe a column is only predictive for values that are *missing* (example @ 1:11:45)? In this example, information about research grant applications (the topic of interest) was only entered into a database *manually* for grant applications *that were accepted*; ie. `null => acceptance=False`. This is **data leakage**. 
  - **Colinearity** is another possibility - the variable could be indicative of something else entirely
  - By removing less important variables, you are potentially removing sources of colinearity. This makes the 'true' relationships more clear

### Technique for inspecting feature importance
- Start with base dataset and make a predictive model $m_0$.
- Pick a column, and randomly shuffle it. Compare $R^2$ and RMSE of $m_0$ on the original and shuffled datasets.
- Why not just exclude the column? That would mean training a whole new random forest for each column, which is slow.

## Install [dataworks](https://github.com/bogeholm/dataworks)

In [1]:
# Uncomment to install dataworks:
#!pip install --upgrade --quiet git+git://github.com/bogeholm/dataworks.git

## Get [notebook utilities file](https://github.com/bogeholm/fastai-intro-to-ml/blob/master/notebookutils.py) - only relevant if in Colab

In [2]:
# Uncomment to fetch notebook utilities
#!curl --proto '=https' --tlsv1.2 -sSf --output notebookutils.py 'https://raw.githubusercontent.com/bogeholm/fastai-intro-to-ml/master/notebookutils.py'

## Imports

In [3]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import sys

#from dataworks.df_utils import (add_datefields, add_nan_columns, categorize_df, inspect_df, 
#    numeric_nans, summarize_df,)

from notebookutils import get_basepath, rmse, print_score

# POandas options
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)

# Matplotlib/Jupyter extras and options
from IPython.display import display, set_matplotlib_formats
from IPython.core.pylabtools import figsize

%matplotlib inline
set_matplotlib_formats('png')
# https://matplotlib.org/3.1.1/gallery/style_sheets/style_sheets_reference.html
plt.style.use('seaborn-whitegrid')
plt.rcParams.update({'font.size': 16})

## Mount Google Drive (for [Colab](https://colab.research.google.com/) integration)

In [4]:
BASEPATH = get_basepath()

In [5]:
DATAPATH = os.path.join(*[BASEPATH, 'all', 'your', 'sub', 'directories'])
print(f'DATAPATH: {DATAPATH}')

DATAPATH: C:\Users\tbm\Documents\machinelearning\fastai-intro-to-ml\data\all\your\sub\directories


## Notebook utilities

In [6]:
def display_allrows(df):
    """ Override max rows and display them all
    """
    with pd.option_context('display.max_rows', len(df)):
            display(df)

# Start learning ...