## Bagging

The bagging estimate is defined by 
$$\hat{f}(x)=\frac{1}{B}\sum^{B}_{b=1}\hat{f}^{*b}(x)$$

Bagging seems to work especially well for high-variance, low-bias procedures, such as trees. The essential idea in bagging is to average noisy but aprroximately unbiased models, and hence reduce the variance.

### Bag of little boostrap

## Bootstrap Methods
Suppose we have a model fit to set of training data. We denote the training set by $Z = (z_1, z_2, ..., z_N)$ where $z_i = (x_i, y_i)$. The basic idea is to randomly draw datasets with replacement from the training data, each sample the same size as the orignal training set. This is done $B$ times, producing B bootstrap datasets. Then we refit the model to each of the bootstrap datasets, and examine the behavior of the fits over the B replications.

# Lesson 2
## Random Forest for Regression or Classfication

1. For b = 1 to $B$:  
    (a) Draw a bootstrap sample **Z**^\* size *N* from the training data.  
    (b) Grow a random-forest tree $T_b$ to the bootstrapped data, by recursively repeating the following steps for each terminal node of the tree, until the minimum node size $n_{min}$ is reached.  
        i. Select m variables at random from the p variables.  
        ii. Pick the best variable/split-point among the m.  
        iii. Split the node into two daughter nodes.    
2. Output the ensemble of trees ${T_b}^B_1$.  

To make a prediction at a new point x:  
*Regression*: $\hat{f}^B_{rf}(x) = \frac{1}{B}\sum^{B}_{b=1}T_b(x)$.  
*Classification*: Let $\hat{C}_b(x)$ be the class prediction of the *b*th random forest tree. Then $\hat{C}_{rf}(x) = majority\ vote\ \{\hat{C}_b(x)\}^B_1$

### symlink
A symbolic link, also termed a soft link, is a special kind of file that points to another file, much like a shortcut in Windows or a Macintosh alias. Unlike a hard link, a symbolic link does not contain the data in the target file. It simply points to another entry somewhere in the file system. 
```linux
ln -s source_file myfile
```

#### Small trick

In [None]:
PATH = "data/bulldozrs"

In [None]:
! ls {PATH}

<font color=blue>It's important to note what metric is being used for a project. Generally, selecting the metric(s) is an important part of the project setup.</font>

## In-Class Notes

The following method extracts particular date fields from a complete datetime for the purpose of constructing categoricals. You should **always** consider this feature extraction step when working with date-time. Without expanding your date-time into these additional fields, you can't caputure any trend/cyclical behavior as a function of time at any of these granularities.

In [None]:
add_datepart(df_raw, 'saledate')

If the categorical variables are currently stored as strings, which is inefficient, and doesn't provide the numeric doing required for a random forest. Therefore we call train_cats to convert strings to pandas categories.

In [None]:
train_cats(df_raw)

In [None]:
df_raw.UsageBand.cat.* # show you a lot information about categical variables.

Go through each column and numeicalize it. If it is not numeric, then replace the dataframe field with columns cat codes plus one (dealing missing values)

In [None]:
df, y = proc_df(df_raw, 'SalePrice')

Run some randomforest

In [None]:
m = RandomForestRegressor(n_jobs = -1)
m.fit(df, y)
m.score(df, y) # what is this m.score function? Only for r^2?

* What's $R^2$  
$R^2=1-\frac{SS_{res}}{SS_{tot}}$  
$SS_{tot} = \sum_{i}(y_i-\overline{y})$, $SS_{res} = \sum_{i}(y_i-f_i)^2$  
If we don't use OLS, then the possible value of $R^2$ could be from $-\infty$ to 1

* In general, anytime you're building a model that has a time element you want your test set to be a separate time period and therefore you really need your validation set to be observer time period as well.

In [None]:
def split_vals(a,n): return a[:n].copy, a[n:].copy()
n_valid = 12000 # same as Kaggle's test set size
n_trn = len(df) - n_valid
raw_train, raw_valid = split_vals(df_raw, n_trn)
X_train, X_valid = split_vals(df, n_trn)
y_train, y_valid = split_vals(y, n_trn)

[PEP 8 -- Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/)

In [None]:
%time m.fit(X_train, y_train) # gives you the time needed to run for this function

In [None]:
df_trn, y_trn = proc_df(df_raw, 'SalePrice', subset=30000)
X_train, _ = split_vals(df_trn, 20000) # _ throw away something
y_train, _ = split_vals(y_trn, 20000)

In [None]:
m = RandomForestRegressor(n_estimators=1, max_depth=3, bootstrap=False, n_jobs=-1)
m.fit(X_train, y_train)
print_score(m)

* Make a small tree. So we pass in max_depth equals three. This creates a small deterministic tree.   
The first step is to creat the first binary decision. How would you pick which vairable and split point?  
* For each value of each variable, compare their weighted average mse? Or ther metric(s).   
* If we don't specify the max_depth, it then will split until every leaf node only has one thing in it, so the training $R^2$ is 1. But the validation score is not as good as we want.

### bagging (n_estimators)
What if we created five different models, each of which was only somewhat predictive but the models weren't at all correlated with each other. They gave predictions that were correlated with each other. That would mean that five models would have profound different insights into the relationships in the data and so if you tok the average of those five models, then you're effectively bringing the insights from each of them and so this idea of averaging models is a technique for ensembling, which is really important.   
**The entire purpose of modeling in machine learning is to find a model which tells you which variables are important and how they interect together to drive your dependent variable.**  

The research community in recent years has generally found that more important thing seems to create uncorrelated trees rather than more accurate trees so more recent advances tend to create trees which are less predictive on their own but less correlated with each other. For example, in scikit-learn, there is another class you can use called extra trees regress on your extra trees classifier with exactly the same api(?) you can try it. That's called extremely randomized trees model and what that does exactly the same as what we just dicussed but rather than trying every space of every variable it randomly tries a few splits of a few variables. 

<font color=red>**Question**: does tree max_depth = None always includes the best tree max_depth = 3? Like the bestsubset in liear regression, the max_depth = None will pick the best one in all possible models. So the order of variables in the tree does not matter?</font>

In [None]:
m = RandomForestRegressor(n_jobs=-1)
m.fit(X_train, y_train)

preds = np.stack([t.predict(X_valid) for t in m.estimators_])
preds[:, 0], np.mean(preds[:,0]), y_valid[0]

preds.shape will be (10, 12000), which means 12,000 predictions for each of the ten trees.

`m.estimators_` gives us each tree model in the whole models. So here we want to see how predition scores of each tree. Here we just see the first column.

The shape of the this curve (no pic) suggests that adding more trees isn't going to help us much. (like n_estimators=20, n_estimator=30, n_estimator=40, ...)

### Out-of-bag (OOB) score
Is our validation set worse than our training set because we're over-fitting, or because the validation set is for a different time period, or a bit of both? With the existing information we've shown, we can't tell. However, random forests have a very clever trick called out-of-bag error which can handle this.  

The idea is to calculate error on the training set, but only include the trees in the calculation of a row's error where that row was not included in training that tree. This allows us to see whether the model is over-fitting, without needing a separate validation set.  

This also has the benifit of allowing us to see whether our model generalizes, even if we only have a samll amount of data so want to avoid separating some out to create a validation set.

In [None]:
m = RandomForestRegressor(n_estimators=40, n_jobs=-1, oob_score=True)

### Gird-search
there's a scikit-learn function called grid search you pass in the list of all of the parameters all of the hyper parameters that you want to tune you pass in each one of all of the values of that hyper parameter you want to try and it runs your model on every possible combination of all of those hyper parameters and tells you which one is the best. And our B score is a great like choice for ...

## Reducing over-fitting
It turns out that one of the easiest ways to avoid over-fitting is also one of the best ways to speed up analysis: subsampling. The basic idea is this: rather than limit the amount of data that our model can access, let's instead limit it to a *different* random subset per tree. That way, given enough trees, the model can still see all the data, but for each individual tree it'll be just as fast as if we had cut down our dataset as before.

In [None]:
set_rf_samples(20000)

now when we run a random forest, it's not going to bootstrap an entire set of all training datasets, it's going to just grab a subset of 20,000 rows. But now every tree can have access to the whole dataset. So if we do enough estimators, enough trees, eventually, it's going to see everything.  '
The trick here is that with a random forest using this technique no dataset is too big. I don't care if it's got a hundred billion rows. You can create a bunch of trees each one of the different random subsets.  

* set_tf_samples is not compatible with oob_score? So we need to trun oob_score = False if we use set_rf_samples() ? But why in the video, it is True?
* to set is back, `reset_rf_samples()`

_In practice, when Jeremy's like doing interactive machine learning using random forests in order to explore models, explore hyperparameters stuff we're going to learn in the future lesson where we actually analyze like feature importance and partial dependence and so forth he generally use subsets and reasonably small forests because all the insights that he's going to get are exactly the same as the big ones but he can run it in like you know three or four secods rather than hours._

### Tree building parameters
* `min_samples_leaf`
Another way to reduce over-fitting is to grow our trees less deeply. We do this by specifying (with min_samples_leaf) that we required some minimum number of rows in every leaf node. This has two benefits:  
    * There are less decision rules for each leaf node; simpler models should generalize better  
    * The predictions are made by averaging more rows in the leaf node, resulting in less volatility.

In [None]:
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, n_jobs=-1, oob_score=True)

`min_samples_leaf`: The minimum number of samples required to be at a leaf node:  
* If int, then consider min_samples_leaf as the minimum number.  
* If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.
* Jeremy's advice: [1, 3, 5, 10, 25], but if sometimes you have a very large dataset, you might need a minimum samples leaf of hundreds or thousands.

* `max_features`: The number of features to consider when looking for the best split.  

We can also increase the amount of variation amongst the trees by not only use a sample of rows for each tree, but also a sample of columns for each split. We do this by specifying `max_features`, which is the porportion of features to randomly select from at each split.

In [None]:
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)

**Remark**: There might be some interactions between variables where that interaction is more important than individual column so if every tree always fits on the first thing the same thing the first you're not going to get much variation in those trees so what we do is in addition to just taking a subset of rows we then at every single split point take a different subset of columns so it's slightly different to the row sampling. For the row sampling, each new tree is based on a random set of rows. Columns sampling every individual binary split we choose from a different subset of columns so in other words rather than looking at every possible level of every possible column we look at every possible level of a random subset of columns. `max_features=0.5` means randomly choose half of them. Also you can use "auto", "sqrt", "log2", and etc. 

**When split order categorical variables, without sort, how can we just split them binary?**   
Because tree is like infinitely flexible even with categorical variable if there's a particular category which have different levels of price it can like gradually zoom in on those groups by using multiple splits. You can help it by telling it the order of your categorical variable but even if you don't, it's okay it's just going to take a few more decisions to get there.  

# Lesson 3
**In what situations should I not try random forests and try other things as well?**  
For unstructured data, where are all the different data points represent the same kind of thing like a wave form in a sound or speech or or the words and piece of text or the pixels in an image are you almost certainly you're going to want to try deep learning and then outside of those two there's a particular type of model we're going to look at called a collaborative filtering model.

In [None]:
os.makedirs('tmp', exist_ok=True)
df_raw.to_feather('tmp/bulldozer-raw')

The feather format file is basically almost the same format that it lives in RAM so it's like ridiculously fast to read it and write stuff from feather point.

In [None]:
df, y, nas = proc_df(df_rwa, 'SalePrice', na_dict=nas) # try to run this function, see what the output looks like

bug fixed. There might be some categories in test data not existing in training data. Also to fill the missing values with median or mean, we need to keep training and test dataset consistent.

[Grocery-sales-forecasting](https://www.kaggle.com/c/favorita-grocery-sales-forecasting)  
There are many tables in this dataset

In [None]:
types = {'id': 'int64',
        'item_nbr': 'int32',
        'store_nbr': 'int8',
        'unit_sales': 'float32',
        'onpromotion': 'object'}

In [None]:
df_all = pd.read_csv(f'{PATH}train.csv', parse_dates = ['date'], dtype = types, infer_datetime_format = True)

**When we use low memory equals to false?**  
The low_memory option is not properly deprecated, but it should be, since it does not actually do anything differently  .  
The reason you get this low_memory warning is because guessing dtypes for each column is very memory demanding. Pandas tries to determine what dtype to set by analyzing the data in each column.   
Specifying dtypes (should always be done), *like what we did above*

In [None]:
df_all.describe(include='all') #have a summary statistics
df_test.describe(include='all') #look at the test data

**Remark:** After we look into the data, we find that the date of test data is right after the date of training data. So how should we subset the training data? Randomly? Sure not. We should select from the bottom more recent. We are not completely throwing away the older date data. Later we might want to weight for recent dates more highly.

In [None]:
df_all.unit_sales = np.log1p(np.clip(df_all.unit_sales, 0, None)) # log the dependent variable

* `numpy.clip()`  
Given an interval, values outside the interval are clipped to the interval edges. For example, if an interval of [0, 1] is specified, values smaller than 0 become 0, and values larger than 1 become 1.

In [None]:
# train_cats(raw_train)
# apply_cats(raw_valid, raw_train)

I think this is a good way to see whether your model gets improved, instead of submitting a csv file to kaggle.

In [None]:
def rmse(x,y):return math.sqrt((x-y)**2).mean()

def print_score(m):
    res = [rmse(m.predict(x), y), rmse(m.predict(val), y_val),
                m.score(x, y), m.score(val, y_val)] # what is .score?
    if hasattr(m, 'obb_score_'): res.append(m.oob_score_)
    print(res)

In [None]:
set_rf_samples(1_000_000)

* The runing time depends on n_estimators * set_rf_samples

In [None]:
x = np.array(trn, dtype=np.float32)
m = RandomForestRegressor(n_estimators=20, min_samples_leaf=100, n_jobs=8) 
# he used 60 cores? If you use every core, it might be slower? So use 8 instead.
%time m.fit(x, y)
# when we use model.fit function ,it actually convert the X into a array. So instead of converting multiple times
# when run different models, he just converted it to a array x first.

In [None]:
%prun m.fit(x,y)
# prun, run something called a profile and what a profiler does is it'll tell you which lines of code behind
# the screens took the most time right and in this case I noticed that there was a code... 

**Remarks:** we can't use oob_score here when we set the set_rf_samples. Because if we do, it's going to use the other 124 million rows to calculate the oob score.

Then try other parameters

In [None]:
m = RandomForestRegressor(n_estimators=20, min_samples_leaf=10, n_jobs=8) 
%time m.fit(x, y)
print_score(m)

In [None]:
m = RandomForestRegressor(n_estimators=20, min_samples_leaf=3, n_jobs=8) 
%time m.fit(x, y)
%time print_score(m)

**Remarks:** the results look reasonable becuase the errors decrease somehow. When we say reasonable though it's not reasonble in the in the sense that it does not give a good result on the way to one (?). Why? When does random forest not work well?   
For example, if we look at the grocery data, the columns that we have to predict with are the date, the store number, the item number and whether on promotion or not, also day of month, day of year and etc. So if you think about it, most of the insight you know around like how much of something you expect to sell tomorrow it's likely to be very wrapped up in the details about like where is that store what kind of things do they tend to sell at that store for that item, what category of item. The random forest has no ability to create anything other than create a bunch of binary splits on things like they have store number item. it doesn't know which which one represents gasoline, it doesn't know which stores are in the center of city versus which ones are out of it. Its ability to know what's really going in is somewhat limited so we're probably going to need to used the entire four years of data to even get some useful insights. But then the students beside using the whole four years of and one of the data we're using is really old so interestingly, there's a kaggle kernel that points out that what you could do is just take the last two weeks and take the average sales by date by store number by item number and just submit that  and if you just submit that you come about 30th.

<font color=blue> It's actually very often to use supplement data, like external data (weather, festivals), as long as you post on the forum that you're using it and then it's publicly avaiable. Outside of Kaggle, you should always be looking like what external data could I possible leverage here.</font>

 If you don't have a good validation set it's hard, if not impossible, to create a good model. So in other words if like if you're trying to predict next month's sales and you build a bunch of models and you have no idea of knowing whether the models you've built are good at predicting sales a month ahead of time then you have no way of knowing when you put your model in production whether it's actually going to be good. So you need a validation set that you know it is reliable and telling you whether or not your model is likely to work well when you like put it into production.