## imports

In [None]:
# Execute before using this notebook if using google colab
kernel = str(get_ipython())
if 'google.colab' in kernel:    
    !wget https://raw.githubusercontent.com/fredzett/rmqa/master/utils.py -P local_modules -nc 
    !npx degit fredzett/rmqa/data data
    import sys
    sys.path.append('local_modules')

In [52]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from patsy import dmatrices
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from scipy.stats import mode

from subprocess import call
from IPython.display import Image

plt.rcParams['font.size'] = 12
plt.rcParams['figure.figsize'] = [9,7]
plt.rcParams['figure.dpi'] = 80
plt.rcParams["axes.spines.right"] = False
plt.rcParams["axes.spines.top"] = False

# Bagging: Decision trees

Let continue with our example from last session. Let's try to build a __decision tree__ with 
- y = car is from USA (origin = 1) or not (origin = 2 or 3) 
- X = horsepower and weight

For this we will 

- create a new y variable called (US) taking value of 1 if car is from USA and 0 if otherwise

In [147]:
df = pd.read_csv("./data/Auto.csv")
df["US"] = np.where(df["origin"]==1,1,0)
y, X = dmatrices("US ~ horsepower + weight -1", df, return_type="dataframe")

Using the above data we can build a decision tree easily using the `sklearn` module

In [96]:
tree_clf = DecisionTreeClassifier(max_depth=3) # change max_depth to make tree smaller or larger
tree_clf.fit(X,y)

DecisionTreeClassifier(max_depth=3)

In [97]:
tree_clf.score(X,y)

0.8035714285714286

The model has an __accuracy score__ of $\approx 0.80$. Note that we could easily improve the score by loosing some the restrictions on building the tree (here: max_depth). However, doing this will lead to significant overfitting. 

Now let's extend this example by building a __bagging classifier__ using $B$ decision trees.

For each $b=1, 2, \ldots, B$  we need to:

- take a bootstrap sample

- train (i.e. fit) a decision tree model to the sampled data

Once we have done this we have $B$ trained decision trees. We can than calculate prediction for each of the $B$ trees and aggregate the prediction to one overall prediction. 

Let's build a bagging function for classification step by step. 

The function should

- take $X$, $y$ and $B$ as input 

- return $B$ fitted models and their scores as output



In [None]:
def bagging(X,y,B):
    
    ## We will build the function step by step
    
    return models, scores