# Lab 5

## Ensemble Learning

For this lab, we will implement a bagging and boosting. Due to the increased simplicity when implementing regression with boosted trees over classification, we will stick to regression for this lab. Classification is a straightforward extension.

There are two main exercises and questions in the lab. 

In [51]:
from pathlib import Path
home = str(Path.home()) # all other paths are relative to this path. change to something else if this is not the case on your system

In [52]:
%load_ext autoreload
%autoreload 2

# make sure your run the cell above before running this
import Lab5_helper

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


For developing this lab, we can our three gene dataset. We will try to predict the value of ESR1 from the other two genes.

In [53]:
import pandas as pd
import numpy as np

df = pd.read_csv(
    f"{home}/csc-466-student/data/breast_cancer_three_gene.csv",index_col=0
)
df.head()

Unnamed: 0,ESR1,AURKA,ERBB2,Subtype
0,0.804501,0.264356,6.941677,LumA
1,0.163597,0.589052,6.551394,Basal
2,0.569347,0.189531,7.05653,LumA
3,0.847584,0.264849,7.028625,LumB
4,0.442474,0.52604,8.783604,LumB


We need to do some simple preprocessing before our neural network can deal with this data. 

In [54]:
X = df.drop('Subtype',axis=1)#.dropna()
X

Unnamed: 0,ESR1,AURKA,ERBB2
0,0.804501,0.264356,6.941677
1,0.163597,0.589052,6.551394
2,0.569347,0.189531,7.056530
3,0.847584,0.264849,7.028625
4,0.442474,0.526040,8.783604
...,...,...,...
2128,0.635455,0.143977,7.159443
2129,0.632849,0.258203,7.145417
2130,0.662344,0.243027,6.936228
2131,0.086119,0.479997,7.671082


In [55]:
t = X['ESR1']
X2 = pd.get_dummies(X.drop('ESR1',axis=1))
X2

Unnamed: 0,AURKA,ERBB2
0,0.264356,6.941677
1,0.589052,6.551394
2,0.189531,7.056530
3,0.264849,7.028625
4,0.526040,8.783604
...,...,...
2128,0.143977,7.159443
2129,0.258203,7.145417
2130,0.243027,6.936228
2131,0.479997,7.671082


#### Exercise 1
Implement bagging using regression trees as the (weak) individual learner. You should be able to reach a similar accuray to my implementation.

In [56]:
import numpy as np
learner = Lab5_helper.get_learner(X2,t) # Here is how to get a single weak learner to help build your ensembles
y = learner.predict(X2) # As usual, here is how to get predictions
RMSE = np.sqrt(((y-t)**2).sum()/len(t)) # Here is a sample calculation of the root mean squared error
print('Our prediction for Fare is off by',RMSE)

Our prediction for Fare is off by 0.16748799642382256


In [57]:
from sklearn.model_selection import train_test_split

ntrials = 50
# when you are debugging you will want to lower this number
RMSEs = []
for trial in range(ntrials):
    X_train, X_test, t_train, t_test = train_test_split(X2, t, test_size=0.25,random_state=trial)
    trees = Lab5_helper.make_trees(X_train,t_train,ntrees=100)
    y = Lab5_helper.make_prediction(trees,X_test)
    RMSEs.append(np.sqrt(((y-t_test)**2).sum()/len(t_test)))
np.median(RMSEs)

0.22817351105917555

**Problem 1:** How do we know this is behaving as expected when there is so much randomness? This is a very similar to a question about how can we compare two algorithms? Let's examine this by changing the number of trees in bagging between 25 and 100. Then we will see if we can detect a difference.

**Your answer here: https://canvas.calpoly.edu/courses/81417/assignments/545575**

In [58]:
results = pd.DataFrame({'RMSE':RMSEs,'Method':'Bagging (ntrees=100)'})
RMSEs = []
for trial in range(ntrials):
    X_train, X_test, t_train, t_test = train_test_split(X2, t, test_size=0.25,random_state=trial)
    trees = Lab5_helper.make_trees(X_train,t_train,ntrees=25)
    y = Lab5_helper.make_prediction(trees,X_test)
    RMSEs.append(np.sqrt(((y-t_test)**2).sum()/len(t_test)))
results2 = pd.DataFrame({'RMSE':RMSEs,'Method':'Bagging (ntrees=25)'})
results = results.append(results2)

  results = results.append(results2)


In [59]:
results.groupby('Method')['RMSE'].median()

Method
Bagging (ntrees=100)    0.228174
Bagging (ntrees=25)     0.229689
Name: RMSE, dtype: float64

In [60]:
results.groupby('Method')['RMSE'].mean()

Method
Bagging (ntrees=100)    0.228465
Bagging (ntrees=25)     0.230261
Name: RMSE, dtype: float64

This looks promising! But what do the statistics tell us?

In [61]:
import altair as alt

alt.Chart(results).mark_boxplot().encode(
    alt.Y("RMSE:Q"),
    x='Method',
).properties(width=300)

For a moment, let's not worry that the underlying distributions do not look normal. We know how to compare the average of two distributions, the t-test. Let's see what that says:

In [62]:
pivot_results = results.pivot(columns='Method')
pivot_results.head()

Unnamed: 0_level_0,RMSE,RMSE
Method,Bagging (ntrees=100),Bagging (ntrees=25)
0,0.217787,0.217539
1,0.232209,0.230607
2,0.228634,0.228965
3,0.220319,0.222392
4,0.223205,0.22558


In [63]:
pivot_results.columns

MultiIndex([('RMSE', 'Bagging (ntrees=100)'),
            ('RMSE',  'Bagging (ntrees=25)')],
           names=[None, 'Method'])

In [64]:
from scipy import stats
stats.stats.ttest_ind(pivot_results[('RMSE', 'Bagging (ntrees=100)')],pivot_results[('RMSE', 'Bagging (ntrees=25)')],equal_var = True)

  stats.stats.ttest_ind(pivot_results[('RMSE', 'Bagging (ntrees=100)')],pivot_results[('RMSE', 'Bagging (ntrees=25)')],equal_var = True)


Ttest_indResult(statistic=-1.4678094799525085, pvalue=0.14535822079233937)

So according to this test, we do not see any significant difference between the results.

**Problem 2.** What if you only used 2 trees? Is there a significant difference between 2 and 25 trees?

**Your answer here: https://canvas.calpoly.edu/courses/81417/assignments/545576**

In [65]:
RMSEs = []
for trial in range(ntrials):
    X_train, X_test, t_train, t_test = train_test_split(X2, t, test_size=0.25,random_state=trial)
    trees = Lab5_helper.make_trees(X_train,t_train,ntrees=2)
    y = Lab5_helper.make_prediction(trees,X_test)
    RMSEs.append(np.sqrt(((y-t_test)**2).sum()/len(t_test)))
results2 = pd.DataFrame({'RMSE':RMSEs,'Method':'Bagging (ntrees=2)'})
results = results.append(results2)

  results = results.append(results2)


In [66]:
results.groupby('Method')['RMSE'].mean()

Method
Bagging (ntrees=100)    0.228465
Bagging (ntrees=2)      0.252451
Bagging (ntrees=25)     0.230261
Name: RMSE, dtype: float64

In [67]:
pivot_results = results.pivot(columns='Method')
pivot_results.head()

Unnamed: 0_level_0,RMSE,RMSE,RMSE
Method,Bagging (ntrees=100),Bagging (ntrees=2),Bagging (ntrees=25)
0,0.217787,0.24776,0.217539
1,0.232209,0.255178,0.230607
2,0.228634,0.247305,0.228965
3,0.220319,0.24856,0.222392
4,0.223205,0.238804,0.22558


In [68]:
stats.stats.ttest_ind(pivot_results[('RMSE', 'Bagging (ntrees=100)')],pivot_results[('RMSE', 'Bagging (ntrees=2)')],equal_var = True)

  stats.stats.ttest_ind(pivot_results[('RMSE', 'Bagging (ntrees=100)')],pivot_results[('RMSE', 'Bagging (ntrees=2)')],equal_var = True)


Ttest_indResult(statistic=-17.10360785165843, pvalue=3.5102981433418424e-31)

Finally, we can report a p-value < 0.05! Even if we were to discuss multiple test correction at this time, this would still be a significant result. So bagging is helping us!

#### Exercise 2
Implement boosting using regression trees as the (weak) individual learner. You should be able to reach a similar accuray to my implementation.

In [69]:
RMSEs = []
for trial in range(ntrials):
    X_train, X_test, t_train, t_test = train_test_split(X2, t, test_size=0.25,random_state=trial)
    X_train2, X_val, t_train2, t_val = train_test_split(X_train, t_train, test_size=0.25,random_state=trial)
    trees,train_RMSEs,val_RMSEs = Lab5_helper.make_trees_boost(X_train2, X_val, t_train2, t_val, max_ntrees=100)
    trees = Lab5_helper.cut_trees(trees,val_RMSEs)
    y = Lab5_helper.make_prediction_boost(trees,X_test)
    RMSEs.append(np.sqrt(((y-t_test)**2).sum()/len(t_test)))
results2 = pd.DataFrame({'RMSE':RMSEs,'Method':'Boosting (max_ntrees=100)'})
results = results.append(results2)

  results = results.append(results2)


In [70]:
source = pd.DataFrame({"train_RMSE":train_RMSEs,"val_RMSE":val_RMSEs, "ntrees":np.arange(len(val_RMSEs))}).melt(id_vars=['ntrees'])

alt.Chart(source).mark_line().encode(
    y = alt.Y("value:Q", scale=alt.Scale(domain=[0.1, 0.3])),
    x = 'ntrees',
    color='variable'
).properties(width=500)

In [71]:
results.groupby('Method')['RMSE'].median().sort_values()

Method
Boosting (max_ntrees=100)    0.226319
Bagging (ntrees=100)         0.228174
Bagging (ntrees=25)          0.229689
Bagging (ntrees=2)           0.251441
Name: RMSE, dtype: float64

In [72]:
results.groupby('Method')['RMSE'].mean().sort_values()

Method
Boosting (max_ntrees=100)    0.227613
Bagging (ntrees=100)         0.228465
Bagging (ntrees=25)          0.230261
Bagging (ntrees=2)           0.252451
Name: RMSE, dtype: float64

In [73]:
np.mean(RMSEs)

0.22761342508021085

In [74]:
import altair as alt

alt.Chart(results).mark_boxplot().encode(
    alt.Y("RMSE:Q", scale=alt.Scale(domain=[0.2, 0.3])),
    x='Method',
).properties(width=300)

**Problem 3.** How would you compare (using t-test) whether the boosting algorithm is better than bagging (ntrees=100)?

In [75]:
# Your solution here: https://canvas.calpoly.edu/courses/81417/assignments/545577

  stats.stats.ttest_ind(pivot_results[('RMSE', 'Bagging (ntrees=100)')],pivot_results[('RMSE', 'Boosting (max_ntrees=100)')],equal_var = True)


Ttest_indResult(statistic=0.6510203654798702, pvalue=0.5165569376678337)

In [76]:
# Good job!
# Don't forget to push with ./submit.sh

#### Having trouble with the test cases and the autograder?

You can always load up the answers for the autograder. The autograder runs your code and compares your answer to the expected answer. I manually review your code, so there is no need to hide this from you.

```python
import joblib
answers = joblib.load(f"{home}/csc-466-student/tests/answers_Lab5.joblib")
answers.keys()
```