### Gradient Boosting & Scikit-Learn Intro

This lab is designed to give everyone their first introduction to the Scikit-Learn API, and Gradient Boosting, one of the most powerful techniques in predictive modeling.

During this lab you'll see if you can build a model, understand its working parts, and make improvements to your results!  

The great thing about `Scikit Learn` is that its API is almost identical from one algorithm to another, so once you get the hang of how to use it, using different methods is fairly seamless.

**Step 1:** Load in the `housing.csv` file

In [48]:
import pandas as pd

In [49]:
df = pd.read_csv('/Users/cameronlefevre/Test-Repo/ClassMaterial/Unit3/data/housing.csv')

**Step 2:** Randomly shuffle your dataset using the `shuffle` method, using a `random_state` of 42.

In [51]:
df = df.sample(df.shape[0], random_state=42)

**Step 3:** Declare your `X` & `y` variables -- We'll be predicting price.

In [52]:
X = df.drop('PRICE', axis=1) #everything but the PRICE column
y = df['PRICE'] # just the PRICE column

**Step 4:** Create a training and test set.

The training set will be the first 80% of the dataset (for both `X` & `y`), and test set will be the last 20%.  Do this for both `X` & `y`, using your shuffled data.

Subsequent questions will refer to the variables you created in this step as `X_train`, `X_test`, `y_train`, `y_test`.

In [53]:
# your answer here
cutoff = int(df.shape[0]*.8)

X_train, X_test = X[:cutoff].copy(), X[cutoff:].copy()
y_train, y_test  = y[:cutoff].copy(), y[cutoff:].copy()

**Step 5:** Import `GradientBoostingRegressor` and initialize it.

In [55]:
from sklearn.ensemble import GradientBoostingRegressor

gbm = GradientBoostingRegressor()

**Step 6:** Call the `fit()` method on `X_train` & `y_train`

In [56]:
gbm.fit(X_train, y_train)

GradientBoostingRegressor()

**Step 7:** Make a column that represents the predictions your model made for each sample in your original dataset.

In [57]:
df['Prediction'] = gbm.predict(X)

**Step 8:** Check the score of your model using the `score()` method.  Compare your score on both the training set and test set.

In [58]:
# your answer here
print(f"Train score: {gbm.score(X_train, y_train)}, Test Score: {gbm.score(X_test, y_test)}")

Train score: 0.9790704008575339, Test Score: 0.8948584148190182


**Step 9:** Take a look at the values returned from the `feature_importances_` attribute

In [59]:
gbm.feature_importances_

array([2.32570653e-02, 5.22216082e-05, 1.76244557e-03, 5.42716264e-04,
       2.74667859e-02, 3.54431976e-01, 6.10631402e-03, 6.66084781e-02,
       1.16943296e-03, 1.72096601e-02, 2.48807624e-02, 1.50372093e-02,
       4.61474933e-01])

**Step 10:** To make a bit more sense out of these, let's put these values into a more readable format.  

Try making a 2 column dataframe using `X.columns` and the values from `feature_importances_` (they should correspond to one another).

In [63]:
# your answer here
importances = pd.DataFrame({
    'Column': X.columns,
    'Importance': gbm.feature_importances_
}).sort_values(by='Importance', ascending=False)

importances

Unnamed: 0,Column,Importance
12,LSTAT,0.461475
5,RM,0.354432
7,DIS,0.066608
4,NOX,0.027467
10,PTRATIO,0.024881
0,CRIM,0.023257
9,TAX,0.01721
11,B,0.015037
6,AGE,0.006106
2,INDUS,0.001762


In [69]:
importances.Importance.sum()

1.0

**Step 11:** Can you improve your results?  For now, toy around a little bit with a few different options for getting different results.  These could be any of the following:

 - changing the number of boosting rounds used via `n_estimators`
 - changing the learning rate
 - removing columns that have lower feature importance, or very low correlation with the target variable

In [67]:
n_estimators  = [100, 250, 500]
learning_rate = [.05, .1]
tree_depth    = [3, 4, 5, 6]
cv_scores     = []

for estimators in n_estimators:
    for rate in learning_rate:
        for depth in tree_depth:
#            print(f"Fitting model for: {}")
            gbm.set_params(n_estimators=estimators, learning_rate=rate, max_depth=depth)
            gbm.fit(X_train, y_train)
            cv_scores.append((gbm.score(X_test, y_test), estimators, rate, depth))

In [68]:
cv_scores

[(0.8786266782877534, 100, 0.05, 3),
 (0.8895933869375947, 100, 0.05, 4),
 (0.904354768986493, 100, 0.05, 5),
 (0.9047742644081224, 100, 0.05, 6),
 (0.8943649457006226, 100, 0.1, 3),
 (0.9011665051447878, 100, 0.1, 4),
 (0.9038117240279105, 100, 0.1, 5),
 (0.9058348999619341, 100, 0.1, 6),
 (0.8953784292938393, 250, 0.05, 3),
 (0.9028309253811327, 250, 0.05, 4),
 (0.9083021661313042, 250, 0.05, 5),
 (0.9102231546133156, 250, 0.05, 6),
 (0.8996792225628538, 250, 0.1, 3),
 (0.9091086823941428, 250, 0.1, 4),
 (0.8902087462990661, 250, 0.1, 5),
 (0.9053818869054743, 250, 0.1, 6),
 (0.9065528203580262, 500, 0.05, 3),
 (0.8944539040959828, 500, 0.05, 4),
 (0.9034971562085208, 500, 0.05, 5),
 (0.9152351612701669, 500, 0.05, 6),
 (0.9041827844272571, 500, 0.1, 3),
 (0.9063886572563367, 500, 0.1, 4),
 (0.9037174096269587, 500, 0.1, 5),
 (0.9041013929844511, 500, 0.1, 6)]