### Gradient Boosted Tree Regression Lab

We'll start be setting up our data:

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

df = pd.read_csv('data/diamonds.csv')
df2 = df.drop(df.columns[0], axis=1)
df3 = pd.get_dummies(df2)

y = df3.iloc[:,3]
X = df3.drop(df3.columns[3], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Next, we'll run our linear model as a baseline:

In [2]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

lr = linear_model.LinearRegression()
linear = lr.fit(X_train, y_train)

y_pred = linear.predict(X_test)
print("RMSE %f" % np.sqrt(mean_squared_error(y_test, y_pred)) )

RMSE 1127.461351


Now let's try the gradient-boosting ensemble on this data. 

Scikit-learn has an implementation at http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html

(It's not used as widely in industry because it's not nearly as fast as xgboost, and doesn't scale out the way Spark+XGBoost, H2O, SparkML GBT, etc. can. But it's fine for learning and exploring like we're doing here.)

In [3]:
from sklearn import ensemble

gbtr = ensemble.GradientBoostingRegressor()
gbt_model = gbtr.fit(X_train, y_train)

y_pred = gbt_model.predict(X_test)
print("RMSE %f" % np.sqrt(mean_squared_error(y_test, y_pred)) )

RMSE 744.919307
