<a href="https://colab.research.google.com/github/ccwilliamsut/machine_learning/blob/master/MLAB_05_Gradient_Boosting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build a Gradient Boosting Model

### Basic Decision Trees
In a previous example (Build a Decision Tree Model), we learned how to build a single decision tree. One of the primary **problems** associated with decision trees, however, is that they are **prone to overfitting / high variance** (see this [link](https://towardsdatascience.com/decision-trees-and-random-forests-df0c3123f991) for further information).

The basic **reason for this** is has to do with the **depth of the tree** and how the increasing specificity of splits leads to a reliance on the training data that does not reflect the real world. The primary means of combating this is to reduce the depth (i.e. ```max_depth```), but this can, in turn, introduce bias problems that will be too generalized when confronted with unseen data.

### Ensemble Methods
One means of combating the issues facing decision trees are to employ methods of ***building multiple trees** that can utilize information from each other to derive better models.

### Random Forests vs. Gradient Boosting
Both methods create multiple trees, but they take different approaches to combining that data. 
- **Random forests** use random samples of the data to create each new tree independent of any other tree. The results are then averaged to produce a final model.
- **Gradient boosting** uses errors from previous trees to inform how each new tree is created in order to produce a "best" model.

>Websites Refereced:
- [Towards Data Science: Decision Trees and Random Forests](https://towardsdatascience.com/decision-trees-and-random-forests-df0c3123f991)
- [Medium: Gradient Boosting vs Random Forest](https://medium.com/@aravanshad/gradient-boosting-versus-random-forest-cfa3fa8f0d80)

## A. Setup Environment

In [0]:
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SETUP ENVIRONMENT <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# Import libraries
import pandas as pd
import seaborn as sns; sns.set()
import numpy as np
import matplotlib.pyplot as plt
import sklearn.metrics as metrics
from IPython.display import display
from sklearn import preprocessing
from sklearn.model_selection import train_test_split 
from sklearn import ensemble
from scipy import stats
from sklearn import linear_model
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import learning_curve,GridSearchCV
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import learning_curve
from sklearn import preprocessing
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

url = 'https://raw.githubusercontent.com/ccwilliamsut/machine_learning/master/absolute_beginners/data_files/modified/CaliforniaHousingDataModified.csv'

df = pd.read_csv(url)
#df = pd.read_csv(~/Downloads/CaliforniaHousingDataModified.csv)


# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SETUP DATA <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

# --------------------------------------------- RENAME FEATURES ---------------------------------------------
df.rename(columns = {'lattitude':'latitude', 't_rooms':'total_rooms'}, inplace=True)


# --------------------------------------------- ONE-HOT ENCODING ---------------------------------------------
df['ocean_proximity'] = pd.Categorical(df['ocean_proximity'])
df_dummies = pd.get_dummies(df['ocean_proximity'], drop_first = True)
df.drop(['ocean_proximity'], axis = 1, inplace = True)
df = pd.concat([df, df_dummies], axis=1)


# --------------------------------------------- DROP UNWANTED FEATURES ---------------------------------------------
df.drop(['population', 'households', 'proximity_to_store', 'ISLAND'], axis = 1, inplace = True)


# ------------------------------------- FIX MISSING DATA -------------------------------------
tb_med = df['total_bedrooms'].median(axis=0)
df['total_bedrooms'] = df['total_bedrooms'].fillna(value = tb_med)
df['total_bedrooms'] = df['total_bedrooms'].astype(int)
df.name = 'df'

# ------------------------------------- Z-SCORE -------------------------------------
z = np.abs(stats.zscore(df))
dfz = df[(z < 3).all(axis = 1)]
dfz.name = 'dfz'

# ------------------------------------- INTERQUARTILE RANGE -------------------------------------
q1 = df.quantile(0.25)
q3 = df.quantile(0.75)
iqr = q3 - q1
lower = q1 - (1.5 * iqr)
upper = q3 + (1.5 * iqr)
dfi = df[~((df < lower) | (df > upper)).any(axis = 1)]
dfi.name = 'dfi'
#dfi = dfi.drop(['NEAR BAY', 'NEAR OCEAN'], axis = 1)  # After applying IQR, the following features are now empty and can be dropped

# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FUNCTIONS <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
def make_heatmap(df = 'df'):
  corr = df.corr()
  plt.figure(figsize=(15 ,9))
  sns.heatmap(corr, annot=True, vmin = -1, vmax = 1, center = 0, fmt = '.1g', cmap = 'coolwarm')
  plt.show()
  plt.close()


def make_heading(heading = 'heading'):
  print('\n\n{0}:\n'.format(heading), '-' * 30)


def drop_features(df_in):
  df_in = df_in.drop([#'median_house_value',
                      'longitude',
                      'latitude',
                      #'housing_median_age',
                      'total_rooms',
                      'total_bedrooms',
                      #'median_income',
                      'INLAND',
                      #'NEAR BAY',
                      #'NEAR OCEAN'
                      ],
                      axis = 1
                      )
  return features_dfx


def plot_test_predictions(y_test, y_pred):
  make_heading('Prediction Performance')  # Make a heading to separate output
  fig, ax = plt.subplots()
  ax = plt.subplot()
  ax.scatter(y_test, y_pred, edgecolors=(0, 0, 0), alpha=0.5)
  ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'g--', lw=4)
  ax.set_xlabel('Actual')
  ax.set_ylabel('Predicted')
  ax.set_title("Ground Truth vs Predicted")
  plt.show()
  plt.close()


def plot_feature_importance(feature_importance):
  # Make importances relative to max importance
  feature_importance = 100.0 * (feature_importance / feature_importance.max())
  sorted_idx = np.argsort(feature_importance)
  pos = np.arange(sorted_idx.shape[0]) + .5
  make_heading('Feature Importance')  # Make a heading to separate output
  plt.subplot(1, 2, 2)
  plt.barh(pos, feature_importance[sorted_idx], align='center')
  plt.yticks(pos, features_dfx)
  plt.xlabel('Relative Importance')
  plt.title('Variable Importance')
  plt.show()
  plt.close()


def plot_deviance():
  # Plot the training/testing accuracy
  y_pred = model.predict(X_test)
  y_staged_predicted_score = model.staged_predict(X_test)
  model_test_score = model.score(X_test, y_test)  # Accuracy of our model on test data
  model_train_score = model.score(X_train, y_train)  # Accuracy of our model on training data
  

  # Create an empty array of zeros based on the value in 'n_estimators' key
  test_score = np.zeros(shape = (params['n_estimators'],),
                        dtype=np.float64
                        )

  # Compute test set deviance (loss) at each stage and place it into the array
  #  based on the predicted value (y_pred) against the actual value (y_test)
  for i, y_pred in enumerate(y_staged_predicted_score):
      test_score[i] = model.loss_(y_test, y_pred)
  
  make_heading('Deviance Over Time')  # Make a heading to separate output
  plt.figure(figsize=(15, 7))
  plt.subplot(1, 2, 1)
  plt.title('Deviance')
  plt.plot(np.arange(params['n_estimators']) + 1, 
           model.train_score_,
           'b-',
           label='Training Set Deviance'
           )
  plt.plot(np.arange(params['n_estimators']) + 1,
           test_score,
           'r-',
           label='Test Set Deviance'
           )
  plt.legend(loc='upper right')
  plt.xlabel('Boosting Iterations')
  plt.ylabel('Deviance')
  plt.show()
  plt.close
  

def show_metrics(X_train, X_test, y_train, y_test, y_pred):
  # Display the shape of each array:
  #make_heading('The shape of each of our arrays after splitting into training and testing sets for dataframe \"{0}\"'.format(dfx.name))
  #print('X_train shape: ', X_train.shape)
  #print('X_test shape:  ', X_test.shape)
  #print('y_train shape: ', y_train.shape)
  #print('y_test shape:  ', y_test.shape)
  #print('y_pred shape:  ', y_pred.shape)

  # Display training / testing set metrics
  make_heading('\n\nAccuracy and Error for training/testing on the {0} model for dataframe \"{1}\"'.format(model_name, dfx.name))
  print("Training Accuracy (score)                 (X_train, y_train):  {:.2f}".format(model_train_score))
  print("Test Accuracy (score)                     (X_test, y_test):    {:.2f}".format(model_test_score))
  print("Predictive Accuracy (R^2 score)           (y_test, y_pred):    {:.2f}".format(predictive_accuracy))
  print("Explained Variance (1 is best) (loss)     (y_test, y_pred):    {:.2f}".format(ev))
  print("Mean Absolute Error on Test Set (loss)    (y_test, y_pred):    {:.2f}".format(mean_ae))
  print("Median Absolute Error on Test Set (loss)  (y_test, y_pred):    {:.2f}".format(median_ae))
  print("RMSE on Test Set (loss)                   (y_test, y_pred):    {:.2f}".format(rmse))

## B. Choose a dataset to use
- Three datasets have been created:
  1. **df** -- the **original dataset** ('NaN' values have been replaced with median, so there are no null values)
  2. **dfz** -- a dataset that has used the **z-score method** for removing outlier data
  3. **dfi** -- a dataset that has used the **IQR method** for removing outlier data

```

```

**NOTE:** When choosing your dataset, you will need to change two commands to reflect the dataset that you want to use. In the following example, the 

**Original code:** --------------------------------------->  **New code** (For example, to change **from** 'dfz' **to** 'dfi' dataset):
```
dfx = dfz.copy()      -->   dfx = dfi.copy()
dfx.name = dfz.name   -->   dfx.name = dfi.name
```



```
  
  

```

In [0]:
# *************************** DETERMINE DATA FOR USE ***************************
# Choose which dataframe you would like to use (dfi: IQR, dfz: z-score, df: original (with replaced NaN values))
dfx = df.copy()
dfx.name = df.name

# Show a heatmap for the given dataframe you want to use
make_heatmap(dfx)

## C. Step by Step: Building a Gradient Boosting Model

### From *Machine Learning For Absolute Beginners* by Oliver Theobald
```

model = ensemble.GradientBoostingRegressor(
                                           n_estimators = 150, 
                                           learning_rate = 0.1, 
                                           max_depth = 30, 
                                           min_samples_split = 4, 
                                           min_samples_leaf = 6, 
                                           max_features = 0.6, 
                                           loss = 'huber'
                                          )
```

The first line is the algorithm itself (gradient boosting) and comprises just one line of code. The code below dictates the hyperparameters for this algorithm. 
- **n_estimators** represents how many decision trees to be used. Remember that a high number of trees generally improves accuracy (up to a certain point) but will extend the model’s processing time. Above, I have selected 150 decision trees as an initial starting point. 
- **learning_rate** controls the rate at which additional decision trees influence the overall prediction. This effectively shrinks the contribution of each tree by the set learning_rate. Inserting a low rate here, such as 0.1, should help to improve accuracy. 
- **max_depth** defines the maximum number of layers (depth) for each decision tree. If “None” is selected, then nodes expand until all leaves are pure or until all leaves contain less than min_samples_leaf. Here, I have chosen a high maximum number of layers (30), which will have a dramatic effect on the final result, as we’ll soon see. 
- [**min_samples_split**](https://stackoverflow.com/questions/46480457/difference-between-min-samples-split-and-min-samples-leaf-in-sklearn-decisiontre) defines the minimum number of samples required to execute a new binary split. For example, min_samples_split = 10 means there must be ten available samples in order to create a new branch.
- [**min_samples_leaf**](**min_samples_split**) represents the minimum number of samples that must appear in each child node (leaf) before a new branch can be implemented. This helps to mitigate the impact of outliers and anomalies in the form of a low number of samples found in one leaf as a result of a binary split. For example, min_samples_leaf = 4 requires there to be at least four available samples within each leaf for a new branch to be created. 
- **max_features** is the total number of features presented to the model when determining the best split.
- **loss** calculates the model's error rate. For this exercise, we are using huber which protects against outliers and anomalies. Alternative error rate options include ls (least squares regression), lad (least absolute deviations), and quantile (quantile regression). Huber is actually a combination of least squares regression and least absolute deviations.


  >Theobald, Oliver. Machine Learning For Absolute Beginners: A Plain English Introduction (Second Edition) (Machine Learning For Beginners Book 1) (pp. 139-141). Scatterplot Press. Kindle Edition.

I also referenced the following website(s) and adapted code in order to graph this model:
- [Scikit-Learn Documentation: Ensemble Gradient Boosting Visualization](https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regression.html#sphx-glr-auto-examples-ensemble-plot-gradient-boosting-regression-py)
- [Sharp Sight Labs: Numpy Zeros Tutorial](https://www.sharpsightlabs.com/blog/numpy-zeros-python/)

In [0]:
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> STEP BY STEP DECISION TREE CONSTRUCTION <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

# ---------------------------- Identify Target Variable ----------------------------
# Create a copy of the dataframe with only the desired features (dropping the target feature)
features_dfx = dfx.copy()
features_dfx = dfx.drop(['median_house_value'], axis = 1)  # We *MUST* drop this b/c it is the target

# Determine additional features to keep/drop (# means that we want to keep that variable)
features_dfx = features_dfx.drop([#'longitude',
                                  #'latitude',
                                  #'housing_median_age',
                                  #'total_rooms',
                                  #'total_bedrooms',
                                  #'median_income',
                                  #'INLAND',
                                  'NEAR BAY',
                                  'NEAR OCEAN'
                                  ],
                                 axis = 1
                                 )


# ---------------------------- SPLIT THE DATASET ----------------------------
X = features_dfx.values                 # Define the independent variable values to be used
y = dfx['median_house_value'].values    # Define the dependent variable values to be used
rs = 20                                 # Define the random state variable (ensuring continuity between runs)
test_size = 0.3                         # Define the percentage of data to set aside for testing (usually b/w 0.2 - 0.3)

# Split the dataset into training and testing arrays
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size = test_size,
                                                    shuffle = True,
                                                    random_state = rs
                                                    )


# ---------------------------- CREATE THE MODEL  ----------------------------
# -- Set the hyperparameters that we want to use in the model
params = {'n_estimators': 1000,
          'learning_rate': .01,
          'max_depth': 5,
          'min_samples_split': 10,
          'min_samples_leaf': 5,
          'max_features': None,       # Can be float, int, string or None; max_features < n_features = reduced variance and increased bias
          'subsample': 1.0,
          'random_state': rs,
          'verbose': 1,
          # Options for 'loss': huber, ls (least squares), lad (least absolute deviations) and quantile (quantile regression)
          'loss': 'huber'
          }

# Define the desired algorithm and congigure the hyperparameters (supplied with '**params' arguement)
model = ensemble.GradientBoostingRegressor(**params)
model_name = type(model).__name__   # Get the name of the model (for use in our display functions)


# ---------------------------- Fit / TRAIN THE MODEL ----------------------------
# Train the model using our training sets (X_train, y_train)
model.fit(X_train, y_train) 

# Here we want to input our X_test array to make predictions based upon our trained model
y_pred = model.predict(X_test)
y_staged_predicted_score = model.staged_predict(X_test)


# ---------------------------- GATHER METRICS ----------------------------
model_test_score = model.score(X_test, y_test)              # Accuracy of our model on test data
model_train_score = model.score(X_train, y_train)           # Accuracy of our model on training data
mean_ae = metrics.mean_absolute_error(y_test, y_pred)       # Mean absolute error (find mae on the test set (to see how well the trained model performs on new data(y_test against y_pred))
median_ae = metrics.median_absolute_error(y_test, y_pred)   # Median absolute error
rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))  # Root mean squared error
ev = metrics.explained_variance_score(y_test, y_pred)       # Explained Variance Score (y_test, y_pred): Measures the proportion to which a mathematical model accounts for the variation (dispersion) of a given data set
predictive_accuracy = metrics.r2_score(y_test, y_pred)      # Determine the predictive accuracy of the model by getting the R^2 score (how well future samples are likely to be predicted by the model)

# ---------------------------- ANALYSIS / VISUALIZATION ----------------------------
show_metrics(X_train, X_test, y_train, y_test, y_pred)  # Dispay the scores and loss for the model
plot_test_predictions(y_test, y_pred)                   # Create a scatterplot of real values against predicted ones for the test set
feature_importance = model.feature_importances_         # Gather feature importance values
plot_feature_importance(feature_importance)             # Plot feature importance
plot_deviance()                                         # Plot training / testing accuracy
joblib.dump(model, 'ca_housing_gb_model.pkl')           # Save the model so that we can use it later