### A random forest classifier.
In this section, a random forest will be used to fit a subsample of the natural gas dataset. A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default). The Random Forest algorithm is an averaging algorithm based on randomized decision trees.

Using the natural gas data set, we will fit forest classifiers with two arrays: an array X of gas price predictors holding the training samples, and an array Y of gas prices holding the target vawelues (class labels) for the training samples.

In [1]:
# Import required libraries
%matplotlib inline
import numpy as np
import scipy as sp

In [2]:
import pandas as pd
#load the data for analysis
dflog=pd.read_excel("data/DataSet_GasPrice_ Outlier_Removed.xlsx")
dflog.head()

Unnamed: 0,Days,Date,AveCoalPrice,OilPrice,GrossGasProd,TotGasCons,GasPrice,Weather,WSTAT,GasPriceStatus,GPSAT,color
0,245,2008-12-31,57.22,41.12,2227.028,2399.702,5.82,WINTER,1,HIGH,1,1
1,276,2009-01-31,54.37,41.71,2251.938,2729.715,5.24,WINTER,1,HIGH,1,1
2,304,2009-02-28,52.3,39.09,2074.167,2332.539,4.52,WINTER,1,HIGH,1,1
3,335,2009-03-31,44.34,47.94,2262.488,2170.709,3.96,WINTER,1,HIGH,1,1
4,365,2009-04-30,41.92,49.65,2147.856,1741.293,3.5,SPRING,0,HIGH,1,1


In [5]:
#Change the columns name to create attributes and features
dflog.columns = ['DAYS', 'DATE', 'COALP', 'OILP', 'GPROD', 'GCONS', 'GASP', 'WEATH', 'WSTAT', 'GPSTAT', 'GPSAT', 'COL']
dflog.head()

Unnamed: 0,DAYS,DATE,COALP,OILP,GPROD,GCONS,GASP,WEATH,WSTAT,GPSTAT,GPSAT,COL
0,245,2008-12-31,57.22,41.12,2227.028,2399.702,5.82,WINTER,1,HIGH,1,1
1,276,2009-01-31,54.37,41.71,2251.938,2729.715,5.24,WINTER,1,HIGH,1,1
2,304,2009-02-28,52.3,39.09,2074.167,2332.539,4.52,WINTER,1,HIGH,1,1
3,335,2009-03-31,44.34,47.94,2262.488,2170.709,3.96,WINTER,1,HIGH,1,1
4,365,2009-04-30,41.92,49.65,2147.856,1741.293,3.5,SPRING,0,HIGH,1,1


Let us start with the relationship that has the most explicit classification: That is classification of Gas price and Gas Consumption with respect to weather Status.

In [6]:
# Split data into training set and testing set
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Split the data into a training and test set.
Xlr, Xtestlr, ylr, ytestlr = train_test_split(dflog[['GASP','GCONS']].values, 
                                              (dflog.WSTAT == 0).values,random_state=5)


We now generate a Random Forest Classifier and fit the training sample

In [7]:
clf = RandomForestClassifier(n_estimators=10)

In [8]:
clf = clf.fit(Xlr, ylr)

In [9]:
clf

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

### Extremely Randomnized Tree
In this section we will take the randomness a little  bit further by using extremely randomized trees. In this case, a random subset of candidate features is used, but instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule. This usually allows to reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias. 

Let us import the required library

In [12]:
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier

Generate a DecisionTree Classifier and estimate the score 

In [13]:
clf = DecisionTreeClassifier(max_depth=None, min_samples_split=2, random_state=0)
scores = cross_val_score (clf, Xlr, ylr)
scores.mean() 


0.65384615384615385

Now generate a Random Forest Classifier and estimate the score

In [14]:
clf = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)
scores = cross_val_score (clf, Xlr, ylr)
scores.mean() 


0.62820512820512819

We also generate an ExtraTrees Classifier and estimate the score

In [15]:
clf = ExtraTreesClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)
scores = cross_val_score (clf, Xlr, ylr)
scores.mean() 

0.64102564102564108

In [16]:
scores.mean() > 0.65

False

### AdaBoost
One other popular boosting algorithm is AdaBoost. The core principle of AdaBoost is to fit a sequence of weak learners (i.e., models that are only slightly better than random guessing, such as small decision trees) on repeatedly modified versions of the data. The predictions from all of them are then combined through a weighted majority vote (or sum) to produce the final prediction.

Using the natural gas dataset let us fit an AdaBoost classifier with 100 weak learners.

In [17]:
#import the required library
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import AdaBoostClassifier

Generate AdaBoost Classifier and estimate the score

In [18]:
clf = AdaBoostClassifier(n_estimators=100)

In [19]:
scores = cross_val_score (clf, Xlr, ylr)
scores.mean() 

0.62820512820512819

### Gradient Tree Boosting
Gradient Boosted Regression Trees (GBRT) is a generalization of boosting to arbitrary differentiable loss functions. GBRT is an accurate and effective off-the-shelf procedure that can be used for both regression and classification problems. 
The advantages of GBRT are:
1. Natural handling of data of mixed type (= heterogeneous features)
2. Predictive power
3. Robustness to outliers in output space (via robust loss functions)

The primary disadvantage of GBRT is Scalability as due to the sequential nature of boosting, it can hardly be parallelized. Parallelization is the parallel construction of the trees and the parallel computation of the predictions. 

We will use the sklearn.ensemble module to provide for both classification and regression via gradient boosted regression trees.

#### Classification
GradientBoostingClassifier supports both binary and multi-class classification. Here we will fit a gradient boosting classifier with 100 decision stumps as weak learners


In [22]:
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0).fit(Xlr, ylr)
clf.score(Xtestlr,ytestlr)                 

0.69230769230769229

#### Regression

GradientBoostingRegressor supports a number of different loss functions for regression which can be specified via the argument loss; the default loss function for regression is least squares ('ls').

In [23]:
#import the libraries
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor


We now carry out regression

In [25]:
est = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=1, random_state=0, loss='ls').fit(Xlr, ylr)
mean_squared_error(ytestlr, est.predict(Xtestlr))   

0.20174199668572462