## Random forest regression on wind turbine data

Random forest predicts a variable. Choose predictor variable(s) to predict target variable. Recall that a random forest is an ensemble estimator.

### The data
La Haute Borne is a wind farm in France. It consists of turbines that generate power. Many technical features were measured which influence power output of the wind turbines. The data includes many technical features as well as environmental variables such as wind speed. 

### The goal
Predict power output for each turbine using the input variables given. 

## Preliminaries

Load the necessary libraries.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

In [None]:
%matplotlib inline

A description of the column headings is given in the following file.

In [None]:
descript = pd.read_csv('data_description.csv', sep=';')

In [None]:
descript.shape

In [None]:
descript

In [None]:
# data = pd.read_csv('wind-data.csv', sep=';', parse_dates=['Date_time'])
data = pd.read_csv('wind-data-truncated.csv', parse_dates=['Date_time'])

In [None]:
data.head()

## Investigate the data

Sample from the dataset.

In [None]:
df = data.sample(2000) # random sample

In [None]:
df.index

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
df.columns

## Preprocess the data

### Clean the data

Fill in missing values here.

In [None]:
df = df.fillna(0)

### Feature engineering

See the strength of relationships between features.

## Identify correlated variables

Use the correlation between features to select features for training the model. We selected positively correlated features for training and prediction.

In [None]:
#c = df.corr().abs()
corr = df.corr()
plt.figure(figsize=(30,30))
sns.heatmap(corr, xticklabels=corr.columns.values, yticklabels=corr.columns.values)

## What question(s) are we trying to answer?

Can we predict the average power produced `P_avg` from

    the average vane position Va2_avg
    wind speed 2 in m/s Ws2_avg
    the average wind speed Ws_avg
    the average outdoor temperature Ot_avg
    the average absolute wind direction Wa_avg?

See this link for help with possible questions: 
Some questions to ask: [questions](https://github.com/jmwagstaff/La-Haute-Borne/blob/master/README.md)

## Create training and test sets

It is an practice in machine learning is to split a dataset into training and test sets. The training set, which is usullay larger than the test set, is used to train the model. The test set is used to make predictions with the model.

In [None]:
df = df.fillna(df.median(axis=0))

In [None]:
X = df[['Va2_avg','Ws2_avg', 'Ws_avg', 'Ot_avg', 'Wa_avg']].values
Y = df['P_avg'].values

Store the names of the selected attributes.

In [None]:
labels = ['Va2_avg','Ws2_avg', 'Ws_avg', 'Ot_avg', 'Wa_avg']; labels

Use the sklearn `train_test_split` function to split data in training and test set.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=42)

### Transform the data

We centre the data around a mean of zero and variance of one.

In [None]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

For more information on preprocessing data see [Preprocessing data](http://scikit-learn.org/stable/modules/preprocessing.html).

[Imputation of missing values](http://scikit-learn.org/stable/modules/preprocessing.html#imputation-of-missing-values)

## Fit the model

Create the regressor model.

In [None]:
#model = RandomForestRegressor(n_jobs=-1, min_impurity_decrease=10)
model = RandomForestRegressor(n_jobs=-1)

Try a different number of estimators.

In [None]:
# Try different numbers of n_estimators or trees - this will take a minute or so
estimators = np.arange(1, 33, 10)
scores = []
for n in estimators:
    model.set_params(n_estimators=n)
    model.fit(X_train, y_train)
    scores.append(model.score(X_test, y_test))
plt.title("Effect of n_estimators")
plt.xlabel("n_estimator")
plt.ylabel("score")
plt.plot(estimators, scores)

## Predict

Apply regressor to test set.

In [None]:
y_predicted = model.predict(X_test)

In [None]:
len(y_predicted), len(y_test)

## Feature importance

Which features are important for our predictions?

In [None]:
result = pd.DataFrame()
result['feature'] = labels
result['importance'] = model.feature_importances_
result.sort_values(by=['importance'], ascending=False, inplace=True)
result

Turns out the wind speed `Ws_avg` is important in predicting average power output.

## Determine accuracy

In [None]:
from sklearn.metrics import r2_score

In [None]:
score = r2_score(y_test, y_predicted)

In [None]:
score

**Interpretation**: We use the $R^2$ coefficient of determination to determine the accuracy of the model. The best possible score is 1.0 and it can be negative if the model is terrible at predictions. There are many measures one can use to determine the accuracy of a regression model. Two of the measures are:

    1. mean squared error, and 
    2. mean absolute percentage error.
    

### XGBoost

We touched on boosting in the module. Boosting is an ensemble technique that combines typically week predictors like decisions trees and builds a better model sequentially. Here we consider XGBoost designed to control overfitting.

You may have to install xgboost with `conda install -c conda-forge xgboost` (under linux) or `conda install -c anaconda py-xgboost` under Windows. If you have trouble installing XGBoost, you can skip this section.

In [None]:
import xgboost as xgb

### Train the model

In [None]:
model = xgb.XGBRegressor()

In [None]:
model.fit(X_train, y_train)

### Make predictions

In [None]:
y_preds = model.predict(X_test)

In [None]:
predictions = [round(i) for i in y_preds]

### Determine accuracy

In [None]:
from sklearn.metrics import explained_variance_score

In [None]:
print(explained_variance_score(predictions,y_test))

### Cross Validate the model

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

In [None]:
kfold = KFold(n_splits=10, random_state=7)
results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: ", results.mean())

The accuracy changes from 90% with the Random Forest approach, 89% with XGBoost, and 93% with cross validation applied to the XGBoost model.

## Summary

Random forests are an ensemble technique that combines many decision trees to boost performance. The ensemble technique can be evaluated using various measures. Here we used the $R^2$ score. We learned about another ensemble machine learning technique to boost performance by control overfitting. 