# 0. SOME PRELIMINARIES 

In [None]:
# Import some libraries
import matplotlib.pyplot as plt 
# For plotting data
import numpy as np              
# For Panda dataframes. A dataframe is a matrix-like structure, 
# similar to R dataframes  
import pandas as pd

import os
os.getcwd()

The "wind_pickle" file contains data in a binary format called "Pickle". Pickle data loads faster than text data.

In [None]:
data = pd.read_pickle('wind_pickle')

You can visualize the attributes in the dataset. Very important, the output attribute (i.e. the value to be predicted, **energy**, is the first attribute). **Steps** represents the hours in advance of the forecast. We will not use this variable here.

In [None]:
# The dataset contains 5937 instances and 556 attributes (including 
# the outcome to be predicted)
print data.shape
data.columns.values.tolist()

Below, data is going to be separated in train, validation, and test. Given that the use of Pandas dataframes is quite advanced, and doing this for you:

In [None]:
indicesTrain = (np.where(data.year<=2006))[0]
indicesVal = (np.where((data.year==2007) | (data.year==2008)))[0]
indicesTest = (np.where(data.year>=2009))[0]

Beware!, **indicesTrain** does not contain the training data, but the *indices* of the training data. For instance, the following cell means that training data is made of instance number 0, instance number 1, ..., up to instance number 2527. This will be important later.

In [None]:
indicesTrain

Now, we are going to transform **data**, which is a Pandas dataframe, to **ava**, which is a NumPy matrix. The reason is that Scikit-learn uses NumPy matrices, not Panda dataframes.

In [None]:
ava = data.as_matrix()

Now, **ava** is going to be decomposed into inputs **X** and outputs **y**. And then, into training, validation, and test. For instance, **Xava** and **yava** contain the input attributes, and the output attribute (**energy**) of the whole dataset. Please, ask yourself why the inputs use "6:" and the output use "0". **Xtrain** and **ytrain** are the same, but for the training dataset.

In [None]:
Xava = ava[:,6:]; yava = ava[:,0]
Xtrain = ava[indicesTrain,6:]; ytrain = ava[indicesTrain,0]
Xval = ava[indicesVal,6:]; yval = ava[indicesVal,0]
Xtest = ava[indicesTest,6:][:,6:]; ytest = ava[indicesTest,0]

The following cell defines function **mae** (Mean Absolute Error), that we will use later to measure the accuracy of models.

In [None]:
def mae(yval_pred, yval):
  val_mae = metrics.mean_absolute_error(yval_pred, yval)
  return(val_mae)

The following cell trains KNN with (Xtrain, ytrain) and evaluates it with (Xval, yval).

In [None]:
from sklearn import metrics
from sklearn import neighbors
n_neighbors = 5
knn = neighbors.KNeighborsRegressor(n_neighbors, weights='uniform')
np.random.seed(0)
%time _ = knn.fit(Xtrain, ytrain)
yval_pred = knn.predict(Xval)

print "MAE for KNN with K=5 is {}".format(mae(yval_pred, yval))

In [None]:
# In case you need help for KNN
help('sklearn.neighbors.KNeighborsRegressor')

The following cell, does hyper-parameter tuning for parameter K (n_neighbors), from 1 to 4 by 1. Please, notice that with **partitions = [(indicesTrain, indicesVal)]** we are telling **gridSearch** to use the training dataset for training the different models with the different parameters, and the validation dataset for testing. Notice that this is different to other notebooks, where crossvalidation was used for this purpose. 

In [None]:
from sklearn.grid_search import GridSearchCV
np.random.seed(0)
param_grid = {'n_neighbors': range(1,4,1)}

partitions = [(indicesTrain, indicesVal)]
clf = GridSearchCV(neighbors.KNeighborsRegressor(), 
                   param_grid,
                   scoring='mean_absolute_error',
                   cv=partitions , verbose=1)
%time _ = clf.fit(Xava,yava)

Next, we show the best K parameter and the MAE of the final model built with the best parameter.

In [None]:
print "Best K: {} and MAE for best K: {}".format(clf.best_params_, -clf.best_score_)

# 1. HOW LONG DOES IT TAKE?

It is always a good idea to have some estimation of how long your machine learning algorithm is going to take. In the next two cells, try to estimate how many seconds KNN (with K=3) does it take, with only **100 instances**. With 6000 instances, it will take approximately 60 times that number. You can use **%time** for timing, as in previous cells.

In [None]:
#<WRITE CODE HERE FOR TIMING KNN>

Please, do the same for Decision trees with default parameters

In [None]:
#<WRITE CODE HERE FOR TIMING Decision Trees>

# 2. MODEL SELECTION AND HYPER-PARAMETER TUNING

Train a KNN model with default parameters

In [None]:
#<WRITE CODE HERE FOR KNN>

Do hyper-parameter tuning for KNN. Can you improve results? Note: if **gridSearch** takes too long, you can use **Randomized Search** instead.

In [None]:
#<WRITE CODE HERE FOR KNN>

Train a decision tree for regression with default parameters

In [None]:
#<WRITE CODE HERE FOR DECISION TREES>

Do hyper-parameter tuning for Decision trees. Can you improve results?

In [None]:
#<WRITE CODE HERE FOR DECISION TREES>

Train a Random Forest (RF) with default parameters. A RF is an ensemble technique based on Decision Trees, but instead of training just a single decision tree, it trains many of them and then computes the average of the outputs. Please, bear in mind that a RF with default parameters involves training 100 trees. You can estimate by hand how long it is going to take, and if it is excessive, you can lower the number of decision trees in the ensemble. 

In [None]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
# help('sklearn.ensemble.RandomForestRegressor')
#<WRITE CODE HERE FOR RANDOM FORESTS>

Do hyper-parameter tuning for Random Forests. Their main hyper-parameter is **n_estimators**, which is the number of decision trees in the ensemble. Check some values around the default value (like, 50, 100, 150, ...). Please, bear in mind this is going to take time ... In case you want to use other hyper-parameters, please ask the teacher.

In [None]:
#<WRITE CODE HERE FOR RANDOM FORESTS HYPER-PARAMETER TUNING>

Train a Gradient Tree Boosting (GB) with default parameters. A GB is also an ensemble technique based on Decision Trees. In this case, the second decision tree tries to fix the mistakes of the first decision tree. The third decision tree tries to fix the mistakes of the first two decision trees. An so on.

Please, bear in mind that a GB with default parameters involves training 100 trees. You can estimate by hand how long it is going to take, and if it is excessive, you can lower the number of decision trees in the ensemble. 

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
gb = GradientBoostingRegressor()
# help('sklearn.ensemble.GradientBoostingRegressor')
#<WRITE CODE HERE FOR GRADIENT BOOSTING>

Do hyper-parameter tuning for Gradient Boosting. Their main hyper-parameter is **n_estimators**, which is the number of decision trees in the ensemble. Check some values around the default value (like, 50, 100, 150, ...). Please, bear in mind this is going to take time ... In case you want to use other hyper-parameters, please ask the teacher.

In [None]:
#<WRITE CODE HERE FOR GRADIENT BOOSTING HYPER-PARAMETER TUNING>

At this point, you should know which model performs best, and what hyper-parameters to use. Please, evaluate that best performing model on the test set.

In [None]:
#<WRITE CODE HERE FOR BEST MODEL EVALUATION ON THE TEST SET>

# 3. ATTRIBUTE SELECTION

This section is more open-ended than the previous ones, and I offer less guidance. It is definitely harder, but you can always ask the teacher. 

You have to answer the following question: 

- "Are all 550 input attributes actually necessary in order to get a good model? Is it possible to have an accurate model that uses fewer than 550 variables? How many? Is it enough to have the attributes for the actual Sotavento location? (13th in the grid)"

In order to answer this question:

1) Go through the "Attribute Selection" ipython notebook, and understand the main ideas about **SelectKBest** and **Pipeline**.

2) Use **SelectKBest** and **Pipeline** (and whatever else you need) in order to find a subset of attributes that allows to build an accurate Decision Tree model. We are going to use here Decision Trees because they are faster (even if Random Forests or Gradient Boosting performed better in previous sections). Please, note that you cannot just copy/paste from the "Attribute Selection" notebook. You will have to think about how to use the main ideas from that notebook, and change whatever needs changing. 

3) Once you have decided which attributes should be used for the Decision Tree, evaluate the final model on the test dataset.


In [None]:
#<USE AS MANY CELLS AS YOU NEED>