### Submission Instructions

Just fill in the markdown and code cells below with your arguments and functions, and run the Python lines given. Make sure the notebook works fine by executing `Kernel/Restart & Run All`.
  
Once the notebook is ready,
1. Create a folder named `afi_last_name1_last_name2` with the team's last names.

2. Put in that folder:

* a file `mp_afi_last_name1_last_name2.ipynb` with the cells below completed. Make sure it works by executing Kernel/Restart & Run All.
* a file `mp_afi_last_name1_last_name2.html` with an html rendering of the previous .ipynb file (just apply File / Download as HTML after a correct run of Kernel/Restart & Run All).
* a file `mp_afi_last_name1_last_name2.pdf` with a pdf print of the html file **without any code**.

3. Compress the folder to a `afi_last_name1_last_name2.7z` 7z (or zip) file.

**Very important!!!**

Make sure you follow the file naming conventions above; the miniproject won't be graded until that is so.

## Recommendations in notebook writing

Notebooks are a great tool for data and model exploration. But in that process a lot of Python garbage can get into them as a consequence of the trial and error process.

But once these tasks are done and one arrives to final ideas and insights on the problem under study, the notebook should be **thoroughly cleaned** and the notebook should **concentrate on the insights and conclussions** without, of course, throwing away the good work done.

Below there are a few guidelines about this.

* Put the useful bits of your code as functions on a **Python module** (plus script, if needed) that is imported at the notebook's beginning. 
* Of course that module should be **properly documented** and **formatted** (try to learn about PEP 8 if you are going to write a lot of Python).
* Leave in the notebook **as little code as possible**, ideally one- or two-line cells calling a function, plotting results or so on.
* **Avoid boilerplate code**. If needed, put it in a module.
* Put on the notebook some way to **hide/display the code** (as shown below).
* The displayed information **should be just that, informative**. So forget about large tables, long output cells, dataframe or array displays and so on.
* Emphasize **insights and conclusions**, using as much markdown as needed to clarifiy and explain them.
* Make sure that **number cells consecutively starting at 1.**
* And, of course, make sure that **there are no errors left**. To avoid these last pitfalls, run `Kernel\Restart Kernel and Run All Cells`.

And notice that whoever reads your notebook is likely to toggle off your code and consider just the markdown cells. Because of this, once you feel that your notebook is finished,
* let it rest for one day, 
* then open it up, toggle off the code 
* and read it to check **whether it makes sense to you**.

If this is not the case, **the notebook is NOT finished!!!**

Following these rules you are much more likely to get good grades at school (and possibly also larger bonuses at work).

**IMPORTANT AND JUST IN CASE: before turning in your work, please REMOVE FROM IT THE PREVIOUS TWO CELLS**

In [1]:
from IPython.display import HTML

HTML('''
<script>code_show=true; 

function code_toggle() {
    if (code_show){
    $('div.input').hide();
    } else {
    $('div.input').show();
    }
    code_show = !code_show
} 

$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to show or hide your raw code."></form>
''')

# Wind Energy Prediction
We want to predict the wind energy production on a farm using wind speed and direction information.

The aim of this wind power forecasting problem is to predict the wind power generation 24 h ahead for a wind farm in Australia.

Attribute Information:
The features include forecasts of the projections of the wind vector on the west-east (U) and south-north (V) axes,at two heights, 10 and 100 m above ground level, plus the corresponding absolute wind speeds.

Data for approximate a **nine month period** are given in a csv file with headers

`TIMESTAMP,TARGETVAR,U10,V10,U100,V100,v10,v100`

where

* TIMESTAMP contains day/hour information.
* TARGETVAR is the wind energy production normalized to a [0, 100] range.
* U10,V10,U100,V100 are the U and V wind components in m/s at heights 10 and 100.
* v10,v100 are the absolute wind speeds in m/s at heights 10 and 100.

The dataset we will use is an adaptation of those available in the Kaggle page https://www.kaggle.com/c/GEF2012-wind-forecasting. 

In [2]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [3]:
import time
import pickle
import gzip

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import cross_val_score, cross_val_predict, KFold, GridSearchCV

from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor

from sklearn.pipeline import Pipeline
from sklearn.compose import TransformedTargetRegressor

from sklearn.metrics import mean_absolute_error, mean_squared_error

import joblib

## Data Loading

We load the csv file using its first column as a `datetime` index.

In [4]:
df_0 = pd.read_csv('..\\w_e.csv', index_col=0, parse_dates=True)
l_vars = df_0.columns[1 : ]
print(l_vars)
df = df_0[l_vars]
df['target'] = df_0['TARGETVAR']

print ("nFilas: %d\tnColumnas: %d\n" % (df.shape[0], df.shape[1]) )
print ("Columnas:\t", np.array(df.columns))

Index(['U10', 'V10', 'U100', 'V100', 'v10', 'v100'], dtype='object')
nFilas: 6576	nColumnas: 7

Columnas:	 ['U10' 'V10' 'U100' 'V100' 'v10' 'v100' 'target']


In [5]:
df.index

DatetimeIndex(['2012-01-01 01:00:00', '2012-01-01 02:00:00',
               '2012-01-01 03:00:00', '2012-01-01 04:00:00',
               '2012-01-01 05:00:00', '2012-01-01 06:00:00',
               '2012-01-01 07:00:00', '2012-01-01 08:00:00',
               '2012-01-01 09:00:00', '2012-01-01 10:00:00',
               ...
               '2012-09-30 15:00:00', '2012-09-30 16:00:00',
               '2012-09-30 17:00:00', '2012-09-30 18:00:00',
               '2012-09-30 19:00:00', '2012-09-30 20:00:00',
               '2012-09-30 21:00:00', '2012-09-30 22:00:00',
               '2012-09-30 23:00:00', '2012-10-01 00:00:00'],
              dtype='datetime64[ns]', name='TIMESTAMP', length=6576, freq=None)

# Data Exploration, Visualization and Correlations

* Compute descriptive statistics.
* Draw boxplots, pairplots and histograms.
* Compute and present correlations. 

Give your comments and conclusions after each step.

## Descriptive analysis

## Boxplots

## Histograms and scatterplots

## Correlations

## Overall conclusions

# MLPRegressor

Perform a CV MLPR estimation of a pipelined MLPR over three folds over the entire sample.

## Analyzing GridSearchCV results

Check the adequacy of the best hyperparameters.

## Testing the MLPR model

Do it over the entire dataset using `cross_val_predict`, get the CV MAE and draw the appropriate plots.

## MLP Residual histograms and relationship with targets

Show and discuss them.

# SV Regressor

Repeat the previous steps with an SVR model with the same structure.

We will work with Gaussian kernels, so we have to set two hyperparameters, `C, gamma` plus the `epsilon` insensitivity. We have to explore **three** hyperparameters, so search times may increase considerably.  
To shorten fit times we downsample the original data to values every three hours. You can use for this Pandas methods such as `resample, asfreq` and `dropna`.

# MLPR and SVR comparison

Compare them and draw the appropriate conclusions.

# Trying to improve the estimator

Me may try to improve the MLPR and SVR results by enlarging the features set with the **square and cube powers of the absolute velocities**.

Redo the previous MLPR and SVR analysis and conclusions over the enlarged dataset with the same analysis structure.

# MLPR model over enlarged features

## Conclusions on the enlarged MLPR model

# SVR model over enlarged features

## Conclusions on the enlarged SVR model

# Final conclusions