In [36]:
from __future__ import unicode_literals

import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

# 1. Preparing data - 28 weeks of life of the banana plant
---

In [37]:
wind_speed = pd.read_csv('../../../../data/raw/FincaPorvenir/Metereologico/28-weeks_January-12_August-03_2018/' \
                                 'Velocidad-del-viento_January-12_August-03_2018.csv', )

In [38]:
print(wind_speed.shape)
wind_speed.head()

(9752, 2)


Unnamed: 0,Fecha:,Velocidad del viento (Km/h)
0,2018-01-12 00:17:28,0.0
1,2018-01-12 00:47:28,0.0
2,2018-01-12 01:17:27,0.0
3,2018-01-12 01:47:27,0.0
4,2018-01-12 02:17:28,0.0


## 1.1. We evaluate if this dataset has null type `NaN`
---

In [39]:
print(wind_speed.isnull().any())
wind_speed.isnull().values.any()

Fecha:                         False
Velocidad del viento (Km/h)    False
dtype: bool


False

Don't have null values

## 1.2. Selecting  relevant index columns features
---

Since the dataset has a column called **`Fecha:`**, which is not a numerical value,

it will be removed so that it does not interfere **with our subsequent scaling**, 

so we are only going to reference the values or samples of the column

** `Velocidad del viento (Km/h)` ** and assigning them to the matrix 

`wind_speed_array` created such as follow:

In [40]:
wind_speed_array = wind_speed.iloc[:, 1].values

In [41]:
# Rehape the wind_speed_array array 
wind_speed_array = wind_speed_array.reshape(-1, 1)
wind_speed_array

array([[0.        ],
       [0.        ],
       [0.        ],
       ...,
       [0.        ],
       [0.        ],
       [4.21052632]])

## 1.3 Generating descriptive data to dataset
---

In [42]:
col = ['Velocidad del viento (Km/h)']
wind_speed_values_df = pd.DataFrame(wind_speed_array, columns=col)
wind_speed_values_df_describe = wind_speed_values_df.describe()

In [43]:
wind_speed_values_df_describe

Unnamed: 0,Velocidad del viento (Km/h)
count,9752.0
mean,2.216788
std,3.084594
min,0.0
25%,0.0
50%,1.511811
75%,3.245074
max,68.571429


In [44]:
# Export this descriptive data to comma separated values and java script object notation
wind_speed_values_df_describe.to_csv('../../../../data/interim/WindSpeed/28-weeks_January-12_August-03_2018/' +'\n' 
                              'Wind-Speed_Describe_January-12_August-03.cvs', sep=',', header=True, index=True)
wind_speed_values_df_describe.to_json('../../../../data/interim/WindSpeed/28-weeks_January-12_August-03_2018/' +'\n'
                                     'Wind-Speed_Describe_January-12_August-03.json')

# 2. Creating Temperature Training and Testing datasets

We have a **`wind_speed_values_df`** dataset with 9752 samples rows.

In [45]:
print(wind_speed_values_df.shape)
wind_speed_values_df.head()

(9752, 1)


Unnamed: 0,Velocidad del viento (Km/h)
0,0.0
1,0.0
2,0.0
3,0.0
4,0.0


We'll divide it into two differents datasets:

- Training dataset
- Testing dataset

This is executed through [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html  "sklearn.model_selection.train_test_split") function of this way:

`train_test_split` receive as a data parameter a numpy array, we have to turn the 

**`wind_speed_values_df`** dataframe to numpy array such as follow:

In [46]:
# numpy_wind_speed_values = wind_speed_values_df.reset_index().values
numpy_wind_speed_values = wind_speed_values_df.values
numpy_wind_speed_values

array([[0.        ],
       [0.        ],
       [0.        ],
       ...,
       [0.        ],
       [0.        ],
       [4.21052632]])

We compose the following datasets from **`numpy_wind_speed_values`** array :

- `wind_speed_values_train`, which is the training matrix
- `wind_speed_values_test`, the testing matrix

We use the [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html  "sklearn.model_selection.train_test_split") function to create the training and testing dataset. Their parameters are:

![alt text](https://cldup.com/hukPWvvLxt-3000x3000.png "klearn.model_selection.train_test_split")


- The first parameter **should be an array**, then we pass the **`numpy_wind_speed_values`** which contain all column features of luminosity.

Of this way, we pass all data (9750 samples rows) to from them `wind_speed_values_train`

and `wind_speed_values_test` will be created

- The `test_size=0.5` parameter means a 50% division; which means 
that half of the data goes to the test dataset and the other half 
goes to the training dataset.

A good choice for the size of tests **is usually 0.2 ie 20% or 0.25 or even 30%.** 

In some rare cases we will have 40% but almost never 0.5 or 50%

We choose 20% which means that we will have 20% of 9752 samples or observations for the test data set, 

in this case **9750 * 0.2 = 1951 samples or records for the test data set**


- The `train_size` parameter is the training dataset size.  **`test_size + train_size = 1 or 100%`**, 
then isn't necessary include it, because wheter we include to `test_size = 0.2`, then the remaining 
data will be to `train_size` this means **0.8 or 80%**

This means that **9750 * 0.8 = 7801  training dataset samples rows. **


- random_state is a seed or data source for generating random values for the data sets. 

If this parameter is not passed, the data will be generated in a random way, but in the way as by default numpy works them.

### 2.1  Creating Training and testing temperature datasets
---

- `wind_speed_values_train`, which is the training matrix
- `wind_speed_values_test`, the testing matrix

In [47]:
wind_speed_values_train, wind_speed_values_test = train_test_split(numpy_wind_speed_values, test_size = 0.2)

In [48]:
# We have 7801 rows to luminosity_luxes_train
print(type(wind_speed_values_train))
print("The dimensionality of wind speed training dataset is: " +'\n' , wind_speed_values_train.shape)
print('\n')

# And we have 1951 rows to luminosity_luxes_test
print(type(wind_speed_values_test))
print("The dimensionality of wind speed testing dataset is: " +'\n' , wind_speed_values_test.shape)

<class 'numpy.ndarray'>
The dimensionality of wind speed training dataset is: 
 (7801, 1)


<class 'numpy.ndarray'>
The dimensionality of wind speed testing dataset is: 
 (1951, 1)


In this way, the model, as you progressively learn the correlations in the training set, the better the prediction of the results in the test set.

But if the model learns too much from memory the correlations of the training sets,

that is to say, when one learns from memory and does not understand things, then it

will have problems to predict what is happening on the set of tests, because it is

learned for difficult correlations, if the logic is not well understood and you can

not make good predictions. This is called overfitting or overfitting

The really important thing is to understand that we need to have two different datasets

- Training set with which the ML model learns
- Test set, on which we test whether the ML model correctly learned the correlations


---
##  3. Feature Scaling Temperature training and testing dataset  
---

[This article post](http://benalexkeen.com/feature-scaling-with-scikit-learn/  "Feature Scaling with scikit-learn") it's a great reference to explore the features scaling methods
on scikit learn

- `StandardScaler` assume that data is normally distributed at the level of each characteristic or variable. If the data is not normally distributed, it is not the best alternative to use for scaling.  

- `Min-Max Scaler` it is probably the most famous scaling algorithm and what it does is resize the range to leave it in a dimension of 0 to 1 or -1 and 1 (in case there are negative values in the original dataset of input data)

This scale of maximums and minimums works best for cases where standard scaling may not work properly. If the distribution is not Gaussian or the standard deviation is very small the escalation of maximums and minimums is the best idea.
However, it is sensitive to outliers or outliers, so if there are outliers in the data it is better to consider robust scaling.

- `Robust Scaler` it is similar to the previous one of maximums and minimums, only that it uses interquartile ranges instead of maximums and minimums, which makes it robust for the outliers

- `Normalizer` which scales each value, dividing each value by its magnitude in n dimensional spaces for n number of characteristics.


---

We will use the scaling of maximums and minimums to scale the wind speed data, because the standard deviation is very small, 

it does not have atypical values and it does not follow a normal distribution (you have to check this)

We apply the maximum and minimum scaling. We provide a rank or base scale that **will be between 0 and 1** using an object,

[MinMaxScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html "sklearn.preprocessing.MinMaxScaler") which transforms each characteristic, (in this 

case it will be the columns feature of `Temperature (C)`) individually according to a given range.

Product of its applicability, generates these attributes in the dataset, already transformed: 

![alt text](https://cldup.com/lTIv4HXgTk-3000x3000.png "sklearn.preprocessing.MinMaxScaler")

### 3.1 We apply maximium and minimum feature scaling to Temperature training dataset

In [49]:
# We provide a base scale range
scaler = MinMaxScaler(feature_range=(0, 1))

print("Remember our temperature training data " + '\n', wind_speed_values_train)

Remember our temperature training data 
 [[0.        ]
 [1.89224704]
 [4.43076923]
 ...
 [4.3768997 ]
 [0.        ]
 [1.75609756]]


With the [fit](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler.fit  "MinMaxScaler.fit")
we compute the maximum and minimum value of  `wind_speed_values_train` dataset to be used in the subsequent scaling 

We assing these values to `minmax_scale_training` variable.

In [50]:
minmax_scale_training = scaler.fit(wind_speed_values_train.astype(float))
# print(minmax_scale_training.data_max_)
# http://terrapinssky.blogspot.com/2017/10/pythonresolved-dataconversionwarning.html

Then, we apply the [transform](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler.transform "MinMaxScaler.transform") method to transform these data to maximum and mínimum scale value. 

Here, with this process, the `wind_speed_values_train` data are scaled between **0 to 1**  selected range 

In [51]:
# transform precipitations_train data to maximum and mínimum scale value. 
win_speed_minmax_training = minmax_scale_training.transform(wind_speed_values_train)

In [52]:
print ("And, these are our scaled data: " + '\n')
win_speed_minmax_training

And, these are our scaled data: 



array([[0.        ],
       [0.03153745],
       [0.07384615],
       ...,
       [0.07294833],
       [0.        ],
       [0.02926829]])

In [53]:
print('Wind Speed Training dataset. Minimum value after MaxMinScaler:\nWind Speed={:.1f}'
      .format(win_speed_minmax_training[:,0].min()))

print('Wind Speed Training dataset. Maximum value after MaxMinScaler:\nWind Speed={:.1f}'
      .format(win_speed_minmax_training[:,0].max()))

Wind Speed Training dataset. Minimum value after MaxMinScaler:
Wind Speed=0.0
Wind Speed Training dataset. Maximum value after MaxMinScaler:
Wind Speed=1.0


Then, our  MinMaxScaler normalized training dataset is `win_speed_minmax_training` numpy array

- We export this array to comma separated values 


In [54]:
win_speed_train_df = pd.DataFrame(win_speed_minmax_training, columns=col)

In [55]:
win_speed_train_df

Unnamed: 0,Velocidad del viento (Km/h)
0,0.000000
1,0.031537
2,0.073846
3,0.069364
4,0.000000
5,0.073620
6,0.114286
7,0.018168
8,0.024267
9,0.062500


In this way we have the dataset `win_speed_train_df` standardized and training, and export it to a .csv file

In [56]:
win_speed_train_df.to_csv('../../../../data/processed/WindSpeed/28-weeks_January-12_August-03_2018/' +'\n' 
                                 'Wind-Speed_Normalized_TRAINING_January-12_August-03.csv', sep=',', header=True, index=False)

### 3.2 We apply maximium and minimum feature scaling to Wind speed testing dataset

In [57]:
# We provide a base scale range
scaler = MinMaxScaler(feature_range=(0, 1))

print("Remember our temperature testing data " + '\n', wind_speed_values_test)

Remember our temperature testing data 
 [[1.5078534 ]
 [0.        ]
 [2.15246637]
 ...
 [6.12765957]
 [2.53521127]
 [0.        ]]


In [58]:
minmax_scale_test = scaler.fit(wind_speed_values_test.astype(float))
# transform luminosity_luxes_test data to maximum and mínimum scale value. 
wind_speed_minmax_test = minmax_scale_test.transform(wind_speed_values_test)

In [59]:
print ("And, these are our testing scaled data: " + '\n')
wind_speed_minmax_test

And, these are our testing scaled data: 



array([[0.02198953],
       [0.        ],
       [0.03139013],
       ...,
       [0.0893617 ],
       [0.03697183],
       [0.        ]])

In [60]:
print('Wind speed Testing dataset. Minimum value after MaxMinScaler:\nWind speed={:.1f}'
      .format(wind_speed_minmax_test[:,0].min()))

print('Wind speed Testing dataset. Maximum value after MaxMinScaler:\nWind speed={:.1f}'
      .format(wind_speed_minmax_test[:,0].max()))

Wind speed Testing dataset. Minimum value after MaxMinScaler:
Wind speed=0.0
Wind speed Testing dataset. Maximum value after MaxMinScaler:
Wind speed=1.0


Then, our  MinMaxScaler normalized training dataset is `wind_speed_minmax_test` numpy array

- We export this array to comma separated values

In [61]:
wind_speed_test_df = pd.DataFrame(wind_speed_minmax_test, columns=col)

In [62]:
print(wind_speed_test_df.shape)
wind_speed_test_df.head()

(1951, 1)


Unnamed: 0,Velocidad del viento (Km/h)
0,0.02199
1,0.0
2,0.03139
3,0.048951
4,0.0


In [63]:
wind_speed_test_df.to_csv('../../../../data/processed/WindSpeed/28-weeks_January-12_August-03_2018/' +'\n' 
                                 'Wind-Speed_Normalized_TESTING_January-12_August-03.csv', sep=',', header=True, index=False)