In [3]:
from __future__ import unicode_literals

import pandas as pd
import numpy as np
from sklearn import preprocessing
from matplotlib import pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split

# 1. Preparando los datos - Últimas 12 semanas de vida de la planta de banano 
---
From May 4 to August 03.

In [6]:
wind_direction_l12w = pd.read_csv('../../../../data/raw/FincaPorvenir/Metereologico/Latest_12-weeks_May-4_August-03_2018/' \
                                 'Direccion-del-viento_May-4_August-03_2018.csv', )

In [7]:
print(wind_direction_l12w.shape)
wind_direction_l12w.head()

(4398, 2)


Unnamed: 0,Fecha:,Direccion del viento (Pos)
0,2018-05-04 00:07:38,SO
1,2018-05-04 00:37:39,SO
2,2018-05-04 01:07:38,SO
3,2018-05-04 01:37:39,NO
4,2018-05-04 02:07:38,NO


## 1.1. We evaluate if this dataset has null type `NaN`
---

In [8]:
print(wind_direction_l12w.isnull().any())
wind_direction_l12w.isnull().values.any()

Fecha:                        False
Direccion del viento (Pos)    False
dtype: bool


False

Don't have null values

## 1.2. Selecting  relevant index columns features
---

Since the dataset has a column called **`Fecha:`**, which is not a numerical value,

it will be removed so that it does not interfere **with our subsequent scaling**, 

so we are only going to reference the values or samples of the column

** `Direccion del viento (pos)` ** and assigning them to the matrix 

`wind_direction_l12w_pos` created such as follow:

In [11]:
wind_direction_l12w_pos = wind_direction_l12w.iloc[:, 1].values
print(type(wind_direction_l12w_pos))
# When we use iloc() dataframe function, the variable is turned on numpy array

wind_direction_l12w_pos


<class 'numpy.ndarray'>


array(['SO', 'SO', 'SO', ..., 'SE', 'S', 'SE'], dtype=object)

- We turn this `wind_direction_l12w_pos` numpy array to pandas dataframe 

In [12]:
col=['Direccion del viento (Pos)']
wind_direction_l12w_pos_df = pd.DataFrame(wind_direction_l12w_pos, columns=col)
print(type(wind_direction_l12w_pos_df))
print("The dimensionality of wind_direction dataframe is: " +'\n' , wind_direction_l12w_pos_df.shape)
wind_direction_l12w_pos_df.head()

<class 'pandas.core.frame.DataFrame'>
The dimensionality of wind_direction dataframe is: 
 (4398, 1)


Unnamed: 0,Direccion del viento (Pos)
0,SO
1,SO
2,SO
3,NO
4,NO


Until here, we have the ** `Direccion del viento (pos)` ** values. 

Although we haven't ended with their preparation, just in case, we want

export it to comma separated values format.

In [14]:
# This dataset created will not be used like final, canonical data sets for modeling 
wind_direction_l12w_pos_df.to_csv('../../../../data/interim/WindDirection/Latest_12-weeks_May-4_August-03_2018/' +'\n' 
                             'Wind-Direction_May-4_August-03_2018_without-dates.csv', sep=',', header=True, index=False)

## 1.3 Encoding categorical values
---
We have that our `wind_direction_l12w_pos_df` dataset, the column feature ** `Direccion del viento (pos)` **

is a categorical variable, due to that have 8 categories or different values, and they aren't numerics:

- `SO` - Sur oeste
- `SE` - Sur este
- `S` - Sur 
- `N` - Norte
- `NO` - Nor oeste
- `NE` - Nor este
- `O` - Oeste
- `E` - Este

These values are strings. We only want numeric values on the equations, so, we need encoder the categorical

variables, that's encode the string text like numerical values.

- ### How to encoding them ?

For each one of these categories, we'll create a column feature using the [pandas.get_dummies()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) function to generate dummies variables.

- ### What are dummies variables ?

Initially, we should understand ** the meaning of a dummy variable**, taking as a reference [this answer post](https://discuss.analyticsvidhya.com/t/what-is-a-dummy-variable/18960 "Variables Dummy")

> A dummy variable is a fictitious, artificial variable and is created to 

> represent an attribute with two or more different categories or levels

That's the situation of the ** `Direccion del viento (pos)` ** column feature.

- ###  Why is it used?

> Regression analysis treats all independent (X) variables in the analysis as numerical. 

> Numerical variables are interval or ratio scale variables whose values are directly comparable, 

>e.g. ‘10 is twice as much as 5’, or
> ‘3 minus 1 equals 2’.

> Often, however, you might want to include an attribute or nominal scale variable such

> as **‘Product Brand’** or **‘Type of Defect’** in your study.

> Say you have three types of defects, ** numbered ‘1’, ‘2’ and ‘3’**. 

>  In this case, ‘3 minus 1’ doesn’t mean anything ...  **you can’t subtracting defect 1 from defect 3**

Then:

> The numbers here are used to indicate or identify the levels of **‘Defect Type’** 

> and do not have intrinsic meaning of their own. Dummy variables are created in this 

> situation to ‘trick’ the regression algorithm into correctly analyzing attribute variables.

So, for the ** `Direccion del viento (pos)` ** column feature context, we have 8 values

- `SO` - Sur oeste
- `SE` - Sur este
- `S` - Sur 
- `N` - Norte
- `NO` - Nor oeste
- `NE` - Nor este
- `O` - Oeste
- `E` - Este

This meaning 8 categories, then instead of have only one column named ** `Direccion del viento (pos)` **

we could have:

![alt text](https://cldup.com/uwafdLR9-f-3000x3000.png "Direccion del viento como variables categóricas")

Each column feature, represent a wind direction values and in each one will be there a ** `1` ** value or ** `0` ** value according to the input represented

By example, for the **`NO`** column feature will be there a ** `1` ** value wheter the input data value is `NO`.
Otherwise, will be **`0`**, such as denoted on red square and their column values.

![alt text](https://cldup.com/83-rU9hxBD-3000x3000.png "Direccion del viento como variables categóricas")

### 1.3.1 Using [pandas.get_dummies()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) to generate dummies variables.
---

** `pandas.get_dummies()` ** accepts like data parameter a numpy array, pandas series and pandas dataframe.  Any of them

In [15]:
print(type(wind_direction_l12w_pos))
print(wind_direction_l12w_pos.shape)
wind_direction_l12w_pos

<class 'numpy.ndarray'>
(4398,)


array(['SO', 'SO', 'SO', ..., 'SE', 'S', 'SE'], dtype=object)

In [17]:
print(type(wind_direction_l12w_pos_df))
print(wind_direction_l12w_pos_df.shape)
wind_direction_l12w_pos_df.head()

<class 'pandas.core.frame.DataFrame'>
(4398, 1)


Unnamed: 0,Direccion del viento (Pos)
0,SO
1,SO
2,SO
3,NO
4,NO


In [18]:
# Applying get_dummies
wind_direction_l12w_pos_df = pd.get_dummies(wind_direction_l12w_pos_df)

In [19]:
print(wind_direction_l12w_pos_df.shape)
print(type(wind_direction_l12w_pos_df))
wind_direction_l12w_pos_df.head(10)

(4398, 8)
<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Direccion del viento (Pos)_E,Direccion del viento (Pos)_N,Direccion del viento (Pos)_NE,Direccion del viento (Pos)_NO,Direccion del viento (Pos)_O,Direccion del viento (Pos)_S,Direccion del viento (Pos)_SE,Direccion del viento (Pos)_SO
0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,1
3,0,0,0,1,0,0,0,0
4,0,0,0,1,0,0,0,0
5,0,0,0,1,0,0,0,0
6,0,0,0,0,0,1,0,0
7,0,0,0,0,0,0,1,0
8,0,0,0,0,0,0,1,0
9,0,1,0,0,0,0,0,0


And the result is a pandas dataframe **`wind_direction_l12w_pos_df`** with the 

wind direction data represented like one column features by each direction value

Aditionally, we can detail that the **`wind_direction_l12w_pos_df`** dataset already have

a **`1`s** and **`0`s** range value, reason which is not need perform some scaling or 

normalization process.

In [20]:
wind_direction_l12w_pos_df = wind_direction_l12w_pos_df.rename(columns={
    "Direccion del viento (Pos)_E": "E",
    "Direccion del viento (Pos)_N": "N",
    "Direccion del viento (Pos)_NE": "NE",
    "Direccion del viento (Pos)_NO": "NO",
    "Direccion del viento (Pos)_O": "O",
    "Direccion del viento (Pos)_S": "S",
    "Direccion del viento (Pos)_SE": "SE",
    "Direccion del viento (Pos)_SO": "SO",   
})
wind_direction_l12w_pos_df.head(10)

Unnamed: 0,E,N,NE,NO,O,S,SE,SO
0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,1
3,0,0,0,1,0,0,0,0
4,0,0,0,1,0,0,0,0
5,0,0,0,1,0,0,0,0
6,0,0,0,0,0,1,0,0
7,0,0,0,0,0,0,1,0
8,0,0,0,0,0,0,1,0
9,0,1,0,0,0,0,0,0


In [23]:
# Export this dataframe to comma separated value file:
wind_direction_l12w_pos_df.to_csv('../../../../data/processed/WindDirection/Latest_12-weeks_May-4_August-03_2018/' +'\n' 
                             'Wind-Direction_May-12_August-03_2018_DummiesValues.csv', sep=',', header=True, index=False)

## 2. Creating wind direction Training and Testing datasets

We have a **`wind_direction_l12w_pos_df`** dataset with 4398 samples rows.

We'll divide into two differents datasets:

- Training dataset
- Testing dataset

This is executed through [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html  "sklearn.model_selection.train_test_split") function of this way:

`train_test_split` receive as a data parameter a numpy array, we have to turn the 

**`wind_direction_l12w_pos_df`** dataframe to numpy array such as follow:

In [25]:
# numpy_wind_direction_l12w_pos = wind_direction_l12w_pos_df.reset_index().values
numpy_wind_direction_l12w_pos = wind_direction_l12w_pos_df.values

In [26]:
# My numpy_wind_direction_l12w_pos variable now is a numpy array
print(numpy_wind_direction_l12w_pos.shape)
print(type(numpy_wind_direction_l12w_pos))
numpy_wind_direction_l12w_pos

(4398, 8)
<class 'numpy.ndarray'>


array([[0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       ...,
       [0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 1, 0]], dtype=uint8)

We compose the following datasets from **`numpy_wind_direction_l12w_pos`** array :

- `wind_direction_train_l12w`, which is the training matrix
- `wind_direction_test_l12w`, the testing matrix

We use the [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html  "sklearn.model_selection.train_test_split") function to create the training and testing dataset. Their parameters are:

![alt text](https://cldup.com/hukPWvvLxt-3000x3000.png "klearn.model_selection.train_test_split")


- The first parameter **should be an array**, then we pass the **`numpy_wind_direction_l12w_pos`** which contain all column features of wind direction.

Of this way, we pass all data (4398 samples rows) to from them `wind_direction_train_l12w`

and `wind_direction_test_l12w` will be created

- The `test_size=0.5` parameter means a 50% division; which means 
that half of the data goes to the test dataset and the other half 
goes to the training dataset.

A good choice for the size of tests **is usually 0.2 ie 20% or 0.25 or even 30%.** 

In some rare cases we will have 40% but almost never 0.5 or 50%

We choose 20% which means that we will have 20% of 9752 samples or observations for the test data set, 

in this case **4398 * 0.2 = 879.6 samples or records for the test data set**


- The `train_size` parameter is the training dataset size.  **`test_size + train_size = 1 or 100%`**, 
then isn't necessary include it, because wheter we include to `test_size = 0.2`, then the remaining 
data will be to `train_size` this means **0.8 or 80%**

This means that **4398 * 0.8 = 3518,4  training dataset samples rows. **


- random_state is a seed or data source for generating random values for the data sets. 

If this parameter is not passed, the data will be generated in a random way, but in the way as by default numpy works them.

### 2.1  Creating Training and testing wind direction datasets
---

- `wind_direction_train_l12w`, which is the training matrix
- `wind_direction_test_l12w`, the testing matrix

In [28]:
wind_direction_train_l12w, wind_direction_test_l12w = train_test_split(numpy_wind_direction_l12w_pos, test_size = 0.2)

# We have 7801 rows to wind_direction_train
print(type(wind_direction_train_l12w))
print("The dimensionality of wind_direction training dataset is: " +'\n' , wind_direction_train_l12w.shape)
print('\n')

# And we have 1951 rows to wind_direction_test
print(type(wind_direction_test_l12w))
print("The dimensionality of wind_direction testing dataset is: " +'\n' , wind_direction_test_l12w.shape)


In this way, the model, as you progressively learn the correlations in the training set, the better the prediction of the results in the test set.

But if the model learns too much from memory the correlations of the training sets, 

that is to say, when one learns from memory and does not understand things, then it 

will have problems to predict what is happening on the set of tests, because it is 

learned for difficult correlations, if the logic is not well understood and you can 

not make good predictions. This is called overfitting or overfitting


The really important thing is to understand that we need to have two different datasets

- Training set with which the ML model learns
- Test set, on which we test whether the ML model correctly learned the correlations

###  2.2. Export the training and testing dataset
---

#### - Wind direction Training dataset

In [31]:
a = np.asarray(wind_direction_train_l12w)

In [33]:
cols = ['E', 'N', 'NE', 'NO', 'O', 'S', 'SE', 'SO']
wind_direction_train_l12w_df = pd.DataFrame(a, columns=cols,)

This way we have the wind_direction_train_l12w_df dataset disorganized training wind direction data

In [35]:
wind_direction_train_l12w_df.head()

Unnamed: 0,E,N,NE,NO,O,S,SE,SO
0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,1,0,0
2,0,0,0,0,0,1,0,0
3,0,0,0,0,0,1,0,0
4,0,0,0,1,0,0,0,0


In [37]:
# We export the wind_direction_train dataset to comma separated value file:
wind_direction_train_l12w_df.to_csv('../../../../data/processed/WindDirection/Latest_12-weeks_May-4_August-03_2018/' +'\n' 
                             'Wind-Direction-TRAINING_May-4_August-03_2018_DummiesValues.csv', sep=',', header=True, index=False)

In [38]:
wind_direction_train_l12w_df.shape

(3518, 8)

#### - Wind direction Testing dataset

In [39]:
b = np.asarray(wind_direction_test_l12w)
wind_direction_test_l12w_df = pd.DataFrame(b, columns=cols,)

In [40]:
print(wind_direction_test_l12w_df.shape)
wind_direction_test_l12w_df.head()

(880, 8)


Unnamed: 0,E,N,NE,NO,O,S,SE,SO
0,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,0,1
2,0,0,0,1,0,0,0,0
3,0,0,0,0,0,1,0,0
4,1,0,0,0,0,0,0,0


This way we have the wind_direction_test_l12w_df dataset disorganized testing wind direction data

In [42]:
# We export the wind_direction_train dataset to comma separated value file:
wind_direction_test_l12w_df.to_csv('../../../../data/processed/WindDirection/Latest_12-weeks_May-4_August-03_2018/' +'\n' 
                             'Wind-Direction-TESTING_May-4_August-03_2018_DummiesValues.csv', sep=',', header=True, index=False)

## FINISH