In [1]:
from __future__ import unicode_literals
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# 1. IMPORTANT FACTS IN THE PROCESS OF BANANA FRUIT PRODUCTION.
---

## 1.1. Time of life of the banana plant (since it is sown until it is removed ):
---
The whole life time of the banana plant is 28 weeks. **since it is sown until it is removed**

**28 weeks** is the time that elapses to produce the harvest. When this happen, the banana clusters are recollected and the plant is removed from the soil


## 1.2. Time in which the plant produces the banana cluster:
---
**The last 12 weeks** of those 28 total weeks of life, **that is from the sixteenth week of life in forward**, is the period which takes the banana plant to produce clusters.
The plant takes for whole these latest 12 weeks to produce the clusters; so this period is the ideal time to collect the fruit, when the 28 weeks has been finished   

Although harvesting is also done from the last 8, 9, 10 or 11 weeks of the plant's life, we will take the 12-week-old number (** the last of the 28 weeks **) as the ideal time of cut from the bunch

![alt text](https://cldup.com/vvGo194rz4.jpeg "Tiempo de cosecha del racimo de banano")


According to the above, for the generation of the temporality of the input data of **Phreatic Level, Precipitations, Temperature, Luminosity, Direction and Wind Speed** for our models, the following will be done:

A. Cluster weight data will be taken **per lot and per week.**

According to the date or time range of the weight of that cluster in that week, 12 weeks will be rolled back and the input data of the aforementioned variables will be taken.

B. Data will also be taken **28 weeks behind the weight of the cluster of that week taken**, to cover the life of the plant

## 2. Week of cluster weight data taken as reference
---
We are referencing data of banana cluster weight which have been collected **between July 27th and August 03rd of 2018**

From there, we will go back in the number of **12 and 28 weeks** to take the input data aforementioned 

In other words, data from **Phreatic Level, Precipitations, Temperature, Luminosity, Direction and Wind Speed** will be collected from January 12th to August 03rd to cover the entire life of the plant for 28 weeks

And data will be collected from **Phreatic Level, Precipitations, Temperature, Luminosity, Direction and Wind Speed** from May 5th to August 3rd to cover the last 12 weeks of the plant's life (the last 12 of the 28 weeks) which is the time range in which the banana bunch/cluster is formed.

---

The [**SIOMA**] (https://www.siomapp.com) platform also allows us to have a consolidated data of average, maximum, minimum, total and kg weights per hectare and total kg of each farm. 

This format is more like a presentation of data to which we must arrive with the outputs that the different models or techniques that we apply give us.

Initially we will take a week's rank, which is the frequency of productivity in the region. 

<s>We will focus on the data of a single lot of the 25, to go little by little generalizing a solution of the approach of our problem </s> 

We will focus on the data of all lots of the Porvenir farm, are 25 in total 

The consolidated productivity data for the week of July 27th to August 3rd, 2018, are the following:


In [2]:
cluster_consolidated = pd.read_csv('../../../data/raw/FincaPorvenir/Cultivos/Consolidado-27July-03Aug_2018/Desde_2018-07-27 00_00_00_Hasta_2018-08-03 23_59_59 - Consolidado.csv', )

In [3]:
cluster_consolidated

Unnamed: 0,Finca,# Racimos,# 8,# 9,# 10,# 11,# 12,Peso promedio (Kg),Maximo (Kg),Minimo (Kg),Peso Total,Kg/Ha,Area (Ha),Unnamed: 13,Dia,Semana
0,Porvenir,7646,1,791,4186,2469,199,23.4,57.1,4.6,178831,844.7,211.7,Desde,2018-07-27 00:00:00,30.0
1,,,,,,,,,,,,,,Hasta,2018-08-03 23:59:59,31.0
2,Lote,# Racimos,# 8,# 9,# 10,# 11,# 12,Peso promedio (Kg),Maximo (Kg),Minimo (Kg),Peso Total,(Kg/Ha),Area (Ha),,,
3,5,418,0,64,285,69,0,24.9,39,8.5,10426,1226.6,8.5,,,
4,6,1225,0,225,769,179,52,23.1,41,4.6,28253,3762,7.5,,,
5,7,461,0,1,170,290,0,24.7,41.5,6,11387,906.9,12.6,,,
6,8,297,0,0,47,250,0,23.8,37.5,12.5,7083,769.9,9.2,,,
7,9,391,0,37,236,118,0,26.8,44,9,10474,1235.1,8.5,,,
8,10,284,0,27,192,55,10,24,42,9,6815,1013.5,6.7,,,
9,11,313,0,0,168,145,0,23.5,41,11,7360,1198.5,6.1,,,


# 3. Exploring data banana clusters - From July 27th to August 3rd

Below, the weight data of individual bunches/clusters collected **between that week from July 27th to August 3rd, 2018 are presented**:

In [9]:
racimitos = pd.read_csv('../../../data/interim/Cultivos/UltimosRacimos/racimitos-27July_To_03August-SQL-JoinedData.csv')

In [11]:
print(racimitos.shape)
racimitos

(7646, 7)


Unnamed: 0,peso,fecha,nombreLote,numeroLote,lat,lng,nombreFinca
0,23.09,2018-07-27 07:08:58,5,5,7.766231,-76.762676,Porvenir
1,30.50,2018-07-27 07:09:01,5,5,7.766231,-76.762676,Porvenir
2,19.50,2018-07-27 07:09:02,5,5,7.766231,-76.762676,Porvenir
3,25.50,2018-07-27 07:09:04,5,5,7.766231,-76.762676,Porvenir
4,26.50,2018-07-27 07:09:06,5,5,7.766231,-76.762676,Porvenir
5,26.00,2018-07-27 07:09:08,5,5,7.766231,-76.762676,Porvenir
6,26.50,2018-07-27 07:09:10,5,5,7.766231,-76.762676,Porvenir
7,25.50,2018-07-27 07:09:12,5,5,7.766231,-76.762676,Porvenir
8,27.00,2018-07-27 07:09:14,5,5,7.766231,-76.762676,Porvenir
9,27.00,2018-07-27 07:09:16,5,5,7.766231,-76.762676,Porvenir


We can see that there is racimitos from lots number five to 25, that's, in the lot 5 and sequentially until the lot 25, were the lots in which make harvest banana clusters between **that week from July 27th to August 3rd,**


## 3.1. Selecting  relevant index columns features
---

In [14]:
racimitos_df = racimitos[['peso','fecha','numeroLote']]
racimitos_df.to_csv('../../../data/interim/Cultivos/UltimosRacimos/racimitos_AllLots/' +'\n' 
                    'racimitos-27July_To_03August_AllLots.csv')

# 3.2. Selecting clusters by lot number

We want select only the clusters belonging to each lot and have a dataset by each one

In [39]:
# grouped2 = racimitos_df.groupby('numeroLote')[['peso','fecha']]

#for key2, item2 in grouped2:
#    grouped2.get_group(key2).to_csv('../../../data/interim/Cultivos/UltimosRacimos/test/' +'\n' 
#                     'racimitos-27July_To_03August_Lot{}.csv'.format(key2), index=False)
# print("Banana clusters by lot: \n ",grouped.count())
    

In [38]:
grouped = racimitos_df.groupby('numeroLote')[['peso','fecha']]
print("Banana clusters by lot: \n ",grouped.count())
[grouped.get_group(key).to_csv('../../../data/interim/Cultivos/UltimosRacimos/ClustersByLotNumber/' +'\n' 
                     'racimitos-27July_To_03August_Lot{}.csv'.format(key), index=False) for key, item in grouped]
print('Cluster weight datasets have been generated for each lot') 

Banana clusters by lot: 
              peso  fecha
numeroLote             
5            418    418
6           1225   1225
7            461    461
8            297    297
9            391    391
10           284    284
11           313    313
12           311    311
13           523    523
14           269    269
15            78     78
16           277    277
17           172    172
18           502    502
19           286    286
20           251    251
21           408    408
22           343    343
23           179    179
24           141    141
25           517    517
Cluster weight datasets have been generated for each lot


# 4. Preparing data banana clusters - From July 27th to August 3rd

We have **`racimitos_df:`** dataframe 

In [40]:
# Tenemos el dataframe racimitos
print(racimitos_df.shape)
racimitos_df.head(10)

(7646, 3)


Unnamed: 0,peso,fecha,numeroLote
0,23.09,2018-07-27 07:08:58,5
1,30.5,2018-07-27 07:09:01,5
2,19.5,2018-07-27 07:09:02,5
3,25.5,2018-07-27 07:09:04,5
4,26.5,2018-07-27 07:09:06,5
5,26.0,2018-07-27 07:09:08,5
6,26.5,2018-07-27 07:09:10,5
7,25.5,2018-07-27 07:09:12,5
8,27.0,2018-07-27 07:09:14,5
9,27.0,2018-07-27 07:09:16,5


## 4.1. We evaluate if this dataset has null type `NaN`
---

In [41]:
print(racimitos_df.isnull().any())
racimitos_df.isnull().values.any()

peso          False
fecha         False
numeroLote    False
dtype: bool


False

Don't have null values

## 4.2. Selecting  relevant index columns features
---

As we already know that all this data of clusters of the **`racimitos_df`** dataframe goes 

from the July 27th to August 3rd. For now, our interest is in treating all the clusters as 

belonging to a farm without differentiating the lots of it, then remove the columns **`Fecha:`**

and **`numeroLote:`** 


So, we are only going to reference the values or samples of the column

**`peso`** and assigning them to the **`racimitosPorvenir`** dataframe created such as follow:

In [42]:
racimitosPorvenir = racimitos_df[['peso']]
print(racimitosPorvenir.shape)
racimitosPorvenir.head(10)

(7646, 1)


Unnamed: 0,peso
0,23.09
1,30.5
2,19.5
3,25.5
4,26.5
5,26.0
6,26.5
7,25.5
8,27.0
9,27.0


## 4.3. Generating descriptive data to dataset
---

In [43]:
racimitosPorvenir_describe = racimitosPorvenir.describe()
racimitosPorvenir_describe

Unnamed: 0,peso
count,7646.0
mean,23.388795
std,5.467227
min,4.59
25%,19.5
50%,23.295
75%,27.0
max,57.09


In [46]:
# Export this descriptive data to comma separated values and java script object notation
racimitosPorvenir_describe.to_csv('../../../data/interim/Cultivos/UltimosRacimos/' +'\n' 
                     'racimitos_Describe_27July_To_03August.csv', sep=',', header=True, index=True)
racimitosPorvenir_describe.to_json('../../../data/interim/Cultivos/UltimosRacimos/' +'\n' 
                     'racimitos_Describe_27July_To_03August.json')

# 5. Creating racimitos wight Training and Testing datasets

We have a **`racimitosPorvenir`** dataset with 7646 samples rows.

In [47]:
print(racimitosPorvenir.shape)
racimitosPorvenir.head()

(7646, 1)


Unnamed: 0,peso
0,23.09
1,30.5
2,19.5
3,25.5
4,26.5


We'll divide it into two differents datasets:

- Training dataset
- Testing dataset

This is executed through [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html  "sklearn.model_selection.train_test_split") function of this way:

`train_test_split` receive as a data parameter a numpy array, we have to turn the 

**`racimitosPorvenir`** dataframe to numpy array such as follow:

In [48]:
# numpy_racimitosPorvenir = racimitosPorvenir.reset_index().values
numpy_racimitosPorvenir = racimitosPorvenir.values

In [49]:
numpy_racimitosPorvenir

array([[23.09],
       [30.5 ],
       [19.5 ],
       ...,
       [13.5 ],
       [20.5 ],
       [20.  ]])

We compose the following datasets from **`numpy_racimitosPorvenir`** array :

- `racimitosPorvenir_train`, which is the training matrix
- `racimitosPorvenir_test`, the testing matrix

We use the [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html  "sklearn.model_selection.train_test_split") function to create the training and testing dataset. Their parameters are:

![alt text](https://cldup.com/hukPWvvLxt-3000x3000.png "klearn.model_selection.train_test_split")


- The first parameter **should be an array**, then we pass the **`numpy_racimitosPorvenir`** which contain all column features of wind speed.

Of this way, we pass all data (7646 samples rows) to from them `racimitosPorvenir_train`

and `racimitosPorvenir_test` will be created

- The `test_size=0.5` parameter means a 50% division; which means 
that half of the data goes to the test dataset and the other half 
goes to the training dataset.

A good choice for the size of tests **is usually 0.2 ie 20% or 0.25 or even 30%.** 

In some rare cases we will have 40% but almost never 0.5 or 50%

We choose 20% which means that we will have 20% of 9752 samples or observations for the test data set, 

in this case **7646 * 0.2 = 1529,2 samples or records for the test data set**


- The `train_size` parameter is the training dataset size.  **`test_size + train_size = 1 or 100%`**, 
then isn't necessary include it, because wheter we include to `test_size = 0.2`, then the remaining 
data will be to `train_size` this means **0.8 or 80%**

This means that **7646 * 0.8 = 6116,8  training dataset samples rows. **


- random_state is a seed or data source for generating random values for the data sets. 

If this parameter is not passed, the data will be generated in a random way, but in the way as by default numpy works them.

## 5.1.  Creating Training and testing racimitos weight datasets
---

- `racimitosPorvenir_train`, which is the training matrix
- `racimitosPorvenir_test`, the testing matrix

In [50]:
racimitosPorvenir_train, racimitosPorvenir_test = train_test_split(numpy_racimitosPorvenir, test_size = 0.2)

In [51]:
# We have 6116 rows to luminosity_luxes_train
print(type(racimitosPorvenir_train))
print("The dimensionality of wind speed training dataset is: " +'\n' , racimitosPorvenir_train.shape)
print('\n')

# And we have 1530 rows to luminosity_luxes_test
print(type(racimitosPorvenir_test))
print("The dimensionality of wind speed testing dataset is: " +'\n' , racimitosPorvenir_test.shape)

<class 'numpy.ndarray'>
The dimensionality of wind speed training dataset is: 
 (6116, 1)


<class 'numpy.ndarray'>
The dimensionality of wind speed testing dataset is: 
 (1530, 1)


In this way, the model, as you progressively learn the correlations in the training set, the better the prediction of the results in the test set.

But if the model learns too much from memory the correlations of the training sets,

that is to say, when one learns from memory and does not understand things, then it

will have problems to predict what is happening on the set of tests, because it is

learned for difficult correlations, if the logic is not well understood and you can

not make good predictions. This is called overfitting or overfitting

The really important thing is to understand that we need to have two different datasets

- Training set with which the ML model learns
- Test set, on which we test whether the ML model correctly learned the correlations


---
#  6. Feature Scaling racimitos weight training and testing dataset  
---

[This article post](http://benalexkeen.com/feature-scaling-with-scikit-learn/  "Feature Scaling with scikit-learn") it's a great reference to explore the features scaling methods
on scikit learn

- `StandardScaler` assume that data is normally distributed at the level of each characteristic or variable. If the data is not normally distributed, it is not the best alternative to use for scaling.  

- `Min-Max Scaler` it is probably the most famous scaling algorithm and what it does is resize the range to leave it in a dimension of 0 to 1 or -1 and 1 (in case there are negative values in the original dataset of input data)

This scale of maximums and minimums works best for cases where standard scaling may not work properly. If the distribution is not Gaussian or the standard deviation is very small the escalation of maximums and minimums is the best idea.
However, it is sensitive to outliers or outliers, so if there are outliers in the data it is better to consider robust scaling.

- `Robust Scaler` it is similar to the previous one of maximums and minimums, only that it uses interquartile ranges instead of maximums and minimums, which makes it robust for the outliers

- `Normalizer` which scales each value, dividing each value by its magnitude in n dimensional spaces for n number of characteristics.


---

We will use the scaling of maximums and minimums to scale the water table data, because the standard deviation is very small, 

it does not have atypical values and it does not follow a normal distribution (you have to check this)

We apply the maximum and minimum scaling. We provide a rank or base scale that **will be between 0 and 1** using an object,

[MinMaxScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html "sklearn.preprocessing.MinMaxScaler") which transforms each characteristic, (in this 

case it will be the columns feature of `peso`) individually according to a given range.

Product of its applicability, generates these attributes in the dataset, already transformed: 

![alt text](https://cldup.com/lTIv4HXgTk-3000x3000.png "sklearn.preprocessing.MinMaxScaler")

## 6.1 We apply maximium and minimum feature scaling to racimitos weight training dataset

In [52]:
# We provide a base scale range
scaler = MinMaxScaler(feature_range=(0, 1))

print("Remember our racimitos weight training data " + '\n', racimitosPorvenir_train)

Remember our racimitos weight training data 
 [[27.5]
 [26. ]
 [25. ]
 ...
 [24. ]
 [29.5]
 [18.5]]


With the [fit](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler.fit  "MinMaxScaler.fit")
we compute the maximum and minimum value of  `racimitosPorvenir_train` dataset to be used in the subsequent scaling 

We assing these values to `minmax_scale_training` variable.

In [53]:
minmax_scale_training = scaler.fit(racimitosPorvenir_train.astype(float))
# print(minmax_scale_training.data_max_)
# http://terrapinssky.blogspot.com/2017/10/pythonresolved-dataconversionwarning.html

Then, we apply the [transform](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler.transform "MinMaxScaler.transform") method to transform these data to maximum and mínimum scale value. 

Here, with this process, the `racimitosPorvenir_train` data are scaled between **0 to 1**  selected range

In [54]:
# transform precipitations_train data to maximum and mínimum scale value. 
racimitosPorvenir_minmax_training = minmax_scale_training.transform(racimitosPorvenir_train)

In [55]:
print ("And, these are our scaled data: " + '\n')
racimitosPorvenir_minmax_training

And, these are our scaled data: 



array([[0.53390818],
       [0.49895129],
       [0.4756467 ],
       ...,
       [0.45234211],
       [0.58051736],
       [0.32416686]])

In [56]:
print('Racimitos weight Training dataset. Minimum value after MaxMinScaler:\nRacimitos weight={:.1f}'
      .format(racimitosPorvenir_minmax_training[:,0].min()))

print('Racimitos weight Training dataset. Maximum value after MaxMinScaler:\nRacimitos weight={:.1f}'
      .format(racimitosPorvenir_minmax_training[:,0].max()))

Racimitos weight Training dataset. Minimum value after MaxMinScaler:
Racimitos weight=0.0
Racimitos weight Training dataset. Maximum value after MaxMinScaler:
Racimitos weight=1.0


Then, our  MinMaxScaler normalized training dataset is `racimitosPorvenir_minmax_training` numpy array

- We export this array to comma separated values 

In [59]:
col = ['pesoRacimo']
racimitosPorvenir_train_df = pd.DataFrame(racimitosPorvenir_minmax_training, columns=col)

In [62]:
print(racimitosPorvenir_train_df.shape)
racimitosPorvenir_train_df.head(10)

(6116, 1)


Unnamed: 0,pesoRacimo
0,0.533908
1,0.498951
2,0.475647
3,0.161035
4,0.475647
5,0.475647
6,0.195992
7,0.370776
8,0.743649
9,0.394081


In [61]:
racimitosPorvenir_train_df[racimitosPorvenir_train_df['pesoRacimo']>1.0]

Unnamed: 0,pesoRacimo


In this way we have the dataset **`racimitosPorvenir_train_df`** standardized and training, and export it to a .csv file

In [64]:
racimitosPorvenir_train_df.to_csv('../../../data/processed/Cultivos/UltimosRacimos/' +'\n' 
                                 'racimitos-weight_Normalized_TRAINING_July-27_August-03.csv', sep=',', header=True, index=False)

## 6.2. We apply maximium and minimum feature scaling to racimitos weight testing dataset


In [65]:
# We provide a base scale range
scaler = MinMaxScaler(feature_range=(0, 1))

print("Remember our racimitos weight testing data " + '\n', racimitosPorvenir_test)

Remember our racimitos weight testing data 
 [[19. ]
 [23. ]
 [32.5]
 ...
 [18.5]
 [20. ]
 [23.5]]


In [67]:
minmax_scale_test = scaler.fit(racimitosPorvenir_test.astype(float))
# transform luminosity_luxes_test data to maximum and mínimum scale value. 
racimitosPorvenir_minmax_test = minmax_scale_test.transform(racimitosPorvenir_test)

print ("And, these are our testing scaled data: " + '\n')
racimitosPorvenir_minmax_test

And, these are our testing scaled data: 



array([[0.25445293],
       [0.33274613],
       [0.5186925 ],
       ...,
       [0.24466628],
       [0.27402623],
       [0.34253279]])

In [68]:
print('Racimitos weight Testing dataset. Minimum value after MaxMinScaler:\nRacimitos weight={:.1f}'
      .format(racimitosPorvenir_minmax_test[:,0].min()))

print('Racimitos weight Testing dataset. Maximum value after MaxMinScaler:\nRacimitos weight={:.1f}'
      .format(racimitosPorvenir_minmax_test[:,0].max()))

Racimitos weight Testing dataset. Minimum value after MaxMinScaler:
Racimitos weight=0.0
Racimitos weight Testing dataset. Maximum value after MaxMinScaler:
Racimitos weight=1.0


Then, our  MinMaxScaler normalized training dataset is `racimitosPorvenir_minmax_test` numpy array

- We export this array to comma separated values

In [69]:
racimitosPorvenir_test_df = pd.DataFrame(racimitosPorvenir_minmax_test, columns=col)

In [70]:
print(racimitosPorvenir_test_df.shape)
racimitosPorvenir_test_df

(1530, 1)


Unnamed: 0,pesoRacimo
0,0.254453
1,0.332746
2,0.518693
3,0.205520
4,0.195733
5,0.332746
6,0.293600
7,0.283813
8,0.401253
9,0.342533


In [71]:
racimitosPorvenir_test_df.to_csv('../../../data/processed/Cultivos/UltimosRacimos/' +'\n' 
                                 'racimitos-weight_Normalized_TESTING_July-27_August-03.csv', sep=',', header=True, index=False)