# 1. Introduction
---
<h3>This notebook contains multiples machine learnings techniques (using AUTOML from PyCaret). Our dataset consist of weather and other variables corresponding to sugar cane crops, the goal is to predict values of sucrose production.</h3>

# 2. Load Libraries
---

In [None]:
# Install and import Pycaret library
!pip install numba==0.53
!pip install pycaret
# !pip install shap

In [None]:
import pandas as pd
import numpy as np
import warnings
import time
from scipy import stats
from scipy.stats import normaltest
from sklearn.preprocessing import quantile_transform
warnings.filterwarnings('ignore')

# Import libraries for visualization and set default values.
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
plt.style.use(['seaborn'])
sns.set_theme(style="whitegrid", palette=sns.color_palette("tab10"))
sns.set_style('ticks')


np.random.seed(42)

In [None]:
#Pycaret is used to automatomate machine learning workflow
from pycaret.regression import *
from pycaret.utils import version
print('Pycaret Version: ', version())

Pycaret Version:  3.3.2


In [None]:
!pip list

# 3. Load Dataset
---

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
path='/Named_data.xlsx'
weather_data_clean=pd.read_excel(path)
file = ' ml no suelo'

In [None]:
weather_data_clean

In [None]:
weather_data_clean['PRODUCTO']=weather_data_clean['PRODUCTO'].replace(list(weather_data_clean.PRODUCTO.unique()), ['Producto '+str(i) for i in range(len(list(weather_data_clean.PRODUCTO.unique())))])

# 4. Normality
---

Perform a normality test on our target variables (*Sac, Sac Campo* and *Sac % Caña*)

In [None]:
fig, ax = plt.subplots(1,3,figsize=(25,10))
ax[0].hist(weather_data_clean['Sac'], color='r')
ax[0].set_title('Histogram of Sac')
ax[1].hist(weather_data_clean['Sac % Caña'],color='g')
ax[1].set_title('Histogram of Sac % Caña')
ax[2].hist(weather_data_clean['Sac Campo'])
ax[2].set_title('Histogram of Sac Campo')

In [None]:
def normal_test(target):
	stat, p = normaltest(weather_data_clean[target])
	alpha =0.05
	if p > alpha:
		print(target + ' looks Gaussian')
	else:
		print(target + ' does not look Gaussian')
normal_test('Sac')
normal_test('Sac % Caña')
normal_test('Sac Campo')

There is no conclusion about the normality.

Using preprocessing techniques such as *Quantile transformation* we can normalize our targets.

In [None]:
def normalization(target):
    y=weather_data_clean[target]
    y_trans = quantile_transform(y.to_frame(), output_distribution="normal", copy=True)
    return y_trans

weather_data_clean['sac_trans'] = normalization('Sac')
weather_data_clean['sac_caña_trans']= normalization('Sac % Caña')
weather_data_clean['sac_campo_trans'] = normalization('Sac Campo')

In [None]:
fig, ax = plt.subplots(1,3,figsize=(25,10))
ax[0].hist(weather_data_clean['sac_trans'], color='r')
ax[0].set_title('Histogram of Sac after transformation')
ax[1].hist(weather_data_clean['sac_caña_trans'],color='g')
ax[1].set_title('Histogram of Sac % Caña after transformation')
ax[2].hist(weather_data_clean['sac_campo_trans'])
ax[2].set_title('Histogram of Sac Campo after transformation')

Info about target variables after the transformation

In [None]:
weather_data_clean[['sac_trans', 'sac_campo_trans', 'sac_caña_trans']].describe()

Testing normality

In [None]:
normal_test('sac_trans')
normal_test('sac_caña_trans')
normal_test('sac_campo_trans')

sac_trans looks Gaussian
sac_caña_trans looks Gaussian
sac_campo_trans looks Gaussian


In [None]:
weather_data_clean.drop(['Sac', 'Sac % Caña', 'Sac Campo'], axis=1, inplace=True) #remove old targets to left only normal targets

Our targets seem more normal.

Split train and test set, later we create validation set. We are going to make a 70/20/10 split.

In [None]:
def dataset_pycaret(data_to_analyse):
  dataset = data_to_analyse.copy()
  data = dataset.sample(frac=0.90, random_state=786)
  data_unseen = dataset.drop(data.index)
  data.reset_index(inplace=True, drop=True)
  data_unseen.reset_index(inplace=True, drop=True)
  print('Data for Modeling: ' + str(data.shape))
  print('Unseen Data For Predictions: ' + str(data_unseen.shape))
  return [data,data_unseen]

data, data_unseen=dataset_pycaret(weather_data_clean)

For the next part of this notebook we are going to train different models for each of our target variables.

# Sac variable

## Setup

Using PyCaret we are going to setup out machine learning workflow. Our dependent variables are normalized using min-max.

In [None]:
reg_trans = setup(data=data, target = 'sac_trans',train_size=0.80,
                ignore_features = ['sac_caña_trans','sac_campo_trans'],
                categorical_features =['tmprda','TIPO COS','Con Sin Mad','nm_cndcion','PRODUCTO','VAR'],
                rare_level_threshold = 0.1, combine_rare_levels = True,
                remove_multicollinearity = True, multicollinearity_threshold = 0.95,
                ignore_low_variance = True, normalize=True, normalize_method='minmax',
                remove_outliers=True)


## Model Selection

Train multiples models and choosing the best three (sorted using $r^2$).

In [None]:
best_trans = compare_models()

We are going to use Light Gradient Boosting Machine, Extra Trees Regressor and Random Forest Regressor to create a blend model.

Let's look some info about our LGBM model

In [None]:
plot_model(best_trans, plot='residuals')

In [None]:
plot_model(best_trans, plot='error')

In [None]:
plot_model(best_trans, plot='feature')

## Blend Model

Create our models

In [None]:
lgbm_model_sac = create_model('lightgbm')
lgbm_model_sac = tune_model(lgbm_model_sac, n_iter=2)
et_model_sac = create_model('et')
et_model_sac = tune_model(et_model_sac, n_iter=2)
rf_model_sac = create_model('rf')
rf_model_sac = tune_model(rf_model_sac, n_iter=2)

Create blended model

In [None]:
tuned_model_sac = blend_models(estimator_list=[lgbm_model_sac, rf_model_sac, et_model_sac], choose_better=True)

Hyperparameter tunning of the blended model

Some metrics about our model

In [None]:
plot_model(tuned_model_sac, plot='residuals')

In [None]:
plot_model(tuned_model_sac, plot='error')

Finalized model and save model in the file sac_caña_model.pkl

In [None]:
final_sac = finalize_model(tuned_model_sac)
# save_model(final_sac, 'sac_model' )

## Predictions on test data

In [None]:
test_data = data_unseen.drop(['sac_trans', 'sac_caña_trans', 'sac_campo_trans'], axis=1)
test_pred = predict_model(final_sac, data = test_data)

In [None]:
def plot_series(time, series,i, format="-", start=0, end=None):
    #plt.figure(figsize=(20,10))
    plt.plot(time[start:end], series[start:end], format,label=i)
    plt.xlabel("Unseen Samples")
    plt.ylabel("Sucrose Field")
    plt.legend()

plt.figure(figsize=(22,5))
plot_series(data_unseen.index, data_unseen['sac_trans'],"True")
plot_series(data_unseen.index, test_pred['Label'],'Predicted')
plt.grid(False)
plt.title('Sac true vs predicted')
fig = plt.gcf()
plt.show()
fig.savefig('true vs predic sac' + file + '.png', bboxs='tight')

In [None]:
from sklearn.metrics import r2_score, mean_absolute_error

In [None]:
from sklearn.metrics import r2_score, mean_absolute_error
r_score = r2_score(y_pred = test_pred['Label'], y_true = data_unseen['sac_trans'])
mae_score = mean_absolute_error(y_pred = test_pred['Label'], y_true = data_unseen['sac_trans'])
print('r2 en test set: {}'.format(r_score))
print('mae en test set: {}'.format(mae_score))

# Sac Campo variable

## Setup

Using PyCaret we are going to setup out machine learning workflow. Our dependent variables are normalized using min-max.

In [None]:
reg_trans = setup(data=data, target = 'sac_campo_trans',train_size=0.80,
                ignore_features = ['sac_caña_trans','sac_trans'],
                categorical_features =['tmprda','TIPO COS','Con Sin Mad','nm_cndcion','PRODUCTO','VAR'],
                rare_level_threshold = 0.1, combine_rare_levels = True,
                remove_multicollinearity = True, multicollinearity_threshold = 0.95,
                ignore_low_variance = True, normalize=True, normalize_method='minmax',
                remove_outliers=True)


## Model Selection

Train multiples models and choosing the best three (sorted using $r^2$).

In [None]:
best_trans = compare_models()

We are going to use Light Gradient Boosting Machine, Extra Trees Regressor and Random Forest Regressor to create a blend model.

Let's look some info about our ET model

In [None]:
plot_model(best_trans, plot='residuals')

It looks overfitted.

In [None]:
plot_model(best_trans, plot='error')

In [None]:
plot_model(best_trans, plot='feature')

## Blend Model

Create our models

In [None]:
lgbm_model_sac_campo = create_model('lightgbm')
et_model_sac_campo = create_model('et')
rf_model_sac_campo = create_model('rf')
lgbm_model_sac_campo = tune_model(lgbm_model_sac_campo, n_iter=2)
#et_model_sac_campo = tune_model(et_model_sac_campo, n_iter=1)
#rf_model_sac_campo = tune_model(rf_model_sac_campo, n_iter=2)

Create blended model

In [None]:
tuned_model_sac_campo = blend_models(estimator_list=[lgbm_model_sac_campo, rf_model_sac_campo, et_model_sac_campo], choose_better=True)

Hyperparameter tunning of the blended model

Some metrics about our model

In [None]:
plot_model(tuned_model_sac_campo, plot='residuals')

In [None]:
plot_model(tuned_model_sac_campo, plot='error')

Finalized model and save model in the file sac_caña_model.pkl

In [None]:
final_sac_campo = finalize_model(tuned_model_sac_campo)
#save_model(final_sac_campo, 'sac_campo_model' )

## Predictions on test data

In [None]:
test_data = data_unseen.drop(['sac_trans', 'sac_caña_trans', 'sac_campo_trans'], axis=1)
test_pred = predict_model(final_sac_campo, data = test_data)

In [None]:
plt.figure(figsize=(22,5))
plot_series(data_unseen.index, data_unseen['sac_campo_trans'],"True")
plot_series(data_unseen.index, test_pred['Label'],'Predicted')
plt.grid(False)
plt.title('Sac_Campo true vs predicted')
fig = plt.gcf()
plt.show()
fig.savefig('true vs predic sac campo' + file + '.png', bboxs='tight')

In [None]:
r_score = r2_score(y_pred = test_pred['Label'], y_true = data_unseen['sac_campo_trans'])
mae_score = mean_absolute_error(y_pred = test_pred['Label'], y_true = data_unseen['sac_campo_trans'])
print('r2 en test set: {}'.format(r_score))
print('mae en test set: {}'.format(mae_score))

r2 en test set: 0.44980347049614566
mae en test set: 0.5543422898333494


The model seems to fit great, however the target variable has many extreme values.

# Sac % Caña variable

## Setup

Using PyCaret we are going to setup out machine learning workflow. Our dependent variables are normalized using min-max.

In [None]:
reg_trans = setup(data=data, target = 'sac_caña_trans',train_size=0.80,
                ignore_features = ['sac_campo_trans','sac_trans'],
                categorical_features =['tmprda','TIPO COS','Con Sin Mad','nm_cndcion','PRODUCTO','VAR'],
                rare_level_threshold = 0.1, combine_rare_levels = True,
                remove_multicollinearity = True, multicollinearity_threshold = 0.95,
                ignore_low_variance = True, normalize=True, normalize_method='minmax',
                remove_outliers=True)


## Model Selection

Train multiples models and choosing the best three (sorted using $r^2$).

In [None]:
best_trans = compare_models()

We are going to use Light Gradient Boosting Machine, Extra Trees Regressor and Random Forest Regressor to create a blend model.

Let's look some info about our ET model

In [None]:
plot_model(best_trans, plot='residuals')

It looks overfitted.

In [None]:
plot_model(best_trans, plot='error')

In [None]:
plot_model(best_trans, plot='feature')

## Blend Model

Create our models

In [None]:
lgbm_model_sac_caña = create_model('lightgbm')
et_model_sac_caña = create_model('et')
rf_model_sac_caña = create_model('rf')
lgbm_model_sac_caña = tune_model(lgbm_model_sac_caña, n_iter=2)
et_model_sac_caña = tune_model(et_model_sac_caña, n_iter=2)
rf_model_sac_caña = tune_model(rf_model_sac_caña, n_iter=2)

Create blended model

In [None]:
tuned_model_sac_caña = blend_models(estimator_list=[lgbm_model_sac_caña, rf_model_sac_caña, et_model_sac_caña], choose_better=True)

Hyperparameter tunning of the blended model

Some metrics about our model

In [None]:
plot_model(tuned_model_sac_caña, plot='residuals')

In [None]:
plot_model(tuned_model_sac_caña, plot='error')

Finalized model and save model in the file sac_caña_model.pkl

In [None]:
final_sac_caña = finalize_model(tuned_model_sac_caña)
# save_model(final_sac_caña, 'sac_caña_model' )

## Predictions on test data

In [None]:
test_data = data_unseen.drop(['sac_trans', 'sac_caña_trans', 'sac_campo_trans'], axis=1)
test_pred = predict_model(final_sac_caña, data = test_data)

In [None]:
def plot_series(time, series,i,c, format="-", start=0, end=None):
    #plt.figure(figsize=(20,10))
    plt.plot(time[start:end], series[start:end], format,label=i, color=c)
    plt.xlabel("Unseen Samples")
    plt.ylabel("Sucrose Field")
    plt.legend()

plt.figure(figsize=(22,5))
plot_series(data_unseen.index, data_unseen['sac_caña_trans'],"True",'tab:blue')
plot_series(data_unseen.index, test_pred['Label'],'Predicted','tab:orange')
plt.grid(False)
plt.title('Sac_%_Caña true vs predicted')
fig = plt.gcf()
plt.show()
fig.savefig('true vs predic sac caña' + file + '.png', bboxs='tight')

In [None]:
r_score = r2_score(y_pred = test_pred['Label'], y_true = data_unseen['sac_caña_trans'])
mae_score = mean_absolute_error(y_pred = test_pred['Label'], y_true = data_unseen['sac_caña_trans'])
print('r2 en test set: {}'.format(r_score))
print('mae en test set: {}'.format(mae_score))