# Credits

This notebook took many ideas from the following Kaggle kernels:

https://www.kaggle.com/hely333/explore-avocados-from-all-sides 
https://www.kaggle.com/zdeutsch/avocados-predictions-with-ml-models-keras-ann

# Predicting Avocado prices

In this tutorial, we will analyze the avocado prices on different US cities and attempt to predict their future prices based on their type, production, and region.


For that, we will use the [Avocado Prices dataset from Kaggle](https://www.kaggle.com/neuromusic/avocado-prices), compiled from the [Hass Avocado Board website](https://www.hassavocadoboard.com/retail/volume-and-price-data).

The dataset is a table with the weekly 2015-2018 retail scan data for National retail volume (units) and price. The Average Price (of avocados) in the table reflects a per unit (per avocado) cost, even when multiple units (avocados) are sold in bags. The Product Lookup codes (**PLU**) in the table are only for Hass avocados. Other varieties of avocados (e.g. greenskins) are not included in this table.

The tables columns are as follows:

- **Date** : The date of the observation.
- **AveragePrice** : the average price of a single avocado in USD.
- **type** : conventional or organic.
- **year** : the year of the observation (redundant information).
- **region** : Region 
- **Total Volume** : Total number of avocados sold.
- **4046** : Total number of avocados with PLU 4046 sold (small Hass).
- **4225** : Total number of avocados with PLU 4225 sold (large Hass).
- **4770** : Total number of avocados with PLU 4770 sold (extra large Hass).
- **Total Bags** : total number of bags sold including all types.
- **Small Bags** : total number of bags sold of small Hass.
- **Large Bags** : total number of bags sold of large Hass.
- **XLarge Bags** : total number of bags sold of extra large Hass.	

## Let's load the dataset first

First, download the dataset zip file [from Kaggle's website](https://www.kaggle.com/neuromusic/avocado-prices/downloads/avocado.csv/1) and unzip it in the same folder where this notebook is.

Then we can read the dataset with Pandas.

In [None]:
import pandas as pd
import numpy as np
avocato_ds = pd.read_csv('avocado.csv')

Let's see its contents using the [head method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html?highlight=head#pandas.DataFrame.head)

In [None]:
avocato_ds.head()

The first column and the second column contains an aritmetic progression staring from 0. These are equivalent to the row number.

Now let's see the columns types and if the contains missing or non-null values using the [info method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html?highlight=info#pandas.DataFrame.info).

In [None]:
avocato_ds.info()

**Dataset info**

- The number of entries (weeks observed) is 18249 
- All the columns contains **18249 non-null objects** (equal to the number of entries). Hence, we don't have missing values.
- We are using 1.9 MB
- The column "Date" type is object, which means a string in pandas. 

### The date column

It is more convenient to convert the "Date" column to a datetime object which allows arithmetic operations between different times. For this we will use Panda's [to_datetime](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html?highlight=to_datetime#pandas.to_datetime) function.

In [None]:
avocato_ds['Date'] = pd.to_datetime(avocato_ds['Date'])

Now let's make sure that the rows are sorted by date using the [sort](https://pandas.pydata.org/pandas-docs/version/0.19/generated/pandas.DataFrame.sort.html) DataFrame method.
This will become useful when we plot different columns.

In [None]:
avocato_ds.sort_values('Date',axis=0, ascending=True, inplace=True)
# the axis keyword indicates along which direction to sort: the index (0) or columns (1).
# ascending=True : sort in ascending order
# the inplace=True keyword modifies the DataFrame in place (do not create a new object).

#### Data cleaning 

Let's clean the dataset a little bit. The "Unammed: 0" and "year" column do not provide useful information. We can remove them using the [drop](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)  method.

In [None]:
avocato_ds.drop(['Unnamed: 0', 'year'], axis=1,inplace=True)
# the axis keyword indicates where to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
# the inplace=True keyword modifies the DataFrame in place (do not create a new object).
avocato_ds.head(2)

# Exploratory data analysis and data cleaning

The exploratory data analysis (EDA) is an approach to analyze datasets to summarize their main characteristics. 
The objective of the EDA is to initial step in every data science project were you explore the characteristics of the data, find patterns or anomalies, test assumptions about the relationship between variables, etc.
In a nutshell, the main goal is to maximize your knowledge of the dataset. 

Let's explore the contents of the columns containg strings columns.

In [None]:
avocato_ds['region'].unique()

In [None]:
avocato_ds['type'].unique()

What we care the most when we buy avocados is their price. Let's start by plotting the temporal evolution of their prices.
Let's start by plotting the average avocato prices for each type sold on the entire US (*TotalUS* region).
We are going to select the rows in the [DataFrame by using bolean indexes](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#different-choices-for-indexing).

In [None]:
# Let's import matplotlib first
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline  

In [None]:
select_US = avocato_ds['region']=='TotalUS'

select_organic = avocato_ds['type']=='organic'

select_conventional = avocato_ds['type']=='conventional'

# The plot DataFrame's method return a matplotlib axes instance that can be reused in other plot() calls
ax=avocato_ds[select_US&select_organic].plot(x='Date',y='AveragePrice',
                                             label='organic', figsize=(12,5))

avocato_ds[select_US&select_conventional].plot(x='Date',y='AveragePrice',
                                               label='conventional',ax=ax)

**There is something weird in the organic average price aorund Jun to Aug, 2015.**
The prices drop to 1 USD and mantain that price over a few weeks.
This is probably an error in the dataset.
Luckly, we have the average price and total number of avocatos sold by region. We can compute our own TotalUS prices and compare it with the actual values in the dataset. 

## Let's fix the TotalUS prices 

To obtain the Average Price over the entire US, we need to compute the total USD sold on each region,
compute the total by region, and then divide by the total number of advocatos sold on the US.

In [None]:
# Let's add the "Total Sold" column with the amount in USD sold each week (each row)
avocato_ds['Total Sale']=avocato_ds['Total Volume']*avocato_ds['AveragePrice']

# Let's select the main regions
# The main regions are described in https://www.hassavocadoboard.com/retail/volume-and-price-data
regions= ['Southeast', 'GreatLakes', 'Northeast', 'West',
          'California',  'Plains',  'Midsouth', 'SouthCentral']

# Reminder: ' | ' represents the 'or' logical operator
select_major_regions = avocato_ds['region'].isin(regions)

# select_major_regions is a boolean Series, with a True value on the rows that 
# correspond any of the main region. False otherwise.
organic_ds = avocato_ds[select_organic & select_major_regions]

# Let's create a copy of the dataset with only the features we are interested in.
organic_ds_short = organic_ds[['Date','Total Sale', 'Total Volume']]
   
# We take the total value for each week
organic_ds_short=organic_ds_short.groupby('Date').sum()
organic_ds_short.head()

In [None]:
# Compute the US Average price. 
us_average_price = organic_ds_short['Total Sale']/organic_ds_short['Total Volume']
# us_average_price is a pandas Series, with the date as index
print(type(us_average_price))

us_average_price.head()

In [None]:
ax=avocato_ds[select_US&select_organic].plot(x='Date',y='AveragePrice',
                                             label='organic (Original)',
                                             legend=True, figsize=(12,5))
us_average_price.plot(ax=ax,label='organic (New)',legend=True, color='r')
ax.set_title('Total US - Average price');

Solved! Well, at least partially. 

We still need to __fix the values__ in our DataFrame **avocato_ds**.

For that, in **avocato_ds**, we need to replace the "AveragePrice" values on all the rows where region="TotalUS" and type='organic', by the values on **us_average_price** that we just compute. 

Let's review the data that we need to use:
- **us_average_price** : Series index=Date , values=AveragePrice
- **avocato_ds** : DataFrame, index=row number, we need to replace the values in the AveragePrice column.

The series are using different indexes. This make simple assignments between DataFrame columns and series not possible.

Let's create an auxiliary **us_average_price** Series that uses the row numbers as index.

In [None]:
select_US = avocato_ds['region']=='TotalUS'
select_organic = avocato_ds['type']=='organic'

# Let's create a new DataFrame with the columns we need.
aux_series = avocato_ds[select_US&select_organic][['Date','AveragePrice']]

# Add the another column with the row number
aux_series['row']=aux_series.index

aux_series.head(2)

In [None]:
# Let's use Date as index
aux_series.set_index('Date',drop=False, inplace=True)
aux_series.head(3)

Now, **aux_series** and **us_average_price** uses the 'Date' as indexes. 

Let's check that the indexes are identical.

In [None]:
aux_series.index.equals(us_average_price.index)

In [None]:
# Let's assing the values to AveragePrice
aux_series['AveragePrice'] = us_average_price
aux_series.head()

If the indexes are not identical, not-a-number values are assignment on non overlapping indexes.

Now, we are ready to update the old AveragePrice with the new computed ones using the DataFrame's [update](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.update.html) method.

In [None]:
# Set the index as row
aux_series.set_index('row',drop=False, inplace=True)

# Create a copy
avocato_ds_fix=avocato_ds.copy()

# Update the AveragePrice in the corresponding rows only.
avocato_ds_fix['AveragePrice'].update(aux_series['AveragePrice'])

In [None]:
select_US = avocato_ds_fix['region']=='TotalUS'

select_organic = avocato_ds_fix['type']=='organic'

select_conventional = avocato_ds_fix['type']=='conventional'

# The plot DataFrame's method return a matplotlib axes instance that can be reused in other plot() calls
ax=avocato_ds_fix[select_US&select_organic].plot(x='Date',y='AveragePrice',
                                             label='organic', figsize=(12,5))

avocato_ds_fix[select_US&select_conventional].plot(x='Date',y='AveragePrice',
                                               label='conventional',ax=ax)

## Exercises

### 1) Plot the Average prices for different regions and type.

In [None]:
# avocato_ds_fix.head()

### 2) Plot the Average prices over the US together with the Total Volume. Are they correlated? 
Use [plot.scatter method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.scatter.html)

In [None]:
# 

### 3) Plot the Average prices over the US together with the Total Bags. Are they correlated? 

# Predicting avocados prices

## Explore data correlations 

Here we will try to predict the average avocados prices based on the information that we have in our dataset. 
As a first step, let's explore the correlation between different variables (columns) and the prices.

Let's try to predict the prices for each major region only. 
For that, we make use of the [isin](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html) DataFram's method to select all the rows that corresponds to these regions.

In [None]:
# Let's select the main regions
# The main regions are described in 
# http://web.archive.org/web/20171017162957/https://www.hassavocadoboard.com/retail/volume-and-price-data

def select_major_regions(input_dataset):
    """
    Return a dataset with only the major regions.
    """
    regions= ['Southeast', 'GreatLakes', 'Northeast', 'West',
              'California',  'Plains',  'Midsouth', 'SouthCentral']

    selected_regions = input_dataset['region'].isin(regions)
    return input_dataset[selected_regions]

major_regions_ds = select_major_regions(avocato_ds_fix)
major_regions_ds.head()

Let's plot the correlation between different correlations between different columns using Seaborn, data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

In [None]:
# To install Seaborn run one of the following commands:

#!conda install -y seaborn
#!pip install seaborn

We will compute the correlations between the different columns in the DataFrame using the [corr](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html) method.
This method can only compute correlations between columns with numeric values. In our dataset, the "type" column contains strings values. 

To compute the correlation in that column, we will encode the string values to numerical values (0 and 1 in this case).
To do that we will use the [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) function from the [scikit-learn library]().

The Scikit-learn library provides simple and efficient tools for data mining and data analysis
Accessible to everybody, and reusable in various contexts.

Scikit-learn is a machine learning library that provides simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc. 

In [None]:
# To install scikit-learn run one of the following commands:

#!conda install -y scikit-learn
#!pip install scikit-learn

In [None]:
columns = ['AveragePrice','type', 'Total Volume','Total Bags', 'region']

selected_data = major_regions_ds[columns]

#################################################################
# Encode the type column
from sklearn.preprocessing import LabelEncoder 
label = LabelEncoder() # Create enconder instance

# Fit the encder to the data.
# In this step, the LabelEnconder find a representation of the data 
# into values between 0 and 1.
label.fit(selected_data.type.unique()) 

# Transform the data from string to numerical values.
# Note: 
# Doing selected_data['cat_type']=values will set the values on a copy of a slice from a DataFrame.
# A more efficient way yo to this is:
selected_data=selected_data.assign(cat_type=label.transform(selected_data['type']))

selected_data.head()

Let's  calculate the correlation matrix

In [None]:
correlations = selected_data.corr()
correlations.head()

And then plot the correlations using the Seaborn's [heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html) function.

In [None]:
import seaborn as sns
# plot the heatmap
sns.heatmap(correlations,
            cbar = True,  # Add colorbar
            annot = True, # If True, write the data value in each cell. 
            fmt = '.2f',  # String formatting code for annotations. 
            annot_kws = {'size':15}) # Keyword arguments for ax.text for annotations                      

The heatmap shows that:

- There is some correlation between the prices and selected columns
- There is an anti-correlation between the prices and the production (supply), as expected.
- The Total Volume and Total Bags have a high correlation, indicating that they provide similar information (although not the same)

Let's show in more detail the relationships between the columns using Seaborn's [pairplot](https://seaborn.pydata.org/generated/seaborn.pairplot.html) function.

In [None]:
sns.pairplot(selected_data,hue='type',vars=['AveragePrice','Total Volume','Total Bags'])

In [None]:
select_organic = selected_data['type']=='organic'
my_axes=sns.pairplot(selected_data[select_organic],
                     vars=['AveragePrice','Total Volume','Total Bags'],
                     hue='region')
plt.subplots_adjust(top=0.9)
my_axes.fig.suptitle('Organic type')

In [None]:
select_conventional = selected_data['type']=='conventional'
my_axes= sns.pairplot(selected_data[select_conventional],
                      vars=['AveragePrice','Total Volume','Total Bags'],
                      hue='region')
plt.subplots_adjust(top=0.9)
my_axes.fig.suptitle('Organic type')

## Implement machine learning models


### What is machine learning?

A general definition of machine Learning is:

    Machine Learning is the field of study that gives computers the ability to learn
    without being explicitly programmed. (Arthur Samuel, 1959)

What that really means is that instead of building a mathematical model based in a fixed set of rules (equations) we will try to create a model from the data itself. 

## Data preprocessing

Let's create the training and testing datasets from the available data.
Since the prices depend on the avocado type, we will create a different model for each type.

First, let's extract the predictors (x) and the predicted variable (y) from the dataset.

In [None]:
# We use the select_major_regions function defined above.
major_regions_ds = select_major_regions(avocato_ds_fix)

# The region column contain string values. 
# To use them in a linear regression we need to encode_them into numbers
# This time, instead of using the LabelEncoder, we will the pandas factorize method
#
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.factorize.html
#
numeric_labels, unique_labels = major_regions_ds['region'].factorize()
major_regions_ds=major_regions_ds.assign(region=np.asarray(numeric_labels, dtype=float))    
    
# Create two independent datasets by type
conventional_ds =  major_regions_ds[major_regions_ds['type']=='conventional']

organic_ds =  major_regions_ds[major_regions_ds['type']=='organic']

x_conventional = conventional_ds[['Total Volume','Total Bags', 'region']]
y_conventional = conventional_ds['AveragePrice']

x_organic = conventional_ds[['Total Volume','Total Bags', 'region']]
y_organic = conventional_ds['AveragePrice']


x_conventional.head()

Now we have a dataset to train the model.
But, we also want to validate (test) how well the model will generalize (perform) to a dataset that has never seen before. 

To that end, we will use a cross-validation technique where we split our training dataset into two sub-sets:

- train data: Data that will be used to train the model
- validation data: Data that will be used only to validate the model

In general, it can be problematic to feed the models data with values over widely different ranges. Although the model can adapt to those values, it can make the learning process for difficult. 

For that, we normalize the data so all the entries (predictor variables) have a similar dynamic range. 

A common approach is to standarize the features (predictor variables) by removing the mean and scaling to unit variance:

       z = (x - u) / s

where __u__ is the mean of the training samples and **s** is the standard deviation of the training samples.

To do that we will use the Scikit-learn [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler).

For other scaling functions see:

https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Create test and train data for each dataset

(x_train_c, x_test_c, 
 y_train_c, y_test_c) = train_test_split(x_conventional,
                                         y_conventional,
                                         test_size = 0.25,# Keep 25% of the samples for testing 
                                         shuffle=False,
                                         random_state=42) # Do not suffle the samples

(x_train_o, x_test_o, 
 y_train_o, y_test_o) = train_test_split(x_organic,
                                         y_organic,
                                         test_size = 0.25,# Keep 25% of the samples for testing 
                                         shuffle=False,
                                         random_state=42) # Do not suffle the samples

# Scale the data

organic_scaler = StandardScaler()
organic_scaler.fit(x_train_c)

conventional_scaler = StandardScaler()
conventional_scaler.fit(x_train_o)

# Now let's transfor the training and the test data using this scaling
x_train_c = conventional_scaler.transform(x_train_c)
x_test_c = conventional_scaler.transform(x_test_c)

x_train_o = organic_scaler.transform(x_train_o)
x_test_o = organic_scaler.transform(x_train_o)

## Build models

Now with the training and validation dataset that we have, we will build and validate several type of models. 
We will show how to build the model to predict the prices of conventional avocados.
It is left as excersice for to implement the same models for the organic avocados.

### Linear regresion model

The most basic form of machine learning is a mutidimensional linear regression of the data. This model assumes a linear (and unique) relationship between the predictors (independent variables, typically denote with **x**) and the predicted variable (dependent variable, typically denote with __y__).

```
y = a + b0 * x0 + b1 * x1 + ... bn * xn
```

The best fit to this equation is learned from training dataset with known (**x,y**) pairs.

Let's implement a Linear regresion model using the 
[LinearRegression model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
) from the Scikit-learn library.

Our linear model will use as inputs the **region, Total Volume, and Total Bags** columns. We will create a different model for each Avocado type.

In [None]:
from sklearn.linear_model import LinearRegression 

# Create model 
conventional_lineal_model = LinearRegression()

# Train it using the training dataset
conventional_lineal_model.fit(x_train_c,y_train_c)  

Now we have trained the model. Let's see how well it performs in general. 

One way of doing that is to measure the coefficient of determination R^2 of the prediction,
using the [score method](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score).
The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.

In [None]:
# Test how well it preforms on the train set
conventional_lineal_model.score(x_train_c,y_train_c)

In [None]:
# Test how well it preforms on the test set
conventional_lineal_model.score(x_test_c,y_test_c)

A linear model is not performing well... Let's see in more detail.

In [None]:
def plot_reliability_plots(model,x_train, x_test, y_train, y_test):
    """
    Scatter plot of predicted vs actual values.
    """

    train_predict = model.predict(x_train)
    test_predict = model.predict(x_test)

    # Let's plot the predicted values 
    fig, ax=plt.subplots(figsize=(6,6))
    ax.set_aspect('equal')
    plt.scatter(train_predict,y_train, label='train')
    plt.scatter(test_predict,y_test, label='test')
    plt.legend(fontsize=15)
    plt.plot([0.5,2],[0.5,2])
    
plot_reliability_plots(conventional_lineal_model, x_train_c, x_test_c, 
                       y_train_c, y_test_c )


This is somehow an expected results. During the Exploratory Data Analysis we saw that the data didn't follow linear relationships.

Let's also compute the Mean Absolute Error (MAE) for our predictions.

In [None]:
from sklearn.metrics import mean_absolute_error, r2_score

def compute_mae(model,x_train, x_test, y_train, y_test):
    """
    Compute Mean Absolute error between predicted and actual values.
    """

    train_predict = model.predict(x_train)
    test_predict = model.predict(x_test)

    mae_train = mean_absolute_error(train_predict,y_train)
    mae_test = mean_absolute_error(test_predict,y_test)
    
    print(f"MAE_train: {mae_train:.2f}")
    print(f"MAE_test: {mae_test:.2f}")
    
    return mae_train, mae_test
    
print("Linear model: Conventional avocados")
compute_mae(conventional_lineal_model, x_train_c, x_test_c, 
            y_train_c, y_test_c );

### Excercise: Linear Regression model for organic avocados 
Build and test the Linear Regression model for the organic avocados.

In [None]:
# Train model 

In [None]:
# Test it

In [None]:
# Plot the predicted vs actual values

### KNeighborsRegressor model

Regression based on k-nearest neighbors.
The target is predicted by local interpolation of the targets associated of the nearest neighbors in the training set.

The k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression. The input consists of the k closest training examples in the feature space. 

In k-NN classification, the output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

Both for classification and regression, a useful technique can be used to assign weight to the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. For example, a common weighting scheme consists in giving each neighbor a weight of 1/d, where d is the distance to the neighbor.
<table>
  <tr>
    <td> 
        <img src="fig/KNN_1.png" alt="KNN" style="width: 250px;"/>
        <p>Training data</p>
    </td>
    <td>
        <img src="fig/KNN_2.png" alt="KNN" style="width: 250px;"/>
        <p>1-Nearest neighbors classification map</p>
    </td>
      <td>
        <img src="fig/KNN_3.png" alt="KNN" style="width: 250px;"/>
        <p>5-Nearest neighbors classification map</p>
    </td>
    </tr>
</table>

Source: [Wikipedia](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)

In [None]:
from sklearn.neighbors import KNeighborsRegressor
knn_model_conv=KNeighborsRegressor(n_neighbors=3)
knn_model_conv.fit(x_train_c,y_train_c) 

plot_reliability_plots(knn_model_conv, x_train_c, x_test_c, 
                       y_train_c, y_test_c )

In [None]:
print("KNN model: Conventional avocados")
compute_mae(knn_model_conv, x_train_c, x_test_c, 
            y_train_c, y_test_c );

## Excercise: KNN model for organic avocados

Build and test the Linear Regression model for the organic avocados.

In [None]:
# Train model 


In [None]:
# Test it

In [None]:
# Plot the predicted vs actual values

## RandomForestRegressor model

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

<img src="fig/random_forests.png" alt="Random Forests" style="width: 500px;"/>

In [None]:
from sklearn.ensemble import RandomForestRegressor

random_forest_conv = RandomForestRegressor(n_estimators=10)
# n_estimators=number of trees in the forest
                                           
random_forest_conv.fit(x_train_c,y_train_c) 

plot_reliability_plots(random_forest_conv, x_train_c, x_test_c, 
                       y_train_c, y_test_c )

In [None]:
print("Random forests model: Conventional avocados")
compute_mae(random_forest_conv, x_train_c, x_test_c, 
            y_train_c, y_test_c );

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

adaboost_conv = GradientBoostingRegressor()
# n_estimators=number of trees in the forest
                                           
adaboost_conv.fit(x_train_c,y_train_c) 

plot_reliability_plots(adaboost_conv, x_train_c, x_test_c, 
                       y_train_c, y_test_c )


## Models summary

In [None]:
print("Linear model: Conventional avocados")
compute_mae(conventional_lineal_model, x_train_c, x_test_c, 
            y_train_c, y_test_c );

print("\nKNN model: Conventional avocados")
compute_mae(knn_model_conv, x_train_c, x_test_c, 
            y_train_c, y_test_c );

print("\nRandom forests model: Conventional avocados")
compute_mae(random_forest_conv, x_train_c, x_test_c, 
            y_train_c, y_test_c );
