#### Problem Tutorial 1: Regression Model

We want to predict the gas consumption (in millions of gallons/year) in 48 of the US states
based on some key features. 

These features are 
 * petrol tax (in cents); 
 * per capital income (in US dollars);
 * paved highway (in miles); and
 * population of people with driving licences

<table>
  <tr><td>
    <img src="https://informedinfrastructure.com/wp-content/uploads/2012/06/traffic-jam.jpg"
         alt="Bank Note " width="600">
  </td></tr>
  <tr><td></td></tr>
  <tr><td>
  <img src="https://miro.medium.com/max/593/1*pfmeGgGM5sxmLBQ5IQfQew.png"
         alt="Matrix" width="600">
  <tr><td></td></tr>
  <tr><td>And seems like a bad consumption problem to have ...</td></tr>
</table>
  
#### Solution:

Since this is a regression problem where the value is a range of numbers, we can use the
common Random Forest Algorithm in Scikit-Learn. Most regression models are evaluated with
four [standard evalution metrics](https://medium.com/usf-msds/choosing-the-right-metric-for-machine-learning-models-part-1-a99d7d7414e4): 

* Mean Absolute Error (MAE)
* Mean Squared Error (MSE)
* Root Mean Squared Error (RSME)
* R-squared (r2)

This example is borrowed from this [source](https://stackabuse.com/random-forest-algorithm-with-python-and-scikit-learn/) and modified and modularized for this tutorial

Aim of this this:

1. Understand MLflow Tracking API
2. How to use the MLflow Tracking API
3. Use the MLflow API to experiment several Runs
4. Interpret and observe runs via the MLflow UI

Some Resources:
* https://mlflow.org/docs/latest/python_api/mlflow.html
* https://www.saedsayad.com/decision_tree_reg.htm
* https://towardsdatascience.com/understanding-random-forest-58381e0602d2
* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
* https://towardsdatascience.com/regression-an-explanation-of-regression-metrics-and-what-can-go-wrong-a39a9793d914
* https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/

Define all the classes and bring them into scope

In [2]:
%run ../../../tracking/notebooks/jupyter/setup/lab_utils_cls.ipynb
%run ../../../tracking/notebooks/jupyter/setup/rfr_regression_cls.ipynb
%run ..tracking/setup/rfc_classification_cls.ipynb
%run ..tracking/setup/rfr_regression_base_exp_cls.ipynb

ERROR:root:File `'../tracking/setup/lab_utils_cls.ipynb.py'` not found.
ERROR:root:File `'..tracking/setup/rfr_regression_cls.ipynb.py'` not found.
ERROR:root:File `'..tracking/setup/rfc_classification_cls.ipynb.py'` not found.
ERROR:root:File `'..tracking/setup/rfr_regression_base_exp_cls.ipynb.py'` not found.


### Load the Dataset

In [5]:
# Load and print dataset
dataset = Utils.load_data("https://raw.githubusercontent.com/dmatrix/mlflow-workshop-part-1/master/data/petrol_consumption.csv")
dataset.head(5)

Get descriptive statistics for the features

In [7]:
dataset.describe()

In [8]:
# Iterate over several runs with different parameters, such as number of trees. 
# For excercises, try changing max_depth, number of estimators, and consult the documentation what other tunning parameters
# may affect a better outcome and supply them to the class constructor
#
max_depth = 0
for n in range (20, 250, 50):
  max_depth = max_depth + 2
  params = {"n_estimators": n, "max_depth": max_depth}
  rfr = RFRModel.new_instance(params)
  (experimentID, runID) = rfr.mlflow_run(dataset, run_name="Regression Petrol Consumption Model", verbose=True)
  print("MLflow Run completed with run_id {} and experiment_id {}".format(runID, experimentID))
  print("-" * 100)

**Note**:

With 20 trees, the root mean squared error is `64.93`, which is greater than 10 percent of the average petrol consumption i.e., `576.77`. 
This may sugggest that we have not used enough estimators (trees).

#### Excercise Assignment. Try different runs with:
1. Change or add parameters, such as depth of the tree or random_state: 42 etc.
2. Change or alter the range of runs and increments of n_estimators
3. Check in MLfow UI if the metrics are affected
4. Load the best model as PyFuncModel (**HINT**: Check how MLflow Project Module 2 loaded pyfunc model and scored it)
5. Score it with some [test data](https://github.com/dmatrix/olt-mlflow/blob/master/tracking/data/test_petrol_consumption.csv) used from our Tracking Module. (**HINT**: Drop the last column, as you want to predict it)