#### Problem - Tutorial 2: Regression

<table>
  <tr><td>
    <img src="https://brokeassstuart.com/wp-content/pictsnShit/2019/07/inside-airbnb-1263x560.jpg"
         alt="Bank Note " width="600">
  </td></tr>
  <tr><td></td></tr>
  <tr><td>
  <img src="https://miro.medium.com/max/593/1*pfmeGgGM5sxmLBQ5IQfQew.png"
         alt="Matrix" width="600">
  <tr><td></td></tr>
  <tr><td>Can we predict AirBnB prices in SF ...</td></tr>
</table>

source: Databricks Learning Academy MLflow Course

Refactored code to modularize it

While iterating or build models, data scientists will often create a base line model to see how the model performs.
And then iterate with experiments, changing or altering parameters to ascertain how the new parameters or
hyper-parameters move the metrics closer to their confidence level.

This is our base line model using RandomForestRegressor model to predict AirBnb house prices in SF.
Given 22 features, can we predict what the next house price will be?

We will compute standard evalution metrics and log them.

Aim of this module is:

1. Introduce tracking ML experiments in MLflow
2. Log an experiment run and explore the results in the UI
3. Record parameters, metrics, and model artifacts

Some Resources:
* https://mlflow.org/docs/latest/python_api/mlflow.html
* https://www.saedsayad.com/decision_tree_reg.htm
* https://stackabuse.com/random-forest-algorithm-with-python-and-scikit-learn/
* https://towardsdatascience.com/understanding-random-forest-58381e0602d2
* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
* https://towardsdatascience.com/explaining-feature-importance-by-example-of-a-random-forest-d9166011959e
* https://seaborn.pydata.org/tutorial/regression.html

In [None]:
%run ./setup/lab_utils_cls.ipynb
%run ./setup/rfr_regression_cls.ipynb
%run ./setup/rfc_classification_cls.ipynb
%run ./setup/rfr_regression_base_exp_cls.ipynb

In [None]:
# load the data
dataset = Utils.load_data("https://raw.githubusercontent.com/dmatrix/mlflow-workshop-part-1/master/data/airbnb-cleaned-mlflow.csv")
dataset.head()

In [None]:
dataset.describe()

In [None]:
# To try different experiment runs, each with its own instance of model with the supplied parameters, 
# add more parameters to this dictionary list to experiment different parameters and their
# effects on evaluation metrics.
# Excercise 1, 2, 3 & 4: 
# 1) add key-value parameters to this list
# 2) iterate over the list
# 3) Compute R2 in the RFHousePriceModel class
# 4) Compute signature and save as part of the model
params_list = [{"n_estimators": 75,"max_depth": 5, "random_state": 42}]
# run these experiments, each with its own instance of model with the supplied parameters.
for params in params_list:
  rfr = RFHousePriceModel.new_instance(params)  
  experiment = "Experiment with {} trees".format(params['n_estimators'])
  (experimentID, runID) = rfr.mlflow_run(dataset, run_name="AirBnB House Pricing Regression Model", verbose=True)
  print("MLflow Run completed with run_id {} and experiment_id {}".format(runID, experimentID))
  print("-" * 100)

### Let's Explore the MLflow UI
 * Add Notes & Tags
 * Compare Runs pick two best runs
 * Annotate with descriptions and tags
 * Evaluate the best run

In [None]:
!mlflow ui

#### Excercise Assignment. Try different runs with:
1. Change or add parameters, such as depth of the tree or random_state: 42 etc.
2. Change or alter the range of runs and increments of n_estimators
3. Compute R2 metric in the `RFHousePriceModel` class and and log the metric
4. Determine the model signature and save as part of the model
5. Check in MLfow UI if the metrics are affected

#### HOMEWORK CHALLENGE

 1. Consult [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) documentation to see what hyperparameters you can specify
  * Change or add parameters, such as depth of the tree
 2. Change or alter the range of runs and increments of n_estimators
 3. Use [scikit-learn cross validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html) to see any difference in metrics