# Machine learning application: Forecasting wind power

<table>
  <tr><td>
    <img src="https://github.com/dmatrix/mlflow-workshop-part-3/raw/master/images/wind_farm.jpg"
         alt="Keras NN Model as Logistic regression"  width="800">
  </td></tr>
</table>

In this notebook, we will use the MLflow Model Registry to build a machine learning application that forecasts the daily power output of a [wind farm](https://en.wikipedia.org/wiki/Wind_farm). 

Wind farm power output depends on weather conditions: generally, more energy is produced at higher wind speeds. Accordingly, the machine learning models used in the notebook predict power output based on weather forecasts with three features: `wind direction`, `wind speed`, and `air temperature`.

*This notebook uses altered data from the [National WIND Toolkit dataset](https://www.nrel.gov/grid/wind-toolkit.html) provided by NREL, which is publicly available and cited as follows:*

*Draxl, C., B.M. Hodge, A. Clifton, and J. McCaa. 2015. Overview and Meteorological Validation of the Wind Integration National Dataset Toolkit (Technical Report, NREL/TP-5000-61740). Golden, CO: National Renewable Energy Laboratory.*

*Draxl, C., B.M. Hodge, A. Clifton, and J. McCaa. 2015. "The Wind Integration National Dataset (WIND) Toolkit." Applied Energy 151: 355366.*

*Lieberman-Cribbin, W., C. Draxl, and A. Clifton. 2014. Guide to Using the WIND Toolkit Validation Code (Technical Report, NREL/TP-5000-62595). Golden, CO: National Renewable Energy Laboratory.*

*King, J., A. Clifton, and B.M. Hodge. 2014. Validation of Power Output for the WIND Toolkit (Technical Report, NREL/TP-5D00-61714). Golden, CO: National Renewable Energy Laboratory.*


### Classes and Utility functions

In [1]:
!pip install mlflow
!pip install scikit-learn



In [2]:
import pandas as pd
import time
import warnings
import mlflow
from mlflow.tracking.client import MlflowClient
from mlflow.entities.model_registry.model_version_status import ModelVersionStatus
warnings.filterwarnings("ignore")
print("Using mlflow version {}".format(mlflow.__version__))

Using mlflow version 1.11.0


In [3]:
class Utils:
  @staticmethod
  def load_data(path, index_col=0):
    df = pd.read_csv(path,index_col=0)
    return df
  
  @staticmethod
  def get_training_data(df):
    # From 2014 through 2018 and drop the power column since that
    # is our dependent variable
    
    training_data = pd.DataFrame(df["2014-01-01":"2018-01-01"])
    X = training_data.drop(columns="power")
    
    # Get our dependent variable values
    y = training_data["power"]
    return X, y

  @staticmethod
  def get_validation_data(df):
    # From 2018 through 2019 and drop the power column since that
    # our dependent variable
    
    validation_data = pd.DataFrame(df["2018-01-01":"2019-01-01"])
    X = validation_data.drop(columns="power")
    
    # Get our dependent variable values
    y = validation_data["power"]
    return X, y

### Define our model and utility classes 

This allows us to use some Python model classes and utility functions

In [4]:
import mlflow.sklearn
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

In [5]:
class RFRModel():
   def __init__(self, params={}):
      self.rf = RandomForestRegressor(**params)
      self.params = params
      self._mse = None
      self._rsme = None

   @classmethod
   def new_instance(cls, params={}):
      return cls(params)

   @property
   def model(self):
      return self.rf

   @property
   def mse(self):
      return self._mse

   @mse.setter
   def mse(self, value):
      self._mse = value

   @property
   def rsme(self):
      return self._rsme

   @rsme.setter
   def rsme(self, value):
      self._rsme = value

   def mlflow_run(self, X_train, y_train, val_x, val_y, model_name,
                  run_name="Random Forest Regressor: Power Forecasting Model",
                  register=False, verbose=False):
      with mlflow.start_run(run_name=run_name) as run:
         # Log all parameters
         mlflow.log_params(self.params)

         # Train and fit the model
         self.rf.fit(X_train, y_train)
         y_pred = self.rf.predict(val_x)

         # Compute metrics
         self._mse = mean_squared_error(y_pred, val_y)
         self._rsme = np.sqrt(self._mse)

         if verbose:
            print("Validation MSE: %d" % self._mse)
            print("Validation RMSE: %d" % self._rsme)

         # log params and metrics
         mlflow.log_params(self.params)
         mlflow.log_metric("mse", self._mse)
         mlflow.log_metric("rmse", self._rsme)

         # Specify the `registered_model_name` parameter of the
         # function to register the model with the Model Registry. This automatically
         # creates a new model version for each new run
         mlflow.sklearn.log_model(
            sk_model=self.model,
            artifact_path="sklearn-model",
            registered_model_name=model_name) if register else mlflow.sklearn.log_model(
               sk_model=self.model,
               artifact_path="sklearn-model")

         run_id = run.info.run_id

      return run_id

### Load our training data

In [6]:
csv_path = "https://raw.githubusercontent.com/dmatrix/ds4g-workshop/master/notebooks/data/windfarm_data.csv"
wind_farm_data = Utils.load_data(csv_path, index_col=0)
wind_farm_data.head(5)

Unnamed: 0,temperature_00,wind_direction_00,wind_speed_00,temperature_08,wind_direction_08,wind_speed_08,temperature_16,wind_direction_16,wind_speed_16,power
2014-01-01,4.702022,106.74259,4.743292,7.189482,100.41638,6.593832,8.172301,99.288,5.967206,1959.3535
2014-01-02,7.695733,98.036705,6.142715,9.977118,94.03181,4.383676,9.690135,204.25444,1.696528,1266.6239
2014-01-03,9.608235,274.0612,10.514304,10.840864,242.87563,16.869741,8.991079,250.2683,12.038399,7545.6797
2014-01-04,6.955563,257.91022,7.18917,5.317223,254.2617,9.069233,3.021174,284.06537,4.590843,3791.0408
2014-01-05,0.830547,265.3944,4.263086,2.480239,104.79496,3.042063,4.227131,263.4169,3.899182,880.6115


In [7]:
X_train, y_train = Utils.get_training_data(wind_farm_data)
val_x, val_y = Utils.get_validation_data(wind_farm_data)

### Initialize a set of hyperparameters for the training and try three runs

In [8]:
# Initialize our model hyperparameters
params_list = [{"n_estimators": 100},
               {"n_estimators": 200},
               {"n_estimators": 300}]

In [9]:
# Train, fit and register our model and iterate over few different tuning parameters
# Use sqlite:///mlruns.db as the local store for tracking and registery

mlflow.set_tracking_uri("sqlite:///mlruns.db")

model_name = "PowerForecastingModel"
for params in params_list:
  rfr = RFRModel.new_instance(params)
  print("Using paramerts={}".format(params))
  runID = rfr.mlflow_run(X_train, y_train, val_x, val_y, model_name, register=True)
  print("MLflow run_id={} completed with MSE={} and RMSE={}".format(runID, rfr.mse, rfr.rsme))

Using paramerts={'n_estimators': 100}


Registered model 'PowerForecastingModel' already exists. Creating a new version of this model...
Created version '4' of model 'PowerForecastingModel'.


MLflow run_id=58023990d79b49b7a850a38a9b51b724 completed with MSE=46492.33347710835 and RMSE=215.6208094714152
Using paramerts={'n_estimators': 200}


Registered model 'PowerForecastingModel' already exists. Creating a new version of this model...
Created version '5' of model 'PowerForecastingModel'.


MLflow run_id=d836b6b407a54448a432b4954965c3af completed with MSE=46613.0563298187 and RMSE=215.9005704712674
Using paramerts={'n_estimators': 300}
MLflow run_id=6c879fd443be4e32a04c43e89cb2210a completed with MSE=44590.924188993515 and RMSE=211.165632120839


Registered model 'PowerForecastingModel' already exists. Creating a new version of this model...
Created version '6' of model 'PowerForecastingModel'.


# Integrating Model Registry with CI/CD Forecasting Application

<table>
  <tr><td>
    <img src="https://github.com/dmatrix/mlflow-workshop-part-3/raw/master/images/forecast_app.png"
         alt="Keras NN Model as Logistic regression"  width="800">
  </td></tr>
</table>

1. Use the model registry fetch different versions of the model
2. Score the model
3. Select the best scored model
4. Promote model to production, after testing


### Let's Examine the MLflow UI

1. Let's our MLflow UI
2. Go to the link printed by `ngrok`below
 *  https://22345abcxyzz.ngrok.io
3. Let's examine some models and start comparing their metrics

In [10]:
# run tracking UI in the background
get_ipython().system_raw("mlflow ui --backend-store-uri sqlite:///mlruns.db --port 5000 &")

In [11]:
# create remote tunnel using ngrok.com to allow local port access
# borrowed from https://colab.research.google.com/github/alfozan/MLflow-GBRT-demo/blob/master/MLflow-GBRT-demo.ipynb#scrollTo=4h3bKHMYUIG6
!pip install pyngrok --quiet
from pyngrok import ngrok

# Terminate open tunnels if exist
ngrok.kill()

# Setting the authtoken (optional)
# Get your authtoken from https://dashboard.ngrok.com/auth
NGROK_AUTH_TOKEN = ""
ngrok.set_auth_token(NGROK_AUTH_TOKEN)

# Open an HTTPs tunnel on port 5000 for http://localhost:5000
public_url = ngrok.connect(port="5000", proto="http", options={"bind_tls": True})
print("MLflow Tracking UI:", public_url)

MLflow Tracking UI: https://777300fb92ff.ngrok.io


### Define a helper function to load PyFunc model from the registry
<table>
  <tr><td> Save a Keras Model Flavor and load as PyFunc Flavor</td></tr>
  <tr><td>
    <img src="https://raw.githubusercontent.com/dmatrix/mlflow-workshop-part-2/master/images/models_2.png"
         alt="" width="600">
  </td></tr>
</table>

In [12]:
def score_model(data, model_uri):
    model = mlflow.pyfunc.load_model(model_uri)
    return model.predict(data)

### Load scoring data

In [13]:
# Load the score data
score_path = "https://raw.githubusercontent.com/dmatrix/ds4g-workshop/master/notebooks/data/score_windfarm_data.csv"
score_df = Utils.load_data(score_path, index_col=0)
score_df.head()

Unnamed: 0,temperature_00,wind_direction_00,wind_speed_00,temperature_08,wind_direction_08,wind_speed_08,temperature_16,wind_direction_16,wind_speed_16,power
2020-12-27,7.123225,103.17663,8.133746,6.454002,107.79322,6.326991,7.219884,119.070526,3.062219,2621.476
2020-12-28,5.37627,118.08433,5.558247,8.118839,116.193535,8.565966,9.307176,120.26443,11.993913,5423.625
2020-12-29,8.593436,115.43259,12.18185,8.587968,112.93136,11.970859,8.956771,110.161095,11.301485,9132.115
2020-12-30,8.069033,103.169685,9.983466,7.930485,106.04551,6.381556,8.228901,111.60216,4.087358,3667.9927


In [14]:
# Drop the power column since we are predicting that value
actual_power = pd.DataFrame(score_df.power.values, columns=['power'])
score = score_df.drop("power", axis=1)

In [23]:
# Formulate the model URI to fetch from the model registery
model_uri = "models:/{}/{}".format(model_name, 1)

# Predict the Power output 
pred_1 = pd.DataFrame(score_model(score, model_uri), columns=["predicted_1"])
pred_1

Unnamed: 0,predicted_1
0,2842.050481
1,5167.698486
2,8700.441234
3,3824.168606


#### Combine with the actual power

In [24]:
actual_power["predicted_1"] = pred_1["predicted_1"]
actual_power

Unnamed: 0,power,predicted_1
0,2621.476,2842.050481
1,5423.625,5167.698486
2,9132.115,8700.441234
3,3667.9927,3824.168606


In [27]:
# Formulate the model URI to fetch from the model registery
model_uri = "models:/{}/{}".format(model_name, 2)

# Predict the Power output
pred_2 = pd.DataFrame(score_model(score, model_uri), columns=["predicted_2"])
pred_2

Unnamed: 0,predicted_2
0,2816.634472
1,5105.207946
2,8733.244066
3,3795.756599


In [28]:
actual_power["predicted_2"] = pred_2["predicted_2"]
actual_power

Unnamed: 0,power,predicted_1,predicted_2
0,2621.476,2842.050481,2816.634472
1,5423.625,5167.698486,5105.207946
2,9132.115,8700.441234,8733.244066
3,3667.9927,3824.168606,3795.756599


In [29]:
# Formulate the model URI to fetch from the model registery
model_uri = "models:/{}/{}".format(model_name, 3)

# Formulate the model URI to fetch from the model registery
pred_3 = pd.DataFrame(score_model(score, model_uri), columns=["predicted_3"])
pred_3

Unnamed: 0,predicted_3
0,2767.752638
1,5101.11854
2,8764.204688
3,3813.358353


### Combine the values into a single pandas DataFrame 

In [30]:
actual_power["predicted_3"] = pred_3["predicted_3"]
actual_power

Unnamed: 0,power,predicted_1,predicted_2,predicted_3
0,2621.476,2842.050481,2816.634472,2767.752638
1,5423.625,5167.698486,5105.207946,5101.11854
2,9132.115,8700.441234,8733.244066,8764.204688
3,3667.9927,3824.168606,3795.756599,3813.358353


### Let's Examine the MLflow UI

1. Go to the link printed by `ngrok`above
 *  https://12345abcxyz.ngrok.io
2. Navigate to the Model Registry Page
3. Pick a version and transition stage: None->Production
