In [None]:
%pip install --upgrade --quiet pandas mlflow plotly-express nbformat scikit-learn fastparquet pyarrow

In [None]:
import os

import mlflow
import pandas as pd
import plotly_express as px
import sklearn
import sklearn.model_selection
import sklearn.pipeline
import sklearn.preprocessing
import sklearn.linear_model
import sklearn.ensemble
import sklearn.tree

Load a dataset containing some SCADA data.

<small>Note: this is publicly available data, not Vattenfall's data</small>

In [None]:
url = "https://raw.githubusercontent.com/dunnkers/experiment-tracking-with-mlflow/main/data/scada.parquet"
data = pd.read_parquet(url)
data

## Explore the data

First, let's check out the "Active power" column against time.

In [None]:
px.line(data, y="active_power")

And also the wind speed:

In [None]:
fig = px.line(data, y="wind_speed")
fig.update_traces(line_color="#FECB52")
fig

Instead of using time as the x-axis, let's try putting the wind speed there and plot it against power. This will give us a power curve! 📈

In [None]:
fig = px.scatter(data, x="wind_speed", y=["active_power", "theoretical_power"])
fig.update_layout(
    xaxis_title="Wind speed (m/s)",
    yaxis_title="Power",
)
fig

## Training a model
Next step! Let's train a model to predict power output. Such, we can validate whether something might be wrong with the turbine.

Choose features and the target, constructing `X` and `y`:

In [None]:
data_without_na = data.dropna()
X = data_without_na[[
    "wind_speed",
    # "wind_direction",
    # "is_curtailed"
]]
y = data_without_na["active_power"]
print(f"Features: {X.columns.to_list()}, target: {y.name}")

Cross validation:

In [None]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    X,
    y,
    shuffle=False,
    train_size=0.1 # use (only) 10% of all data for training
)
test_index_min = X_test.index.min()
test_index_max = X_test.index.max()
print(f"Training data from {X_train.index.min()} to {X_train.index.max()}")
print(f"Testing data from {test_index_min} to {test_index_max}")

Set up the model:

In [None]:
model = sklearn.pipeline.make_pipeline(
    # Preprocessing
    # sklearn.preprocessing.StandardScaler(),
    # sklearn.preprocessing.PolynomialFeatures(degree=3),

    # Model
    sklearn.linear_model.LinearRegression(),
    # sklearn.linear_model.Ridge(),
    # sklearn.tree.DecisionTreeRegressor(),
    # sklearn.ensemble.RandomForestRegressor(),
)

Train model

In [None]:
model.fit(X_train, y_train)

Evaluate model

In [None]:
score = model.score(X_test, y_test)
score

Predict on all data

In [None]:
data.loc[X.index, "predictions"] = model.predict(X)
data.loc[X.index, "residuals"] = data["active_power"] - data["predictions"]

Make a nice plot of the predictions 📊

In [None]:
fig = px.line(
    data[test_index_min:],
    y=["active_power", "predictions"]
)
fig

## 📝 Assignments
Well, we just trained a model and visualized the results. That was great! But can we now track the results using **MLFlow**? Let's try it out! Don't forget to help each other 🤝!

💡 Tip: use the [MLFlow documentation](https://www.mlflow.org/docs/latest/tracking.html#logging-functions)

0. **📝 Make it yours**

    First, let's make sure that whatever will be logged will be associated with you.

    Fill out your name in the `name` variable below.

1. **🤔 What does `start_run` do?**

    What does the `with` statement do in Python? What does [`mlflow.start_run()`](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.start_run) do?
    
    Run the code and find your run in the MLFlow UI. Find the link below. Try showing the _User_ column to find your run.

    - ✓ You can find your run in the MLFlow UI.

    - ⤫ Ask for help if you cannot find your run.

2. **📦 Logging the model**

    Now, log the trained using [`mlflow.sklearn.log_model`](https://mlflow.org/docs/latest/python_api/mlflow.sklearn.html#mlflow.sklearn.log_model). The trained model is stored in the `model` variable. 

    - ✓ You should now see _Artifacts_ in your MLFlow experiment run.

    <small>🎁 Bonus: What does the `registered_model_name` parameter do in the `log_model` function?</small>

3. **📈 Logging the R2 score**

    Next, log the model score using [`mlflow.log_metric`](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.log_metric). Call your metric `R2_score`. The model score is stored in the `score` variable.

    - ✓ You should now see the R2 score in the _Metrics_ tab of your run.

4. **📈 Logging the plot**

    It would also be nice to get that nice plot in the MLFlow UI. Try to log the plot using [`mlflow.log_figure`](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.log_figure).

    💡 Tip: save the plot with a `.html` extension to make the plot interactive. If you don't want an interactive plot, you can use `.png`.

    - ✓ Find the plot saved in the _Artifacts_ tab.

5. **🎯 Improving the model**

    The model is okay but can be much improved 😊. Go into the code above and see if there are any improvements to be made.
    
    Think of things like additional features, feature-processing, or a different model. Once you have done modifications, run the code again and log the results using MLFlow 👍🏻.
    
    <small>💡 Tip: Perhaps some of the commented lines can help you out 😏.</small>

    - ✓ You improved the model R2 score!

    <small>🎁 Bonus: What else is useful to log to mlflow? Extend your experiment logging.</small>


In [None]:
# TODO [Assignment 0]: fill in your name below 📝
your_name: str = "< fill in your name here >"

# mlflow setup
os.environ["LOGNAME"] = your_name
mlflow.set_tracking_uri("http://20.31.89.132:5000")

In [None]:
# TODO [Assignment 1]: what does `mlflow.start_run()` do?
with mlflow.start_run():
    mlflow.log_param("features", str(X.columns))
    mlflow.log_param("model_name", str(model))

    # TODO [Assignment 2]: log the model using mlflow.sklearn.log_model
    ...

    # TODO [Assignment 3]: log the model test score as `test_score` using mlflow.log_metric
    ...

    # TODO [Assignment 4]: log the plotly figure using mlflow.log_figure
    ...


# TODO [Assignment 5]: Go in the Notebook code above ^^ and make improvements to the model!

Visit the MLFlow server:

[http://20.31.89.132:5000/](http://20.31.89.132:5000/)

Find your experiment!

🎉