# Model training 🛠

Let's create a model that can predict the generated power, given the wind speed. <br>
In this notebook we use the package `scikit-learn` to prepare the data and train a model. <br>
We say *train*, but for now, we'll actually just be *fitting* a linear regression model to the data.

In [None]:
import pandas as pd
import plotly_express as px
import sklearn
import sklearn.model_selection
import sklearn.pipeline
import sklearn.preprocessing
import sklearn.linear_model

In [None]:
data_path = "../data/turbine-data.csv"
data = pd.read_csv(data_path).set_index("timestamp")
data.index = pd.to_datetime(data.index)

Let's do some quick preparation of the data, before we train a model:

In [None]:
# Drop data with missing values
data_without_na = data.dropna() 

X = data_without_na[["wind_speed"]]  # Our model's input
y = data_without_na["active_power"]  # Target values

In [None]:
# Hold out a test set which the model will not see during training, 
# so we can evaluate the model's performance on unseen data.
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    X,
    y,
    shuffle=False,
    # use (only) 10% of all data for training
    train_size=0.1  # don't change this number
)
print(f"Training data: from {X_train.index.min()} to {X_train.index.max()}")
print(f"Testing data: from {X_test.index.min()} to {X_test.index.max()}")

Now, we create a model. Here we start small with a linear regression model. In the next notebook exercise you may make this model as fancy as you like!

In [None]:
model = sklearn.pipeline.make_pipeline(
    sklearn.linear_model.LinearRegression(),
)

And train the model!

In [None]:
model.fit(X_train, y_train)

And evaluate:

In [None]:
score = model.score(X_test, y_test)
# score = model.score(X_train, y_train)
score

Pretty good! <br>
Or... is it?

First of all, what does `score()` actually compute? You can use the following code cell to find out. What's better, a higher or lower score?

In [None]:
??model.steps[-1][1].score

Second of all, how do we know if this is *good* compared to other runs for this dataset, or compared to the models of your neighbours?

Let's find out how we can track these experiments better in the next excercise, and learn how experiment tracking improves the MLOps workflow.