In [1]:
from os import environ
import pandas as pd

from cognite.client import CogniteClient
from cognite.model_hosting.data_spec import DataSpec, FileSpec

# Make sure to set your API key
API_KEY = environ["COGNITE_API_KEY"]
client = CogniteClient(api_key=API_KEY)

files = client.files
mlh = client.experimental.model_hosting

# Introduction

In this tutorial we will train a very simple linear regression model which predicts one value given two features.
This example is unrealistically simple and trivial on purpose to give more attention to Model Hosting.

We have some training data in the two files `data.csv` and `target.csv`.
The goal is to train a linear regression model that we can use to do prediction on new observed features.

Let's first take a look at the data

In [2]:
pd.read_csv("data/data.csv")

Unnamed: 0,f1,f2
0,1,2
1,2,4
2,4,3
3,3,1
4,0,5
5,2,2


In [3]:
pd.read_csv("data/target.csv")

Unnamed: 0,y
0,15
1,20
2,25
3,20
4,15
5,18


Actually, our training data have been generated such that `y = 10 + 3*f1 + f2`, so a linear regression model should be able to fit this perfectly.

# 0. Upload data to CDP

Often, the data you need will already be available in CDP (Cognite Data Platform).
But since this is just some dummy data we will have to upload it first.
We'll simply upload it through the Files API:

In [4]:
data_file_id = files.upload_file(
    file_path="data/data.csv",
    file_name="data.csv",
    directory="examples/model_hosting/simple_train_and_predict",
    content_type="text/csv",
    overwrite=True
)["fileId"]
target_file_id = files.upload_file(
    file_path="data/target.csv",
    file_name="target.csv",
    directory="examples/model_hosting/simple_train_and_predict",
    content_type="text/csv",
    overwrite=True
)["fileId"]

# 1. Write the code

The first thing we need to do is write the code for our model.
We do this by creating a Python package.
Our package is named `linreg` and can be found in a folder with the same name.
It's just a regular Python package that is pip-installable.

It's required that your model resides in a class named Model, and that this class is inside a module (i.e. file) named model (model.py). You can read more about this in the docs.
Notice that our requirements (packages that our model needs) are defined in `setup.py`.

# 2. Create a source package

Before we use our code in Model Hosting we have to upload it.
In Model Hosting, a Python package that defines a model is called a source package.

We will use this method to build a distribution of our source package and upload it to the hosting environment.

In [5]:
source_package_id = mlh.source_packages.build_and_upload_source_package(
    name="linreg-v01", 
    runtime_version="0.1",
    package_directory="linreg",
    description="My linreg model",
    metadata={"interesting-metadata": "anything"}, # Optional
).id

# 3. Create a model

A model in Model Hosting is an abstract resource that can consist of any number of model versions.
So before we create and train a specific model version, we need to have a model that will act as a parent container. You can read more about this in the docs.

In [6]:
model_id = mlh.models.create_model(
    name="tutorial-model",
    description="Some description", # Optional
    metadata={"interesting-metadata": "anything"}, # Optional
).id

# 4. Train and deploy a model version

A model version is a specific instance that is trained and can do prediction.
It uses some source package that you have created earlier and resides under some model.

We will pass in a data spec as an argument to the training routine to specify our training data.
In this tutorial, it's simply the two files we uploaded earlier.

In [7]:
version_id = mlh.models.train_and_deploy_model_version(
    model_id=model_id,
    name="tutorial-model-version",
    source_package_id=source_package_id,
    description="Some description", # Optional
    metadata={"interesting-metadata": "anything"}, # Optional
    args={
        "data_spec": DataSpec(files={"data": FileSpec(data_file_id), "target": FileSpec(target_file_id)})
    }
).id

Now we just have to wait for our model to be trained and deployed.
We can check the status until its 'READY' (this will take several minutes).

<font color='red'>The rest of the notebook will work if you don't wait until the model version is ready!</font>

In [11]:
mlh.models.get_model_version(model_id, version_id).status

'READY'

# 5. (Optional) Check the training outcome

What has our linear regression model learned?
The model we created persists the coefficients in `coefficients.csv` during the training routine, so we can now take a look at them manually of we want.

Remember that we know our training data is given by `y = 10 + 3*f1 + f2`.

In [12]:
mlh.models.download_artifact(model_id=model_id, version_id=version_id, name="coefficients.csv")
pd.read_csv("coefficients.csv")

Unnamed: 0,beta_hat
0,10.0
1,3.0
2,1.0


# 6. Predict with our model version

Now that our model version is ready we can use it to predict.
Let's give it three instances to predict on, and let's specify that we want four decimal digits using our custom argument.

In [13]:
mlh.models.online_predict(
    model_id=model_id,
    version_id=version_id,
    instances=[
        [1, 1],
        [3, 10],
        [0, 1/3]
    ],
    args={"precision": 4}
)

[14.0, 29.0, 10.3333]

Does this match `y = 10 + 3*f1 + f2`?

# 7. Clean up

Delete the model after finishing the tutorial. This will remove all versions under this model, 
so they no longer consume resources.

In [15]:
mlh.models.delete_model(model_id)
mlh.source_packages.delete_source_package(source_package_id)