In [1]:
from os import environ, path
from random import choices
import string
from subprocess import check_call
from io import StringIO

import pandas as pd
import requests

from cognite.config import configure_session
from cognite.v05 import files
from cognite.v06.analytics import models
from cognite.data_transfer_service import DataSpec, FilesDataSpec

# Make sure to set your API key and porject
API_KEY = "<YOUR-API-KEY>"
PROJECT = "<YOUR-PROJECT>"
configure_session(api_key=API_KEY, project=PROJECT)

# We'll prepend all resources we make with this random string
# to avoid collision with others that try this tutorial
random_postfix = "".join(choices(string.ascii_lowercase+string.digits, k=6))

# Introduction

For this tutorial we will train a very simple linear regression model which predicts one value given two features.
This example is unrealistically simple and trivial on purpose to give more attention to Model Hosting.

We have some training data in the two files `data.csv` and `target.csv`.
The goal is to train a linear regression model that we can use to do prediction on new observed features.

Let's first take a look at the data

In [2]:
pd.read_csv("data.csv")

Unnamed: 0,f1,f2
0,1,2
1,2,4
2,4,3
3,3,1
4,0,5
5,2,2


In [3]:
pd.read_csv("target.csv")

Unnamed: 0,y
0,15
1,20
2,25
3,20
4,15
5,18


Actually, our training data are so that `y = 10 + 3*f1 + f2`, so a linear regression model should be able to fit this perfectly.

# 0. Upload data to CDP

Often, the data you need will already be available in CDP (Cognite Data Platform).
But since this is just some dummy data we will have to upload it first.
We'll simply upload it through the Files API:

In [4]:
data_file_id = files.upload_file(
    file_path="data.csv",
    file_name="data.csv",
    directory=f"hosting-tutorial/{random_postfix}",
    content_type="text/csv"
)["fileId"]
target_file_id = files.upload_file(
    file_path="target.csv",
    file_name="target.csv",
    directory=f"hosting-tutorial/{random_postfix}",
    content_type="text/csv"
)["fileId"]

# 1. Write the code

The first thing we need to do is write the code for our model.
We do this by creating a Python package.
Our package is named `linreg` and can be found in a folder with the same name.
It's just a regular Python package that are pip-installable.

It's required that your model resides in a class named Model, and that this class is inside a module (i.e. file) named model (model.py). You can read more about this in the docs.
Notice that our requirements (packages that our model needs) are defined in `setup.py`.

# 2. Create a source package

Before we use our code in Model Hosting we have to upload it.
In Model Hosting, a Python package that defines a model is called a source package.

We first package our code in a tar.gz archive:

In [5]:
# Will create a linreg-0.1.tar.gz archive
check_call(["python", "setup.py", "sdist"], cwd="linreg")
path.exists("linreg/dist/linreg-0.1.tar.gz")

True

And then we can upload it and create a new source package

In [6]:
source_package_id = models.create_source_package(
    name=f"linreg-v01-{random_postfix}",
    package_name="linreg",
    available_operations=["TRAIN", "PREDICT"],
    runtime_version="0.1",
    description="Some description", # Optional
    meta_data={"interesting-metadata": [1, 2, 3]}, # Optional (can be arbitrary JSON)
    file_path="linreg/dist/linreg-0.1.tar.gz"
)["id"]

# 3. Create a model

A model in Model Hosting is an abstract resource that can consist of any number of model versions.
So before we create and train a specific model version, we need to have a model that will act as a parent container. You can read more about this in the docs.

In [7]:
model_id = models.create_model(
    name=f"tutorial-model-{random_postfix}",
    description="Some description", # Optional
    metadata={"interesting-metadata": [1, 2, 3]}, # Optional (can be arbitrary JSON)
)["id"]

# 4. Create and train a model version

A model version is a specific instance that is trained and can do prediction.
It uses some source package that you have created earlier and resides under some model.

We will pass in a data spec as an argument to the training routine to specify our training data.
In this tutorial, it's simply the two files we uploaded earlier.

In [8]:
version_id = models.train_model_version(
    model_id=model_id,
    name=f"tutorial-model-version-{random_postfix}",
    source_package_id=source_package_id,
    description="Some description", # Optional
    metadata={"interesting-metadata": [1, 2, 3]}, # Optional (can be arbitrary JSON)
    args={
        "data_spec": DataSpec(files_data_spec=FilesDataSpec(file_ids={
            "data": data_file_id,
            "target": target_file_id
        })).to_JSON()
    }
)["id"]

Now we just have to wait for our model to be trained and deployed.
We can check the status until its 'READY' (this will take several minutes).

<font color='red'>The rest of the notebook will work if you don't wait until the model version is ready!</font>

In [9]:
models.get_version(model_id, version_id)["status"]

'READY'

# 5. (Optional) Check the training outcome

What has our linear regression model learned?
Since we persisted the coefficients in `coefficients.csv` during the training routine we can now take a look at them manually of we want.

Remember that we know our training data is given by `y = 10 + 3*f1 + f2`.

In [10]:
# Not supported through the SDK yet, so we'll call the endpoints directly
coefficients_download_url = requests.get(
    f"https://api.cognitedata.com/api/0.6" \
    f"/projects/{PROJECT}" \
    f"/analytics" \
    f"/models/{model_id}" \
    f"/versions/{version_id}" \
    f"/artefacts/coefficients.csv",
    headers={"api-key": API_KEY}
).json()["downloadURL"]
pd.read_csv(StringIO(requests.get(coefficients_download_url).text))

Unnamed: 0,beta_hat
0,10.0
1,3.0
2,1.0


# 6. Predict with our model version

Now that our model version is ready we can use it to predict.
Let's give it three instances to predict on, and let's specify that we want four decimal digits using our custom argument.

In [11]:
models.online_predict(
    model_id=model_id,
    version_id=version_id,
    instances=[
        [1, 1],
        [3, 10],
        [0, 1/3]
    ],
    args={"precision": 4}
)

{'predictions': [14.0, 29.0, 10.3333]}

Does this match `y = 10 + 3*f1 + f2`?