In [45]:
from arthurai import ArthurAI, ModelType, InputType, Stage
import numpy as np
import joblib
import datetime
import time
import pandas as pd

In [None]:
import sys
sys.path.append("..")
from model_utils import transformations, load_datasets

In this guide, we'll use the credit dataset (and a pre-trained model) to onboard a new model to the Arthur platform. We'll walk through registering the model using a sample of the training data. 

#### Set up connection
Supply your API Key below to autheticate with the platform.

In [None]:
URL = "dashboard.arthur.ai"
ACCESS_KEY = "..."

config = {"url": URL, "access_key":ACCESS_KEY}
connection = ArthurAI(config)

## Create Model

We'll instantiate a model object with a small amount of metadata about the model input and output types. Then, we'll use a sample of the training data to register the full data schema for this Tabular model.

In [None]:
arthur_model = connection.model(name="CreditRiskModel_test_v1.0.07",
                               input_type=InputType.Tabular,
                               model_type=ModelType.Multiclass,
                               is_batch=True)

In [None]:
(X_train, Y_train), (X_test, Y_test) = load_datasets("../fixtures/datasets/credit_card_default.csv")

In [None]:
Y_train.head()

In [None]:
X_train.head()

In [None]:
arthur_model.from_dataframe(X_train, Stage.ModelPipelineInput)
arthur_model.from_dataframe(Y_train, Stage.GroundTruth)
arthur_model.set_positive_class(1)

Before saving, you can review a model to make sure everything is correct.

In [None]:
arthur_model.review_model()

In [None]:
arthur_model.save()

### Setting baseline data
For tracking data drift, you can upload a dataset to serve as the baseline or reference set. Often, this is a sample of your training data for the associated model. With batch models, we upload a reference set as a parquet file. If your data is in another format, you can use pandas to easily generate parquet.

In [None]:
X_train.to_parquet("./training_sample.parquet", index=False)

Point to a directory which contains one or more parquet files. All parquet files will be uploaded and joined to form the reference set.

In [None]:
arthur_model.set_reference_data(directory_path=".",  stage=Stage.ModelPipelineInput)

## Sending Batches of Inferences

Load test data and trained model. Let's familiarize ourselves with the data and the model.


In [15]:
X_test.shape
sk_model = joblib.load("../fixtures/serialized_models/credit_model.pkl")

In [16]:
sk_model

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=15, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=500,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [18]:
sk_model.predict_proba(X_train.iloc[0:1, :])[0]

array([0.93286639, 0.06713361])

We'll again use parquet files to upload batches of inferences. Each batch should be designated by a directory which contains one or more parquet files. 

There are a couple columns in the parquet that we should take special note of. First, each inference needs a unique identifier so that it can later be joined with ground truth. Include a column named "external_id" and ensure these IDs are unique across batches. For example, if you run predictions across your customer base on a daily-batch cadence, then a unique identfier could be composed of your customer_id plus the date. 

Additionally, the predictions/scores from your model should match the column names in the registered schema. If we take a look above at *arthur_model.review_model()* we'll note that columns were created for us corresponding to classifier class probabilities ("1" and "0") and ground truth outcomes ("1_ground_truth" and "0_ground_truth"). Use these columns when uploading parquet. 

In [19]:
import os

In [48]:
os.mkdir("./inferences")
os.mkdir("./ground_truth")
for i in range(10):
    batch_size=1000
    batch_id = "batch_{}".format(str(np.random.randint(1e3)))

    rows_inds = np.random.randint(X_test.shape[0], size=batch_size)
    batch_inputs_df = X_test.iloc[rows_inds, :]
    batch_predictions_df = sk_model.predict_proba(batch_inputs_df)
    batch_ground_truths = Y_test.values[rows_inds]
    inference_external_id = [str(num) for num in np.random.randint(1e9, size=batch_size)]
    
    # need to include model prediction columns, and external_id
    batch_df = batch_inputs_df.copy()
    batch_df["0"] = batch_predictions_df[:, 0]
    batch_df["1"] = batch_predictions_df[:, 1]
    batch_df["external_id"] = inference_external_id


    # each batch needs to be in its own directory, which 
    # can contain multiple parquet files
    dir_name = f'./inferences/{batch_id}'
    os.mkdir(dir_name)
    filepath = f'{dir_name}/{batch_id}.parquet'
    batch_df.to_parquet(filepath, index=False)
    arthur_model.send_batch_inferences(dir_name, batch_label=batch_id)
    
    # note the inference-wise groundtruth for later use
    ground_truth_df = pd.DataFrame({"external_id":inference_external_id,
                                   "0_ground_truth": batch_ground_truths ,
                                   "1_ground_truth": 1 - batch_ground_truths})
    ground_truth_df.to_parquet(f'./ground_truth/{batch_id}.parquet', index=False)
    



## Sending Batches of Ground Truth

In [None]:
arthur_model.send_batch_ground_truth("./ground_truth/")