In [1]:
from arthurai import ArthurAI
from arthurai.client.apiv3 import InputType, OutputType, Stage
import numpy as np
import joblib
import datetime
import time

In [2]:
import sys
sys.path.append("..")
from model_utils import transformations, load_datasets

In this guide, we'll use the credit dataset (and a pre-trained model) to onboard a new model to the Arthur platform. We'll walk through registering the model using a sample of the training data. This is an example of a streaming model.

#### Set up connection
Supply your API Key below to autheticate with the platform.

In [3]:
URL = "app.arthur.ai"
ACCESS_KEY = "..."

connection = ArthurAI(url=URL, access_key=ACCESS_KEY, client_version=3)

## Create Model

We'll instantiate a model object with a small amount of metadata about the model input and output types. Then, we'll use a sample of the training data to register the full data schema for this Tabular model.

In [14]:
arthur_model = connection.model(partner_model_id="CreditRiskModel_v0.0.1",
                               input_type=InputType.Tabular,
                               output_type=OutputType.Multiclass)

In [4]:
(X_train, Y_train), (X_test, Y_test) = load_datasets("../fixtures/datasets/credit_card_default.csv")

In [16]:
Y_train.head()

15529    0
24251    0
26478    0
29062    0
14135    0
Name: default payment next month, dtype: int64

In [9]:
X_train.head()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
19234,150000,2,1,2,37,1,-2,-1,-1,-1,...,689,2943,-6159,-6159,0,689,2943,0,0,16007
13544,20000,1,1,2,28,1,-2,-2,-2,-1,...,0,0,12723,30000,0,0,0,12723,18000,32200
233,190000,1,2,2,34,2,0,0,0,2,...,134379,142323,140120,150052,5000,5000,10000,0,12118,2769
7841,150000,2,2,2,34,-1,-1,-1,-1,-1,...,4102,713,810,5438,701,4137,719,817,5467,9587
17822,100000,1,1,2,24,-1,0,0,0,0,...,36030,37173,35875,36263,1563,1673,1816,1213,1248,1212


We need to register what the data schema is for the inputs to the model. Since your model might hundreds or thousands of input features, you can just pass us a pandas DataFrame of your training data, and we'll handle the rest.

In [17]:
arthur_model.from_dataframe(X_train, Stage.ModelPipelineInput)

We need to register the schema for the outputs of the model: what will a typical prediction look like and what will a typical ground truth look like? What names, shapes, and datatypes should Arthur expect for these objects?

Since this is a binary classification model, we'll do this all in one step with the *.add_binary_classifier_output_attributes()* method. All we need to supply is a mapping that establishes:
  * names for the model's predictions
  * names for the model's ground truth
  * the mapping that related these two
  
Our classifier will be making predictions about class *0* and class *1* and will return a probability score for each class. Therefore, we'll set up a name *prediction_0* and a name *prediction_1*. Additionally, our groundtruth will be either a 0 or 1, but we'll always represent ground truth in the one-hot-endoded form. Therefore, we create two field called *gt_0* and *gt_1*. We link these all up in a dictionary and pass that to the model.  

In [18]:
prediction_to_ground_truth_map = {
    "prediction_0": "gt_0",
    "prediction_1": "gt_1"
}

arthur_model.add_binary_classifier_output_attributes("prediction_1", prediction_to_ground_truth_map)

{'prediction_0': <arthurai.client.apiv3.attributes.ArthurAttribute at 0x127f78b90>,
 'gt_0': <arthurai.client.apiv3.attributes.ArthurAttribute at 0x122e39f90>,
 'prediction_1': <arthurai.client.apiv3.attributes.ArthurAttribute at 0x127f78a10>,
 'gt_1': <arthurai.client.apiv3.attributes.ArthurAttribute at 0x127f97050>}

Note that the first argument to *.add_binary_classifier_output_attributes()* is the name of the "positive predicted class", for purposes of calculating accuracy metrics. 

Before saving, you can review a model to make sure everything is correct.

In [19]:
arthur_model.review()

Unnamed: 0,name,stage,value_type,categorical,is_unique,categories,range,monitor_for_bias
0,gt_0,GROUND_TRUTH,INTEGER,True,False,"[{value: 0}, {value: 1}]","[None, None]",False
1,gt_1,GROUND_TRUTH,INTEGER,True,False,"[{value: 0}, {value: 1}]","[None, None]",False
2,LIMIT_BAL,PIPELINE_INPUT,INTEGER,False,False,[],"[10000, 800000]",False
3,SEX,PIPELINE_INPUT,INTEGER,True,False,"[{value: 1}, {value: 2}]","[None, None]",False
4,EDUCATION,PIPELINE_INPUT,INTEGER,True,False,"[{value: 0}, {value: 1}, {value: 2}, {value: 3...","[None, None]",False
5,MARRIAGE,PIPELINE_INPUT,INTEGER,True,False,"[{value: 0}, {value: 1}, {value: 2}, {value: 3}]","[None, None]",False
6,AGE,PIPELINE_INPUT,INTEGER,False,False,[],"[21, 79]",False
7,PAY_0,PIPELINE_INPUT,INTEGER,True,False,"[{value: 0}, {value: 1}, {value: 2}, {value: 3...","[None, None]",False
8,PAY_2,PIPELINE_INPUT,INTEGER,True,False,"[{value: 0}, {value: 1}, {value: 2}, {value: 3...","[None, None]",False
9,PAY_3,PIPELINE_INPUT,INTEGER,True,False,"[{value: 0}, {value: 1}, {value: 2}, {value: 3...","[None, None]",False


In [None]:
arthur_model.save()

### Setting baseline data
Next, we'll use the training data to set a baseline reference for calcuating data drift. 

For tracking data drift, you can upload a dataset to serve as the baseline or reference set. Often, this is a sample of your training data for the associated model. Our reference dataset should ideally include examples of
  * inputs 
  * ground truth
  * model predictions
  
for a sample of the training set. This way, Arthur can monitor for drift and stability in all of these aspects. 

In [6]:
# load our pre-trained classifier so we can generate predictions
sk_model = joblib.load("../fixtures/serialized_models/credit_model.pkl")

In [22]:
# get all input columns
reference_set = X_train.copy()

# get ground truth labels
reference_set["gt_1"] = Y_train
reference_set["gt_0"] = 1-Y_train

# get model predictions
preds = sk_model.predict_proba(X_train)
reference_set["prediction_1"] = preds[:, 1]
reference_set["prediction_0"] = preds[:, 0]


In [28]:
arthur_model.set_reference_data(data=reference_set)

{'counts': {'success': 21000, 'failure': 0, 'total': 21000}, 'failures': [[]]}

## Sending Inferences

Load test data and trained model. Let's familiarize ourselves with the data and the model.


In [None]:
X_test.shape

In [None]:
sk_model

In [None]:
sk_model.predict_proba(X_train.iloc[0:1, :])

To send inferences, we'll iterate through datapoints in a test set and send telemetry to Arthur. You can send inferences one at a time or in a list. We will combine our model inputs and our model predictions into a dictionary called *inference_data*. 

In [None]:
for i in range(X_test.shape[0]):
    datarecord = X_test.iloc[i:i+1, :]
    predicted_probs = sk_model.predict_proba(datarecord)[0]
    ground_truth = np.int(Y_test.iloc[i])
    external_id = str(np.random.randint(1e9))

    inputs = datarecord.to_dict(orient='records')[0]
    prediction = {"prediction_1":predicted_probs[1], 
                  "prediction_0":predicted_probs[0]}
    ground_truth={"gt_1": ground_truth, 
                  "gt_0":1-ground_truth}
    cur_time = datetime.datetime.utcnow().isoformat()
    inf_data = inputs.copy()
    inf_data.update(prediction)

    
    arthur_model.send_inferences([{
        "inference_data": inf_data,
        "ground_truth_data": ground_truth,
        "partner_inference_id" : external_id,
        "inference_timestamp" : cur_time,
        "ground_truth_timestamp" : cur_time
    }])
    
    print("Sent inference with id {}".format(external_id))
    time.sleep(0.001 * np.random.random())

You can send inferences one at a time but you can also send them in small bunches using the *send_infereces()* method. In that case, you would send a list of dictionaries, each of which is similar to above. 

If you model scoring system is a set up in a batch processor where you run a daily, weekly, or monthly job, then we recommend setting a batch model with Arthur and using the corresponding *send_batch_inferences()* method. An example batch model can be found [here](../../credit_risk_batch/notebooks/Quickstart.ipynb).