# Arthur Sandbox Example: OpenAI question-answering

In this guide, we'll use a question-answering dataset from Huggingface and the GPT3 endpoint from OpenAI to onboard a new streaming model to the Arthur platform. Then we will use Arthur to analyze our model.


# Outline

Read on for an overview of everything this notebook will cover. **[Click here to dive straight into the code.](#Imports)**

## Onboarding

Onboarding is the process of setting up your model to be monitored by Arthur. You specify the type of data your model ingests, send a reference dataset to provide a baseline of the distribution of your data, and you configure additional settings among the services Arthur offers.

**Arthur does not need your model object itself to monitor performance - only predictions are required**

All you need to monitor your model with Arthur is to upload the predictions your model makes: Arthur computes analytics about your model based on that prediction data. This data can be computed directly by your model in a script or notebook like this one to be uploaded to the platform, or can be fetched from an external database to be sent to Arthur.

### Getting Model Predictions
We'll prepare a sample from a question-answering dataset and generate answers from the GPT-3 endpoint.

### Registering Model with Arthur
We'll configure our model's attributes and save the model to the Arthur platform.

### Sending Inferences
We'll send model inferences (inputs and predictions) to the Arthur platform.


## Model Monitoring and Analysis

Once onboarding is complete and you have inferences uploaded to the platform, you can use Arthur to get model monitoring insights.

We will evaluate the Arthur-computed metrics to identify trends in our model's inferences

---

# Setup & Imports

In [1]:
# ensure required packages are installed
#  don't worry, our requirements are flexible!

! pip install -r requirements.txt > /dev/null

In [2]:
from datetime import datetime, timedelta
from IPython.display import display, HTML
import numpy as np
import pandas as pd
import shortuuid

---

# Onboarding

## Loading the Dataset

We're using the [SciQ dataset from Huggingface](https://huggingface.co/datasets/sciq)

The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. For the majority of the questions, an additional paragraph with supporting evidence for the correct answer is provided.

In [3]:
from datasets import load_dataset

In [4]:
sciq_dataset = load_dataset("sciq")

Found cached dataset sciq (/Users/maxcembalest/.cache/huggingface/datasets/sciq/default/0.1.0/50e5c6e3795b55463819d399ec417bfd4c3c621105e00295ddb5f3633d708493)


  0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
sciq_df = sciq_dataset.data['train'].table.to_pandas()
sciq_df_inf = sciq_dataset.data['test'].table.to_pandas()

In [6]:
sciq_df_sample = sciq_df.sample(5)
sciq_df_inf_sample = sciq_df_inf.sample(5)

## Create the full LLM inputs by concatenating the actual question after a sentence/paragraph of supporting information

In [7]:
gpt3_inputs = [
    f"What is the answer to this question? question: {row.question}, answer: " 
    for _, row in sciq_df_sample.iterrows()
]

gpt3_inputs_inf = [
    f"What is the answer to this question? question: {row.question}, answer: " 
    for _, row in sciq_df_inf_sample.iterrows()
]

In [8]:
gpt3_inputs

['What is the answer to this question? question: In binary fission how many times does a cell split?, answer: ',
 'What is the answer to this question? question: What is the movement of muscle in the digestive system called?, answer: ',
 'What is the answer to this question? question: What is the basic unit of structure and function of living things?, answer: ',
 'What is the answer to this question? question: Use of oil-consuming bacteria to clean up an oil spill is an example of what?, answer: ',
 'What is the answer to this question? question: What is it called when the chance that a certain event will occur?, answer: ']

# Get OpenAI responses

Using the endpoint for the `text-davinci-003` model, we get an answer to each question from our sample dataframe

In [9]:
import openai

In [10]:
from typing import List

def get_gpt3_responses(inputs: List[str]):
    """
    Applies GPT3 to the inputs
    """
    
    responses = []
    
    for inp in inputs:
        openai_response = openai.Completion.create(
            model="text-davinci-003", 
            prompt=inp, 
            temperature=0.1, 
            logprobs=1)
        gpt3_answer = openai_response["choices"][0]["text"]
        responses.append(openai_response)
    
    return responses

In [11]:
responses = get_gpt3_responses(gpt3_inputs)

In [12]:
responses_inf = get_gpt3_responses(gpt3_inputs_inf)

In [13]:
gpt3_response_texts = [c['choices'][0]['text'] for c in responses]
gpt3_response_texts_inf = [c['choices'][0]['text'] for c in responses_inf]

#### Get token likelihoods

In [14]:
gpt3_tokens_ref = [resp['choices'][0]['logprobs']['tokens'] for resp in responses]
gpt3_probabilities_ref = [list(np.exp(resp['choices'][0]['logprobs']['token_logprobs'])) for resp in responses]
gpt3_token_likelihoods_ref = [[
    {t: p}
    for t, p in zip(gpt3_tokens_ref[i], gpt3_probabilities_ref[i])] for i in range(len(gpt3_inputs))]

gpt3_tokens_inf = [resp['choices'][0]['logprobs']['tokens'] for resp in responses_inf]
gpt3_probabilities_inf = [list(np.exp(resp['choices'][0]['logprobs']['token_logprobs'])) for resp in responses_inf]
gpt3_token_likelihoods_inf = [[
    {t: p}
    for t, p in zip(gpt3_tokens_inf[i], gpt3_probabilities_inf[i])] for i in range(len(gpt3_inputs_inf))]

#### Get finish reasons

In [15]:
gpt3_response_finish_reasons_ref = [c['choices'][0]['finish_reason'] for c in responses]
gpt3_response_finish_reasons_inf = [c['choices'][0]['finish_reason'] for c in responses_inf]

#### Get readability scores

In [16]:
import textstat
gpt3_readability_ref = [textstat.flesch_reading_ease(o) for o in gpt3_response_texts]
gpt3_readability_inf = [textstat.flesch_reading_ease(o) for o in gpt3_response_texts_inf]

#### Get correctness of answers

In [17]:
def get_correctness(outputs, answers):
    correctnesses = []
    for a, o in zip(answers, outputs):
        in_ = True
        for word in a.replace(',','').split():
            if word.lower() not in o.lower():
                in_=False
        correctnesses.append(in_)
    return correctnesses

In [18]:
gpt3_correctness_ref = get_correctness(
    gpt3_response_texts,
    list(sciq_df_sample.correct_answer.values)
)

gpt3_correctness_inf = get_correctness(
    gpt3_response_texts_inf,
    list(sciq_df_inf_sample.correct_answer.values)
)

#### Use ChatGPT to get LLM-generated feedback

In [19]:
def get_pseudofeedback(inputs: List[str]):
    responses = []
    for inp in inputs:
        print(inp)
        messages = [
            {"role": "system", "content": "You give short written feedback saying if an answer made sense for a given question. You also rate the answer out of 10 (0=worst,10=best)"},
            {"role": "user", "content": "What is the answer to this question? question: What makes breathing difficult due to respiratory system disease?, answer: The answer to this question is the answer to your question.\n\nIt is important to know the answer to this question because"},
            {"role": "assistant", "content": "You didn't actually answer the question, thats bad and confusing. 1"},
            {"role": "user", "content": "What is the answer to this question? question: What makes breathing difficult due to respiratory system disease?, answer: Respiratory system diseases can make breathing difficult by causing inflammation, damage, or blockages in the airways or lung tissues. These can include conditions such as asthma, chronic obstructive pulmonary disease (COPD), pneumonia, lung cancer, and others. The inflammation and swelling can narrow the airways, making it harder for air to flow in and out of the lungs. Damage to the lung tissues can reduce their ability to expand and contract, which can also make breathing difficult. Additionally, excess mucus or fluid in the lungs can further narrow the airways and make it hard to breathe. These factors can cause symptoms such as shortness of breath, coughing, wheezing, and fatigue."},
            {"role": "assistant", "content": "Thank you very much, that makes a lot of sense! 10"},
            {"role": "user", "content": inp}
        ]

        openai_response = openai.ChatCompletion.create(
          model="gpt-3.5-turbo",
          messages=messages
        )
        chat_gpt_answer = openai_response["choices"][0]["message"]["content"]
        print(chat_gpt_answer, "\n=====\n")
        responses.append(openai_response)
    
    return responses

In [20]:
gpt3_pseudofeedback_ref = get_pseudofeedback([inp + rt for inp, rt in zip(gpt3_inputs, gpt3_response_texts)])
gpt3_pseudofeedback_inf = get_pseudofeedback([inp + rt for inp, rt in zip(gpt3_inputs_inf, gpt3_response_texts_inf)])



What is the answer to this question? question: In binary fission how many times does a cell split?, answer: 

In binary fission, a cell splits into two identical daughter cells.
Yes, that is correct and makes sense. 10 
=====

What is the answer to this question? question: What is the movement of muscle in the digestive system called?, answer: 

Peristalsis.
Great, that is a correct and concise answer! 10 
=====

What is the answer to this question? question: What is the basic unit of structure and function of living things?, answer: 

Cell
Your answer is correct and makes sense. 10 
=====

What is the answer to this question? question: Use of oil-consuming bacteria to clean up an oil spill is an example of what?, answer: 

Bioremediation.
Great answer, that makes sense! 10 
=====

What is the answer to this question? question: What is it called when the chance that a certain event will occur?, answer:  Probability.
Your answer is correct and makes sense. 10. 
=====

What is the answer

In [21]:
def get_feedback_scores(feedback_strings):
    feedback_ints = []
    for fs in feedback_strings:
        score = -1
        fs = fs.replace('out of 10', '')
        for num_string in np.arange(11).astype(str):
            if num_string in fs:
                score = int(num_string)
        feedback_ints.append(score)
    return feedback_ints

In [22]:
gpt3_pseudofeedback_messages_ref = [c['choices'][0]['message']['content'] for c in gpt3_pseudofeedback_ref]
gpt3_pseudofeedback_messages_inf = [c['choices'][0]['message']['content'] for c in gpt3_pseudofeedback_inf]

gpt3_feedback_scores_ref = get_feedback_scores(gpt3_pseudofeedback_messages_ref)
gpt3_feedback_scores_inf = get_feedback_scores(gpt3_pseudofeedback_messages_inf)

### Create inference dataframes

In [23]:
gpt3_ref_data = pd.DataFrame({
    'user_input': gpt3_inputs,
    'output_text': gpt3_response_texts,
    'token_likelihoods': gpt3_token_likelihoods_ref,
    'finish_reason': gpt3_response_finish_reasons_ref,
    'readability': gpt3_readability_ref,
    'correct': gpt3_correctness_ref,
    'user_feedback': gpt3_pseudofeedback_messages_ref,
    'user_feedback_score': gpt3_feedback_scores_ref
})
gpt3_ref_data

Unnamed: 0,user_input,output_text,token_likelihoods,finish_reason,readability,correct,user_feedback,user_feedback_score
0,What is the answer to this question? question:...,"\n\nIn binary fission, a cell splits into two ...","[{' ': 0.9973323729575256}, {' ': 0.9903408105...",length,60.31,True,"Yes, that is correct and makes sense. 10",10
1,What is the answer to this question? question:...,\n\nPeristalsis.,"[{' ': 0.8272828525535677}, {' ': 0.8514157387...",stop,-47.99,True,"Great, that is a correct and concise answer! 10",10
2,What is the answer to this question? question:...,\n\nCell,"[{' ': 0.8079144942571181}, {' ': 0.6975522204...",stop,121.22,True,Your answer is correct and makes sense. 10,10
3,What is the answer to this question? question:...,\n\nBioremediation.,"[{' ': 0.458393909546269}, {' ': 0.99104760150...",stop,-217.19,True,"Great answer, that makes sense! 10",10
4,What is the answer to this question? question:...,Probability.,"[{' Prob': 0.7149669937078649}, {'ability': 0....",stop,-217.19,True,Your answer is correct and makes sense. 10.,10


In [24]:
gpt3_inf_data = pd.DataFrame({
    'user_input': gpt3_inputs_inf,
    'output_text': gpt3_response_texts_inf,
    'token_likelihoods': gpt3_token_likelihoods_inf,
    'finish_reason': gpt3_response_finish_reasons_inf,
    'readability': gpt3_readability_inf,
    'correct': gpt3_correctness_inf,
    'user_feedback': gpt3_pseudofeedback_messages_inf,
    'user_feedback_score': gpt3_feedback_scores_inf
})
gpt3_inf_data

Unnamed: 0,user_input,output_text,token_likelihoods,finish_reason,readability,correct,user_feedback,user_feedback_score
0,What is the answer to this question? question:...,\nAlveoli.,"[{' ': 0.6886342436057409}, {'Al': 0.814359430...",stop,36.62,True,Your answer is correct and concise. Well done! 10,10
1,What is the answer to this question? question:...,\n\nThe Sun.,"[{' ': 0.9694359882575952}, {' ': 0.8036643034...",stop,120.21,True,That answer is accurate and makes sense. 9,9
2,What is the answer to this question? question:...,\n\nSchrödinger's approach uses three quantum ...,"[{' ': 0.6784315222521419}, {' ': 0.7127359027...",length,64.37,False,"l, and ml) to specify the wave function or orb...",-1
3,What is the answer to this question? question:...,Smell.,"[{' Sm': 0.5371301762378593}, {'ell': 0.999999...",stop,121.22,True,Great job! Your answer makes complete sense. 10.,10
4,What is the answer to this question? question:...,\n\nNeurotransmitters.,"[{' ': 0.6816356509754551}, {' ': 0.9149446750...",stop,-217.19,True,"Great, that answer is accurate and concise. 10",10


In [25]:
# gpt3_ref_data.to_csv('gpt3_reference_data.csv')
# gpt3_inf_data.to_csv('gpt3_inference_data.csv')

## Registering Model With Arthur

### Setting Up Connection
Supply your login to authenticate with the platform.

In [35]:
from arthurai import ArthurAI
# connect to Arthur
# UNCOMMENT the two lines below and enter your details
arthur = ArthurAI(
    url="https://app.arthur.ai",  # you can also pass this through the ARTHUR_ENDPOINT_URL environment variable
    login="<YOUR LOGIN HERE>",  # you can also pass this through the ARTHUR_LOGIN environment variable
)

### Registering Model Type

In [27]:
from arthurai.common.constants import InputType, OutputType, ValueType, Stage

We'll instantiate an [`ArthurModel`](https://docs.arthur.ai/sdk/sdk_v3/apiref/arthurai.core.models.ArthurModel.html) with the `ArthurAI.model()` method, which constructs a new local `ArthurModel` object. Later we'll use `ArthurModel.save()` to register this model with the Arthur platform.

We give the model a user-friendly `display_name` and allow the unique `partner_model_id` field to be automatically generated, but you can supply a unique identifier if it helps you map your models in Arthur to your other MLOps systems.

The `InputType` of a model specifies the general type of data your model ingests. The `OutputType` of a model specifies the modeling task at hand.

### Building the model by specifying attributes

We use a helper function to register the model attributes for the input and output text the model will process

In [28]:
# register arthur model type
arthur_model_gpt3 = arthur.model(
    display_name="OpenAI_GPT3",
    input_type=InputType.NLP, 
    output_type=OutputType.TokenSequence)

# register attributes for token sequence model
arthur_model_gpt3.build_token_sequence_model(
    input_column="user_input", 
    output_text_column="output_text",
    output_likelihood_column="token_likelihoods"
)

# register additional non-input attributes
arthur_model_gpt3.add_attribute(
    name='finish_reason', 
    stage=Stage.NonInputData,
    value_type=ValueType.String,
    categorical=True,
    categories=['length', 'stop']
)

arthur_model_gpt3.add_attribute(
    name="readability",
    stage=Stage.NonInputData,
    value_type=ValueType.Float,
    min_range=0,
    max_range=100
)

arthur_model_gpt3.add_attribute(
    name="correct",
    stage=Stage.NonInputData,
    value_type=ValueType.Boolean,
    categorical=True,
    categories=[True,False]
)

arthur_model_gpt3.add_attribute(
    name="user_feedback",
    stage=Stage.NonInputData,
    value_type=ValueType.String,
    categorical=True,
    is_unique=True
)

arthur_model_gpt3.add_attribute(
    name="user_feedback_score",
    stage=Stage.NonInputData,
    value_type=ValueType.Integer,
    categorical=True,
    categories=[-1,0,1,2,3,4,5,6,7,8,9,10]
)

arthur_model_gpt3.review()

Unnamed: 0,name,stage,value_type,categorical,is_unique,categories,bins,range,monitor_for_bias
0,user_input,PIPELINE_INPUT,UNSTRUCTURED_TEXT,True,True,[],,"[None, None]",False
1,token_likelihoods,PREDICTED_VALUE,TOKEN_LIKELIHOODS,False,False,[],,"[None, None]",False
2,output_text,PREDICTED_VALUE,UNSTRUCTURED_TEXT,True,False,[],,"[None, None]",False
3,finish_reason,NON_INPUT_DATA,STRING,True,False,"[{value: length}, {value: stop}]",,"[None, None]",False
4,readability,NON_INPUT_DATA,FLOAT,False,False,[],,"[0, 100]",False
5,correct,NON_INPUT_DATA,BOOLEAN,True,False,"[{value: True}, {value: False}]",,"[None, None]",False
6,user_feedback,NON_INPUT_DATA,STRING,True,True,[],,"[None, None]",False
7,user_feedback_score,NON_INPUT_DATA,INTEGER,True,False,"[{value: -1}, {value: 0}, {value: 1}, {value: ...",,"[None, None]",False


### Saving the Model

Before saving, be sure to review your model to make sure everything is correct. We already saw the model schema returned by `ArthurModel.build()`, but we have since changed our attribute congiruations. Therefore we call `ArthurModel.review()` to see that our changed attributes look correct before saving to the platform. See the [onboarding walkthrough on the Arthur docs](https://docs.arthur.ai/user-guide/walkthroughs/model-onboarding/index.html#review-model) for tips on reviewing your model.

Note that while we capture the ranges of the attributes in this schema, they don’t need to be exact and won’t affect any performance calculations. They’re used as metadata to configure plots in the online Arthur dashboard, but never affect data drift or any other computations.

In [29]:
# review the model attribute properties in the model schema
arthur_model_gpt3.review()

Unnamed: 0,name,stage,value_type,categorical,is_unique,categories,bins,range,monitor_for_bias
0,user_input,PIPELINE_INPUT,UNSTRUCTURED_TEXT,True,True,[],,"[None, None]",False
1,token_likelihoods,PREDICTED_VALUE,TOKEN_LIKELIHOODS,False,False,[],,"[None, None]",False
2,output_text,PREDICTED_VALUE,UNSTRUCTURED_TEXT,True,False,[],,"[None, None]",False
3,finish_reason,NON_INPUT_DATA,STRING,True,False,"[{value: length}, {value: stop}]",,"[None, None]",False
4,readability,NON_INPUT_DATA,FLOAT,False,False,[],,"[0, 100]",False
5,correct,NON_INPUT_DATA,BOOLEAN,True,False,"[{value: True}, {value: False}]",,"[None, None]",False
6,user_feedback,NON_INPUT_DATA,STRING,True,True,[],,"[None, None]",False
7,user_feedback_score,NON_INPUT_DATA,INTEGER,True,False,"[{value: -1}, {value: 0}, {value: 1}, {value: ...",,"[None, None]",False


Now, we save the model. 

Note that this will be the first call to send data to the Arthur platform so far in this example - no information has been sent yet to the platform.

The method `ArthurModel.save()` sends an API request to Arthur to validate your model - if there are any problems with your model schema, this method will result in an error informing you how to correct your model's configuration. If no errors are found, the model will be saved to the platform.

In [30]:
# validate the model and save it onto the Arthur platform
arthur_model_gpt3.save()

10:46:22 - arthurai - We have registered the  model with Arthur and are getting it ready to accept inferences...
We are still working on getting your model ready to accept inferences...
10:47:38 - arthurai - Model Creation Completed successfully, you can now send Data to Arthur.


'bc18efc2-4ead-4cb8-9e9f-df2b3b2ca73f'

In [31]:
arthur_model_gpt3.set_reference_data(data=gpt3_ref_data.drop(['token_likelihoods'], axis=1))

10:47:40 - arthurai - Starting upload (0.006 MB in 1 files), depending on data size this may take a few minutes
10:47:40 - arthurai - Upload completed: /var/folders/8v/8v36mrp907z7lp5d4cd7yf4h0000gn/T/tmpzndw1ied/bc18efc2-4ead-4cb8-9e9f-df2b3b2ca73f-0.parquet


({'counts': {'success': 5, 'failure': 0, 'total': 5}, 'failures': [[]]},
 {'dataset_close_result': {'message': 'success'}})

<a id='sending_inferences_content'></a>

## Sending Inferences

#### Using `ArthurModel.send_inferences()`

We send our inference data to the platform along with some unique IDs. Timestamps will be auto-generated for these inferences by the `send_inferences` function from the Arthur SDK. See [our API docs for sending inferences](https://docs.arthur.ai/api-documentation/v3-api-docs.html#tag/inferences/paths/~1models~1%7Bmodel_id%7D~1inferences/post) for the full specification of inference ingestion in Arthur.

In [32]:
from arthurai.util import generate_timestamps
# generate fake timestamps for each inference over the last month
inference_timestamps = list(generate_timestamps(len(gpt3_inf_data), '3d', 'now', 'D'))

print(f"First and last timestamps: {inference_timestamps[0].strftime('%Y-%m-%d')} to {inference_timestamps[-1].strftime('%Y-%m-%d')}")

First and last timestamps: 2023-04-18 to 2023-04-21


In [34]:
inference_result = arthur_model_gpt3.send_inferences(
    gpt3_inf_data, 
    inference_timestamps=inference_timestamps,
    partner_inference_ids=[shortuuid.uuid() for _ in range(len(gpt3_inputs_inf))])
print('# of successful and # of failing inference uploads:', inference_result['counts'])

# of successful and # of failing inference uploads: {'failure': 0, 'success': 5, 'total': 5}


## See Model in Dashboard

In [49]:
# the code below will render a link for you to view your model in the Arthur Dashboard

def render_arthur_model_dashboard_link(arthur, arthur_model):
    url = 'https://' + ''.join(arthur.client.api_base_url.split('/')[1:-2])
    link_text = f"See your model ({arthur_model.display_name}) in the Arthur Dashboard"
    href_string = f"{url}/model/{arthur_model.id}/overview"
    html_string = f'<br> <a style="font-size:200%" href={href_string}>{link_text}</a> <br>'
    display(HTML(html_string))

render_arthur_model_dashboard_link(arthur, arthur_model_gpt3) 

Once your inference data has been uploaded to the platform, you can see your model by following the above link to the model dashboard page to see an overview of the model and browse its inference data.

---