# Arthur Sandbox Example: OpenAI question-answering

In this guide, we'll use a question-answering dataset from Huggingface and the ChatGPT endpoint from OpenAI to onboard a new streaming model to the Arthur platform. Then we will use Arthur to analyze our model.


# Outline

Read on for an overview of everything this notebook will cover. **[Click here to dive straight into the code.](#Imports)**

## Onboarding

Onboarding is the process of setting up your model to be monitored by Arthur. You specify the type of data your model ingests, send a reference dataset to provide a baseline of the distribution of your data, and you configure additional settings among the services Arthur offers.

**Arthur does not need your model object itself to monitor performance - only predictions are required**

All you need to monitor your model with Arthur is to upload the predictions your model makes: Arthur computes analytics about your model based on that prediction data. This data can be computed directly by your model in a script or notebook like this one to be uploaded to the platform, or can be fetched from an external database to be sent to Arthur.

### Getting Model Predictions
We'll prepare a sample from a question-answering dataset and generate answers from the GPT-3 endpoint.

### Registering Model with Arthur
We'll configure our model's attributes and save the model to the Arthur platform.

### Sending Inferences
We'll send model inferences (inputs and predictions) to the Arthur platform.

---

# Setup & Imports

In [1]:
# ensure required packages are installed
#  don't worry, our requirements are flexible!

! pip install -r requirements.txt > /dev/null

In [2]:
from datetime import datetime, timedelta
from IPython.display import display, HTML
import numpy as np
import pandas as pd
import shortuuid

---

# Onboarding

## Loading the Dataset

We're using the [SciQ dataset from Huggingface](https://huggingface.co/datasets/sciq)

The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. For the majority of the questions, an additional paragraph with supporting evidence for the correct answer is provided.

In [3]:
from datasets import load_dataset

In [4]:
sciq_dataset = load_dataset("sciq")

Found cached dataset sciq (/Users/maxcembalest/.cache/huggingface/datasets/sciq/default/0.1.0/50e5c6e3795b55463819d399ec417bfd4c3c621105e00295ddb5f3633d708493)


  0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
sciq_df = sciq_dataset.data['train'].table.to_pandas()
sciq_df_inf = sciq_dataset.data['test'].table.to_pandas()

In [8]:
sciq_df_sample = sciq_df.sample(50)
sciq_df_inf_sample = sciq_df_inf.sample(50)

## Create the full LLM inputs by concatenating the actual question after a sentence/paragraph of supporting information

In [9]:
chatgpt_inputs = [
    f"What is the answer to this question? question: {row.question}, answer: " 
    for _, row in sciq_df_sample.iterrows()
]

chatgpt_inputs_inf = [
    f"What is the answer to this question? question: {row.question}, answer: " 
    for _, row in sciq_df_inf_sample.iterrows()
]

# Get OpenAI responses

Using the endpoint for the `gpt-3.5-turbo` model, we get an answer to each question from our sample dataframe

In [10]:
import openai

In [11]:
from typing import List, Tuple

def get_chatgpt_responses(
    inputs: List[str],
    system_config: str = "You answer scientific questions about physics, biology, and chemistry.",
    example_behavior: Tuple[str, str] = ("What is the difference between xylem and phloem?", "Xylem transports and stores water and water-soluble nutrients in vascular plants. Phloem is responsible for transporting sugars, proteins, and other organic molecules in plants. Vascular plants are able to grow higher than other plants due to the rigidity of xylem cells, which support the plant.")
):
    """
    Applies ChatGPT to the inputs
    We manually write a config setting ("system" message) and an example back-and-forth ("user" and "assistant" messages)
    """
    responses = []
    for inp in inputs:
        messages = [
            {"role": "system", "content": system_config},
            {"role": "user", "content": example_behavior[0]},
            {"role": "assistant", "content": example_behavior[1]},
            {"role": "user", "content": inp}
        ]

        openai_response = openai.ChatCompletion.create(
          model="gpt-3.5-turbo",
          messages=messages
        )
        chat_gpt_answer = openai_response["choices"][0]["message"]["content"]
        responses.append(openai_response)
    
    return responses

In [12]:
responses = get_chatgpt_responses(chatgpt_inputs)

In [13]:
responses_inf = get_chatgpt_responses(chatgpt_inputs_inf)

In [14]:
response_texts = [c['choices'][0]['message']['content'] for c in responses]
response_texts_inf = [c['choices'][0]['message']['content'] for c in responses_inf]

#### Get finish reasons

In [15]:
chatgpt_response_finish_reasons_ref = [c['choices'][0]['finish_reason'] for c in responses]
chatgpt_response_finish_reasons_inf = [c['choices'][0]['finish_reason'] for c in responses_inf]

#### Get readability scores

In [16]:
import textstat
chatgpt_readability_ref = [textstat.flesch_reading_ease(o) for o in response_texts]
chatgpt_readability_inf = [textstat.flesch_reading_ease(o) for o in response_texts_inf]

#### Get correctness of answers

In [17]:
def get_correctness(outputs, answers):
    correctnesses = []
    for a, o in zip(answers, outputs):
        in_ = True
        for word in a.replace(',','').split():
            if word.lower() not in o.lower():
                in_=False
        correctnesses.append(in_)
    return correctnesses

In [18]:
chatgpt_correctness_ref = get_correctness(
    response_texts,
    list(sciq_df_sample.correct_answer.values)
)

chatgpt_correctness_inf = get_correctness(
    response_texts_inf,
    list(sciq_df_inf_sample.correct_answer.values)
)

#### Use ChatGPT to get LLM-generated feedback

In [19]:
def get_pseudofeedback(inputs: List[str]):
    responses = []
    for inp in inputs:
        print(inp)
        messages = [
            {"role": "system", "content": "You give short written feedback saying if an answer made sense for a given question. You also rate the answer out of 10 (0=worst,10=best)"},
            {"role": "user", "content": "What is the answer to this question? question: What makes breathing difficult due to respiratory system disease?, answer: The answer to this question is the answer to your question.\n\nIt is important to know the answer to this question because"},
            {"role": "assistant", "content": "You didn't actually answer the question, thats bad and confusing. 1"},
            {"role": "user", "content": "What is the answer to this question? question: What makes breathing difficult due to respiratory system disease?, answer: Respiratory system diseases can make breathing difficult by causing inflammation, damage, or blockages in the airways or lung tissues. These can include conditions such as asthma, chronic obstructive pulmonary disease (COPD), pneumonia, lung cancer, and others. The inflammation and swelling can narrow the airways, making it harder for air to flow in and out of the lungs. Damage to the lung tissues can reduce their ability to expand and contract, which can also make breathing difficult. Additionally, excess mucus or fluid in the lungs can further narrow the airways and make it hard to breathe. These factors can cause symptoms such as shortness of breath, coughing, wheezing, and fatigue."},
            {"role": "assistant", "content": "Thank you very much, that makes a lot of sense! 10"},
            {"role": "user", "content": inp}
        ]

        openai_response = openai.ChatCompletion.create(
          model="gpt-3.5-turbo",
          messages=messages
        )
        chat_gpt_answer = openai_response["choices"][0]["message"]["content"]
        print(chat_gpt_answer, "\n=====\n")
        responses.append(openai_response)
    
    return responses

In [20]:
chatgpt_pseudofeedback_ref = get_pseudofeedback([inp + rt for inp, rt in zip(chatgpt_inputs, response_texts)])
chatgpt_pseudofeedback_inf = get_pseudofeedback([inp + rt for inp, rt in zip(chatgpt_inputs_inf, response_texts_inf)])



What is the answer to this question? question: How often does condensation occur in your cells?, answer: Condensation frequently occurs within cells during various cellular activities such as respiration, metabolism, energy production, protein synthesis, and DNA replication. However, the exact frequency of condensation occurring within cells would depend on the specific cellular activity taking place and the environmental conditions of the cell.
The answer is clear and provides relevant information to the question. Well done! 9 
=====

What is the answer to this question? question: The e. coli chromosome, like many other bacterial chromosomes, is what shape?, answer: The E. coli chromosome, like many other bacterial chromosomes, is circular in shape.
Your answer is very clear and accurate. It makes sense for the given question. 10. 
=====

What is the answer to this question? question: Which members of the food chain break down remains of plants and other organisms when they die?, answ

In [21]:
def get_feedback_scores(feedback_strings):
    feedback_ints = []
    for fs in feedback_strings:
        score = -1
        fs = fs.replace('out of 10', '')
        for num_string in np.arange(11).astype(str):
            if num_string in fs:
                score = int(num_string)
        feedback_ints.append(score)
    return feedback_ints

In [22]:
pseudofeedback_messages_ref = [c['choices'][0]['message']['content'] for c in chatgpt_pseudofeedback_ref]
pseudofeedback_messages_inf = [c['choices'][0]['message']['content'] for c in chatgpt_pseudofeedback_inf]

feedback_scores_ref = get_feedback_scores(pseudofeedback_messages_ref)
feedback_scores_inf = get_feedback_scores(pseudofeedback_messages_inf)

### Create inference dataframes

In [23]:
chatgpt_ref_data = pd.DataFrame({
    'user_input':chatgpt_inputs,
    'output_text':response_texts,
    'finish_reason':chatgpt_response_finish_reasons_ref,
    'readability':chatgpt_readability_ref,
    'correct':chatgpt_correctness_ref,
    'user_feedback':pseudofeedback_messages_ref,
    'user_feedback_score':feedback_scores_ref,
})
chatgpt_ref_data

Unnamed: 0,user_input,output_text,finish_reason,readability,correct,user_feedback,user_feedback_score
0,What is the answer to this question? question:...,Condensation frequently occurs within cells du...,stop,-2.13,False,The answer is clear and provides relevant info...,9
1,What is the answer to this question? question:...,"The E. coli chromosome, like many other bacter...",stop,41.36,True,Your answer is very clear and accurate. It mak...,10
2,What is the answer to this question? question:...,The members of the food chain that break down ...,stop,51.48,True,"Great answer, it is clear and directly address...",10
3,What is the answer to this question? question:...,"The answer to this question is ""lahar"". A laha...",stop,62.17,False,"Excellent answer, that is absolutely correct a...",10
4,What is the answer to this question? question:...,"The element that organic substances, including...",stop,31.89,True,Your answer is correct and makes sense. 10,10


In [24]:
chatgpt_inf_data = pd.DataFrame({
    'user_input':chatgpt_inputs,
    'output_text':response_texts_inf,
    'finish_reason':chatgpt_response_finish_reasons_inf,
    'readability':chatgpt_readability_inf,
    'correct':chatgpt_correctness_inf,
    'user_feedback':pseudofeedback_messages_inf,
    'user_feedback_score':feedback_scores_inf,
})
chatgpt_inf_data

Unnamed: 0,user_input,output_text,finish_reason,readability,correct,user_feedback,user_feedback_score
0,What is the answer to this question? question:...,The term associated with mining for a rock tha...,stop,55.24,True,"Yes, that answer makes sense in the context of...",10
1,What is the answer to this question? question:...,When mammals raise their hair with tiny muscle...,stop,57.3,True,"Great answer, it is clear and informative. 10",10
2,What is the answer to this question? question:...,The pressure inside a container is dependent o...,stop,50.16,True,"Great answer, it is clear and accurate. 10",10
3,What is the answer to this question? question:...,"In an aqueous solution, the solvent is water.",stop,63.36,True,"Yes, that's correct and concise. Good job! 9",9
4,What is the answer to this question? question:...,"The answer is ""energy."" Energy is required by ...",stop,70.29,True,"Yes, that answer makes perfect sense! 10",10


In [25]:
# chatgpt_ref_data.to_csv('chatgpt_reference_data.csv')
# chatgpt_inf_data.to_csv('chatgpt_inference_data.csv')

## Registering Model With Arthur

### Setting Up Connection
Supply your login to authenticate with the platform.

In [29]:
from arthurai import ArthurAI
# connect to Arthur

arthur = ArthurAI(
    url="https://app.arthur.ai",  # you can also pass this through the ARTHUR_ENDPOINT_URL environment variable
    login="<your login here>",  # you can also pass this through the ARTHUR_LOGIN environment variable
)

Please enter password for admin: ········


### Registering Model Type

In [30]:
from arthurai.common.constants import InputType, OutputType, ValueType, Stage

We'll instantiate an [`ArthurModel`](https://docs.arthur.ai/sdk/sdk_v3/apiref/arthurai.core.models.ArthurModel.html) with the `ArthurAI.model()` method, which constructs a new local `ArthurModel` object. Later we'll use `ArthurModel.save()` to register this model with the Arthur platform.

We give the model a user-friendly `display_name` and allow the unique `partner_model_id` field to be automatically generated, but you can supply a unique identifier if it helps you map your models in Arthur to your other MLOps systems.

The `InputType` of a model specifies the general type of data your model ingests. The `OutputType` of a model specifies the modeling task at hand.

### Building the model by specifying attributes

We use a helper function to register the model attributes for the input and output text the model will process

In [31]:
# register arthur model type
arthur_model_chatgpt = arthur.model(
    display_name="OpenAI_ChatGPT",
    input_type=InputType.NLP, 
    output_type=OutputType.TokenSequence)

# register attributes for token sequence model
arthur_model_chatgpt.build_token_sequence_model(
    input_column="user_input", 
    output_text_column="output_text"
)

# register additional non-input attributes
arthur_model_chatgpt.add_attribute(
    name='finish_reason', 
    stage=Stage.NonInputData,
    value_type=ValueType.String,
    categorical=True,
    categories=['length', 'stop']
)

arthur_model_chatgpt.add_attribute(
    name="readability",
    stage=Stage.NonInputData,
    value_type=ValueType.Float,
    min_range=0,
    max_range=100
)

arthur_model_chatgpt.add_attribute(
    name="correct",
    stage=Stage.NonInputData,
    value_type=ValueType.Boolean,
    categorical=True,
    categories=[True,False]
)

arthur_model_chatgpt.add_attribute(
    name="user_feedback",
    stage=Stage.NonInputData,
    value_type=ValueType.String,
    categorical=True,
    is_unique=True
)

arthur_model_chatgpt.add_attribute(
    name="user_feedback_score",
    stage=Stage.NonInputData,
    value_type=ValueType.Integer,
    categorical=True,
    categories=[-1,0,1,2,3,4,5,6,7,8,9,10]
)

arthur_model_chatgpt.review()

Unnamed: 0,name,stage,value_type,categorical,is_unique,categories,bins,range,monitor_for_bias
0,user_input,PIPELINE_INPUT,UNSTRUCTURED_TEXT,True,True,[],,"[None, None]",False
1,output_text,PREDICTED_VALUE,UNSTRUCTURED_TEXT,True,False,[],,"[None, None]",False
2,finish_reason,NON_INPUT_DATA,STRING,True,False,"[{value: length}, {value: stop}]",,"[None, None]",False
3,readability,NON_INPUT_DATA,FLOAT,False,False,[],,"[0, 100]",False
4,correct,NON_INPUT_DATA,BOOLEAN,True,False,"[{value: True}, {value: False}]",,"[None, None]",False
5,user_feedback,NON_INPUT_DATA,STRING,True,True,[],,"[None, None]",False
6,user_feedback_score,NON_INPUT_DATA,INTEGER,True,False,"[{value: -1}, {value: 0}, {value: 1}, {value: ...",,"[None, None]",False


### Saving the Model

Before saving, be sure to review your model to make sure everything is correct. We already saw the model schema returned by `ArthurModel.build()`, but we have since changed our attribute congiruations. Therefore we call `ArthurModel.review()` to see that our changed attributes look correct before saving to the platform. See the [onboarding walkthrough on the Arthur docs](https://docs.arthur.ai/user-guide/walkthroughs/model-onboarding/index.html#review-model) for tips on reviewing your model.

Note that while we capture the ranges of the attributes in this schema, they don’t need to be exact and won’t affect any performance calculations. They’re used as metadata to configure plots in the online Arthur dashboard, but never affect data drift or any other computations.

In [32]:
# review the model attribute properties in the model schema
arthur_model_chatgpt.review()

Unnamed: 0,name,stage,value_type,categorical,is_unique,categories,bins,range,monitor_for_bias
0,user_input,PIPELINE_INPUT,UNSTRUCTURED_TEXT,True,True,[],,"[None, None]",False
1,output_text,PREDICTED_VALUE,UNSTRUCTURED_TEXT,True,False,[],,"[None, None]",False
2,finish_reason,NON_INPUT_DATA,STRING,True,False,"[{value: length}, {value: stop}]",,"[None, None]",False
3,readability,NON_INPUT_DATA,FLOAT,False,False,[],,"[0, 100]",False
4,correct,NON_INPUT_DATA,BOOLEAN,True,False,"[{value: True}, {value: False}]",,"[None, None]",False
5,user_feedback,NON_INPUT_DATA,STRING,True,True,[],,"[None, None]",False
6,user_feedback_score,NON_INPUT_DATA,INTEGER,True,False,"[{value: -1}, {value: 0}, {value: 1}, {value: ...",,"[None, None]",False


Now, we save the model. 

Note that this will be the first call to send data to the Arthur platform so far in this example - no information has been sent yet to the platform.

The method `ArthurModel.save()` sends an API request to Arthur to validate your model - if there are any problems with your model schema, this method will result in an error informing you how to correct your model's configuration. If no errors are found, the model will be saved to the platform.

In [33]:
# validate the model and save it onto the Arthur platform
arthur_model_chatgpt.save()

10:46:15 - arthurai - We have registered the  model with Arthur and are getting it ready to accept inferences...
10:47:01 - arthurai - Model Creation Completed successfully, you can now send Data to Arthur.


'113fbaf0-5749-47a2-94da-63d93077d520'

In [34]:
arthur_model_chatgpt.set_reference_data(data=chatgpt_ref_data['user_input'])

18:25:21 - arthurai - Starting upload (0.005 MB in 1 files), depending on data size this may take a few minutes
18:25:21 - arthurai - Upload completed: /var/folders/8v/8v36mrp907z7lp5d4cd7yf4h0000gn/T/tmpmd5plux2/aea149ea-aef1-43e3-9acc-de15e81be501-0.parquet


({'counts': {'success': 50, 'failure': 0, 'total': 50}, 'failures': [[]]},
 {'dataset_close_result': {'message': 'success'}})

<a id='sending_inferences_content'></a>

## Sending Inferences

#### Using `ArthurModel.send_inferences()`

We send our inference data to the platform along with some unique IDs. Timestamps will be auto-generated for these inferences by the `send_inferences` function from the Arthur SDK. See [our API docs for sending inferences](https://docs.arthur.ai/api-documentation/v3-api-docs.html#tag/inferences/paths/~1models~1%7Bmodel_id%7D~1inferences/post) for the full specification of inference ingestion in Arthur.

In [35]:
inference_result = arthur_model_chatgpt.send_inferences(
    chatgpt_inf_data, 
    partner_inference_ids=[shortuuid.uuid() for _ in range(len(chatgpt_inputs_inf))])
print('# of successful and # of failing inference uploads:', inference_result['counts'])

10:47:29 - arthurai - 5 rows were missing inference_timestamp fields, so the current time was populated
# of successful and # of failing inference uploads: {'failure': 0, 'success': 5, 'total': 5}


## See Model in Dashboard

In [36]:
# the code below will render a link for you to view your model in the Arthur Dashboard

def render_arthur_model_dashboard_link(arthur, arthur_model):
    url = 'https://' + ''.join(arthur.client.api_base_url.split('/')[1:-2])
    link_text = f"See your model ({arthur_model.display_name}) in the Arthur Dashboard"
    href_string = f"{url}/model/{arthur_model.id}/overview"
    html_string = f'<br> <a style="font-size:200%" href={href_string}>{link_text}</a> <br>'
    display(HTML(html_string))

render_arthur_model_dashboard_link(arthur, arthur_model_chatgpt) 

Once your inference data has been uploaded to the platform, you can see your model by following the above link to the model dashboard page to see an overview of the model and browse its inference data.

---