# Quickstart - Medical Transcript Classifier

In this guide, we'll use the medical transcript dataset (and a pre-trained model) to onboard a new model to the Arthur platform. We will go through:
* Onboarding a model to Arthur
* Uploading a reference set
* Enabling explainability
* Sending Inferences

In [121]:
from arthurai import ArthurAI
from arthurai.client.apiv3 import InputType, OutputType, Stage, TextDelimiter

import pandas as pd
import joblib
import os
import uuid
from datetime import datetime

### Set up connection

Supply your API key below to authenticate with the platform

In [5]:
URL = "app.arthur.ai"
ACCESS_KEY = "..."

connection = ArthurAI(url=URL, access_key=ACCESS_KEY)

### Load Data

First we will load our training data, which we will use to help define model schema

In [14]:
df = pd.read_csv('../datasets/processed_mtsamples.csv')
df.head(5)

Unnamed: 0.1,Unnamed: 0,transcription,medical_specialty
0,0,"2-d m-mode: , ,1. left atrial enlargement wit...",cardiovascular / pulmonary
1,1,1. the left ventricular cavity size and wall ...,cardiovascular / pulmonary
2,2,"2-d echocardiogram,multiple views of the heart...",cardiovascular / pulmonary
3,3,"description:,1. normal cardiac chambers size....",cardiovascular / pulmonary
4,4,"2-d study,1. mild aortic stenosis, widely calc...",cardiovascular / pulmonary


In [46]:
# we have presaved the class list from the label encoder used to train model
# classes are stored as a list with the index matching output from classifier
with open('../medical_transcript_model/classes.pkl', 'rb') as f:
    classes = pickle.load(f).tolist()
classes

### Create Model

We will instantiate a model object with a small amount of metadata about the models input and output types. Then we will use a sample of the training data to help define the full schema for this NLP model.

NLP models require specifying a `text_delimiter` which specifies how a raw document is split into tokens.

In [64]:
model = connection.model(
    partner_model_id="Medical Transcript Classifier",
    input_type=InputType.NLP,
    output_type=OutputType.Multiclass,
    text_delimiter=TextDelimiter.NOT_WORD
)

We need to register what the data schema is for the inputs to the model. Since your model might hundreds or thousands of input features, you can just pass us a pandas DataFrame of your training data, and we'll handle the rest.

In [65]:
model.from_dataframe(df['transcription'], Stage.ModelPipelineInput)

We need to register the schema for the outputs of the model: what will a typical prediction look like and what will a typical 
ground truth look like? What names, shapes, and datatypes should Arthur expect for these objects?

Since this is a classification model, we'll do this all in one step with the `.add_multiclass_classifier_output_attributes()` method. All we need to supply is a mapping that establishes:

* names for the model's predictions
* names for the model's ground truth
* the mapping that related these two

We will name the predictions the same as the class name. Ground truth will be the same but with `ground_truth_` prefix.  
Raw data can be messy. Our class names contain invalid characters, so we also need to format those.

In [66]:
# create function to format class names
def format_class_name(name):
    return name.replace('/', '').replace('-', '').replace(' ', '_').replace('.', '')

# example
print("Raw:", classes[0])
print("Formatted:", format_class_name(classes[0]))

Raw: cardiovascular / pulmonary
Formatted: cardiovascular__pulmonary


In [67]:
formatted_class_names = [format_class_name(raw_class) for raw_class in classes]
output_mapping = {
    name: f"ground_truth_{name}"
    for name in formatted_class_names
}
output_mapping

{'cardiovascular__pulmonary': 'ground_truth_cardiovascular__pulmonary',
 'consult__history_and_phy': 'ground_truth_consult__history_and_phy',
 'gastroenterology': 'ground_truth_gastroenterology',
 'general_medicine': 'ground_truth_general_medicine',
 'neurology': 'ground_truth_neurology',
 'obstetrics__gynecology': 'ground_truth_obstetrics__gynecology',
 'orthopedic': 'ground_truth_orthopedic',
 'radiology': 'ground_truth_radiology',
 'soap__chart__progress_notes': 'ground_truth_soap__chart__progress_notes',
 'urology': 'ground_truth_urology'}

In [68]:
model.add_multiclass_classifier_output_attributes(output_mapping)

{'cardiovascular__pulmonary': <arthurai.client.apiv3.attributes.ArthurAttribute at 0x1292513c8>,
 'ground_truth_cardiovascular__pulmonary': <arthurai.client.apiv3.attributes.ArthurAttribute at 0x129251da0>,
 'consult__history_and_phy': <arthurai.client.apiv3.attributes.ArthurAttribute at 0x129251eb8>,
 'ground_truth_consult__history_and_phy': <arthurai.client.apiv3.attributes.ArthurAttribute at 0x1294654a8>,
 'gastroenterology': <arthurai.client.apiv3.attributes.ArthurAttribute at 0x129251e10>,
 'ground_truth_gastroenterology': <arthurai.client.apiv3.attributes.ArthurAttribute at 0x129382978>,
 'general_medicine': <arthurai.client.apiv3.attributes.ArthurAttribute at 0x129382e80>,
 'ground_truth_general_medicine': <arthurai.client.apiv3.attributes.ArthurAttribute at 0x129442048>,
 'neurology': <arthurai.client.apiv3.attributes.ArthurAttribute at 0x1293820b8>,
 'ground_truth_neurology': <arthurai.client.apiv3.attributes.ArthurAttribute at 0x1294655c0>,
 'obstetrics__gynecology': <arthura

Before saving, you can review a model to make sure everything is correct

In [69]:
model.review()

Unnamed: 0,name,stage,value_type,categorical,is_unique,categories,range,monitor_for_bias
0,ground_truth_cardiovascular__pulmonary,GROUND_TRUTH,INTEGER,True,False,"[{value: 0}, {value: 1}]","[None, None]",False
1,ground_truth_consult__history_and_phy,GROUND_TRUTH,INTEGER,True,False,"[{value: 0}, {value: 1}]","[None, None]",False
2,ground_truth_gastroenterology,GROUND_TRUTH,INTEGER,True,False,"[{value: 0}, {value: 1}]","[None, None]",False
3,ground_truth_general_medicine,GROUND_TRUTH,INTEGER,True,False,"[{value: 0}, {value: 1}]","[None, None]",False
4,ground_truth_neurology,GROUND_TRUTH,INTEGER,True,False,"[{value: 0}, {value: 1}]","[None, None]",False
5,ground_truth_obstetrics__gynecology,GROUND_TRUTH,INTEGER,True,False,"[{value: 0}, {value: 1}]","[None, None]",False
6,ground_truth_orthopedic,GROUND_TRUTH,INTEGER,True,False,"[{value: 0}, {value: 1}]","[None, None]",False
7,ground_truth_radiology,GROUND_TRUTH,INTEGER,True,False,"[{value: 0}, {value: 1}]","[None, None]",False
8,ground_truth_soap__chart__progress_notes,GROUND_TRUTH,INTEGER,True,False,"[{value: 0}, {value: 1}]","[None, None]",False
9,ground_truth_urology,GROUND_TRUTH,INTEGER,True,False,"[{value: 0}, {value: 1}]","[None, None]",False


In [70]:
model.save()

'11c971c0-6814-4778-8add-8ab98328ff0a'

### Setting baseline data
Next, we'll use the training data to set a baseline reference for calcuating data drift.

For tracking data drift, you can upload a dataset to serve as the baseline or reference set. Often, this is a sample of your training data for the associated model. Our reference dataset should ideally include examples of

inputs
ground truth
model predictions
for a sample of the training set. This way, Arthur can monitor for drift and stability in all of these aspects.

In [72]:
# load our pre-trained classifier to generate predictions
classifier = joblib.load('../medical_transcript_model/model.pkl')

In [91]:
# create reference set
reference_set = pd.DataFrame()
reference_set['transcription'] = df['transcription']

# create ground truth columns
gt_rows = []
for val in df['medical_specialty']:
    actual = classes.index(val)
    gt_rows.append({
        f"ground_truth_{class_name}": 1 if i == actual else 0
        for i, class_name in enumerate(formatted_class_names)
    })
gt_df = pd.DataFrame(gt_rows)

# create predictions
preds = classifier.predict_proba(df['transcription'])
pred_rows = []
for pred in preds:
    pred_rows.append({
        class_name: pred[i]
        for i, class_name in enumerate(formatted_class_names)
    })
pred_df = pd.DataFrame(pred_rows)

# combine 
reference_set = pd.concat([reference_set, gt_df, pred_df], axis=1)
reference_set.head(5)

Unnamed: 0,transcription,ground_truth_cardiovascular__pulmonary,ground_truth_consult__history_and_phy,ground_truth_gastroenterology,ground_truth_general_medicine,ground_truth_neurology,ground_truth_obstetrics__gynecology,ground_truth_orthopedic,ground_truth_radiology,ground_truth_soap__chart__progress_notes,...,cardiovascular__pulmonary,consult__history_and_phy,gastroenterology,general_medicine,neurology,obstetrics__gynecology,orthopedic,radiology,soap__chart__progress_notes,urology
0,"2-d m-mode: , ,1. left atrial enlargement wit...",1,0,0,0,0,0,0,0,0,...,0.316751,0.043254,0.044706,0.046426,0.090356,0.031201,0.049627,0.288422,0.053869,0.035387
1,1. the left ventricular cavity size and wall ...,1,0,0,0,0,0,0,0,0,...,0.334772,0.043876,0.043201,0.045382,0.085806,0.031134,0.045819,0.284302,0.053869,0.031841
2,"2-d echocardiogram,multiple views of the heart...",1,0,0,0,0,0,0,0,0,...,0.265649,0.064104,0.062531,0.067106,0.09871,0.041992,0.062808,0.216498,0.074221,0.046379
3,"description:,1. normal cardiac chambers size....",1,0,0,0,0,0,0,0,0,...,0.282942,0.045061,0.051904,0.043727,0.099272,0.032984,0.055726,0.3006,0.049215,0.03857
4,"2-d study,1. mild aortic stenosis, widely calc...",1,0,0,0,0,0,0,0,0,...,0.28301,0.055204,0.047477,0.058051,0.10438,0.036132,0.059297,0.254366,0.06082,0.041261


In [93]:
model.set_reference_data(data=reference_set)

{'counts': {'success': 1989, 'failure': 0, 'total': 1989}, 'failures': [[]]}

### Enable Explainability

We want to be able to visualize why our model made the predictions it did. For that, we need to enable explainability.
For more details on enabling explainability, [see the docs](https://docs.arthur.ai/guides/explainability.html).
For this example, we have pre-created the python file with a `predict()` function.

In [98]:
os.getcwd().replace('nlp_medical_transcript_classifier/notebooks', 'nlp_medical_transcript_classifier/')

'/Users/RJ/arthur-ai/projects/arthur-sandbox/example_projects/nlp_medical_transcript_classifier/'

In [100]:
# define path to project directory
project_dir = os.getcwd()
project_dir = project_dir.replace('nlp_medical_transcript_classifier/notebooks', 'nlp_medical_transcript_classifier/')
project_dir += 'medical_transcript_model'

# get our training data without labels into a dataframe
sample_data = pd.DataFrame(df['transcription']).head(50)

# enable explainability
model.enable_explainability(
    df=sample_data,
    project_directory=project_dir,
    streaming_explainability_enabled=True,
    requirements_file="requirements.txt",
    user_predict_function_import_path="entrypoint",
    explanation_nsamples=1000,
    explanation_algo='lime'
)

`enable_shap` was set to True, but SHAP is currently not supported for NLP models. Automatically disabling SHAP


### Send Inference

Now we will walk through sending an inference

In [124]:
# grab a row to make a prediction for
idx = 0

record = inf_df.iloc[idx:idx+1]['transcription']
pred = classifier.predict_proba(record)[0]
actual = classes.index(inf_df.iloc[idx]['medical_specialty'])

gt_data = {
    f"ground_truth_{class_name}": 1 if i == actual else 0
    for i, class_name in enumerate(formatted_class_names)
}
inference_data = {
    class_name: pred[i]
    for i, class_name in enumerate(formatted_class_names)
}
inference_data['transcription'] = record.iloc[0]
external_id = str(uuid.uuid4())
cur_time = datetime.utcnow().isoformat()
    
inference = {
    "partner_inference_id": external_id,
    "inference_timestamp": cur_time,
    "inference_data": inference_data,
    "ground_truth_timestamp": cur_time,
    "ground_truth_data": gt_data
}

# inspect the inference to see final format
inference

{'partner_inference_id': '38688fda-bbc5-4143-93f1-e74f71569bca',
 'inference_timestamp': '2020-11-30T23:04:30.649141',
 'inference_data': {'cardiovascular__pulmonary': 0.3357206129242593,
  'consult__history_and_phy': 0.020132120435966218,
  'gastroenterology': 0.14195127669875848,
  'general_medicine': 0.02400584436194211,
  'neurology': 0.060343351789777976,
  'obstetrics__gynecology': 0.0669489613621052,
  'orthopedic': 0.10846941497590533,
  'radiology': 0.0665582478555355,
  'soap__chart__progress_notes': 0.026751419018192963,
  'urology': 0.14911875057755708,
  'transcription': 'preoperative diagnoses,1.  empyema thoracis.,2.  need for intravenous antibiotics.,postoperative diagnoses,1.  empyema thoracis.,2.  need for intravenous antibiotics.,procedure:,  central line insertion.,description of procedure: , with the patient in his room, after obtaining the informed consent, his left deltopectoral area was prepped and draped in the usual fashion.  xylocaine 1% was infiltrated and w

In [125]:
model.send_inferences([inference])

{'counts': {'failure': 0, 'success': 1, 'total': 1},
 'results': [{'message': 'success', 'row_number': 0, 'status': 200}]}