# Quickstart - Medical Transcript Classifier

In this guide, we'll use the medical transcript dataset (and a pre-trained model) to onboard a new model to the Arthur platform. We will go through:
* Onboarding a model to Arthur
* Enabling explainability
* Sending Inferences

In [1]:
from datetime import datetime, timedelta
import os
import pickle
import uuid

from arthurai import ArthurAI
from arthurai.common.constants import InputType, OutputType, Stage, TextDelimiter
from random import randint
import joblib
import pandas as pd
import numpy as np
import pytz

### Set up connection

Supply your API key below to authenticate with the platform

In [2]:
# connect to Arthur
# UNCOMMENT the two lines below and enter your details
arthur = ArthurAI(
    # url="https://app.arthur.ai",  # you can also pass this through the ARTHUR_ENDPOINT_URL environment variable
    # login="<YOUR_USERNAME_OR_EMAIL>",  # you can also pass this through the ARTHUR_LOGIN environment variable
)

### Load Data

First we will load our training data, which we will use to help define model schema

In [3]:
df_raw = pd.read_csv('../datasets/processed_mtsamples.csv')
df_raw.head(5)

Unnamed: 0,transcription,medical_specialty
0,"2-d m-mode : , ,1. left atrial enlargement lef...",cardiovascular / pulmonary
1,1. left ventricular cavity size wall thickness...,cardiovascular / pulmonary
2,"2-d echocardiogram , multiple views heart grea...",cardiovascular / pulmonary
3,"description : ,1. normal cardiac chambers size...",cardiovascular / pulmonary
4,"2-d study,1 . mild aortic stenosis , widely ca...",cardiovascular / pulmonary


### Data Prep

After loading the data, we will format the data for Arthur

In [4]:
# presaved the class list from the label encoder used to train model
# classes are stored as a list with the index matching output from classifier
with open('../classes.pkl', 'rb') as f:
    class_names = pickle.load(f).tolist()
class_names

['cardiovascular / pulmonary',
 'consult - history and phy.',
 'gastroenterology',
 'general medicine',
 'neurology',
 'obstetrics / gynecology',
 'orthopedic',
 'radiology',
 'soap / chart / progress notes',
 'urology']

Raw data can be messy. Our class names contain invalid characters, so we also need to format those. Arthur only accepts alphanumeric (and underscore) characters as valid attribute names.

In [5]:
# create function to format class names
def format_class_name(name):
    return name.replace('/', '').replace('-', '').replace(' ', '_').replace('.', '')

# created list of formatted class names
formatted_class_names = {raw_class: format_class_name(raw_class) for raw_class in class_names}

# example
print("Raw:", class_names[0])
print("Formatted:", format_class_name(class_names[0]))

Raw: cardiovascular / pulmonary
Formatted: cardiovascular__pulmonary


Since this is a multi-class classification problem, we have to provide the predicted probability for each category as well as the ground truth labels (one-hot encoded). We will name the predictions the same as the class name. Ground truth will be the same but with `ground_truth_` prefix.

In [6]:
# load our pre-trained classifier to generate predictions
classifier = joblib.load('../model.pkl')

In [7]:
# create predictions
preds = classifier.predict_proba(df_raw['transcription'])
pred_rows = []
for pred in preds:
    pred_rows.append({
        class_name: pred[i]
        for i, class_name in enumerate(formatted_class_names.values())
    })
pred_df = pd.DataFrame(pred_rows)

# combine 
df = pd.concat([df_raw, pred_df], axis=1)

# format category classes
df['medical_specialty'] = df['medical_specialty'].apply(format_class_name)

df.head(5)

Unnamed: 0,transcription,medical_specialty,cardiovascular__pulmonary,consult__history_and_phy,gastroenterology,general_medicine,neurology,obstetrics__gynecology,orthopedic,radiology,soap__chart__progress_notes,urology
0,"2-d m-mode : , ,1. left atrial enlargement lef...",cardiovascular__pulmonary,0.31457,0.047311,0.044943,0.050944,0.088034,0.032376,0.050487,0.274421,0.061108,0.035806
1,1. left ventricular cavity size wall thickness...,cardiovascular__pulmonary,0.328527,0.046741,0.044155,0.048519,0.086364,0.030619,0.046095,0.279938,0.056335,0.032707
2,"2-d echocardiogram , multiple views heart grea...",cardiovascular__pulmonary,0.260582,0.064429,0.06383,0.069294,0.09844,0.042074,0.064015,0.215111,0.075557,0.046668
3,"description : ,1. normal cardiac chambers size...",cardiovascular__pulmonary,0.280616,0.045469,0.051983,0.040549,0.101007,0.035458,0.056501,0.300329,0.05122,0.036869
4,"2-d study,1 . mild aortic stenosis , widely ca...",cardiovascular__pulmonary,0.276198,0.057141,0.047565,0.061456,0.103771,0.036767,0.060006,0.247587,0.067203,0.042306


### Create Model

We will instantiate a model object with a small amount of metadata about the models input and output types. Then we will use a sample of the training data to help define the full schema for this NLP model.

NLP models require specifying a `text_delimiter` which specifies how a raw document is split into tokens.

In [8]:
model = arthur.model(
    partner_model_id=f"MedicalTranscriptClassifier_QS-{datetime.now().strftime('%Y%m%d%H%M%S')}",
    display_name="Medical Transcript Classifier",
    input_type=InputType.NLP,
    output_type=OutputType.Multiclass,
    text_delimiter=TextDelimiter.NOT_WORD
)

Since this is a classification model, we need to supply is a mapping that establishes:

* names for the model's predictions
* names for the model's ground truth
* the mapping that related these two

In [9]:
# creating mapping from predictions to ground truth
output_mapping = {
    formatted_name: formatted_name
    for raw_name, formatted_name in formatted_class_names.items()
}
output_mapping

{'cardiovascular__pulmonary': 'cardiovascular__pulmonary',
 'consult__history_and_phy': 'consult__history_and_phy',
 'gastroenterology': 'gastroenterology',
 'general_medicine': 'general_medicine',
 'neurology': 'neurology',
 'obstetrics__gynecology': 'obstetrics__gynecology',
 'orthopedic': 'orthopedic',
 'radiology': 'radiology',
 'soap__chart__progress_notes': 'soap__chart__progress_notes',
 'urology': 'urology'}

We need to register the schema for the inputs and outputs of the model: what are the input variables? What will a typical prediction look like and what will a typical ground truth look like? What names, shapes, and datatypes should Arthur expect for these objects? Since your model might have hundreds or thousands of input features, you can just pass us a Pandas DataFrame of your training data, and we'll handle the rest.

For tracking data drift, you can upload a dataset to serve as the baseline or reference set. Often, this is a sample of your training data for the associated model. The `build()` method sets the given dataframe as the reference set. This way, Arthur can monitor for drift and stability of all given variables.

In [10]:
# Register model attributes with Arthur
model.build(df, ground_truth_column="medical_specialty", pred_to_ground_truth_map=output_mapping)

2022-07-21 17:34:22,034 - arthurai.core.models - INFO - Please review the inferred schema. If everything looks correct, lock in your model by calling arthur_model.save()


Unnamed: 0,name,stage,value_type,categorical,is_unique,categories,bins,range,monitor_for_bias
0,transcription,PIPELINE_INPUT,UNSTRUCTURED_TEXT,True,False,[],,"[None, None]",False
1,medical_specialty,GROUND_TRUTH_CLASS,STRING,True,False,"[{value: urology}, {value: gastroenterology}, ...",,"[None, None]",False
2,cardiovascular__pulmonary,PREDICTED_VALUE,FLOAT,False,False,[],,"[0, 1]",False
3,consult__history_and_phy,PREDICTED_VALUE,FLOAT,False,False,[],,"[0, 1]",False
4,gastroenterology,PREDICTED_VALUE,FLOAT,False,False,[],,"[0, 1]",False
5,general_medicine,PREDICTED_VALUE,FLOAT,False,False,[],,"[0, 1]",False
6,neurology,PREDICTED_VALUE,FLOAT,False,False,[],,"[0, 1]",False
7,obstetrics__gynecology,PREDICTED_VALUE,FLOAT,False,False,[],,"[0, 1]",False
8,orthopedic,PREDICTED_VALUE,FLOAT,False,False,[],,"[0, 1]",False
9,radiology,PREDICTED_VALUE,FLOAT,False,False,[],,"[0, 1]",False


In [11]:
model_id = model.save()
with open("quickstart_model_id.txt", "w") as f:
    f.write(model_id)

2022-07-21 17:34:29,626 - arthurai.core.data_service - INFO - Starting upload (1.950 MB in 1 files), depending on data size this may take a few minutes
2022-07-21 17:34:30,622 - arthurai.core.data_service - INFO - Upload completed: /var/folders/vg/dzh9m54s0vz7ws4f6gqmqn5m0000gn/T/tmpezhhzrbe/1c751e46-b10a-4103-9e3f-2f4b6c444f01-0.parquet


You can fetch a model by ID. for example pull the last-created model:

In [12]:
with open("quickstart_model_id.txt", "r") as f:
    model_id = f.read()
model = arthur.get_model(model_id)

### Enable Explainability

We want to be able to visualize why our model made the predictions it did. For that, we need to enable explainability.
For more details on enabling explainability, [see the docs](https://docs.arthur.ai/user-guide/explainability.html).
For this example, we have pre-created the python file with a `predict()` function.

In [None]:
# define path to project directory
project_dir = os.getcwd()
project_dir = project_dir.replace('nlp_medical_transcript_classifier/notebooks', 'nlp_medical_transcript_classifier/')

# enable explainability
model.enable_explainability(
    df=df,
    project_directory=project_dir,
    streaming_explainability_enabled=True,
    requirements_file="requirements.txt",
    user_predict_function_import_path="entrypoint",
    explanation_nsamples=1000,
    explanation_algo='lime'
)

### Send Inference

Now we will walk through sending inferences to Arthur

In [16]:
df_inferences = df_raw.copy()

# grab a row to make a prediction for
inferences = []
n_inferences = 200
for i in range(n_inferences):
    
    if i >= 50 and i % 50 == 0:
        print(f"processing record {i}/{n_inferences}")

    record = df_inferences.iloc[i:i+1]['transcription']
    pred = classifier.predict_proba(record)[0]
    actual = format_class_name(df_inferences.iloc[i]['medical_specialty'])

    inference_data = {
        class_name: pred[i]
        for i, class_name in enumerate(formatted_class_names.values())
    }
    
    inference_data['transcription'] = record.iloc[0]
    external_id = f"{uuid.uuid4()}-i"
    cur_time = datetime.now(pytz.utc) - timedelta(days = i%10)
        
    inference = {
        "partner_inference_id": external_id,
        "inference_timestamp": cur_time,
        "inference_data": inference_data,
        "ground_truth_timestamp": cur_time,
        "medical_specialty": actual
    }

    inferences.append(inference)

processing record 50/200
processing record 100/200
processing record 150/200


In [17]:
# inspect a inference to see final format
inferences[0]

{'partner_inference_id': '444dc7ff-ad47-4109-939b-b5d6df36217f-i',
 'inference_timestamp': datetime.datetime(2022, 7, 21, 21, 36, 37, 47146, tzinfo=<UTC>),
 'inference_data': {'cardiovascular__pulmonary': 0.3145696685561315,
  'consult__history_and_phy': 0.047311432165361364,
  'gastroenterology': 0.04494273151561295,
  'general_medicine': 0.05094425837515593,
  'neurology': 0.08803377527442867,
  'obstetrics__gynecology': 0.03237626252754842,
  'orthopedic': 0.05048688882466476,
  'radiology': 0.2744210720668056,
  'soap__chart__progress_notes': 0.06110806764999628,
  'urology': 0.03580584304429395,
  'transcription': '2-d m-mode : , ,1. left atrial enlargement left atrial diameter 4.7 cm.,2 . normal size right left ventricle.,3 . normal lv systolic function left ventricular ejection fraction 51 % .,4 . normal lv diastolic function.,5 . pericardial effusion.,6 . normal morphology aortic valve , mitral valve , tricuspid valve , pulmonary valve.,7 . pa systolic pressure 36 mmhg. , doppl

In [None]:
model.send_inferences(inferences)