# Using Mistral-7B-Instruct-v0.2 to process medical transcripts
Based on https://www.kaggle.com/code/malenaprezsevilla/introduction-to-prompt-engineering-using-mistral

## Model setup
Quantization and Accelerate allow inference to run even on a laptop

In [1]:
# General packages
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from textwrap import fill
from IPython.display import Markdown, display # for formating Python display folowing markdown language
import warnings
warnings.filterwarnings('ignore') # avoid warning messages importing packages

# Mistral and LangChain packages (prompt engineering)
import torch
from langchain import PromptTemplate, HuggingFacePipeline
from langchain.output_parsers import ResponseSchema, StructuredOutputParser
from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


In [2]:
# Model version of Mistral
MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.2"

# Quantization is a technique used to reduce the memory and computation requirements 
# of deep learning models, typically by using fewer bits, 4 bits
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

# Initialization of a tokenizer for the Mistral-7b model, 
# necessary to preprocess text data for input
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

# Initialization of the pre-trained language Mistral-7b
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, torch_dtype=torch.float16,
    trust_remote_code=True,
    device_map="auto",
    quantization_config=quantization_config
)

# Configuration of some generation-related settings
generation_config = GenerationConfig.from_pretrained(MODEL_NAME)
generation_config.max_new_tokens = 1024 # maximum number of new tokens that can be generated by the model
generation_config.temperature = 0.7 # randomness of the generated tex
generation_config.top_p = 0.95 # diversity of the generated text
generation_config.do_sample = True # sampling during the generation process
generation_config.repetition_penalty = 1.15 # the degree to which the model should avoid repeating tokens in the generated text

# A pipeline is an object that works as an API for calling the model
# The pipeline is made of (1) the tokenizer instance, the model instance, and
# some post-procesing settings. Here, it's configured to return full-text outputs
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,
    generation_config=generation_config,
)

Loading checkpoint shards: 100%|██████████| 3/3 [01:04<00:00, 21.63s/it]


In [3]:
# HuggingFace pipeline
llm = HuggingFacePipeline(pipeline=pipe)

### Test the model
Instruction format: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2#instruction-format

In [28]:
def generate(model, text, template=None, format_instructions=None):
    if template == None:
        template = "[INST]{text}[/INST]"
    
    prompt = PromptTemplate.from_template(template)
    
    response = model(prompt.format(text = text, format_instructions = format_instructions))
    return response.strip()

def generate_and_display(model, text, template=None, format_instructions=None):
    result = generate(model, text, template, format_instructions)

    # No point displaying a templated prompt, this is just a convenience for simple prompts
    if (template == None):
        display(Markdown(f"<b>{text}</b>"))

    display(Markdown(f"{result}"))

generate_and_display(llm, "Explain the fundamentals of ChatGPT in a couple of lines.")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<b>Explain the fundamentals of ChatGPT in a couple of lines.</b>

ChatGPT is a type of large language model developed by OpenAI and currently operated by Microsoft. It uses deep learning techniques to understand input text, generate human-like responses, and maintain a conversation with users. The model's ability to comprehend context and provide relevant information makes it suitable for various applications like customer service, education, and entertainment. However, keep in mind that while advanced, it doesn't have access to personal data or real-world knowledge unless explicitly provided during interaction.

### Medical transcripts dataset
Source: https://www.kaggle.com/datasets/tboyle10/medicaltranscriptions

In [15]:
df=pd.read_csv('../data/medical-transcripts/mtsamples.csv',index_col=0)
df.head(5)

Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller..."
1,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh..."
2,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...","bariatrics, laparoscopic gastric bypass, heart..."
3,2-D M-Mode. Doppler.,Cardiovascular / Pulmonary,2-D Echocardiogram - 1,"2-D M-MODE: , ,1. Left atrial enlargement wit...","cardiovascular / pulmonary, 2-d m-mode, dopple..."
4,2-D Echocardiogram,Cardiovascular / Pulmonary,2-D Echocardiogram - 2,1. The left ventricular cavity size and wall ...,"cardiovascular / pulmonary, 2-d, doppler, echo..."


In [16]:
df.describe()

Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
count,4999,4999,4999,4966,3931.0
unique,2348,40,2377,2357,3849.0
top,An example/template for a routine normal male...,Surgery,Lumbar Discogram,"PREOPERATIVE DIAGNOSIS: , Low back pain.,POSTO...",
freq,12,1103,5,5,81.0


In [17]:
# Read and show the transcription of one record
def show_transcription(df,i,width=50):
    print('Record %i\n' %i)
    text = (Markdown(f"<p>{df.loc[i,'transcription']}</p>"))
    
    #for l in textwrap.wrap(text=df.loc[i,'transcription'],width=width):
    #    print(l)
    return text

In [18]:
texto=df.loc[60,'transcription']
print(fill(texto))

PHYSICAL EXAMINATION: , The patient is a 63-year-old executive who was
seen by his physician for a company physical.  He stated that he was
in excellent health and led an active life.  His physical examination
was normal for a man of his age.  Chest x-ray and chemical screening
blood work were within normal limits.  His PSA was
elevated.,IMAGING:,Chest x-ray:  Normal.,CT scan of abdomen and
pelvis:  No abnormalities.,LABORATORY:,  PSA 14.6.,PROCEDURES: ,
Ultrasound guided sextant biopsy of prostate:  Digital rectal exam
performed at the time of the biopsy showed a 1+ enlarged prostate with
normal seminal vesicles.,PATHOLOGY:  ,Prostate biopsy:  Left apex:
adenocarcinoma, moderately differentiated, Gleason's score 3 + 4 =
7/10.  Maximum linear extent in apex of tumor was 6 mm.  Left mid
region prostate:  moderately differentiated adenocarcinoma, Gleason's
3 + 2 = 5/10.  Left base, right apex, and right mid-region and right
base:  negative for carcinoma.,TREATMENT:,  The patient opted fo

### Structured data extraction

In [41]:
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field
from typing import List, Optional
from enum import Enum

class GenderEnum(Enum):
    Male = 'Male'
    Female = 'Female'

class ImagingSchema(BaseModel):
    type: str = Field(..., description="The type of medical imaging performed")
    findings: str = Field(..., description="The findings of the medical imaging performed")

class PatientSchema(BaseModel):
    age: Optional[int] = Field(..., description="Patient's age")
    gender: Optional[GenderEnum] = Field(..., description="Patient's gender")
    imaging: List[ImagingSchema] = Field(..., description="The list of medical imaging examinations performed")

# Instance of the parser
output_parser = PydanticOutputParser(pydantic_object=PatientSchema)

# This is what will be added to the prompt in order to get the model to respond in JSON
print(output_parser.get_format_instructions())

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"$defs": {"GenderEnum": {"enum": ["Male", "Female"], "title": "GenderEnum", "type": "string"}, "ImagingSchema": {"properties": {"type": {"description": "The type of medical imaging performed", "title": "Type", "type": "string"}, "findings": {"description": "The findings of the medical imaging performed", "title": "Findings", "type": "string"}}, "required": ["type", "findings"], "title": "ImagingSchema", "type": "object"}}, "properties": {"age": {"anyOf": [{"type": "integer"}, {"type": "null"}], "description": "Patient's age", "title": "Age"},

In [44]:
template = """[INST]
You are a medical expert reading through transcripts. Your expertise spans the whole medical domain.

Your job is to extract structured information from the transcript.

This is the transcript: ```{text}```

{format_instructions}

Your response JSON should only include the fields defined in the schema above. Ignore all other information from the transcript.

Respond only with valid JSON, add no further comments to the response.
[/INST]
"""

generate_and_display(llm, texto, template, output_parser.get_format_instructions())

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


```json
{
  "age": 63,
  "gender": "Male",
  "imaging": [
    {
      "type": "chest x-ray",
      "findings": "Normal"
    },
    {
      "type": "ct scan of abdomen and pelvis",
      "findings": "No abnormalities"
    }
  ]
}
```