#dataset: mtsample.csv (scrapped data from mtsample)
https://www.kaggle.com/datasets/tboyle10/medicaltranscriptions

We will use the sample “transcription” text from mtsample.csv

#NLP Libraries: spaCy & SciSpacy
Install spacy & scispacy Packages.

spaCy models are designed to perform specific NLP tasks, such as tokenization, part-of-speech tagging, and named entity recognition.

scispaCy is a Python package containing spaCy models for processing biomedical, scientific or clinical text.

# Sapcy for NER:
spaCy is widely used for Named Entity Recognition (NER), a task in Natural Language Processing (NLP) that identifies and classifies entities in a text into predefined categories, such as names of people, organizations, locations, dates, and more. Here’s a simple overview of how spaCy works in NER:

In [None]:
!pip install -U spacy
!pip install scispacy



#Install scispacy base models and NER models
Available Pre-trained NER Models in SciSpacy:

# en_core_sci_sm:
Size: Small
Description: This model is lightweight and suitable for basic tasks. It includes functionality for recognizing biomedical entities but may not be as comprehensive as larger models.

#en_core_sci_md:
Size: Medium
Description: This model offers better accuracy and performance than the small model. It has more training data and can recognize a wider range of entities, making it a good choice for general scientific text processing.

#en_core_sci_lg:
Size: Large
Description: This is the largest and most accurate model available. It is trained on a vast amount of scientific literature and is capable of recognizing a comprehensive set of entities. It is suitable for more complex applications where high accuracy is essential.

#en_ner_bc5cdr_md-0.5.1:
ScispaCy pre-trained NER model en_ner_bc5cdr_md-0.5.1:  extract disease and drugs. Drugs are extracted as Chemicals.

“bc5cdr” refers to the BC5CDR corpus, a biomedical text corpus used to train the model.
 The “md” in the name refers to the biomedical domain. The “0.5.1” in the name refers to the version of the model.
The en_ner_bc5cdr_md-0.5.1 model is specifically designed to recognize named entities in biomedical text, such as diseases, genes, and drugs, as chemicals.

This model can be useful for NLP tasks in the biomedical domain, such as information extraction, text classification, and question-answering

In [None]:
#  Downgrade to the matching spaCy + scispaCy versions
!pip install -U spacy==3.7.4 scispacy==0.5.4

# install your downloaded model
!pip install /content/en_core_sci_sm-0.5.4.tar.gz
!pip install /content/en_core_sci_md-0.5.4.tar.gz
!pip install /content/en_ner_bc5cdr_md-0.5.4.tar.gz


Collecting spacy==3.7.4
  Using cached spacy-3.7.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting scispacy==0.5.4
  Using cached scispacy-0.5.4-py3-none-any.whl.metadata (16 kB)
Collecting thinc<8.3.0,>=8.2.2 (from spacy==3.7.4)
  Using cached thinc-8.2.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting weasel<0.4.0,>=0.1.0 (from spacy==3.7.4)
  Using cached weasel-0.3.4-py3-none-any.whl.metadata (4.7 kB)
Collecting typer<0.10.0,>=0.3.0 (from spacy==3.7.4)
  Using cached typer-0.9.4-py3-none-any.whl.metadata (14 kB)
Collecting smart-open<7.0.0,>=5.2.1 (from spacy==3.7.4)
  Using cached smart_open-6.4.0-py3-none-any.whl.metadata (21 kB)
Collecting scipy<1.11 (from scispacy==0.5.4)
  Using cached scipy-1.9.3.tar.gz (42.1 MB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  [1;31merror[0m: [

Render-python is client side library and can assist you in managing and setting those tranformations, and performing
some calculations using the renderapi.transorm module. We have focused our initial efforts at supporting the
most commonly used types of transformations.
Some transformation types presently support tform and ‘inverse_tform‘ methods for calculating where numpy array
sets of points map to and from these tranformations. Some presently support estimate methods which given a set of
source and destination points, allow the estimation of a best fit transformation.

In [None]:
pip install render



In [None]:
import scispacy
import spacy
#Core models
import en_core_sci_sm
import en_core_sci_md
#NER specific models
import en_ner_bc5cdr_md # extracting disease and drugs
#Tools for extracting & displaying data
from spacy import displacy
import pandas as pd

Test the models with sample data

In [None]:
mtsample_df=pd.read_csv('/content/mtsamples.csv')
mtsample_df.head()

Unnamed: 0.1,Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
0,0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller..."
1,1,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh..."
2,2,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...","bariatrics, laparoscopic gastric bypass, heart..."
3,3,2-D M-Mode. Doppler.,Cardiovascular / Pulmonary,2-D Echocardiogram - 1,"2-D M-MODE: , ,1. Left atrial enlargement wit...","cardiovascular / pulmonary, 2-d m-mode, dopple..."
4,4,2-D Echocardiogram,Cardiovascular / Pulmonary,2-D Echocardiogram - 2,1. The left ventricular cavity size and wall ...,"cardiovascular / pulmonary, 2-d, doppler, echo..."


# New Section

In [None]:
# Pick specific transcription to use (row 3, column "transcription") and test the scispacy NER model
text = mtsample_df.loc[10, "transcription"]

Load specific model: en_core_sci_sm and pass text through
A full spaCy pipeline for biomedical data.

The function displacy. render() , with style="ent" as an argument, is able to retrieve the entities predicted by spaCy and highlight them. This is the resulting output that highlights all the entities predicted in the text, along with their category.

spacy.load will return a Language object containing all components and data needed to process text. This object is usually called nlp in the documentation and tutorials. Calling the nlp object on a string of text will return a processed Doc object with the text split into words and annotated.

In [None]:

nlp_sm = spacy.load("en_core_sci_sm")
doc = nlp_sm(text)
#Display resulting
#entity extraction
displacy_image = displacy.render(doc, jupyter=True,style='ent')

  deserializers["tokenizer"] = lambda p: self.tokenizer.from_disk(  # type: ignore[union-attr]


Note the entity is tagged here. Mostly medicalterms. However, these are generic entities.



Now Load the specific model: en_core_sci_md and pass text

A full spaCy pipeline for biomedical data with a larger vocabulary and 50k word vectors

In [None]:
nlp_md = en_core_sci_md.load()
doc = nlp_md(text)
#Display resulting entity extraction
displacy_image = displacy.render(doc, jupyter=True,style='ent')

This time the numbers are also tagged as entities by en_core_sci_md.



Now Load specific model: import en_ner_bc5cdr_md and pass text through

A spaCy NER model trained on the BC5CDR corpus( the main part or body of a bodily structure or organ)

BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions.


used to DISEASE, CHEMICAL

In [None]:
nlp_bcc = en_ner_bc5cdr_md.load()
doc = nlp_bcc(text)
#Display resulting entity extraction
displacy_image = displacy.render(doc, jupyter=True,style='ent')

Now two medical entities are tagged: disease and chemical(drugs).

Display the entity

In [None]:
doc = nlp_bcc(text)
print(doc.ents)
print("TEXT", "START", "END", "ENTITY TYPE")
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

(obesity, obesity, Laparoscopic Roux-en-Y gastric bypass, overweight, weight loss, obesity, Marcaine)
TEXT START END ENTITY TYPE
obesity 33 40 DISEASE
obesity 77 84 DISEASE
Laparoscopic Roux-en-Y gastric bypass 100 137 DISEASE
overweight 341 351 DISEASE
weight loss 400 411 DISEASE
obesity 496 503 DISEASE
Marcaine 1256 1264 CHEMICAL


Process the clinical text dropping NAN values and creating a random smaller sample for the custom entity model.

In [None]:
mtsample_df.dropna(subset=['transcription'], inplace=True)
mtsample_df_subset = mtsample_df.sample(n=100, replace=False, random_state=42)   #replacebool, default False :disallow sampling of the same row more than once.
mtsample_df_subset.info()
mtsample_df_subset.head()

<class 'pandas.core.frame.DataFrame'>
Index: 100 entries, 3162 to 3581
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Unnamed: 0         100 non-null    int64 
 1   description        100 non-null    object
 2   medical_specialty  100 non-null    object
 3   sample_name        100 non-null    object
 4   transcription      100 non-null    object
 5   keywords           78 non-null     object
dtypes: int64(1), object(5)
memory usage: 5.5+ KB


Unnamed: 0.1,Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
3162,3162,Markedly elevated PT INR despite stopping Cou...,Hematology - Oncology,Hematology Consult - 1,"HISTORY OF PRESENT ILLNESS:, The patient is w...",
1981,1981,Intercostal block from fourth to tenth interc...,Pain Management,Intercostal block - 1,"PREPROCEDURE DIAGNOSIS:, Chest pain secondary...","pain management, xylocaine, marcaine, intercos..."
1361,1361,The patient is a 65-year-old female who under...,SOAP / Chart / Progress Notes,Lobectomy - Followup,"HISTORY OF PRESENT ILLNESS: , The patient is a...","soap / chart / progress notes, non-small cell ..."
3008,3008,Construction of right upper arm hemodialysis ...,Nephrology,Hemodialysis Fistula Construction,"PREOPERATIVE DIAGNOSIS: , End-stage renal dise...","nephrology, end-stage renal disease, av dialys..."
4943,4943,Bronchoscopy with brush biopsies. Persistent...,Cardiovascular / Pulmonary,Bronchoscopy - 8,"PREOPERATIVE DIAGNOSIS: , Persistent pneumonia...","cardiovascular / pulmonary, persistent pneumon..."


spaCy matcher – The rule-based matching resembles the usage of regular expressions, but spaCy provides additional capabilities. Using the tokens and relationships within a document enables you to identify patterns that include entities with the help of NER models. The goal is to locate drug names and their dosages from the text, which could help detect medication errors by comparing them with standards and guidelines.

The goal is to locate drug names and their dosages from the text, which could help detect medication errors by comparing them with standards and guidelines.

In [None]:
from spacy.matcher import Matcher
pattern = [{'ENT_TYPE':'CHEMICAL'}, {'LIKE_NUM': True}, {'IS_ASCII': True}]
matcher = Matcher(nlp_bcc.vocab)
matcher.add("DRUG_DOSE", [pattern])

The code above creates a pattern to identify a sequence of three tokens:

A token whose entity type is CHEMICAL (drug name)

A token that resembles a number (dosage)

A token that consists of ASCII characters (units, like mg or mL)

Then we initialize the Matcher with a vocabulary. The matcher must always share the same vocab with the documents it will operate on, so we use the nlp_bcc object vocab. We then add this pattern to the matcher and give it an ID.

In [None]:
for transcription in mtsample_df_subset['transcription']:
    doc = nlp_bcc(transcription)
    matches = matcher(doc)
    for match_id, start, end in matches:
        string_id = nlp_bcc.vocab.strings[match_id]  # get string representation
        span = doc[start:end]  # the matched span adding drugs doses
        print(span.text, start, end, string_id,)

Xylocaine 20 mL 129 132 DRUG_DOSE
Marcaine 0.25% 133 136 DRUG_DOSE
Aspirin 81 mg 204 207 DRUG_DOSE
Spiriva 10 mcg 212 215 DRUG_DOSE
nifedipine 10 mg 376 379 DRUG_DOSE
aspirin one tablet 220 223 DRUG_DOSE
q. three months 227 230 DRUG_DOSE
Warfarin 2.5 mg 239 242 DRUG_DOSE
Topamax 100 mg 57 60 DRUG_DOSE
Zoloft 100 mg 63 66 DRUG_DOSE
Abilify 5 mg 69 72 DRUG_DOSE
Motrin 800 mg 74 77 DRUG_DOSE
Xanax 1 mg 76 79 DRUG_DOSE
Paxil 10 mg 120 123 DRUG_DOSE
Prednisone 20 mg 125 128 DRUG_DOSE
Nexium 40 mg 149 152 DRUG_DOSE
Naprosyn one p.o 1109 1112 DRUG_DOSE
Lidocaine 1% 260 263 DRUG_DOSE
lidocaine 2% 221 224 DRUG_DOSE
Creatinine 1.3, 91 94 DRUG_DOSE
sodium 141, 94 97 DRUG_DOSE
potassium 4.0. 98 101 DRUG_DOSE
Calcium 8.6. 102 105 DRUG_DOSE
code 21470, 63 66 DRUG_DOSE
7-hole 2.3 titanium 607 610 DRUG_DOSE
Norvasc 10 mg 427 430 DRUG_DOSE
aspirin 81 mg 434 437 DRUG_DOSE
Klonopin 0.5 mg 448 451 DRUG_DOSE
digoxin 0.125 mg 456 459 DRUG_DOSE
Lexapro 10 mg 463 466 DRUG_DOSE
TriCor 145 mg 470 473 DRUG_DOSE


In [None]:
#Now we can loop through all transcriptions and extract the text matching this pattern:
for transcription in mtsample_df_subset['transcription']:
    doc = nlp_bcc(transcription)
    matches = matcher(doc)
    for match_id, start, end in matches:
        string_id = nlp_bcc.vocab.strings[match_id]  # get string representation
        span = doc[start:end]  # the matched span adding drugs doses
        print(span.text, start, end, string_id,)
        #Add disease and chemical
        for ent in doc.ents:
            print(ent.text, ent.start_char, ent.end_char, ent.label_)

Xylocaine 20 mL 129 132 DRUG_DOSE
PREPROCEDURE 0 12 DISEASE
Chest pain 26 36 DISEASE
Chest pain 122 132 DISEASE
chest pain 388 398 DISEASE
Xylocaine 730 739 CHEMICAL
Marcaine 750 758 CHEMICAL
contusion 987 996 DISEASE
respiratory distress 1076 1096 DISEASE
pain 1150 1154 DISEASE
Marcaine 0.25% 133 136 DRUG_DOSE
PREPROCEDURE 0 12 DISEASE
Chest pain 26 36 DISEASE
Chest pain 122 132 DISEASE
chest pain 388 398 DISEASE
Xylocaine 730 739 CHEMICAL
Marcaine 750 758 CHEMICAL
contusion 987 996 DISEASE
respiratory distress 1076 1096 DISEASE
pain 1150 1154 DISEASE
Aspirin 81 mg 204 207 DRUG_DOSE
non-small cell lung cancer 114 140 DISEASE
barium 322 328 CHEMICAL
hernia 380 386 DISEASE
odynophagia 647 658 DISEASE
thoracic dysphagia 667 685 DISEASE
tenderness 829 839 DISEASE
DVT 918 921 DISEASE
weight loss 952 963 DISEASE
anorexia 965 973 DISEASE
chills 983 989 DISEASE
headaches 991 1000 DISEASE
aches 1006 1011 DISEASE
pains 1015 1020 DISEASE
cough 1022 1027 DISEASE
hemoptysis 1029 1039 DISEASE
short

Now we can see the pipeline extracted Disease, Drugs(Chemicals), and Drugs-Doses information from the clinical text.

There is some misclassification, but we can increase the model’s performance using more data.

We can now use these medical entities in various tasks like disease detection, predictive analysis, clinical decision support system, medical text classification, summarization, questions -answering, and many more.