## **Medical Q&A with sentence transformers**


In this notebook, We designed models to provide answers to user queries by utilizing medical documents and sentence transformers.

### Install necessary libraries

In [1]:
pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
Collecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp37-cp37m-win_amd64.whl (977 kB)
Collecting huggingface-hub>=0.4.0
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
Collecting fsspec
  Downloading fsspec-2023.1.0-py3-none-any.whl (143 kB)
Collecting safetensors>=0.3.1
  Downloading safetensors-0.3.3-cp37-cp37m-win_amd64.whl (266 kB)
Collecting regex!=2019.12.17
  Downloading regex-2023.8.8-cp37-cp37m-win_amd64.whl (268 kB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp37-cp37m-win_amd64.whl (3.5 MB)
Collecting click
  Downloading click-8.1.7-py3-none-any.whl (97 kB)
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.p

### Importing  the Librairies

In [145]:
from sentence_transformers import SentenceTransformer, util
from PIL import Image
import pandas as pd
import numpy as np

### Load NLP LLM model

In [8]:
# Download NLP sentence transformer model
model_llm = SentenceTransformer('paraphrase-MiniLM-L6-v2')

### Load the data

We will load our dataset of 2967 rows of side effects data from drugs.com. Data can be found here:https://www.kaggle.com/datasets/jithinanievarghese/drugs-related-to-common-treatments

In [208]:
#Load medical drug data
data = pd.read_csv('C:/Users/Arun sivakumar/Desktop/drugs_side_effects_drugs_com.csv')

In [209]:
data.head(2)

Unnamed: 0,drug_name,medical_condition,side_effects,generic_name,drug_classes,brand_names,activity,rx_otc,pregnancy_category,csa,alcohol,related_drugs,medical_condition_description,rating,no_of_reviews,drug_link,medical_condition_url
0,doxycycline,Acne,"(hives, difficult breathing, swelling in your ...",doxycycline,"Miscellaneous antimalarials, Tetracyclines","Acticlate, Adoxa CK, Adoxa Pak, Adoxa TT, Alod...",87%,Rx,D,N,X,amoxicillin: https://www.drugs.com/amoxicillin...,Acne Other names: Acne Vulgaris; Blackheads; B...,6.8,760.0,https://www.drugs.com/doxycycline.html,https://www.drugs.com/condition/acne.html
1,spironolactone,Acne,hives ; difficulty breathing; swelling of your...,spironolactone,"Aldosterone receptor antagonists, Potassium-sp...","Aldactone, CaroSpir",82%,Rx,C,N,X,amlodipine: https://www.drugs.com/amlodipine.h...,Acne Other names: Acne Vulgaris; Blackheads; B...,7.2,449.0,https://www.drugs.com/spironolactone.html,https://www.drugs.com/condition/acne.html


In [210]:
#Transform data to include words such as "side effect" to improve readability of our answers
data['drug_side_effects'] = data['drug_name'] +" has the side effect of "+ data['side_effects']

#Drop rows where the side effects are empty since we can't return an answer for those drugs due to missing data
data = data[pd.notnull(data['side_effects'])]

print("Non-null row count:",data.isnull().sum().sum())
data.reset_index(inplace=True,drop=True)

Non-null row count: 6777


### Encoding medical documents

In [211]:
#Encode side effects data using our intialized Sentence transformer model
medical_corpus_embeddings = model.encode(data['drug_side_effects'].values.tolist())

In [212]:
#Encode end user queries
user_query = ['what are the side effects of Aldactone', 'what are the side effects of benzphetamine']

In [213]:
#Save original corpus for later use
original_corpus = data['drug_side_effects'].values.tolist()

#Encode questions given by end user to be answered by our model
queries_embeddings = model.encode(user_query)

# Find the top match for each user query based on cosine similarity
hits = util.semantic_search(queries_embeddings, medical_corpus_embeddings, top_k=1)

# Print results of first user query
print(f"Query: {user_query[0]}")
for hit in hits[0]:
    print(original_corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))
    print('\n')

# Print results of second user query
print(f"Query: {user_query[1]}")
for hit in hits[1]:
    print(original_corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))


Query: what are the side effects of Aldactone
Aldactone has the side effect of hives ; difficulty breathing; swelling of your face, lips, tongue, or throat. Call your doctor at once if you have: a light-headed feeling, like you might pass out; little or no urination; high potassium level - nausea , weakness, tingly feeling, chest pain, irregular heartbeats, loss of movement; o signs of other electrolyte imbalances - increased thirst or urination, confusion, vomiting , muscle pain, slurred speech, severe weakness, numbness , loss of coordination, feeling unsteady. Common Aldactone side effects may include: breast swelling or tenderness. (Score: 0.7186)


Query: what are the side effects of benzphetamine
benzphetamine has the side effect of hives ; difficulty breathing; swelling of your face, lips, tongue, or throat. Benzphetamine may cause serious side effects. Stop using benzphetamine and call your doctor at once if you have: shortness of breath (even with mild exertion), swelling, rap

Here, cosine similarity is used as a "score" to get the most similar embedding, as the answer to the user query.

In [214]:
#Let's get the top two searches to show the next most similar sentence in the medical documents
hits = util.semantic_search(queries_embeddings, medical_corpus_embeddings, top_k=2)

# Print results of first query
print(f"Query: {user_query[1]}")
for hit in hits[1]:
    print(original_corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))

Query: what are the side effects of benzphetamine
benzphetamine has the side effect of hives ; difficulty breathing; swelling of your face, lips, tongue, or throat. Benzphetamine may cause serious side effects. Stop using benzphetamine and call your doctor at once if you have: shortness of breath (even with mild exertion), swelling, rapid weight gain; chest pain, feeling like you might pass out; pounding heartbeats or fluttering in your chest; confusion or irritability, unusual thoughts or behavior; or dangerously high blood pressure (severe headache , blurred vision, buzzing in your ears, anxiety , confusion, chest pain, shortness of breath, uneven heartbeats, seizure). Common side effects of benzphetamine may include: feeling restless or hyperactive; headache, dizziness , tremors; sleep problems ( insomnia ); increased sweating; dry mouth or an unpleasant taste in your mouth; nausea, diarrhea upset stomach ; or skin rash . (Score: 0.6885)
Benzaclin has the side effect of hives; diffi

Here, Cosine similarity measure returns "the side effects of Benzaclin" as the next most similar answer for "benzphetamine" due to the similar name. Thus if a drug does not exist, the next most similar match is returned by our model.

Thus it is important to make sure all user queries are covered.

### Added feature: disease treated

Let's include another use case, answering what disease the particular drug treats.
In order to include the disease treated by a particular drug, we need to encode the columns which has the relevant medical condition.

In [215]:
#Lets check for null values in the data, as we need to drop those rows
data['medical_condition'].isna().sum()

0

In [217]:
#Encode medical condition
data['medical_condition_present'] = data['drug_name'] +" is used to treat "+ data['medical_condition']

In [223]:
#Lets combine drug side effects and treated medical conditions in a single column for our encoding convenience
data_converted = pd.melt(data, id_vars='drug_name', value_vars=['drug_side_effects', 'medical_condition_present'])

In [224]:
data_converted

Unnamed: 0,drug_name,variable,value
0,doxycycline,drug_side_effects,"doxycycline has the side effect of (hives, dif..."
1,spironolactone,drug_side_effects,spironolactone has the side effect of hives ; ...
2,minocycline,drug_side_effects,"minocycline has the side effect of skin rash, ..."
3,Accutane,drug_side_effects,Accutane has the side effect of problems with ...
4,clindamycin,drug_side_effects,clindamycin has the side effect of hives ; dif...
...,...,...,...
5609,Atti-Plex P,medical_condition_present,Atti-Plex P is used to treat Weight Loss
5610,benzphetamine,medical_condition_present,benzphetamine is used to treat Weight Loss
5611,fenfluramine,medical_condition_present,fenfluramine is used to treat Weight Loss
5612,Imcivree,medical_condition_present,Imcivree is used to treat Weight Loss


In [226]:
#Encode conditions + disease treated
medical_condition_embeddings = model.encode(data_converted['value'].values.tolist())

In [227]:
#Create new corpus with both side effects and disease treated
added_corpus = data_converted['value'].values.tolist()

In [231]:
#Encode end user queries
user_query = ['what are the side effects of Arthricream', 'what disease is treated by setmelanotide']


queries_embeddings = model.encode(user_query)

# Find the top-1 corpus documents matching each query
hits = util.semantic_search(queries_embeddings, medical_condition_embeddings, top_k=1)

# Print results of first user query
print(f"Query: {user_query[0]}")
for hit in hits[0]:
    print(added_corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))
    print('\n')

# Print results of second user query which asks about the disease treated by a particular drug
print(f"Query: {user_query[1]}")
for hit in hits[1]:
    print(added_corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))

Query: what are the side effects of Arthricream


Query: what disease is treated by setmelanotide
setmelanotide is used to treat Weight Loss (Score: 0.7788)


In this way, we can add answers for most frequently asked questions by the users. Thus we can promote diversity of the answers provided by out Q&A model as it skims through large medical documents to answer user queries within seconds.

### Conclusion

The model can be improved with more customized answers to user queries based on popular questions. The accurate answers produced for different queries show that the Medical question answering model can be extended to more detailed questions and newer drugs. Additionally, multimodal models such as CLIP can extend this work to produce image examples of diseases (osteoarthits, Psoriasis) to user queries and extend this work further.