<a href="https://colab.research.google.com/github/ankesh86/LLMProjects/blob/main/ExtractingQues_LangChain_LLaMa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Schemas with LLaMa
## Goal get a python object back from LLaMa
1. Define a pydantic schema
2. Define a langchain series of prompts
3. Get a python object in return

In [1]:
!pip install langchain-groq langchain

Collecting langchain-groq
  Downloading langchain_groq-0.1.3-py3-none-any.whl (11 kB)
Collecting langchain
  Downloading langchain-0.1.17-py3-none-any.whl (867 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m867.6/867.6 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting groq<1,>=0.4.1 (from langchain-groq)
  Downloading groq-0.5.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.0/75.0 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-core<0.2.0,>=0.1.45 (from langchain-groq)
  Downloading langchain_core-0.1.50-py3-none-any.whl (302 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.8/302.8 kB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.5-py3-none-any.whl (28 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting

In [32]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_groq import ChatGroq
from google.colab import userdata

# Set up additional parameters if possible
model_parameters = {
    #"max_tokens": 1024,  # Setting a large enough limit to handle verbose medical questions
    "temperature": 0.1,  # Keeping the output deterministic
}


model = ChatGroq(**model_parameters, groq_api_key=userdata.get('GROQ_API_KEY'), model_name="llama3-70b-8192")

## **MCQ Extraction**

In [5]:
from typing import List

from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field, validator
from pydantic import BaseModel, Field

class MedicalMCQ(BaseModel):
    question: str = Field(description="The medical multiple-choice question.")
    options: List[str] = Field(description="List of answer options for the MCQ.")
    correct_option: int = Field(description="Index of the correct answer option, zero-indexed.")
    explanation: str = Field(description="Explanation or additional context related to the question.")

# Initialize the parser for the MedicalMCQ model
mcq_parser = PydanticOutputParser(pydantic_object=MedicalMCQ)


# Create a prompt template for generating a structured output from unformatted text
mcq_prompt = PromptTemplate(
    template="Extract the medical MCQ from the following text: \n{format_instructions}\n{query}\n",
    input_variables=["query"],  # This will contain the unformatted MCQ text
    partial_variables={"format_instructions": mcq_parser.get_format_instructions()},
)

# Assuming 'model' refers to a suitable model from LangChain that can parse the text to JSON
chain = mcq_prompt | model | mcq_parser


In [13]:
# Example text input
mcq_text = """
1. Revised |f an adult's BP recordings
repeatedly show the SBP between 120
and 128mm Hg and DBP 88mm Hg, it is
classified as.
A. Normal [25%]
@B. High normal [31%]
C. Grade 1 Hypertension [43%]
D. Grade 2 Hypertension [1%]
@ 31% of the people got this right
© Marrow QBank v3.0 - Dec2018 - MCQID MB3882
aksuggie@gmail.com
As the DBP of the person is 88 mm Hg, it is
classified as high normal BP.
Note: Whenever hypertension can be classified
into two different categories based on values of
SBP and DBP then the higher category is
considered as classification of hypertension.
MARROW
Classification of Blood Pressure
for Adults - European Society of
Cardiology guidelines - 2018
Pearl #1218 + Medicine
3ank v3.0 - Dec2018 - aksuggie@gmail.com
<120
Optimal ad <80
Normal 120-129 g0-84
and/or
i 130 - 139
High normal and/or 85-89
Grade 1 140-159
hypertension —_ and/or 90-99
Grade 2 160-179
hypertension and/or 100-109
Grade 3 2180 st10
hypertension and/or =
Isolated systolic
4 2140 and <90
hypertension
#RecentUpdates
A Report error Share MCQ
MCQ ID: MB3882
Reference
as Online Resource European Society of Cardiology
2018 Guidelines
"""

# Invoke the chain with the unformatted text as input
extracted_mcq = chain.invoke({"query": mcq_text})

# Access the extracted data
print(f"Question: {extracted_mcq.question}")
print(f"Options: {extracted_mcq.options}")
print(f"Correct Option: {extracted_mcq.correct_option}")
print(f"Explanation: {extracted_mcq.explanation}")


Question: If an adult's BP recordings repeatedly show the SBP between 120 and 128mm Hg and DBP 88mm Hg, it is classified as.
Options: ['Normal', 'High normal', 'Grade 1 Hypertension', 'Grade 2 Hypertension']
Correct Option: 1
Explanation: As the DBP of the person is 88 mm Hg, it is classified as high normal BP. Note: Whenever hypertension can be classified into two different categories based on values of SBP and DBP then the higher category is considered as classification of hypertension.


## **Multiple MCQ Extraction**

In [42]:
from typing import List
from pydantic import BaseModel, Field

class MedicalMCQ(BaseModel):
    question: str = Field(description="The medical multiple-choice question.")
    options: List[str] = Field(description="List of answer options for the MCQ")
    correct_option: int = Field(description="Index of the correct answer option, zero-indexed.")
    explanation: str = Field(description="Explanation or additional context related to the question till next question is detected")

class MedicalMCQs(BaseModel):
    mcqs: List[MedicalMCQ] = Field(description="A list of medical multiple-choice questions.")


In [47]:
from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import PromptTemplate

# Initialize the parser for the MedicalMCQs model
mcqs_parser = PydanticOutputParser(pydantic_object=MedicalMCQs)

mcqs_prompt = PromptTemplate(
    template="""
    Please extract all medical multiple-choice questions (MCQs) from the following text. Each MCQ should include:
    - The question text,
    - A list of answer options labeled as 'a)', 'b)', 'c)', and 'd)',
    - The correct option index, and
    - A complete explanation that continues until just before the next question starts or until the text explicitly signals the conclusion of the explanation.
    Ensure the entire explanation is captured even if it spans multiple paragraphs.
    \n{format_instructions}\n{query}\n
    """,
    input_variables=["query"],
    partial_variables={"format_instructions": mcqs_parser.get_format_instructions()},
)

In [48]:
# Assuming 'model' refers to a suitable model from LangChain that can parse the text to structured JSON
chain = mcqs_prompt | model | mcqs_parser


# Enter text

In [49]:
# Example text input containing multiple MCQs
mcqs_text = """
1)	A lady who is 38 weeks pregnant comes to the OPD for a routine checkup. She has a history of normal twin delivery at term 4 years ago. What is her gravida and para score?
a)	G2P2
b)	G2P1
c)	G3P2
d)	G3P1
Answer: G2P1
The woman has had two pregnancies in total – the previous twin pregnancy and the current pregnancy, hence the gravida score is 2. The first pregnancy resulted in delivery at term (crossed viable period), hence her para score is 1 (parity doesn’t include current pregnancy). Hence, her obstetric score would be G2P1 (Gravida-2, Para-1).

2)	A young boy was brought to the casualty with a history of fever and abdominal pain. On examination, he was febrile and his pulse rate was 104 beats/min. The CT scan image is shown below. Which of the following signs is not seen in this disease?

a)	Mcburney’s tenderness
b)	Rovsing’s sign
c)	Psoas sign
d)	Balance sign
Answer: Balance sign
The patient has acute appendicitis and Balance sign is not seen in acute appendicitis.
Balance sign is seen in ruptured spleen. Balance sign is positive if there is a fixed dullness to percussion in the left flank and shifting dullness to percussion in the right flank. The fixed dullness is due to the presence of coagulated blood near the lacerated spleen while the shifting dullness is due to the presence of blood in the peritoneal cavity.

3)	A patient came with complaints of hair loss. His wife mentions that she has noticed some behavioral changes. The doctor notices that there is a loss of eyebrows on the lateral side. He then comes to a conclusion by examining the nails. What is the type of poisoning in this case?
a)	Thallium
b)	Arsenic
c)	Mercury
d)	Lead
Answer: Thallium
The given scenario of hair loss, loss of eyebrows on the lateral side along with behavioral and nail changes point to thallium poisoning.
Thallium poisoning occurs due to an overdose or idiosyncrasy. Thallium sulfate and thallium acetate are the two salts of thallium used in dyes, in the glass industry, and as rodenticides and pesticides.


4)	A 50-year-old man with liver metastasis and altered LFT is posted for a surgery. What would be the opioid of choice in him?
a)	Fentanyl
b)	Alfentanil
c)	Sufentanil
d)	Remifentanil
Answer: Remifentanil
Opioid of choice in patients with compromised hepatic function is remifentanil.
Remifentanil is hydrolyzed by plasma and tissue esterases and hence its metabolism is independent of hepatic function.
Other opioids like fentanyl, alfentanil, and sufentanil are metabolized by the liver.

5)	A primigravida in her first trimester came for a routine ANC. USG findings revealed an empty uterine cavity. Serum beta hCG was measured and was found to be 1700 IU/mL. All of the following drugs can be used in managing this patient except:
a)	Potassium chloride
b)	Methotrexate
c)	 Actinomycin
d)	Oxytocin
Answer: Oxytocin
Oxytocin is not used in the management of unruptured ectopic pregnancy.

6)	Which of the following methods is not used for delivery of the aftercoming head in breech?
a)	Burns-Marshall method
b)	Mauriceau-Smellie-Veit method
c)	Piper’s forceps
d)	Keilland’s forceps
Answer: Keilland’s forceps
Delivery of the aftercoming head in breech is not done with the help of Keilland’s forceps. These forceps are used for the rotation of the fetal head. It is not used anymore.

7)	A young girl with McCune-Albright syndrome is noted to have precocious menstruation. This condition is defined as the occurrence of menstruation at
a)	<8 years
b)	 <7 years
c)	<10 years
d)	<11 years
Answer: <10 years
Precocious menstruation refers to the occurrence of menstruation before the age of 10 years.
Precocious puberty in a girl is the appearance of any of the secondary sexual characteristics before the age of 8 years or the occurrence of menarche before the age of 10 years.

8)	While evaluating a child with suspected refractive error, you notice that the ophthalmology resident performs a retinoscopy. Which of the following is this test based on?
A)	Method of neutralization
B)	Imbert-Fick law
C)	Principle of bending of light
D)	Poiseuille law
Answer: Method of neutralization
Retinoscopy is an objective method used to determine the error of refraction by applying the method of neutralization. It is also called skiascopy or shadow test.
Principle: When light is reflected from a mirror into the eye, the direction in which light will travel across the pupil will depend upon the eye’s refractive state.

9)	A woman comes to the casualty at 18 weeks of gestation with complaints of vaginal bleeding. She has a history of two early pregnancy losses. USG reveals the absence of fetal cardiac activity. Which of the following investigations need not be done in this patient?
a)	Extensive infection work-up
b)	Serum prolactin level
c)	Antiphospholipid antibodies
d)	Parental karyotyping
Answer: Extensive infection work-up
The given clinical scenario is suggestive of recurrent abortions, and extensive infection workup is not done. Infections do not cause this condition with the exception of syphilis.
Syphilis can cause recurrent abortion only if it remains untreated in the subsequent pregnancy.
Even though recurrent abortions are defined as ≥3 pregnancy losses, investigations must be done after 2 consecutive losses.

10)	In an immunocompromised patient, which of the following is the drug of choice for prophylaxis of pneumocystis jirovecii?
a)	Cephalosporin
b)	Dexamethasone
c)	Cotrimoxazole
d)	Amoxycillin
Answer: Cotrimoxazole
Trimethoprim-Sulfamethoxazole Combination (commonly known as Co-trimoxazole) is the first line drug for both prophylaxis and treatment of Pneumocystis jirovecii infections in immunocompromised patients.
Cotrimoxazole acts by causing a sequential block of folate metabolism.
"""

# **Final Processing**

In [51]:

# Invoke the chain with the unformatted text as input and additional parameters
extracted_mcqs = chain.invoke({"query": mcqs_text})


# Process and print each MCQ from the list
for mcq in extracted_mcqs.mcqs:
    print(f"Question: {mcq.question}")
    print(f"Options: {mcq.options}")
    print(f"Correct Option: {mcq.correct_option}")
    print(f"Explanation: {mcq.explanation}")
    print("\n")


Question: A lady who is 38 weeks pregnant comes to the OPD for a routine checkup. She has a history of normal twin delivery at term 4 years ago. What is her gravida and para score?
Options: ['G2P2', 'G2P1', 'G3P2', 'G3P1']
Correct Option: 1
Explanation: The woman has had two pregnancies in total – the previous twin pregnancy and the current pregnancy, hence the gravida score is 2. The first pregnancy resulted in delivery at term (crossed viable period), hence her para score is 1 (parity doesn’t include current pregnancy). Hence, her obstetric score would be G2P1 (Gravida-2, Para-1).


Question: A young boy was brought to the casualty with a history of fever and abdominal pain. On examination, he was febrile and his pulse rate was 104 beats/min. The CT scan image is shown below. Which of the following signs is not seen in this disease?
Options: ['McBurney’s tenderness', 'Rovsing’s sign', 'Psoas sign', 'Balance sign']
Correct Option: 3
Explanation: The patient has acute appendicitis and 

##Optimising data

In [9]:
from typing import List, Optional
from pydantic import BaseModel, Field, validator, ValidationError

class MedicalMCQ(BaseModel):
    question: str = Field(description="The medical multiple-choice question.")
    options: List[str] = Field(description="List of answer options for the MCQ.")
    correct_option: Optional[str] = Field(default=None, description="Index of the correct answer option, ABCD-indexed.")
    explanation: str = Field(description="Explanation or additional context related to the question.")

    @validator('options', each_item=True)
    def clean_options(cls, v):
        return v.lstrip("ABCD. ").strip()

    @validator('correct_option', pre=True, always=True)
    def set_default_correct_option(cls, v):
        try:
            return int(v)
        except (TypeError, ValueError):
            return None

class MedicalMCQs(BaseModel):
    mcqs: List[MedicalMCQ] = Field(description="A list of medical multiple-choice questions.")


<ipython-input-9-09bed55ec6d4>:10: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.7/migration/
  @validator('options', each_item=True)
<ipython-input-9-09bed55ec6d4>:14: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.7/migration/
  @validator('correct_option', pre=True, always=True)


In [10]:
from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import PromptTemplate

# Initialize the parser for the MedicalMCQs model
mcqs_parser = PydanticOutputParser(pydantic_object=MedicalMCQs)

mcqs_prompt = PromptTemplate(
    template="""Extract all medical MCQs from the following text, ensuring each part is complete and explanations are not truncated:
                \n{format_instructions}\n{query}\n""",
    input_variables=["query"],
    partial_variables={"format_instructions": mcqs_parser.get_format_instructions()},
)

# Batch processing function
def process_text_in_batches(text, batch_size=1000):
    parts = [text[i:i+batch_size] for i in range(0, len(text), batch_size)]
    results = []
    for part in parts:
        try:
            extracted_mcqs = chain.invoke({"query": part})
            results.extend(extracted_mcqs.mcqs)
        except Exception as e:
            print(f"Error processing batch: {str(e)}")
    return results

def index_to_alpha(index):
    # This function converts a numeric index to an alphabetic label.
    if index is not None and 0 <= index < 26:
        return chr(ord('A') + index)
    return None


In [14]:
# Example usage with batch processing
mcqs_text = '''
7. Wellen's syndrome suggests ___
© A. Stable angina [12%]
@ B. Unstable angina [40%]
C. Prinzmetal angina [35%]
D. Ludwig angina [13%]
) 40% of the people got this right
Wellen's syndrome suggests unstable angina.
Wellen's syndrome is caused by tight stenosis in
the LAD coronary system that causes ischemic
chest pain with deep T wave inversions in
multiple precordial leads, with or without
cardiac enzyme elevations and with minimal or
no ST elevations. This is also called the left
anterior descending coronary artery- T wave
inversion pattern. Patients have recurrent
episodes and a high incidence of MI.
#Eponyms #recentNEET
A Report error & Share MCQ
Reference
Bg Harrison's Principles of Internal Medicine 20th
Edition Page number. 1680
8. Which of the following is true about
stable angina?
A. CK-MB is elevated [6%]
© B. Troponin | is elevated [11%]
C. Myoglobin is elevated [4%]
D. The levels of cardiac markers remain
unchanged [79%]
wd 79% of the people got this right
The levels of cardiac markers remain unchanged
in stable angina.
Stable angina occurs when the myocardial
oxygen demand exceeds the myocardial oxygen
supply. If adequate blood flow is restored , there
will be no permanent damage to the
myocardium. This does not raise the cardiac
biomarkers as there is no necrosis of cardiac
myocytes.
In myocardial infarction, myocyte necrosis
occurs due to ischemia. The cardiac biomarkers
in the plasma are elevated in this condition.
Ref: https://www.uptodate.com/contents/whats-new?search=
"57407 1ecbcec4345feb4c3144"
#recentNEET
A Report error > Share MCQ
Reference
Harrison's Principles Of Internal Medicine - 19th
Mal Edition Page no: 1581
'''
all_mcqs = process_text_in_batches(mcqs_text)


# Use this function in your loop where you print each MCQ's details.
for mcq in all_mcqs:
    alpha_option = index_to_alpha(mcq.correct_option)
    print(f"Question: {mcq.question}")
    print(f"Options: {mcq.options}")
    print(f"Correct Option: {alpha_option}")  # Display as 'A', 'B', 'C', etc.
    print(f"Explanation: {mcq.explanation}")
    print("\n")


Question: Wellen's syndrome suggests ___
Options: ['A. Stable angina', 'B. Unstable angina', 'C. Prinzmetal angina', 'D. Ludwig angina']
Correct Option: B
Explanation: Wellen's syndrome suggests unstable angina. Wellen's syndrome is caused by tight stenosis in the LAD coronary system that causes ischemic chest pain with deep T wave inversions in multiple precordial leads, with or without cardiac enzyme elevations and with minimal or no ST elevations.


Question: Which of the following is true about stable angina?
Options: ['A. CK-MB is elevated', 'B. Troponin I is elevated', 'C. Myoglobin is elevated', 'D. The levels of cardiac markers remain unchanged']
Correct Option: D
Explanation: The levels of cardiac markers remain unchanged in stable angina.


Question: Why do cardiac biomarkers remain unchanged in stable angina?
Options: ['There is no necrosis of cardiac myocytes', 'There is no ischemia', 'There is no myocardial oxygen demand', 'There is no myocardial oxygen supply']
Correct Op

In [86]:
all_mcqs

[MedicalMCQ(question='Which is the first branch of the right coronary artery?', options=['Sinoatrial node artery', 'Atrioventricular nodal artery', 'Right conus artery', 'Right marginal artery'], correct_option=2, explanation='The first branch of the right coronary artery (RCA) is the right conus artery. The second branch of the RCA is usually the sinoatrial node artery. It arises from the RCA in about 60% of patients and from the LCX artery in about 40%.')]

In [89]:
!pip freeze


absl-py==1.4.0
aiohttp==3.9.5
aiosignal==1.3.1
alabaster==0.7.16
albumentations==1.3.1
altair==4.2.2
annotated-types==0.6.0
anyio==3.7.1
appdirs==1.4.4
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
array_record==0.5.1
arviz==0.15.1
astropy==5.3.4
astunparse==1.6.3
async-timeout==4.0.3
atpublic==4.1.0
attrs==23.2.0
audioread==3.0.1
autograd==1.6.2
Babel==2.14.0
backcall==0.2.0
beautifulsoup4==4.12.3
bidict==0.23.1
bigframes==1.4.0
bleach==6.1.0
blinker==1.4
blis==0.7.11
blosc2==2.0.0
bokeh==3.3.4
bqplot==0.12.43
branca==0.7.2
build==1.2.1
CacheControl==0.14.0
cachetools==5.3.3
catalogue==2.0.10
certifi==2024.2.2
cffi==1.16.0
chardet==5.2.0
charset-normalizer==3.3.2
chex==0.1.86
click==8.1.7
click-plugins==1.1.1
cligj==0.7.2
cloudpathlib==0.16.0
cloudpickle==2.2.1
cmake==3.27.9
cmdstanpy==1.2.2
colorcet==3.1.0
colorlover==0.3.0
colour==0.1.5
community==1.0.0b1
confection==0.1.4
cons==0.4.6
contextlib2==21.6.0
contourpy==1.2.1
cryptography==42.0.5
cufflinks==0.17.3
cupy-cuda12x==12.2.0