# JSONLoader Ingestion RAG with LangChain and Mistral 7B

## Library Setup

In [3]:
!pip install langchain chromadb sentence-transformers
!pip install openai tiktoken
!pip install jq







In [20]:
from langchain.chains import RetrievalQA
from langchain.document_loaders import JSONLoader
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.prompts import PromptTemplate
from langchain.text_splitter import TokenTextSplitter,CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import AutoTokenizer, pipeline
import time

In [4]:
# Set up OpenAI
import os
import getpass
import openai

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
openai.api_key = os.environ["OPENAI_API_KEY"]

OpenAI API Key:········


In [5]:
os.environ["HUGGINGFACEHUB_API_TOKEN"] = getpass.getpass("HuggingFace API Key:")

HuggingFace API Key:········


In [8]:
embeddings = SentenceTransformerEmbeddings(model_name="all-mpnet-base-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## RAG Data Ingestion and Loading

In [None]:
def metadata_func(record: dict, metadata: dict) -> dict:
    # Define the metadata extraction function.
    metadata["year"] = record.get("pub_date").get('year')
    metadata["month"] = record.get("pub_date").get('month')
    metadata["day"] = record.get("pub_date").get('day')
    metadata["title"] = record.get("article_title")
    
    return metadata

In [11]:
# Retrieve background documents from json file
loader = JSONLoader(
    file_path='data/pubmed_background.json',
    jq_schema='.[]',
    content_key='article_abstract',
    metadata_func=metadata_func)

data = loader.load()
print(f"{len(data)} pubmed background documents loaded!")
data[1]

754 pubmed background documents loaded!


Document(page_content='Osteoporotic fractures lead to increased disability and mortality in the elderly population. With the rapid increase in the aging population around the globe, more effective treatments for osteoporosis and osteoporotic fractures are urgently required. The underlying molecular mechanisms of osteoporosis are believed to be due to the increased activity of osteoclasts, decreased activity of osteoblasts, or both, which leads to an imbalance in the bone remodeling process with accelerated bone resorption and attenuated bone formation. Currently, the available clinical treatments for osteoporosis have mostly focused on factors influencing bone remodeling; however, they have their own limitations and side effects. Recently, cytokine immunotherapy, gene therapy, and stem cell therapy have become new approaches for the treatment of various diseases. This article reviews the latest research on bone remodeling mechanisms, as well as how this underpins current and potential 

In [7]:
# Load documents into index
text_splitter = TokenTextSplitter(chunk_size=128, chunk_overlap=64)
chunks = text_splitter.split_documents(data)

print(f"{len(data)} pubmed articles are converted to {len(chunks)} text fragments!")
chunks[0]

754 pubmed articles are converted to 1802 text fragments!


Document(page_content='Altered metabolism is a hallmark of cancer and presents a vulnerability that can be exploited in cancer treatment. Regulated cell death (RCD) plays a crucial role in cancer metabolic therapy. A recent study has identified a new metabolic-related RCD known as disulfidptosis. Preclinical findings suggest that metabolic therapy using glucose transporter (GLUT) inhibitors can trigger disulfidptosis and inhibit cancer growth. In this review, we summarize the specific mechanisms underlying disulfidptosis and outline potential future research directions. We also discuss the challenges that may arise in the clinical translation of disulfidptosis research.', metadata={'source': '/home/vying/LLM/Medical_Research_RAG/data/pubmed_background.json', 'seq_num': 1, 'year': '2023', 'month': '04', 'day': '27', 'title': 'Disulfidptosis: a new target for metabolic cancer therapy.'})

In [12]:
backgroundDB = Chroma.from_documents(chunks, embeddings)
print("Created background vector DB!")

Created background vector DB!


In [13]:
loader = JSONLoader(
    file_path='data/pubmed_reference.json',
    jq_schema='.[]',
    content_key='article_abstract',
    metadata_func=metadata_func)

data = loader.load()
print(f"{len(data)} pubmed reference documents loaded!")
data[1]

8127 pubmed reference documents loaded!


Document(page_content='Antibody drug conjugates (ADCs) combine the potent cytotoxicity of chemotherapy with the antigen -specific targeted approach of antibodies into one single molecule. Trophoblast cell surface antigen 2 (TROP-2) is a transmembrane glycoprotein involved in calcium signal transduction and is expressed in multiple tumor types. TROP-2 expression is higher in HER2-negative breast tumors (HR+/HR-) and is associated with worse survival. Sacituzumab govitecan (SG) is a first-in-class TROP-2-directed ADC with an anti-TROP-2 antibody conjugated to SN-38, a topoisomerase inhibitor via a hydrolysable linker. This hydrolysable linker permits intracellular and extracellular release of the membrane permeable payload enabling the "bystander effect" contributing to the efficacy of this agent. There was significant improvement in progression free survival (PFS) and overall survival (OS) with SG versus chemotherapy in pretreated metastatic triple negative breast cancer (TNBC), resulti

In [14]:
# Load documents into index
text_splitter = TokenTextSplitter(chunk_size=128, chunk_overlap=64)
chunks = text_splitter.split_documents(data)

print(f"{len(data)} pubmed articles are converted to {len(chunks)} text fragments!")
chunks[0]

8127 pubmed articles are converted to 17496 text fragments!


Document(page_content='Evaluate the efficacy, safety, and tolerability of zavegepant nasal spray in the acute treatment of migraine.', metadata={'source': '/home/vying/LLM/Medical_Research_RAG/data/pubmed_reference.json', 'seq_num': 1, 'year': '2022', 'month': '10', 'day': '14', 'title': 'Zavegepant nasal spray for the acute treatment of migraine: A Phase 2/3 double-blind, randomized, placebo-controlled, dose-ranging trial.'})

In [15]:
referenceDB = Chroma.from_documents(chunks, embeddings)
print("Created reference vector DB!")

Created reference vector DB!


## LLM and RAG Components

In [18]:
# Model specification
model_id = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_8bit=False, device_map='auto')



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



In [19]:
# Pipeline/LLM specification
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=128)
llm = HuggingFacePipeline(
    pipeline = pipe,
    model_kwargs={"temperature": 0, "max_length": 1024}
)

## Topic Sentences

In [32]:
# PROMPT specification
TOPIC_PROMPT_TEMPLATE = \
    """
    You are a medical research assistant to help write an academic research paper.
    Answer the Question using the provided Context only.
    Your answer should be in an academic tone and be no longer than 128 words.
    Context: {context} 
    Question: {question}
    Answer:"
    """
TOPIC_PROMPT = PromptTemplate.from_template(TOPIC_PROMPT_TEMPLATE)

In [33]:
# RAG pipeline
topic_chain = RetrievalQA.from_chain_type(
    llm,
    retriever = backgroundDB.as_retriever(k=2),
    chain_type_kwargs = {"prompt": TOPIC_PROMPT},
    return_source_documents = True
)

In [34]:
query = "What is a popular topic in current medical research?"
result = topic_chain({"query": query})

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 