In [1]:
!pip install requests   transformers tensorflow sentence-transformers langchain  pinecone[grpc] langchain-pinecone langchain_community langchain_experimental pinecone-client tf-keras 

Collecting pinecone-client
  Using cached pinecone_client-5.0.1-py3-none-any.whl.metadata (19 kB)
Using cached pinecone_client-5.0.1-py3-none-any.whl (244 kB)
Installing collected packages: pinecone-client
  Attempting uninstall: pinecone-client
    Found existing installation: pinecone-client 6.0.0
    Uninstalling pinecone-client-6.0.0:
      Successfully uninstalled pinecone-client-6.0.0
Successfully installed pinecone-client-5.0.1


In [2]:
import os
os.chdir("../")

In [3]:
from langchain.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
import re
from pathlib import Path
from typing import List, Tuple
from gensim.utils import simple_preprocess
from textblob import TextBlob
from tqdm import tqdm
from langchain.schema import Document

# === CONFIGURATION ===
DATA_DIR = Path("data")   # Update with your actual directory path
CHUNK_SIZE = 200
CHUNK_OVERLAP = 50
ENABLE_SPELL_CORRECTION = False   # Set to True if needed (very slow)


# === Load text files ===
def load_text_files(directory: Path) -> List[Tuple[str, str]]:
    texts = []
    for file_path in tqdm(directory.glob("*.txt"), desc="Loading files"):
        try:
            with open(file_path, "r", encoding="utf-8") as file:
                content = file.read().strip()
                if content:
                    texts.append((content, file_path.name))
        except Exception as e:
            print(f"Error reading {file_path}: {e}")
    return texts


# === Clean and tokenize text ===
def clean_and_tokenize(text: str) -> str:
    text = re.sub(r'\s+', ' ', text)              # Normalize whitespace
    text = text.lower()                           # Lowercase
    text = re.sub(r'[^\w\s]', '', text)           # Remove punctuation
    tokens = simple_preprocess(text)
    return ' '.join(tokens)


# === Optional spell correction (slow) ===
def correct_spelling(text: str) -> str:
    try:
        return str(TextBlob(text).correct())
    except Exception as e:
        print(f"Spell correction error: {e}")
        return text


# === Chunk text into overlapping windows ===
def chunk_text(text: str, chunk_size: int = 200, overlap: int = 50) -> List[str]:
    words = text.split()
    if chunk_size <= overlap:
        raise ValueError("chunk_size must be greater than overlap.")
    return [
        ' '.join(words[i:i + chunk_size])
        for i in range(0, len(words), chunk_size - overlap)
    ]


# === Main pipeline: return Document objects with metadata ===
def process_documents(input_dir: Path, chunk_size: int = 100, overlap: int = 20, correct: bool = False) -> List[Document]:
    raw_docs = load_text_files(input_dir)  # (content, filename) tuples
    documents = []

    for content, source in tqdm(raw_docs, desc="Processing documents"):
        cleaned = clean_and_tokenize(content)

        if correct:
            cleaned = correct_spelling(cleaned)

        chunks = chunk_text(cleaned, chunk_size=chunk_size, overlap=overlap)

        for i, chunk in enumerate(chunks):
            documents.append(Document(
                page_content=chunk,
                metadata={
                    "source": source,
                    "title": source.replace(".txt", ""),
                    "chunk_id": i
                }
            ))

    return documents


# === Run the pipeline ===
if __name__ == "__main__":
    input_path = Path(DATA_DIR)
    documents = process_documents(
        input_dir=input_path,
        chunk_size=CHUNK_SIZE,
        overlap=CHUNK_OVERLAP,
        correct=ENABLE_SPELL_CORRECTION
    )
    print(f"\n✅ Total document chunks created: {len(documents)}")


Loading files: 17it [00:00, 45.36it/s]
Processing documents: 100%|██████████| 17/17 [00:39<00:00,  2.30s/it]


✅ Total document chunks created: 75136





In [48]:
documents

[Document(metadata={'source': 'Anatomy_Gray.txt', 'title': 'Anatomy_Gray', 'chunk_id': 0, 'text': 'what is anatomy anatomy includes those structures that can be seen grossly without the aid of magnification and microscopically with the aid of magnification typically when used by itself the term anatomy tends to mean gross or macroscopic anatomythat is the study of structures that can be seen without using microscopic microscopic anatomy also called histology is the study of cells and tissues using microscope anatomy forms the basis for the practice of medicine anatomy leads the physician toward an understanding of patients disease whether he or she is carrying out physical examination or using the most advanced imaging techniques anatomy is also important for dentists chiropractors physical therapists and all others involved in any aspect of patient treatment that begins with an analysis of clinical signs the ability to interpret clinical observation correctly is therefore the endpoint

In [5]:
from langchain.embeddings import HuggingFaceEmbeddings

In [6]:
#Download the Embeddings from Hugging Face
def download_hugging_face_embeddings():
    embeddings=HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')
    return embeddings


In [8]:
embeddings = download_hugging_face_embeddings()

  embeddings=HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')
  from .autonotebook import tqdm as notebook_tqdm





In [9]:
query_result = embeddings.embed_query("Hello world")
print("Length", len(query_result))

Length 384


In [10]:
query_result

[-0.03447727486491203,
 0.03102317824959755,
 0.006734970025718212,
 0.026108985766768456,
 -0.03936202451586723,
 -0.16030244529247284,
 0.06692401319742203,
 -0.006441489793360233,
 -0.0474504791200161,
 0.014758856035768986,
 0.07087527960538864,
 0.05552763119339943,
 0.019193334504961967,
 -0.026251312345266342,
 -0.01010954286903143,
 -0.02694045566022396,
 0.022307461127638817,
 -0.022226648405194283,
 -0.14969263970851898,
 -0.017493007704615593,
 0.00767625542357564,
 0.05435224249958992,
 0.0032543970737606287,
 0.031725890934467316,
 -0.0846213847398758,
 -0.02940601296722889,
 0.05159561336040497,
 0.04812406003475189,
 -0.0033148222137242556,
 -0.058279167860746384,
 0.04196927323937416,
 0.022210685536265373,
 0.1281888335943222,
 -0.022338971495628357,
 -0.011656315997242928,
 0.06292839348316193,
 -0.032876335084438324,
 -0.09122604131698608,
 -0.031175347045063972,
 0.0526994913816452,
 0.04703482985496521,
 -0.08420311659574509,
 -0.030056199058890343,
 -0.02074483036

In [18]:
from dotenv import load_dotenv
load_dotenv()

True

In [17]:
PINECONE_API_KEY=os.environ.get('PINECONE_API_KEY')
GOOGLE_API_KEY=os.environ.get('GOOGLE_API_KEY')

In [23]:
from pinecone.grpc import PineconeGRPC as Pinecone
from pinecone import ServerlessSpec
import os

pc = Pinecone(api_key=PINECONE_API_KEY)

index_name = "cdssrag"


pc.create_index(
    name=index_name,
    dimension=384, 
    metric="cosine", 
    spec=ServerlessSpec(
        cloud="aws", 
        region="us-east-1"
    ) 
) 

In [20]:
os.environ["PINECONE_API_KEY"] = PINECONE_API_KEY
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

In [None]:
# Embed each chunk and upsert the embeddings into your Pinecone index.
from langchain_pinecone import PineconeVectorStore

docsearch = PineconeVectorStore.from_documents(
    documents=documents,
    index_name=index_name,
    embedding=embeddings,
)

In [None]:
from langchain_pinecone import PineconeVectorStore
from tqdm import tqdm

def upload_documents_in_batches(documents, index_name, embeddings, batch_size=50):
    for i in tqdm(range(0, len(documents), batch_size), desc="Uploading to Pinecone"):
        batch = documents[i:i + batch_size]
        try:
            PineconeVectorStore.from_documents(
                documents=batch,
                index_name=index_name,
                embedding=embeddings
            )
        except Exception as e:
            print(f"❌ Error uploading batch {i // batch_size}: {e}")


upload_documents_in_batches(documents, index_name=index_name, embeddings=embeddings)


In [24]:
# Load Existing index

from langchain_pinecone import PineconeVectorStore
# Embed each chunk and upsert the embeddings into your Pinecone index.
docsearch = PineconeVectorStore.from_existing_index(
    index_name=index_name,
    embedding=embeddings
)

In [25]:
docsearch


<langchain_pinecone.vectorstores.PineconeVectorStore at 0x143347bcfd0>

In [26]:
retriever = docsearch.as_retriever(search_type="similarity", search_kwargs={"k":3})

In [27]:
retrieved_docs = retriever.invoke("What is Acne?")

In [28]:
retrieved_docs


[Document(id='4b64faeb-bffe-4ecb-836a-dc7728cd9a3b', metadata={'chunk_id': 2517.0, 'source': 'Pediatrics_Nelson.txt', 'title': 'Pediatrics_Nelson'}, page_content='and overgrowth of normal skin flora leading to pilosebaceous occlusion and enlargement androgens are potent stimulus of the sebaceous gland the subsequent inflammatory component and pustule formation results from proliferation of acnes commensal organism of the skin the pathogenesis of acne thus involves three components increased sebum production hyperkeratosis and bacterial proliferation effective treatment focuses on minimizing these factors acne is the most common skin disorder in adolescents occurring in of teenagers the incidence is similar in both sexes although boys often are more severely affected acne may begin as early as years of age and may continue into adulthood acne primarily affects areas with increased sebaceous glanddensity such as the face upper chest and back of the pilosebaceous unit results in to mm ope

In [29]:
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

In [37]:
# Initialize Gemini
llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    temperature=0.4,
    max_output_tokens=500
)

system_prompt = (
    "You are a medical assistant specializing in providing accurate and concise information "
    "based solely on the provided medical context. "
    "Use only the information from the retrieved context below to answer the user's question. "
    "If the answer is not explicitly stated in the context, respond with 'I am sorry, The answer is not available in the provided information.' "
    "Keep your answer factual, clear, and limited to a maximum of seven sentences. "
    "Do not provide personal opinions or make guesses. "
    "This response is not a substitute for professional medical advice."
    "\n\n"
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

In [38]:
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

In [39]:
response = rag_chain.invoke({"input": "what is Acromegaly"})
print(response["answer"])

Acromegaly can cause nerve entrapment, particularly of the median nerve. Carpal tunnel syndrome has been identified in a percentage of acromegalic patients. Polyneuropathy is also recognized as a complication of acromegaly, characterized by paresthesia, loss of tendon reflexes in the legs, and atrophy of the distal leg muscles. In some cases, enlarged nerves may occur due to hypertrophic changes in the endoneurial and perineurial tissues. Treatment with bromocriptine and octreotide must be continuous to prevent relapse. If the patient is intolerant of medication or in the case of acromegaly to octreotide and newer drugs the treatment is surgical using transsphenoidal microsurgical approach.


In [33]:
response = rag_chain.invoke({"input": "What is generative ai?"})
print(response["answer"])

I am sorry, The answer is not available in the provided information.


In [34]:
response = rag_chain.invoke({"input": "What is gigantism?"})
print(response["answer"])

Gigantism is characterized by a generalized increase in body size with long arms and legs. This condition develops before the epiphyses close, as is the case in prepubertal children, due to excessive levels of growth hormone and IGF. In most instances, gigantism is accompanied by evidence of acromegaly.


In [35]:
response = rag_chain.invoke({"input": "What is Acne?"})
print(response["answer"])

Acne vulgaris, or acne, is a chronic inflammatory disorder affecting areas with the greatest concentration of sebaceous glands, such as the face, chest, and back. It is caused by chronic inflammation of the pilosebaceous unit (hair follicle with an associated sebaceous gland). The primary event in all acne lesions is the development of the microcomedo, which results from the obstruction of the hair follicle with keratin. Increased sebum production from sebaceous glands and overgrowth of normal skin flora lead to pilosebaceous occlusion and enlargement.


In [41]:
response = rag_chain.invoke({"input": "what is the sign and symptoms and the lab results of a person with pneumonia"})
print(response["answer"])

Classic symptoms of pneumonia include sudden onset fever, productive cough, purulent yellow-green sputum or hemoptysis, dyspnea, night sweats, and pleuritic chest pain. Atypical symptoms include gradual onset, dry cough, headaches, myalgias, and sore throat. The lung exam may show bronchial breath sounds, rales, wheezing, dullness to percussion, egophony, and tactile fremitus.

Lab results may include sputum gram stain and culture, blood culture, and ABG. For specific pathogens, tests include Legionella urine antigen test, sputum staining with direct fluorescent antibody (DFA) culture for Legionella, Chlamydia pneumoniae serologic testing/culture/PCR. The white blood cell count with bacterial pneumonias is elevated with a predominance of neutrophils, whereas with viral pneumonias, it is often normal or mildly elevated with a predominance of lymphocytes.


In [42]:
response = rag_chain.invoke({"input": "i have a patient  with 3 years history of difficulty of swallowing after eating solids what could be the differencial diagnosis"})
print(response["answer"])

Based on the information provided, intermittent dysphagia that occurs only with solid food implies structural dysphagia. Episodic dysphagia to solids that is unchanged over years indicates a benign disease process such as Schatzki's ring or eosinophilic esophagitis. Food impaction with prolonged inability to pass an ingested bolus even with ingestion of liquid is typical of structural dysphagia.


In [47]:
response = rag_chain.invoke({"input": "i have a patient  with 3 years history of difficulty of swallowing liquids what could be the specific  differencial diagnosis "})
print(response["answer"])

Based on the provided information, here are potential differential diagnoses for a patient with a 3-year history of difficulty swallowing liquids:

*   **Oropharyngeal dysphagia:** This condition typically involves more difficulty swallowing liquids than solids and can be caused by neurological or muscular issues such as stroke, Parkinson's disease, myasthenia gravis, prolonged intubation, or Zenker's diverticula.
*   **Motility disorders:** Motility disorders such as achalasia, scleroderma, or esophageal spasm can present with difficulty swallowing both liquids and solids.
