<a href="https://colab.research.google.com/github/coolcat218/Medical-Diagnosis/blob/main/Medical_Diagnosis_Academic_Research.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Medical Diagnosis: Academic Research Reports
Uploads multiple PDFs of academic research reports (sourced from Google Scholar + JSTOR), split their text into "chunks" of text strings, summarize each chunk using an LLM, and then enable semantic search over the embedded vectors of chunks based on a user input of symptoms.

## Install necessary libraries

Install libraries for PDF processing, text splitting, and semantic search.


In [None]:
%pip install pypdf langchain sentence-transformers faiss-cpu transformers

Collecting pypdf
  Downloading pypdf-6.0.0-py3-none-any.whl.metadata (7.1 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Downloading pypdf-6.0.0-py3-none-any.whl (310 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.5/310.5 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m59.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf, faiss-cpu
Successfully installed faiss-cpu-1.12.0 pypdf-6.0.0


## 1) Upload and load pdf documents

Enables user upload multiple PDF files (I've uploaded 28 files that cover a variety of medical conditions) and then load their content


In [None]:
# Import necessary libraries and define the function to load and combine text from multiple PDF files

import gradio as gr
from pypdf import PdfReader

def load_pdfs(file_paths):
    """Loads text from multiple PDF files and combines it into a single string."""
    combined_text = ""
    for file_path in file_paths:
        with open(file_path, "rb") as f:
            reader = PdfReader(f)
            for page in reader.pages:
                combined_text += page.extract_text()
    return combined_text

In [None]:
# Use Gradio interface to allow users to upload multiple PDF files and consolidate the text into a single line of text.

iface = gr.Interface(
    fn=load_pdfs,
    inputs=gr.File(label="Upload PDF Files", file_count="multiple", file_types=[".pdf"]),
    outputs="text",
    title="PDF Text Extractor",
    description="Upload multiple PDF files to extract and combine their text content."
)

combined_text = iface.launch()

It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://cd680b3232085b8862.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


## 2) Split text into chunks

Break down the text content of the PDF documents into smaller text strings, i.e. "chunks."


In [None]:
import nltk
# Download the required NLTK data (still good practice even if not strictly needed for RecursiveCharacterTextSplitter)
nltk.download('punkt')
nltk.download('punkt_tab')

from langchain.text_splitter import RecursiveCharacterTextSplitter # Switch back to RecursiveCharacterTextSplitter
import os

# Define the path to the combined text file you will upload
COMBINED_TEXT_FILE_PATH = "combined_text.txt" # Assumes the file is in the default Colab directory (/content/)

combined_text = ""
# Read the combined text from the file
if os.path.exists(COMBINED_TEXT_FILE_PATH):
    try:
        # Try reading with utf-8 first
        with open(COMBINED_TEXT_FILE_PATH, "r", encoding="utf-8") as f:
            combined_text = f.read()
        print(f"Successfully loaded text from '{COMBINED_TEXT_FILE_PATH}' with utf-8 encoding.")
    except UnicodeDecodeError:
        # If utf-8 fails, try latin-1
        try:
            with open(COMBINED_TEXT_FILE_PATH, "r", encoding="latin-1") as f:
                combined_text = f.read()
            print(f"Successfully loaded text from '{COMBINED_TEXT_FILE_PATH}' with latin-1 encoding.")
        except Exception as e:
            print(f"Error reading from file '{COMBINED_TEXT_FILE_PATH}' with latin-1 encoding: {e}")
            combined_text = "" # Ensure combined_text is empty on error
    except Exception as e:
        print(f"Error reading from file '{COMBINED_TEXT_FILE_PATH}': {e}")
        combined_text = "" # Ensure combined_text is empty on error
else:
    print(f"Error: File '{COMBINED_TEXT_FILE_PATH}' not found.")
    print("Please upload 'combined_text.txt' using the File Explorer on the left sidebar.")

# Proceed with splitting only if combined_text has content
if combined_text:
    # Define chunk size and overlap using variables
    chunk_size = 10000  # Example chunk size
    chunk_overlap = 400 # Example overlap

    # Initialize RecursiveCharacterTextSplitter
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )

    # create_documents expects a list of strings or Document objects
    # We pass a list containing our single combined_text string
    document_chunks = text_splitter.create_documents([combined_text])

    # Print chunk size and overlap from the variables
    print(f"Created {len(document_chunks)} document chunks with chunk_size={chunk_size} and chunk_overlap={chunk_overlap}.")
    # Display the first few chunks to verify
    if len(document_chunks) > 0:
        print("First chunk:")
        print(document_chunks[0].page_content[:500] + "...") # Print first 500 chars of the first chunk
    if len(document_chunks) > 1:
        print("\nSecond chunk:")
        print(document_chunks[1].page_content[:500] + "...") # Print first 500 chars of the second chunk
else:
    print("No text content available to split into chunks.")
    document_chunks = [] # Ensure document_chunks is defined even if no text is loaded

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Successfully loaded text from 'combined_text.txt' with latin-1 encoding.
Created 212 document chunks with chunk_size=10000 and chunk_overlap=400.
First chunk:
Biochemistry and Biophysics Reports 36 (2023) 101578
Available online 23 November 2023
2405-5808/© 2023 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
Acne vulgaris: A review of the pathophysiology, treatment, and recent 
nanotechnology based advances 
Mallikarjun Vasam
a , *
, Satyanarayana Korutla
a
, Raghvendra Ashok Bohara
b , c , ** 
a
Chaitanya (Deemed to Be University)-Pharmacy, Hanamkonda, Wa...

Second chunk:
tory lesions such as ulcers (pustules, nodules, cysts and papules). 
Non-inflammatory lesions are smaller and less pus-filled than in -
flammatory lesions [ 22 , 23 ]. In addition, it was discovered that 
neutrophils produce reactive oxygen species (ROS), which damage 
the follicular epithelium and contribute to acne inf

## Summarize chunks using llm

### Subtask:
Use an LLM to generate summaries for each text chunk.


**Reasoning**:
Import the necessary classes for loading an LLM and for processing documents with an LLM, instantiate an LLM, load the summarization chain, and then iterate through the document chunks to generate summaries.



In [None]:
%pip install langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB)
Downloading langchain_community-0.3.27-py3-none-any.whl (2.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m29.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dataclasses_json-0.6.7-py3-none-any.whl (

In [None]:
from google.colab import userdata
from langchain.chat_models import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain
import os
try:
    openai_api_key = userdata.get('OPENAI_API_KEY')
except Exception as e:
    print(f"Error retrieving OpenAI API key from Colab secrets: {e}")
    openai_api_key = None # Ensure api_key is None if retrieval fails

# Instantiate an LLM using the API key from colab secrets
if openai_api_key:
    llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo", openai_api_key=openai_api_key)

    summarize_chain = load_summarize_chain(llm, chain_type="map_reduce")

    # Iterate through chunks and generate summaries of the chunks
    summaries = []
    if 'document_chunks' in globals() and document_chunks:
        for i, chunk in enumerate(document_chunks):
            print(f"Summarizing chunk {i+1}/{len(document_chunks)}...")
            try:
                summary = summarize_chain.run([chunk])
                summaries.append({"chunk": chunk, "summary": summary})
            except Exception as e:
                print(f"Error summarizing chunk {i+1}: {e}")
                summaries.append({"chunk": chunk, "summary": f"Error summarizing chunk: {e}"})

        print("\nGenerated Summaries:")
        for i, item in enumerate(summaries[:3]): # Print first 3 summaries
            print(f"--- Summary {i+1} ---")
            print(item['summary'])
            print("-" * 20)
    else:
        print("Error: 'document_chunks' not found or is empty. Please ensure the text splitting step was successful.")

else:
    print("OpenAI API key not loaded. Please store your key in Colab secrets with the name 'OPENAI_API_KEY'.")

  llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo", openai_api_key=openai_api_key)


Summarizing chunk 1/212...


  summary = summarize_chain.run([chunk])


Summarizing chunk 2/212...
Summarizing chunk 3/212...
Summarizing chunk 4/212...
Summarizing chunk 5/212...
Summarizing chunk 6/212...
Summarizing chunk 7/212...
Summarizing chunk 8/212...
Summarizing chunk 9/212...
Summarizing chunk 10/212...
Summarizing chunk 11/212...
Summarizing chunk 12/212...
Summarizing chunk 13/212...
Summarizing chunk 14/212...
Summarizing chunk 15/212...
Summarizing chunk 16/212...
Summarizing chunk 17/212...
Summarizing chunk 18/212...
Summarizing chunk 19/212...
Summarizing chunk 20/212...
Summarizing chunk 21/212...
Summarizing chunk 22/212...
Summarizing chunk 23/212...
Summarizing chunk 24/212...
Summarizing chunk 25/212...
Summarizing chunk 26/212...
Summarizing chunk 27/212...
Summarizing chunk 28/212...
Summarizing chunk 29/212...
Summarizing chunk 30/212...
Summarizing chunk 31/212...
Summarizing chunk 32/212...
Summarizing chunk 33/212...
Summarizing chunk 34/212...
Summarizing chunk 35/212...
Summarizing chunk 36/212...
Summarizing chunk 37/212...


In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

# initialize embedding model
embedding_model = HuggingFaceEmbeddings()

# create a list of the summary texts from the summaries variable
summary_texts = [item['summary'] for item in summaries if 'summary' in item]

# create a FAISS index from the list of summaries and the initialized embedding model if valid summary
if summary_texts:
    vector_store = FAISS.from_texts(summary_texts, embedding_model)
    print("FAISS index created successfully.")
else:
    vector_store = None
    print("No valid summaries found to create FAISS index.")

  embedding_model = HuggingFaceEmbeddings()
  embedding_model = HuggingFaceEmbeddings()
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

FAISS index created successfully.


In [None]:
# save the FAISS index to disk for later use.
if vector_store is not None:
    vector_store.save_local("faiss_index")
    print("FAISS index saved to disk as 'faiss_index'.")
else:
    print("No FAISS index to save.")

FAISS index saved to disk as 'faiss_index'.


## Query the data available from research reports

Allow the user to input a query (i.e. symptoms that they have experienced + medications they are currently taking) and retrieve relevant summarized chunks.

load the saved FAISS index, define user query, and perform a similarity search across embedded vectors


In [None]:
import nltk
# Download the required NLTK data directly as suggested by the LookupError
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [None]:
%pip install faiss-cpu



Use Gradio interface to take the relevant chunks from the semantic search, pass them to an LLM with a prompt to act as a medical AI and suggest common medical conditions, and display the LLM's output as an AI-generated diagnosis.

In [None]:
import gradio as gr
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.prompts import PromptTemplate
from google.colab import userdata
import os
import pandas as pd

FAISS_INDEX_PATH = "faiss_index"
DRUG_LIST_FILE_PATH = "FAERS_Q125_Symptom_Sum.csv"

# Load the saved FAISS index
try:
    embeddings = HuggingFaceEmbeddings()
    vector_store = FAISS.load_local(FAISS_INDEX_PATH, embeddings, allow_dangerous_deserialization=True)
    print("FAISS index loaded successfully.")

    # Load the OpenAI API key from Colab secrets
    try:
        openai_api_key = userdata.get('OPENAI_API_KEY')
    except Exception as e:
        print(f"Error retrieving OpenAI API key from Colab secrets: {e}")
        openai_api_key = None

    drug_list = []
    if os.path.exists(DRUG_LIST_FILE_PATH):
        try:
            drug_df = pd.read_csv(DRUG_LIST_FILE_PATH)
            # Convert all entries in the first column to strings and also filter
            drug_list = drug_df.iloc[:, 0].astype(str).tolist()
            drug_list = [str(drug).strip() for drug in drug_list if pd.notna(drug) and str(drug).strip()]

            print(f"Successfully loaded {len(drug_list)} drug names from '{DRUG_LIST_FILE_PATH}'.")
        except Exception as e:
            print(f"Error loading drug list from '{DRUG_LIST_FILE_PATH}': {e}")
            drug_list = []
    else:
        print(f"Warning: Drug list file '{DRUG_LIST_FILE_PATH}' not found. Drug occurrences will not be checked.")


    # Define the medical AI prompt
    medical_ai_prompt = """You are a medical AI analyzing research text snippets.
Based on the following text snippets, suggest common medical conditions that are causing the user input of symptoms.
This is an AI-generated suggestion and not a substitute for professional medical advice.

Relevant Text Snippets:
{text}

Suggested Medical Conditions/Insights:
"""
    medical_ai_prompt_template = PromptTemplate(
        template=medical_ai_prompt,
        input_variables=["text"]
    )

    def perform_search(user_query):
        """Performs semantic search based on the user query, checks for drug occurrences, generates AI diagnosis, and returns it"""
        user_query_str = str(user_query)

        if not user_query_str.strip():
            return "Please enter your experienced symptoms or query."

        if vector_store is None:
             return "FAISS index not loaded. Cannot perform search."

        ai_diagnosis = ""

        try:
            found_drugs = []
            user_query_lower = user_query_str.lower()
            for drug in drug_list:
                if isinstance(drug, str) and drug.strip():
                    try:
                        if drug.lower() in user_query_lower:
                            found_drugs.append(drug)
                    except Exception as e:
                        print(f"Error checking drug '{drug}': {e}")
            if found_drugs:
                print(f"Found the following drugs in the query: {', '.join(found_drugs)}")

            # Perform a similarity search based on the user query
            relevant_chunks = vector_store.similarity_search(user_query_str, k=10)

            if relevant_chunks:
                if openai_api_key:
                    llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo", openai_api_key=openai_api_key)

                    summarize_chain = load_summarize_chain(llm, chain_type="stuff", prompt=medical_ai_prompt_template)

                    # Generate AI diagnosis from relevant chunks
                    try:
                        ai_diagnosis = summarize_chain.run(relevant_chunks)
                    except Exception as e:
                        ai_diagnosis = f"Error generating AI diagnosis: {e}"
                else:
                     ai_diagnosis = "OpenAI API key not loaded. Cannot generate AI diagnosis."
            else:
                ai_diagnosis = "No relevant information found to generate a diagnosis."

            return ai_diagnosis

        except Exception as e:
            return f"An error occurred during search: {e}"


    # Create a Gradio interface for the query
    iface = gr.Interface(
        fn=perform_search,
        inputs=gr.Textbox(label="Enter your symptoms or query"),
        outputs="text",
        title="Medical Research Semantic Search",
        description="Enter symptoms or a medical query to find relevant information from the research reports. The system will check your query for known drug names."
    )

    print("\nLaunching Gradio interface for query input...")
    iface.launch()


except Exception as e:
    print(f"An error occurred while loading the FAISS index: {e}")
    print("Could not load the FAISS index. Ensure 'faiss_index' directory exists and contains the index.")

  embeddings = HuggingFaceEmbeddings()


FAISS index loaded successfully.
Successfully loaded 2740 drug names from 'FAERS_Q125_Symptom_Sum.csv'.

Launching Gradio interface for query input...
It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://f9e93e1b5ec609f0ef.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
