<a href="https://colab.research.google.com/github/afyaaenaya/KHCC/blob/main/PDF_Parser.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
pip install llama-index-core llama-parse llama-index-readers-file llama-index-vector-stores-lancedb llama-index-llms-openai llama-index-embeddings-openai

Collecting llama-index-core
  Downloading llama_index_core-0.10.62-py3-none-any.whl.metadata (2.4 kB)
Collecting llama-parse
  Downloading llama_parse-0.4.9-py3-none-any.whl.metadata (4.4 kB)
Collecting llama-index-readers-file
  Downloading llama_index_readers_file-0.1.32-py3-none-any.whl.metadata (5.4 kB)
Collecting llama-index-vector-stores-lancedb
  Downloading llama_index_vector_stores_lancedb-0.1.7-py3-none-any.whl.metadata (720 bytes)
Collecting llama-index-llms-openai
  Downloading llama_index_llms_openai-0.1.28-py3-none-any.whl.metadata (650 bytes)
Collecting llama-index-embeddings-openai
  Downloading llama_index_embeddings_openai-0.1.11-py3-none-any.whl.metadata (655 bytes)
Collecting dataclasses-json (from llama-index-core)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting deprecated>=1.2.9.3 (from llama-index-core)
  Downloading Deprecated-1.2.14-py2.py3-none-any.whl.metadata (5.4 kB)
Collecting dirtyjson<2.0.0,>=1.0.8 (from llama-index-core

In [2]:
from llama_parse import LlamaParse
from llama_index.vector_stores.lancedb import LanceDBVectorStore
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex, StorageContext, SimpleDirectoryReader

In [3]:
import nest_asyncio
nest_asyncio.apply()

from google.colab import userdata

# API access to llama-cloud
LLAMA_CLOUD_API_KEY = userdata.get('LLAMACLOUD_KEY')

# Using OpenAI API for embeddings/llms
OPENAI_API_KEY = userdata.get('KHCC_OPENAI')



---



Testing the model's ability to parse and retrieve information from a scanned PDF

In [7]:
parser = LlamaParse(
    result_type="markdown",
    api_key=LLAMA_CLOUD_API_KEY
)

file_extractor = {".pdf": parser}
documents = SimpleDirectoryReader(input_files=['/content/original1.pdf'], file_extractor=file_extractor).load_data()
print(documents)

Started parsing the file under job_id b815fbcd-9798-40dc-94a1-ffb88094649a
[Document(id_='658bf2da-c994-4457-aab3-55c6cf649d54', embedding=None, metadata={'file_path': '/content/original1.pdf', 'file_name': 'original1.pdf', 'file_type': 'application/pdf', 'file_size': 2205273, 'creation_date': '2024-08-07', 'last_modified_date': '2024-08-07'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='# King Hussein Cancer Center\n\n# Department of Cell Therapy and Applied Genomics\n\nQueen Rania Al-Abdullah Str.\n\nTel:(962-6)5300460\n\nFax:(962-6)5342567\n\nP.O.Box 1269 Amman 11941 Jordan\n\n# NEXT-GENERATION SEQUENCING (NGS) - MYELOID NEOPLASMS PANEL\n\n|PATIENT NAME:|DATE OF BIRTH: (DD MM YYYY)|GENDER:|\n|---|---|---|\n|AMEER ATEF ADEL ABBAS|17/07/2016|MALE|\n|

In [9]:

index = VectorStoreIndex.from_documents(documents)

# create a query engine for the index
query_engine = index.as_query_engine()

# query the engine
query = "What pathogenic mutations are present in this patient?"
response = query_engine.query(query)
print(response)

The pathogenic mutation present in this patient is FLT3 internal tandem duplication (ITD).




---



Testing the model's ability to parse and retrieve information from the same PDF after running it through Tesseracrt OCR

In [16]:
document_OCR = SimpleDirectoryReader(input_files=['/content/1.pdf'], file_extractor=file_extractor).load_data()

Started parsing the file under job_id 3ff32bfa-86b8-408b-9edb-e36171c7efb5


In [17]:
index = VectorStoreIndex.from_documents(document_OCR)

# create a query engine for the index
query_engine = index.as_query_engine()

# query the engine
query = "What pathogenic mutations are present in this patient?"
response = query_engine.query(query)
print(response)

No pathogenic mutations were identified in this patient based on the provided context information.


Another document that has been ran through an OCR

In [18]:
document_OCR = SimpleDirectoryReader(input_files=['/content/3.pdf'], file_extractor=file_extractor).load_data()

Started parsing the file under job_id d3d87ba4-75e6-4d20-bf68-573f8053e992


In [19]:
index = VectorStoreIndex.from_documents(document_OCR)

# create a query engine for the index
query_engine = index.as_query_engine()

# query the engine
query = "What pathogenic mutations are present in this patient?"
response = query_engine.query(query)
print(response)

The presence of pathogenic mutations in this patient cannot be definitively determined based on the information provided.
