# Introduction

This is a simple set of code to get you started, it contains some concepts (using Langchain for simplicity, but can also not using it):

0. To start, you can either run this on your PC or with google colab (for free).

If you are trying it on your own PC, please create a new directory and create a new environment.

```
cd my_project

python -m venv .env

.env\Scripts\activate

pip install torch transformers pdfplumber langchain tiktoken faiss-cpu openai

(.env) C:\my_project
```

1. Loading data
- We are using langchains PDFPlumberLoader to load the data. PDFPlumber is good at handling some specifica cases, and also able to extract text.
- Langchain will split the PDF document into pages, and for each of the page the are metadata associated. This is useful for referencing.

2. Split data into chuncks
- We use the text_splitter to split the document into chunks of 1500 characters, with a overlab of 150 characters.
- There are also other spliters, we just need to find the one which are suitable. The 150 overlap is to handle edge situations
- You can see after the split the metadata are maintained for each chunck

3. Load splitted text into a vectorstore, in this case we are using FAISS vectorstore
- We are using OpenAIEmbeddings, in theory any embedding should work for queries.
- Note that we are currently storing the DB in memory. We can also specific a location to persist the information processed on the machine.

4. When the vectorstore is build, we can then test with some queries.
- Example provide shows similarity search and max margine relavence search. The later may be better

5. Select a large language model. So I have provided an example to use huggingface T5 as an example. It's not performing as good as openAI or LLAMA2 


## 1. Load the PDF 

In [4]:
from langchain.document_loaders import PDFPlumberLoader

In [6]:
loader = PDFPlumberLoader("./docs/1 Clinical paper_AI on AMD.pdf")

In [12]:
docs = loader.load()
print(f"{len(docs)} pages loaded...")

10 pages loaded...


In [47]:
print(docs[0].metadata)

{'source': './docs/1 Clinical paper_AI on AMD.pdf', 'file_path': './docs/1 Clinical paper_AI on AMD.pdf', 'page': 1, 'total_pages': 10, 'Creator': 'Arbortext Advanced Print Publisher 9.1.520/W Unicode', 'ModDate': "D:20190726201857-07'00'", 'CreationDate': "D:20190724091338+05'30'", 'Producer': 'Acrobat Distiller 9.0.0 (Windows); modified using iText 4.2.0 by 1T3XT', 'Subject': 'Clinical & Experimental Ophthalmology 0.0:null-null', 'WPS-PROCLEVEL': '3', 'WPS-JOURNALDOI': '10.1111/(ISSN)1442-9071', 'Title': 'Development and validation of a deep‐learning algorithm for the detection of neovascular age‐related macular degeneration from colour fundus photographs', 'WPS-ARTICLEDOI': '10.1111/ceo.13575'}


## 2. Splitt Text

In [30]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
chunk_size = 1500
chunk_overlap = 150
separators = ["\n\n", " ", "(?<=\.) ", "", "\n", "(?<=, )"]

splitter = RecursiveCharacterTextSplitter(chunk_size = chunk_size, 
                                          chunk_overlap = chunk_overlap,
                                          separators = separators)

In [31]:
texts = text_splitter.split_documents(docs)
print(f"{len(texts)} chunks created")

50 chunks created


## 3. Load into vector store

In [35]:
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings

db = FAISS.from_documents(texts, OpenAIEmbeddings())

## 4. Direct query the vector store

In [37]:
question = "What are the age of participants?"
sim_search_result = db.similarity_search(question) # this is based on the similrity (cosine distance)
mmr_search_result = db.max_marginal_relevance_search(question) # MMR optimizes for similarity to query and diversity among selected documents.

In [74]:
for r in sim_search_result:
    print(f"Page: {r.metadata['page']}")
    print("-".join(["" for x in range(10)]))
    print(r.page_content[:200])

Page: 4
---------
4 KEELETAL.
FIGURE 2 Thedeepconvolutionalneuralnetworkusedinthisstudy.Datastreamisfromlefttoright.Afundusphotographisfirstlypre-
processedbyscaling,subtractionoflocalspaceaveragecolour,downsizingtheim
Page: 4
---------
86years (mean age 65.3 years, 60.1% female), 14% were DLA grading on three computers operating concurrently,
born in Southern Europe (Greece, Italy or Malta), with the using a custom DLA software that
Page: 6
---------
21777 participants from the MCCS. Of these, there were 0.967, 100% and 92.6%, respectively. This consisted of
FIGURE 4 AMDtruepositive.ImagesA1,B1&C1showoriginalimageswithoutheat-map.A2showsheatmappre
Page: 10
---------
10 KEELETAL.
REFERENCES 16. AgurtoC,BarrigaES,MurrayV,etal.Automaticdetectionofdia-
betic retinopathy and age-related macular degeneration in digital
1. Wong WL, Su X, Li X, et al. Global prevalence o


## 5. Using A LLM for the Retrival

In [75]:
from transformers import AutoTokenizer, pipeline, AutoModelForSeq2SeqLM
from langchain.llms import HuggingFacePipeline

model_name = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

pipe = pipeline(
    "text2text-generation",
    model=model, 
    tokenizer=tokenizer, 
    max_length=100
)

local_llm = HuggingFacePipeline(pipeline=pipe)

In [76]:
from langchain.chains import RetrievalQA
qa_chain_t5 = RetrievalQA.from_chain_type(
    local_llm,
    retriever=db.as_retriever()
)

In [77]:
import os

from langchain.chat_models import ChatOpenAI
OPENAI_KEY = os.environ.get("OPENAI_API_KEY") # OpenAI key
chat = ChatOpenAI(temperature=0, openai_api_key=OPENAI_KEY)


In [78]:
from langchain.chains import RetrievalQA
qa_chain_openai = RetrievalQA.from_chain_type(
    chat,
    retriever=db.as_retriever()
)

In [79]:
qa_chain_t5(question)

Token indices sequence length is longer than the specified maximum sequence length for this model (1210 > 512). Running this sequence through the model will result in indexing errors


{'query': 'What are the age of participants?', 'result': '47'}

In [80]:
qa_chain_openai(question)

{'query': 'What are the age of participants?',
 'result': 'The participants in the study were aged 47 to 86 years, with a mean age of 65.3 years.'}