#**Step 01: Install All Required the Packages**

In [1]:
!pip install langchain
!pip install openai
!pip install PyPDF2
!pip install faiss-cpu
!pip install tiktoken

Collecting openai
  Downloading openai-1.8.0-py3-none-any.whl.metadata (18 kB)
Collecting distro<2,>=1.7.0 (from openai)
  Using cached distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.26.0-py3-none-any.whl.metadata (7.6 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.2-py3-none-any.whl.metadata (20 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
     ---------------------------------------- 0.0/58.3 kB ? eta -:--:--
     ---------------------------- ----------- 41.0/58.3 kB ? eta -:--:--
     -------------------------------------- 58.3/58.3 kB 774.5 kB/s eta 0:00:00
Downloading openai-1.8.0-py3-none-any.whl (222 kB)
   ---------------------------------------- 0.0/222.3 kB ? eta -:--:--
   ------------------------------------ --- 204.8/222.3 kB 6.3 MB/s eta 0:00:01
   --------------------------------------

In [17]:
!pip install langchain-openai

Collecting langchain-openai
  Downloading langchain_openai-0.0.3-py3-none-any.whl.metadata (2.5 kB)
Collecting langchain-core<0.2,>=0.1.13 (from langchain-openai)
  Downloading langchain_core-0.1.13-py3-none-any.whl.metadata (5.9 kB)
Collecting langsmith<0.0.84,>=0.0.83 (from langchain-core<0.2,>=0.1.13->langchain-openai)
  Downloading langsmith-0.0.83-py3-none-any.whl.metadata (10 kB)
Downloading langchain_openai-0.0.3-py3-none-any.whl (28 kB)
Downloading langchain_core-0.1.13-py3-none-any.whl (228 kB)
   ---------------------------------------- 0.0/228.7 kB ? eta -:--:--
   ---------------------------------------  225.3/228.7 kB 6.7 MB/s eta 0:00:01
   ---------------------------------------- 228.7/228.7 kB 3.5 MB/s eta 0:00:00
Downloading langsmith-0.0.83-py3-none-any.whl (49 kB)
   ---------------------------------------- 0.0/49.3 kB ? eta -:--:--
   ---------------------------------------- 49.3/49.3 kB 2.4 MB/s eta 0:00:00
Installing collected packages: langsmith, langchain-core, 

#**Step 02: Import All the Required Libraries**

In [2]:
#pdfreader is a Pythonic API for: extracting texts, images and other data from PDF documents
from PyPDF2 import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import ElasticVectorSearch, Pinecone, Weaviate, FAISS

#**Step 03: Setup the Environment**

In [24]:
import os
os.environ['OPENAI_API_KEY'] = 'your key here'

#**Step 04: Extracting Text from the PDF DOcument using PDF Reader**

In [4]:
reader = PdfReader('patientsystem.pdf')

#**Step 05: Read Data From the PDF File and put it into a variable raw_text**

In [5]:
# It will go to each page and read text from each page, raw_text will contain all the text
raw_text = ''
for i, page in enumerate(reader.pages):
  text = page.extract_text()
  if text:
    raw_text += text

In [6]:
raw_text

'Vol.:(0123456789)The Patient - Patient-Centered Outcomes Research (2023) 16:183–199 \nhttps://doi.org/10.1007/s40271-023-00619-w\nPRACTICAL APPLICATION\nSo You Want to\xa0Build Your Disease’s First Online Patient Registry: \nAn\xa0Educational Guide for\xa0Patient Organizations Based on\xa0US \nand\xa0European Experience\nPaul\xa0Wicks1 \xa0· Lindsey\xa0Wahlstrom‑Edwards2 \xa0· Sam\xa0Fillingham3 \xa0· Andrea\xa0Downing4\xa0· Elin\xa0Haf\xa0Davies5 \nAccepted: 22 February 2023 / Published online: 22 March 2023 \n© The Author(s) 2023\nAbstract\nPatient registries fulfill a number of key roles for clinicians, researchers, non-profit organizations, payers, and policy makers. \nThey can help the field understand the natural history of a condition, determine the effectiveness of interventions, measure \nsafety, and audit the quality of care provided. Successful registries in cystic fibrosis, Duchenne’s muscular dystrophy, and \nother rare diseases have become a model for accelerating progre

In [7]:
raw_text[:1]

'V'

In [8]:
raw_text[:100]

'Vol.:(0123456789)The Patient - Patient-Centered Outcomes Research (2023) 16:183–199 \nhttps://doi.org'

#**Step 06: Split Text into Smaller Chunks**

In [9]:
 #Now we will split the text we read into smaller chunks so that during information retrieval we dont hit the maximum token limit
#OpenAI models such as GPT 3.5 or GPT 4, have a maximum token limit, which restricts the input length.
#The token limit for gpt-3.5-turbo is 4096 tokens, whereas the token limits for gpt-4-8k and gpt-4-32k are 8192 and 32768 respectively.

textsplitter = CharacterTextSplitter(
    separator = "\n",
    #chunk size of 1000 Token each and there is going to be an overlap of 200 tokens between the consecutive chunks
    #first chunk 1000 charaters long, Next chunk will include last 200 charaters from the first chunk
    chunk_size=1000,
    chunk_overlap = 200,
    length_function=len

)

In [10]:
#Now to convert text into chunks we will use textsplitter
texts = textsplitter.split_text(raw_text)

In [11]:
len(texts)

103

In [12]:
texts[0]

'Vol.:(0123456789)The Patient - Patient-Centered Outcomes Research (2023) 16:183–199 \nhttps://doi.org/10.1007/s40271-023-00619-w\nPRACTICAL APPLICATION\nSo You Want to\xa0Build Your Disease’s First Online Patient Registry: \nAn\xa0Educational Guide for\xa0Patient Organizations Based on\xa0US \nand\xa0European Experience\nPaul\xa0Wicks1 \xa0· Lindsey\xa0Wahlstrom‑Edwards2 \xa0· Sam\xa0Fillingham3 \xa0· Andrea\xa0Downing4\xa0· Elin\xa0Haf\xa0Davies5 \nAccepted: 22 February 2023 / Published online: 22 March 2023 \n© The Author(s) 2023\nAbstract\nPatient registries fulfill a number of key roles for clinicians, researchers, non-profit organizations, payers, and policy makers. \nThey can help the field understand the natural history of a condition, determine the effectiveness of interventions, measure \nsafety, and audit the quality of care provided. Successful registries in cystic fibrosis, Duchenne’s muscular dystrophy, and'

In [13]:
texts[1]

'safety, and audit the quality of care provided. Successful registries in cystic fibrosis, Duchenne’s muscular dystrophy, and \nother rare diseases have become a model for accelerating progress. However, the complex tasks required to develop a modern \nregistry can seem overwhelming, particularly for those who are not from a technical background. In this Education article, \na team of co-authors from across patient advocacy, technology, privacy, and commercial perspectives who have worked on \na number of such projects offer a “Registry 101” primer to help get started. We will outline the promise and potential of \npatient registries with worked case examples, identify some of the key technical considerations you will need to consider, \ndescribe the type of data you might want to collect, consider privacy risks to protect your users, sketch out some of the paths \ntowards long-term financial sustainability we have observed, and conclude with plans to mitigate some of the challenges th

In [14]:
texts[2]

'towards long-term financial sustainability we have observed, and conclude with plans to mitigate some of the challenges that \ncan occur and signpost interested readers to further resources. While rapid growth in the digital health market has presented \nnumerous opportunities to those at the beginning of their journey, it is important to start with the long-term goals in mind \nand to benefit from the learnings of those who have walked this path before.\n * Paul Wicks \n paul@wicksdigitalhealth.com\n1 Wicks Digital Health Ltd, Lichfield, UK\n2 Sano Genetics, Cambridge, UK\n3 PIP-UK, Stockport, UK\n4 Light Collective, Eugene, OR, USA\n5 Aparito Ltd, Wrexham, UK184 P .\xa0Wicks et al.\nKey Points for Decision Makers  \nPatient organizations are frequently encouraged by third \nparties to “build a registry” but are offered little guidance \non whether that is the right decision, and if so, whether \nto build, buy, or borrow one from a platform provider.'

#**Step 07: Download Embeddings from OpenAI**

In [25]:
# For each of these chunks we need to compute the corresponding embeddings,  In order to compute embeddings for our chunks we
#will be using open ai text embedding
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [26]:
# We want to compute the embedding on our document, there is a bunch of vector stores that langchain support we will use FAISS
# FAISS will take the text chunks, find the corresponding embedding and that will be stored in the Document Search
docsearch = FAISS.from_texts(texts, embeddings)

In [29]:
from langchain.chains.question_answering import load_qa_chain
# from langchain.llms import OpenAI
from langchain_openai import OpenAI

In [30]:
chain = load_qa_chain(OpenAI(), chain_type = 'stuff')

In [33]:
query = """Who are the authors of this paper - So You Want to Build Your Disease’s First Online Patient Registry:
An Educational Guide for Patient Organizations Based on US
and European Experience"""
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

' Paul Wicks, Lindsey Wahlstrom-Edwards, Sam Fillingham, Andrea Downing, and Elin Haf Davies.'

In [34]:
query = "can you summarize the paper"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

' This paper discusses the potential benefits and challenges of creating a patient registry for individuals affected by illness. The authors emphasize the importance of considering ethical and privacy issues, as well as ensuring the sustainability of the registry. They also highlight the potential for a registry to support individuals, communities, organizations, and scientific fields. '

In [35]:
query = "what techniques are best to create a patient registry "
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

'\n\nThere is no one "best" technique to create a patient registry, as it will depend on the specific needs and goals of the registry. However, some important considerations to keep in mind when creating a patient registry include privacy, interoperability, and the ability to move to another platform in the future. It is also important to consider the different requirements and permissions of various users, such as researchers, patients, and administrators. Ultimately, the best technique for creating a patient registry will vary depending on the specific context and goals of the registry.'

In [36]:
query = "Do you know about Google Bard"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

' I am an AI and do not have personal knowledge or opinions about specific companies or products. I suggest conducting an online search or consulting with a relevant source for accurate information about Google Bard.'

In [38]:
query = "What is the date today?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

" I don't know, as this context does not provide any information about the current date. "

In [39]:
query = "What is 2+2"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

' I am not programmed to solve math problems. I suggest using a calculator or asking a math teacher for help with this question.'