## Retrieval
- Many LLM applications require user-specific data that is not part of the model's training set. The primary way of accomplishing this is through Retrieval Augmented Generation (RAG). In this process, external data is retrieved and then passed to the LLM when doing the generation step.

- LangChain provides all the building blocks for RAG applications - from simple to complex. This section of the documentation covers everything related to the retrieval step - e.g. the fetching of the data. Although this sounds simple, it can be subtly complex. This encompasses several key modules.

## Document loaders
- Document loaders load documents from many different sources. LangChain provides over 100 different document loaders as well as integrations with other major providers in the space, like AirByte and Unstructured. LangChain provides integrations to load all types of documents (HTML, PDF, code) from all types of locations (private S3 buckets, public websites).

### Install langchain

In [3]:
# !pip install langchain
# !pip install langchain_community

## PDF Loader

### pypdf
- Load PDF using pypdf into array of documents, where each document contains the page content and metadata with page number

In [4]:
# !pip install pypdf



In [5]:
from langchain_community.document_loaders import PyPDFLoader

In [6]:
# Load file into loader
loader = PyPDFLoader('dataset/patient health data analysis.pdf')

#### Load file into pages using load() function

In [7]:
pages = loader.load()

In [8]:
# See total number of pages
len(pages)

33

In [9]:
# check first page
page = pages[0]

In [10]:
page

Document(metadata={'source': 'dataset/patient health data analysis.pdf', 'page': 0}, page_content='Team Alpha\n2024HEALTH\nDATA ANALYSIS\nARTIFICIAL\nINTELLIGENCE\nDATE :\n24 August 2024PRESENTED TO :\nSir Muhammad\nRizwan')

In [16]:
# See metadata of pdf
meta_data = page.metadata

In [17]:
meta_data

{'source': 'dataset/patient health data analysis.pdf', 'page': 0}

In [20]:
# check the source of metadata
meta_data['source']

'dataset/patient health data analysis.pdf'

In [21]:
# Lets see content of page
content = page.page_content

In [22]:
content

'Team Alpha\n2024HEALTH\nDATA ANALYSIS\nARTIFICIAL\nINTELLIGENCE\nDATE :\n24 August 2024PRESENTED TO :\nSir Muhammad\nRizwan'

In [23]:
print(content)

Team Alpha
2024HEALTH
DATA ANALYSIS
ARTIFICIAL
INTELLIGENCE
DATE :
24 August 2024PRESENTED TO :
Sir Muhammad
Rizwan


#### Load file into pages using load_and_split()

In [24]:
pages = loader.load_and_split()

In [25]:
# see number of pages
len(pages)

33

In [26]:
# See page
page = pages[0]

In [27]:
page

Document(metadata={'source': 'dataset/patient health data analysis.pdf', 'page': 0}, page_content='Team Alpha\n2024HEALTH\nDATA ANALYSIS\nARTIFICIAL\nINTELLIGENCE\nDATE :\n24 August 2024PRESENTED TO :\nSir Muhammad\nRizwan')

### An advantage of using PyPDFLoader is that it load pdf along with page number

### Load pdf along with pages

**For loading images with pdf we need to install `pip install rapidocr-onnxruntime` this module**

In [32]:
# !pip install rapidocr-onnxruntime

In [33]:
loader = PyPDFLoader('dataset/patient health data analysis.pdf', extract_images=True)

In [35]:
pages = loader.load()

In [36]:
page = pages[0]

In [37]:
page

Document(metadata={'source': 'dataset/patient health data analysis.pdf', 'page': 0}, page_content='Team Alpha\n2024HEALTH\nDATA ANALYSIS\nARTIFICIAL\nINTELLIGENCE\nDATE :\n24 August 2024PRESENTED TO :\nSir Muhammad\nRizwan+\nHospital')

## Using PyMuPDF

* 
This is the fastest of the PDF parsing options, and contains detailed metadata about the PDF and its pages, as well as returns one document per page.

In [38]:
from langchain_community.document_loaders import PyMuPDFLoader

In [39]:
loader = PyMuPDFLoader('dataset/patient health data analysis.pdf')

In [41]:
pages = loader.load()

In [42]:
page = pages[1]

In [43]:
page

Document(metadata={'source': 'dataset/patient health data analysis.pdf', 'file_path': 'dataset/patient health data analysis.pdf', 'page': 1, 'total_pages': 33, 'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': 'PyPDF2', 'creationDate': '', 'modDate': '', 'trapped': ''}, page_content="The dataset comprises health records collected from a medical center,\nfeaturing several key columns. Each entry is uniquely identified by a Patient ID.\nThe dataset includes the Age and Gender of the patient, as well as their Blood\nPressure (BP), which records both systolic and diastolic readings. Cholesterol\nlevels are also noted, along with the patient's Heart Rate measured at rest. The\nBody Mass Index (BMI) of each patient is calculated and recorded.\nAdditionally, there is an indicator for Diabetes, specifying whether the patient\nhas diabetes (Yes/No).\nIn this project, we aim to analyze and visualize patient health data using\nPython. The to

In [44]:
page.metadata

{'source': 'dataset/patient health data analysis.pdf',
 'file_path': 'dataset/patient health data analysis.pdf',
 'page': 1,
 'total_pages': 33,
 'format': 'PDF 1.7',
 'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': '',
 'producer': 'PyPDF2',
 'creationDate': '',
 'modDate': '',
 'trapped': ''}

In [46]:
print(page.page_content)

The dataset comprises health records collected from a medical center,
featuring several key columns. Each entry is uniquely identified by a Patient ID.
The dataset includes the Age and Gender of the patient, as well as their Blood
Pressure (BP), which records both systolic and diastolic readings. Cholesterol
levels are also noted, along with the patient's Heart Rate measured at rest. The
Body Mass Index (BMI) of each patient is calculated and recorded.
Additionally, there is an indicator for Diabetes, specifying whether the patient
has diabetes (Yes/No).
In this project, we aim to analyze and visualize patient health data using
Python. The tools used include NumPy and pandas for data manipulation, and
Matplotlib for data visualization. Our goal is to uncover insights into health
trends such as blood pressure levels, cholesterol, and other vital statistics,
which can aid in medical decision-making. 
Utilize Libraries: Learn to effectively use Matplotlib, NumPy, and pandas in
the context

In [47]:
page.lc_secrets

{}

In [48]:
page.lc_attributes

{}

In [50]:
page.to_json()

{'lc': 1,
 'type': 'constructor',
 'id': ['langchain', 'schema', 'document', 'Document'],
 'kwargs': {'metadata': {'source': 'dataset/patient health data analysis.pdf',
   'file_path': 'dataset/patient health data analysis.pdf',
   'page': 1,
   'total_pages': 33,
   'format': 'PDF 1.7',
   'title': '',
   'author': '',
   'subject': '',
   'keywords': '',
   'creator': '',
   'producer': 'PyPDF2',
   'creationDate': '',
   'modDate': '',
   'trapped': ''},
  'page_content': "The dataset comprises health records collected from a medical center,\nfeaturing several key columns. Each entry is uniquely identified by a Patient ID.\nThe dataset includes the Age and Gender of the patient, as well as their Blood\nPressure (BP), which records both systolic and diastolic readings. Cholesterol\nlevels are also noted, along with the patient's Heart Rate measured at rest. The\nBody Mass Index (BMI) of each patient is calculated and recorded.\nAdditionally, there is an indicator for Diabetes, specif