## Retrieval
- Many LLM applications require user-specific data that is not part of the model's training set. The primary way of accomplishing this is through Retrieval Augmented Generation (RAG). In this process, external data is retrieved and then passed to the LLM when doing the generation step.

- LangChain provides all the building blocks for RAG applications - from simple to complex. This section of the documentation covers everything related to the retrieval step - e.g. the fetching of the data. Although this sounds simple, it can be subtly complex. This encompasses several key modules.

## Document loaders
- Document loaders load documents from many different sources. LangChain provides over 100 different document loaders as well as integrations with other major providers in the space, like AirByte and Unstructured. LangChain provides integrations to load all types of documents (HTML, PDF, code) from all types of locations (private S3 buckets, public websites).

### Install langchain

In [3]:
# !pip install langchain
# !pip install langchain_community

## PDF Loader

### pypdf
- Load PDF using pypdf into array of documents, where each document contains the page content and metadata with page number

In [4]:
# !pip install pypdf



In [5]:
from langchain_community.document_loaders import PyPDFLoader

In [6]:
# Load file into loader
loader = PyPDFLoader('dataset/patient health data analysis.pdf')

#### Load file into pages using load() function

In [7]:
pages = loader.load()

In [8]:
# See total number of pages
len(pages)

33

In [9]:
# check first page
page = pages[0]

In [10]:
page

Document(metadata={'source': 'dataset/patient health data analysis.pdf', 'page': 0}, page_content='Team Alpha\n2024HEALTH\nDATA ANALYSIS\nARTIFICIAL\nINTELLIGENCE\nDATE :\n24 August 2024PRESENTED TO :\nSir Muhammad\nRizwan')

In [16]:
# See metadata of pdf
meta_data = page.metadata

In [17]:
meta_data

{'source': 'dataset/patient health data analysis.pdf', 'page': 0}

In [20]:
# check the source of metadata
meta_data['source']

'dataset/patient health data analysis.pdf'

In [21]:
# Lets see content of page
content = page.page_content

In [22]:
content

'Team Alpha\n2024HEALTH\nDATA ANALYSIS\nARTIFICIAL\nINTELLIGENCE\nDATE :\n24 August 2024PRESENTED TO :\nSir Muhammad\nRizwan'

In [23]:
print(content)

Team Alpha
2024HEALTH
DATA ANALYSIS
ARTIFICIAL
INTELLIGENCE
DATE :
24 August 2024PRESENTED TO :
Sir Muhammad
Rizwan


#### Load file into pages using load_and_split()

In [24]:
pages = loader.load_and_split()

In [25]:
# see number of pages
len(pages)

33

In [26]:
# See page
page = pages[0]

In [27]:
page

Document(metadata={'source': 'dataset/patient health data analysis.pdf', 'page': 0}, page_content='Team Alpha\n2024HEALTH\nDATA ANALYSIS\nARTIFICIAL\nINTELLIGENCE\nDATE :\n24 August 2024PRESENTED TO :\nSir Muhammad\nRizwan')

### An advantage of using PyPDFLoader is that it load pdf along with page number

### Load pdf along with pages

**For loading images with pdf we need to install `pip install rapidocr-onnxruntime` this module**

In [32]:
# !pip install rapidocr-onnxruntime

In [33]:
loader = PyPDFLoader('dataset/patient health data analysis.pdf', extract_images=True)

In [None]:
pages = loader.load()

In [None]:
page = pages[0]

In [None]:
page