## Retrieval
- Many LLM applications require user-specific data that is not part of the model's training set. The primary way of accomplishing this is through Retrieval Augmented Generation (RAG). In this process, external data is retrieved and then passed to the LLM when doing the generation step.

- LangChain provides all the building blocks for RAG applications - from simple to complex. This section of the documentation covers everything related to the retrieval step - e.g. the fetching of the data. Although this sounds simple, it can be subtly complex. This encompasses several key modules.

## Document loaders
- Document loaders load documents from many different sources. LangChain provides over 100 different document loaders as well as integrations with other major providers in the space, like AirByte and Unstructured. LangChain provides integrations to load all types of documents (HTML, PDF, code) from all types of locations (private S3 buckets, public websites).

### Install langchain

In [1]:
# !pip install langchain
# !pip install langchain_community

## PDF Loader

### pypdf
- Load PDF using pypdf into array of documents, where each document contains the page content and metadata with page number

In [2]:
# !pip install pypdf

In [3]:
from langchain_community.document_loaders import PyPDFLoader

In [4]:
# Load file into loader
loader = PyPDFLoader('dataset/patient health data analysis.pdf')

#### Load file into pages using load() function

In [5]:
pages = loader.load()

In [6]:
# See total number of pages
len(pages)

33

In [7]:
# check first page
page = pages[0]

In [8]:
page

Document(metadata={'source': 'dataset/patient health data analysis.pdf', 'page': 0}, page_content='Team Alpha\n2024HEALTH\nDATA ANALYSIS\nARTIFICIAL\nINTELLIGENCE\nDATE :\n24 August 2024PRESENTED TO :\nSir Muhammad\nRizwan')

In [9]:
# See metadata of pdf
meta_data = page.metadata

In [10]:
meta_data

{'source': 'dataset/patient health data analysis.pdf', 'page': 0}

In [11]:
# check the source of metadata
meta_data['source']

'dataset/patient health data analysis.pdf'

In [12]:
# Lets see content of page
content = page.page_content

In [13]:
content

'Team Alpha\n2024HEALTH\nDATA ANALYSIS\nARTIFICIAL\nINTELLIGENCE\nDATE :\n24 August 2024PRESENTED TO :\nSir Muhammad\nRizwan'

In [14]:
print(content)

Team Alpha
2024HEALTH
DATA ANALYSIS
ARTIFICIAL
INTELLIGENCE
DATE :
24 August 2024PRESENTED TO :
Sir Muhammad
Rizwan


#### Load file into pages using load_and_split()

In [15]:
pages = loader.load_and_split()

In [16]:
# see number of pages
len(pages)

33

In [17]:
# See page
page = pages[0]

In [18]:
page

Document(metadata={'source': 'dataset/patient health data analysis.pdf', 'page': 0}, page_content='Team Alpha\n2024HEALTH\nDATA ANALYSIS\nARTIFICIAL\nINTELLIGENCE\nDATE :\n24 August 2024PRESENTED TO :\nSir Muhammad\nRizwan')

### An advantage of using PyPDFLoader is that it load pdf along with page number

### Load pdf along with pages

**For loading images with pdf we need to install `pip install rapidocr-onnxruntime` this module**

In [19]:
# !pip install rapidocr-onnxruntime

In [20]:
loader = PyPDFLoader('dataset/patient health data analysis.pdf', extract_images=True)

In [21]:
pages = loader.load()

In [22]:
page = pages[0]

In [23]:
page

Document(metadata={'source': 'dataset/patient health data analysis.pdf', 'page': 0}, page_content='Team Alpha\n2024HEALTH\nDATA ANALYSIS\nARTIFICIAL\nINTELLIGENCE\nDATE :\n24 August 2024PRESENTED TO :\nSir Muhammad\nRizwan+\nHospital')

## Using PyMuPDF

* 
This is the fastest of the PDF parsing options, and contains detailed metadata about the PDF and its pages, as well as returns one document per page.

In [24]:
from langchain_community.document_loaders import PyMuPDFLoader

In [25]:
loader = PyMuPDFLoader('dataset/patient health data analysis.pdf')

In [26]:
pages = loader.load()

In [27]:
page = pages[1]

In [28]:
page

Document(metadata={'source': 'dataset/patient health data analysis.pdf', 'file_path': 'dataset/patient health data analysis.pdf', 'page': 1, 'total_pages': 33, 'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': 'PyPDF2', 'creationDate': '', 'modDate': '', 'trapped': ''}, page_content="The dataset comprises health records collected from a medical center,\nfeaturing several key columns. Each entry is uniquely identified by a Patient ID.\nThe dataset includes the Age and Gender of the patient, as well as their Blood\nPressure (BP), which records both systolic and diastolic readings. Cholesterol\nlevels are also noted, along with the patient's Heart Rate measured at rest. The\nBody Mass Index (BMI) of each patient is calculated and recorded.\nAdditionally, there is an indicator for Diabetes, specifying whether the patient\nhas diabetes (Yes/No).\nIn this project, we aim to analyze and visualize patient health data using\nPython. The to

In [29]:
page.metadata

{'source': 'dataset/patient health data analysis.pdf',
 'file_path': 'dataset/patient health data analysis.pdf',
 'page': 1,
 'total_pages': 33,
 'format': 'PDF 1.7',
 'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': '',
 'producer': 'PyPDF2',
 'creationDate': '',
 'modDate': '',
 'trapped': ''}

In [30]:
print(page.page_content)

The dataset comprises health records collected from a medical center,
featuring several key columns. Each entry is uniquely identified by a Patient ID.
The dataset includes the Age and Gender of the patient, as well as their Blood
Pressure (BP), which records both systolic and diastolic readings. Cholesterol
levels are also noted, along with the patient's Heart Rate measured at rest. The
Body Mass Index (BMI) of each patient is calculated and recorded.
Additionally, there is an indicator for Diabetes, specifying whether the patient
has diabetes (Yes/No).
In this project, we aim to analyze and visualize patient health data using
Python. The tools used include NumPy and pandas for data manipulation, and
Matplotlib for data visualization. Our goal is to uncover insights into health
trends such as blood pressure levels, cholesterol, and other vital statistics,
which can aid in medical decision-making. 
Utilize Libraries: Learn to effectively use Matplotlib, NumPy, and pandas in
the context

In [31]:
page.lc_secrets

{}

In [32]:
page.lc_attributes

{}

In [33]:
page.to_json()

{'lc': 1,
 'type': 'constructor',
 'id': ['langchain', 'schema', 'document', 'Document'],
 'kwargs': {'metadata': {'source': 'dataset/patient health data analysis.pdf',
   'file_path': 'dataset/patient health data analysis.pdf',
   'page': 1,
   'total_pages': 33,
   'format': 'PDF 1.7',
   'title': '',
   'author': '',
   'subject': '',
   'keywords': '',
   'creator': '',
   'producer': 'PyPDF2',
   'creationDate': '',
   'modDate': '',
   'trapped': ''},
  'page_content': "The dataset comprises health records collected from a medical center,\nfeaturing several key columns. Each entry is uniquely identified by a Patient ID.\nThe dataset includes the Age and Gender of the patient, as well as their Blood\nPressure (BP), which records both systolic and diastolic readings. Cholesterol\nlevels are also noted, along with the patient's Heart Rate measured at rest. The\nBody Mass Index (BMI) of each patient is calculated and recorded.\nAdditionally, there is an indicator for Diabetes, specif

---

# CSV Loader

In [34]:
from langchain_community.document_loaders import CSVLoader

In [35]:
loader = CSVLoader('dataset/amazon_books.csv')

In [36]:
data = loader.load()
# In case of csv file it load data on the basis of rows

In [37]:
data

[Document(metadata={'source': 'dataset/amazon_books.csv', 'row': 0}, page_content='Title: N/A\nAuthor: N/A\nPrice: N/A\nRating: N/A\nLink: '),
 Document(metadata={'source': 'dataset/amazon_books.csv', 'row': 1}, page_content='Title: If He Had Been with Me\nAuthor: Laura Nowlin\nPrice: $7.27\nRating: 4.2 out of 5 stars\nLink: '),
 Document(metadata={'source': 'dataset/amazon_books.csv', 'row': 2}, page_content='Title: Lessons from the Greatest Stock Traders of All Time: Proven Strategies Active Traders Can Use Today to Beat the Markets\nAuthor: John Boik\nPrice: $14.73\nRating: 4.7 out of 5 stars\nLink: '),
 Document(metadata={'source': 'dataset/amazon_books.csv', 'row': 3}, page_content='Title: How Legendary Traders Made Millions: Profiting From the Investment Strategies of the Gretest Traders of All time\nAuthor: John Boik\nPrice: $15.55\nRating: 4.6 out of 5 stars\nLink: '),
 Document(metadata={'source': 'dataset/amazon_books.csv', 'row': 4}, page_content='Title: N/A\nAuthor: N/A\nPr

In [38]:
# Get data of only first row
row_1 = data[0]
row_1

Document(metadata={'source': 'dataset/amazon_books.csv', 'row': 0}, page_content='Title: N/A\nAuthor: N/A\nPrice: N/A\nRating: N/A\nLink: ')

In [39]:
row_1.page_content

'Title: N/A\nAuthor: N/A\nPrice: N/A\nRating: N/A\nLink: '

In [40]:
# Print all rows
for rows in data:
    print(rows.page_content.strip())

Title: N/A
Author: N/A
Price: N/A
Rating: N/A
Link:
Title: If He Had Been with Me
Author: Laura Nowlin
Price: $7.27
Rating: 4.2 out of 5 stars
Link:
Title: Lessons from the Greatest Stock Traders of All Time: Proven Strategies Active Traders Can Use Today to Beat the Markets
Author: John Boik
Price: $14.73
Rating: 4.7 out of 5 stars
Link:
Title: How Legendary Traders Made Millions: Profiting From the Investment Strategies of the Gretest Traders of All time
Author: John Boik
Price: $15.55
Rating: 4.6 out of 5 stars
Link:
Title: N/A
Author: N/A
Price: N/A
Rating: N/A
Link:
Title: The Outsiders
Author: S. E. Hinton
Price: $9.49
Rating: 4.7 out of 5 stars
Link:
Title: Life Skills for Kids: How to Cook, Clean, Make Friends, Handle Emergencies, Set Goals, Make Good Decisions, and Everything in Between
Author: Karen Harris
Price: $11.29
Rating: 4.5 out of 5 stars
Link:
Title: Monster Stocks: How They Set Up, Run Up, Top and Make You Money
Author: John Boik
Price: $31.36
Rating: 4.3 out of 5 s

### Customizing the CSV parsing and loading

In [41]:
loader = CSVLoader(file_path='dataset/amazon_books.csv', csv_args={
    'delimiter': ',',
    'quotechar': '"'
})

In [42]:
data = loader.load()

In [43]:
data

[Document(metadata={'source': 'dataset/amazon_books.csv', 'row': 0}, page_content='Title: N/A\nAuthor: N/A\nPrice: N/A\nRating: N/A\nLink: '),
 Document(metadata={'source': 'dataset/amazon_books.csv', 'row': 1}, page_content='Title: If He Had Been with Me\nAuthor: Laura Nowlin\nPrice: $7.27\nRating: 4.2 out of 5 stars\nLink: '),
 Document(metadata={'source': 'dataset/amazon_books.csv', 'row': 2}, page_content='Title: Lessons from the Greatest Stock Traders of All Time: Proven Strategies Active Traders Can Use Today to Beat the Markets\nAuthor: John Boik\nPrice: $14.73\nRating: 4.7 out of 5 stars\nLink: '),
 Document(metadata={'source': 'dataset/amazon_books.csv', 'row': 3}, page_content='Title: How Legendary Traders Made Millions: Profiting From the Investment Strategies of the Gretest Traders of All time\nAuthor: John Boik\nPrice: $15.55\nRating: 4.6 out of 5 stars\nLink: '),
 Document(metadata={'source': 'dataset/amazon_books.csv', 'row': 4}, page_content='Title: N/A\nAuthor: N/A\nPr

## TextLoader

In [44]:
from langchain_community.document_loaders import TextLoader

In [48]:
loader = TextLoader('dataset/Matplotlib.txt')

In [49]:
text = loader.load()

In [50]:
text

[Document(metadata={'source': 'dataset/Matplotlib.txt'}, page_content="What is MATPLOTLIB?\nMatplotlib is a low level graph \nplotting library in python that \nserves as a visualization utility.\nMatplotlib is open source and we \ncan use it freely.\n\nMatplotlib is mostly written in python, \na few segments are written in C, \nObjective-C and Javascript for \nPlatform compatibility.\n_______________________________________\nInstallation of Matplotlib\nImporting\n_______________________________________\nPyplot\nMost of the Matplotlib utilities \nlies under the pyplot submodule, \nand are usually imported under \nthe plt alias:\n\nimport matplotlib.pyplot as plt\n_______________________________________\nDraw a line in a diagram from position \n(0,0) to position (6,250):\n\nimport matplotlib.pyplot as plt\nimport numpy as np\nxpoints = np.array([0, 6])\nypoints = np.array([0, 250])\nplt.plot(xpoints, ypoints)\nplt.show()\n_______________________________________\nPlotting x and y points\n

In [53]:
print(text[0].page_content)

What is MATPLOTLIB?
Matplotlib is a low level graph 
plotting library in python that 
serves as a visualization utility.
Matplotlib is open source and we 
can use it freely.

Matplotlib is mostly written in python, 
a few segments are written in C, 
Objective-C and Javascript for 
Platform compatibility.
_______________________________________
Installation of Matplotlib
Importing
_______________________________________
Pyplot
Most of the Matplotlib utilities 
lies under the pyplot submodule, 
and are usually imported under 
the plt alias:

import matplotlib.pyplot as plt
_______________________________________
Draw a line in a diagram from position 
(0,0) to position (6,250):

import matplotlib.pyplot as plt
import numpy as np
xpoints = np.array([0, 6])
ypoints = np.array([0, 250])
plt.plot(xpoints, ypoints)
plt.show()
_______________________________________
Plotting x and y points
The plot() function is used to draw 
points (markers) in a diagram.

By default, the plot() function 
dra

## Doc Loader

**For using docloader we need to install `pip install docx2txt` module**

In [2]:
# !pip install docx2txt

Collecting docx2txt
  Downloading docx2txt-0.8.tar.gz (2.8 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: docx2txt
  Building wheel for docx2txt (setup.py): started
  Building wheel for docx2txt (setup.py): finished with status 'done'
  Created wheel for docx2txt: filename=docx2txt-0.8-py3-none-any.whl size=3972 sha256=4e4ac404caa465b219f78a966cb60b396fac8617f979f22cda72780f675aa367
  Stored in directory: c:\users\mansoor\appdata\local\pip\cache\wheels\6f\81\48\001bbc0109c15e18c009eee300022f42d1e070e54f1d00b218
Successfully built docx2txt
Installing collected packages: docx2txt
Successfully installed docx2txt-0.8


In [3]:
from langchain_community.document_loaders import Docx2txtLoader

In [4]:
loader = Docx2txtLoader('../Notes/loader.docx')

In [5]:
data = loader.load()

In [6]:
data

[Document(metadata={'source': '../Notes/loader.docx'}, page_content='Loader in LangChain\n\nWhat is a Loader in LangChain?\n\nDefinition: A loader in LangChain is a component that facilitates the ingestion of data from various sources into the LangChain framework.\n\nPurpose: It ensures that data is properly formatted and prepared for further processing, such as embedding and indexing.\n\nKey Functions of a Loader\n\nData Ingestion: \n\n\tCollects data from different sources like files, databases, APIs, and web pages.\n\n\tSupports various data formats including text, CSV, JSON, and more.\n\nData Chunking: \n\n\tSplits large documents into smaller, manageable chunks.\n\n\tEnsures that each chunk is of optimal size for embedding and retrieval processes.\n\nData Cleaning: \n\n\tRemoves unnecessary or irrelevant information.\n\n\tStandardizes data formats to ensure consistency.\n\nData Transformation: \n\n\tConverts raw data into a format suitable for embedding.\n\n\tApplies necessary tra

In [8]:
data[0].metadata

{'source': '../Notes/loader.docx'}

In [10]:
data[0].page_content

'Loader in LangChain\n\nWhat is a Loader in LangChain?\n\nDefinition: A loader in LangChain is a component that facilitates the ingestion of data from various sources into the LangChain framework.\n\nPurpose: It ensures that data is properly formatted and prepared for further processing, such as embedding and indexing.\n\nKey Functions of a Loader\n\nData Ingestion: \n\n\tCollects data from different sources like files, databases, APIs, and web pages.\n\n\tSupports various data formats including text, CSV, JSON, and more.\n\nData Chunking: \n\n\tSplits large documents into smaller, manageable chunks.\n\n\tEnsures that each chunk is of optimal size for embedding and retrieval processes.\n\nData Cleaning: \n\n\tRemoves unnecessary or irrelevant information.\n\n\tStandardizes data formats to ensure consistency.\n\nData Transformation: \n\n\tConverts raw data into a format suitable for embedding.\n\n\tApplies necessary transformations like tokenization and normalization.\n\nTypes of Loader

## Web Based Loader

In [11]:
from langchain_community.document_loaders import WebBaseLoader

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [12]:
loader = WebBaseLoader('https://www.xevensolutions.com')

In [13]:
docx = loader.load()

In [15]:
page = docx[0]

In [16]:
page.metadata

{'source': 'https://www.xevensolutions.com',
 'title': 'Xeven Solutions - AI Development & Solutions Company',
 'description': 'Xeven Solutions is a leading AI Development & Solutions Company providing custom AI-based software services to automate workflow and boost innovation.',
 'language': 'en-US'}

In [18]:
print(page.page_content)








Xeven Solutions - AI Development & Solutions Company











































































 
 





 








 


Services

AI Development Services AI Chatbot Development Predictive Modelling​ Mobile App Development Chat GPT Integrations Custom Software Natural Language Processing Digital Marketing Machine Learning DevOps Computer Vision​ Custom Web Development Staff Augmentation UI UX Design

Salesforce
Industries

HealthTech EdTech FinTech GreenTech Internet of Things Retail AI Diagnostics E-Commerce Smart Healthcare HIPAA Compliance

Portfolio
Company

About Us Life at Xeven

Resource

Blogs Gallery Careers

Contact Us




X
 


















 







 



 
1-267-800-0191














Free AI Consultation





























Your Trusted AI Development Company 



				We build meaningful AI Healthcare Solutions to shape the future of your business						






Get Free AI Consultation



















Drive Unstoppable Busine