<a href="https://colab.research.google.com/github/Zeeshan138063/rag/blob/main/RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# RAG
RAG (Retrieval-Augmented Generation) pipelines tackle AI hallucinations by integrating real-time information retrieval with text generation. This ensures that models produce more accurate, context-grounded responses, reducing the chances of misleading or incorrect outputs. A powerful step forward in building reliable AI!

[More on RAG](https://www.linkedin.com/pulse/rag-retrieval-augmented-generation-pipelines-muhammad-zeeshan-oodvf/?trackingId=pXByMvHLQDW%2Fw11KVFg8LQ%3D%3D)



1.   Ingestion
2.   Retrieval
3.   Synthesis




# LangChain
***LangChain*** is a framework for developing applications powered by large language models (LLMs).
by providing utilities for working with text, embeddings, memory, and more.

In [None]:
!pip install langchain langchain-community

Collecting langchain
  Downloading langchain-0.2.16-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-community
  Downloading langchain_community-0.2.16-py3-none-any.whl.metadata (2.7 kB)
Collecting langchain-core<0.3.0,>=0.2.38 (from langchain)
  Downloading langchain_core-0.2.38-py3-none-any.whl.metadata (6.2 kB)
Collecting langchain-text-splitters<0.3.0,>=0.2.0 (from langchain)
  Downloading langchain_text_splitters-0.2.4-py3-none-any.whl.metadata (2.3 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.115-py3-none-any.whl.metadata (13 kB)
Collecting tenacity!=8.4.0,<9.0.0,>=8.1.0 (from langchain)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.22.0-py3-none-any.whl.metad

# Document loaders [#](https://python.langchain.com/v0.2/docs/integrations/document_loaders/):

---

Loaders refer to components that are used to load or ingest data from various sources such as files, databases, APIs, or even web pages, and converting that data into a format that can be processed by LangChain.

DocumentLoaders load data into the standard LangChain Document format.

*Each DocumentLoader has its own specific parameters, but they can all be invoked in the same way with the ***.load*** method.*

## Types of Loaders:


*   **File Loaders:** These load data from files like text files, PDFs, CSVs, JSON, etc.

*   **Database Loaders:** These load data from databases, converting rows or documents into a format that can be used by the language model.

* **Web Loaders:** These scrape or fetch content from web pages, transforming the retrieved text into a structured format.

* **API Loaders:** These fetch data from APIs and convert the responses into a usable format for further processing.

* **Customization:**
 LangChain allows you to create custom loaders if your data source doesn’t fit the pre-existing loaders. This flexibility ensures you can integrate almost any data source into your language model application.


###  Preprocessing:
Loader performs necessary preprocessing steps such as tokenization, normalization and format conversiotn to ensure data is in the optimal state for model consumption

 ### Integration:
 Once the data is loaded, it can be passed through various components of LangChain, such as text splitting, embedding generation, memory integration, or directly into a language model for processing.



---

#### Example Use Case

---


If you have a large set of PDF documents and you want to extract the text content for use in a language model application, you could use a PDF loader in LangChain to automate this process. The loader would read each PDF, extract the text, and format it in a way that the language model can use it for tasks like summarization, question answering, or information retrieval.

Loaders are a crucial part of building data pipelines in LangChain, ensuring that data is efficiently and correctly ingested into the system for further processing.

### [PDF Loader](https://python.langchain.com/v0.2/docs/integrations/document_loaders/pypdfloader/)
Lets load the PDF

In [None]:
!pip install pypdf

Collecting pypdf
  Downloading pypdf-4.3.1-py3-none-any.whl.metadata (7.4 kB)
Downloading pypdf-4.3.1-py3-none-any.whl (295 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/295.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.8/295.8 kB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-4.3.1


In [None]:
from langchain.document_loaders import PyPDFLoader #Initializatio
file_path="/content/drive/MyDrive/Colab Notebooks/Rich-Dad-Poor-Dad.pdf"
loader = PyPDFLoader(file_path)
pages = loader.load()

#### Each page is a Document
A Document contains text(page_content) and metadata.

In [None]:
len(pages)

In [None]:
page = pages[224]

In [None]:
print(page)

In [None]:
dir(page)

In [None]:
page.metadata

In [None]:
page.page_content[0:10]

In [None]:
page.json()

We can compare 2 pages.

In [None]:
pages[1]==pages[-1]

While loading with

```
 extract_images=True
```
Then I had to install following dependencies as well.


In [None]:
!pip install onnxruntime-gpu
!pip install rapidocr-onnxruntime

Collecting onnxruntime-gpu
  Downloading onnxruntime_gpu-1.19.2-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting coloredlogs (from onnxruntime-gpu)
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB)
Collecting humanfriendly>=9.1 (from coloredlogs->onnxruntime-gpu)
  Downloading humanfriendly-10.0-py2.py3-none-any.whl.metadata (9.2 kB)
Downloading onnxruntime_gpu-1.19.2-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (226.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m226.2/226.2 MB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading humanfriendly-10.0-py2.py3-none-any.whl (86 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hIns

In [None]:
docs = []
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("/content/drive/MyDrive/Colab Notebooks/Rich-Dad-Poor-Dad.pdf", extract_images=True)
# docs_lazy = loader.lazy_load()
docs = await loader.aload()

for doc in docs:
    print(doc.page_content)
    print(doc.metadata)
    docs.append(doc)

[Other Available Document Leaders ](https://python.langchain.com/v0.2/docs/integrations/document_loaders/#pdfs)


1.  [PyPDF](https://python.langchain.com/v0.2/docs/integrations/document_loaders/pypdfloader)
 document loader  to load and parse PDFs
 > Supports PDFs

2.   [Unstructured](https://python.langchain.com/v0.2/docs/integrations/document_loaders/unstructured_file)
 document loader to load files of many types.
 > Unstructured supports loading of text files, powerpoints, html, pdfs, images, and more
3.   [Amazon Textract](https://python.langchain.com/v0.2/docs/integrations/document_loaders/amazon_textract/)
Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from scanned documents.
It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables. Today, many companies manually extract data from scanned documents such as PDFs, images, tables, and forms, or through simple OCR software that requires manual configuration (which often must be updated when the form changes). To overcome these manual and expensive processes, Textract uses ML to read and process any type of document, accurately extracting text, handwriting, tables, and other data with no manual effort.
> Textract supports PDF, TIFF, PNG and JPEG format.
4.   [MathPix](https://python.langchain.com/v0.2/docs/integrations/document_loaders/mathpix/)
Uses MathPix to laod PDFs
>   Supports PDFs

5.     [PDFPlumber](https://python.langchain.com/v0.2/docs/integrations/document_loaders/pdfplumber/) Like PyMuPDF, the output Documents contain detailed ***metadata about the PDF and its pages***, and returns one document per page.
>   Supports PDFs
6.   [PyPDFDirectry](https://python.langchain.com/v0.2/docs/integrations/document_loaders/pypdfdirectory)  loads all PDF files from a specific directory.
>   Supports PDFs
7.   [PyPDFium2](https://python.langchain.com/v0.2/docs/integrations/document_loaders/pypdfium2/)  Load PDF files using PyPDFium2.
>   Supports PDFs
8.    [UnstructuredPDFLoader](https://python.langchain.com/v0.2/docs/integrations/document_loaders/unstructured_pdfloader/)  Under the hood, Unstructured creates different "elements" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying mode="elements"
>   Supports PDFs
9.   [PyMuPDF](https://python.langchain.com/v0.2/docs/integrations/document_loaders/pymupdf/PyMuPDF) is optimized for ***speed, and contains detailed metadata about the PDF and its pages.***   It returns one document per page.

  >   Supports PDFs

10.  [PDFMiner](https://python.langchain.com/v0.2/docs/integrations/document_loaders/pdfminer/)   Load PDF files using PDFMiner. Using PDFMiner to generate HTML text
>   Supports PDFs


all other PDF loaders can also be used to fetch remote PDFs
We can update the built-in metadata and can add more information as per needed.



Below are some useful examples by using the above mentioned Loaders

In [None]:
!pip install pdfplumber

In [None]:
from langchain_community.document_loaders import PDFPlumberLoader

loader = PDFPlumberLoader("/content/drive/MyDrive/Colab Notebooks/Rich-Dad-Poor-Dad.pdf")
docs = loader.load()
docs[0]

In [None]:
docs[4]

Document(metadata={'source': '/content/drive/MyDrive/Colab Notebooks/Rich-Dad-Poor-Dad.pdf', 'file_path': '/content/drive/MyDrive/Colab Notebooks/Rich-Dad-Poor-Dad.pdf', 'page': 4, 'total_pages': 241, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'Adobe InDesign CS6 (Macintosh)', 'producer': 'Adobe PDF Library 10.0.1', 'creationDate': "D:20140328154719-07'00'", 'modDate': "D:20140328154734-07'00'", 'trapped': ''}, page_content='If you purchase this book without a cover, or purchase a PDF, jpg, or tiff copy of this book, \nit is likely stolen property or a counterfeit. In that case, neither the authors, the publisher, \nnor any of their employees or agents has received any payment for the copy. Furthermore, \ncounterfeiting is a known avenue of financial support for organized crime and terrorist \ngroups. We urge you to please not purchase any such copy and to report any instance of \nsomeone selling such copies to Plata Publishing LLC.\nThis 

In [None]:
!pip install unstructured
!pip install pillow-heif
!pip install pi-heif # install the correct package
!pip install unstructured[local-inference]


In [None]:
from langchain_community.document_loaders import UnstructuredPDFLoader
file_path = "/content/drive/MyDrive/Colab Notebooks/Rich-Dad-Poor-Dad.pdf"
loader = UnstructuredPDFLoader(file_path, mode="elements")
data = loader.load()
data[0]


In [None]:
!pip install onnxruntime-gpu
!pip install rapidocr-onnxruntime

In [None]:
# from langchain.document_loaders import PyPDFLoader
# loader = PyPDFLoader("/content/drive/MyDrive/Colab Notebooks/Rich-Dad-Poor-Dad.pdf")
# pages = loader.load()
docs = []
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("/content/drive/MyDrive/Colab Notebooks/Rich-Dad-Poor-Dad.pdf", extract_images=True)
# docs_lazy = loader.lazy_load()
docs = await loader.aload()

for doc in docs:
    print(doc.page_content)
    print(doc.metadata)
    docs.append(doc)

In [None]:
!pip install -qU langchain-community pymupdf

In [None]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(file_path=file_path)
docs = loader.load()
docs[0]

Document(metadata={'source': '/content/drive/MyDrive/Colab Notebooks/Rich-Dad-Poor-Dad.pdf', 'file_path': '/content/drive/MyDrive/Colab Notebooks/Rich-Dad-Poor-Dad.pdf', 'page': 0, 'total_pages': 241, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'Adobe InDesign CS6 (Macintosh)', 'producer': 'Adobe PDF Library 10.0.1', 'creationDate': "D:20140328154719-07'00'", 'modDate': "D:20140328154734-07'00'", 'trapped': ''}, page_content='Robert T. Kiyosaki\nWhat The Rich Teach Their Kids About Money – \nThat The Poor And Middle Class Do Not!\n')

In [None]:
docs[4]

Document(metadata={'source': '/content/drive/MyDrive/Colab Notebooks/Rich-Dad-Poor-Dad.pdf', 'file_path': '/content/drive/MyDrive/Colab Notebooks/Rich-Dad-Poor-Dad.pdf', 'page': 4, 'total_pages': 241, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'Adobe InDesign CS6 (Macintosh)', 'producer': 'Adobe PDF Library 10.0.1', 'creationDate': "D:20140328154719-07'00'", 'modDate': "D:20140328154734-07'00'", 'trapped': ''}, page_content='If you purchase this book without a cover, or purchase a PDF, jpg, or tiff copy of this book, \nit is likely stolen property or a counterfeit. In that case, neither the authors, the publisher, \nnor any of their employees or agents has received any payment for the copy. Furthermore, \ncounterfeiting is a known avenue of financial support for organized crime and terrorist \ngroups. We urge you to please not purchase any such copy and to report any instance of \nsomeone selling such copies to Plata Publishing LLC.\nThis 

In [None]:
!pip install unstructured[pdf]



### Testing with Urdu Language Documents

In [None]:
from langchain_community.document_loaders import PyPDFium2Loader


file_path = "/content/drive/MyDrive/Colab Notebooks/Ahtrame mahobt by farah tahir.pdf"
loader = PyPDFium2Loader(file_path)
data = loader.load()
for doc in data:
  print(doc.page_content)
  print(doc.metadata)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### [CSV](https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/csv/)
A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas.



In [16]:
from langchain_community.document_loaders.csv_loader import CSVLoader

file_path='/content/drive/MyDrive/Colab Notebooks/business-financial-data-march-2024-csv.csv'
loader = CSVLoader(file_path)
data = loader.load()
for data in data[:2]: # load first 2 records
  print(data.page_content)
  print(data.metadata)

Series_reference: BDCQ.SF1AA2CA
Period: 2016.06
Data_value: 1116.386
Suppressed: 
STATUS: F
UNITS: Dollars
Magnitude: 6
Subject: Business Data Collection - BDC
Group: Industry by financial variable (NZSIOC Level 2)
Series_title_1: Sales (operating income)
Series_title_2: Forestry and Logging
Series_title_3: Current prices
Series_title_4: Unadjusted
Series_title_5: 
{'source': '/content/drive/MyDrive/Colab Notebooks/business-financial-data-march-2024-csv.csv', 'row': 0}
Series_reference: BDCQ.SF1AA2CA
Period: 2016.09
Data_value: 1070.874
Suppressed: 
STATUS: F
UNITS: Dollars
Magnitude: 6
Subject: Business Data Collection - BDC
Group: Industry by financial variable (NZSIOC Level 2)
Series_title_1: Sales (operating income)
Series_title_2: Forestry and Logging
Series_title_3: Current prices
Series_title_4: Unadjusted
Series_title_5: 
{'source': '/content/drive/MyDrive/Colab Notebooks/business-financial-data-march-2024-csv.csv', 'row': 1}


We can specify tge delimiter, fieldsname etc.

In [18]:

loader = CSVLoader(file_path,
                   csv_args={
                       'delimiter': ',',
                       'quotechar': '"',
                       'fieldnames': ['Series_reference', 'Period', 'STATUS', 'Data_value']
                   })
data = loader.load()
for data in data[:2]: # load first 2 records
  print(data.page_content)
  print(data.metadata)

Series_reference: Series_reference
Period: Period
STATUS: Data_value
Data_value: Suppressed
None: STATUS,UNITS,Magnitude,Subject,Group,Series_title_1,Series_title_2,Series_title_3,Series_title_4,Series_title_5
{'source': '/content/drive/MyDrive/Colab Notebooks/business-financial-data-march-2024-csv.csv', 'row': 0}
Series_reference: BDCQ.SF1AA2CA
Period: 2016.06
STATUS: 1116.386
Data_value: 
None: F,Dollars,6,Business Data Collection - BDC,Industry by financial variable (NZSIOC Level 2),Sales (operating income),Forestry and Logging,Current prices,Unadjusted,
{'source': '/content/drive/MyDrive/Colab Notebooks/business-financial-data-march-2024-csv.csv', 'row': 1}



**[Parameters](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.csv_loader.CSVLoader.html)**


 * **file_path** (Union[str, Path]) – The path to the CSV file.

* **source_column** (Optional[str]) – The name of the column in the CSV file to use as the source. Optional. Defaults to None.

* **metadata_columns** (Sequence[str]) – A sequence of column names to use as metadata. Optional.

* **csv_args** (Optional[Dict]) – A dictionary of arguments to pass to the csv.DictReader. Optional. Defaults to None.

* **encoding** (Optional[str]) – The encoding of the CSV file. Optional. Defaults to None.

* **autodetect_encoding** (bool) – Whether to try to autodetect the file encoding.

* **content_columns** (Sequence[str]) – A sequence of column names to use for the document content. If not present, use all columns that are not part of the metadata.


In [None]:
print(page)

In [None]:
dir(page)

In [None]:
page.metadata

In [None]:
page.page_content[0:10]

In [None]:
page.json()

### [Webpages](https://python.langchain.com/v0.2/docs/integrations/document_loaders/#webpages)
Lets load the Webpages

This covers how to use WebBaseLoader to load all text from HTML webpages into a document format that we can use downstream. For more custom logic for loading webpages look at some child class examples such as IMSDbLoader, AZLyricsLoader, and CollegeConfidentialLoader.

If you don't want to worry about website crawling, bypassing JS-blocking sites, and data cleaning, consider using FireCrawlLoader or the faster option SpiderLoader.

[Available Webpage Leaders](https://python.langchain.com/v0.2/docs/integrations/document_loaders/#webpages)


1.  [Web](https://python.langchain.com/v0.2/docs/integrations/document_loaders/web_base)
Uses urllib and BeautifulSoup to load and parse HTML web pages.It will only get all the details of given url.

2.   [RecursiveURL](https://python.langchain.com/v0.2/docs/integrations/document_loaders/recursive_url)
Recursively scrapes all child links from a root URL

3.   [Sitemap](https://python.langchain.com/v0.2/docs/integrations/document_loaders/sitemap)
Scrapes all pages on a given sitemap

4.   [Firecrawl](https://python.langchain.com/v0.2/docs/integrations/document_loaders/firecrawl)
API service that can be deployed locally, hosted version has free credits.

In [19]:
!pip install -qU langchain_community beautifulsoup4

In [None]:
from langchain_community.document_loaders import WebBaseLoader
from pprint import pprint # import the pprint function from the standard library

url="https://www.xevensolutions.com/"
# url="https://www.amazon.com/dp/B091GCJ4RT/"
# url="https://www.costco.co.uk/Garden-Sheds-Patio/Barbecues-and-Firepits/Gas-Barbecues/Kirkland-Signature-7-Burner-Mini-Island-Gas-Barbecue-Grill-Cover/p/2127649"
loader = WebBaseLoader(url)
docs = loader.load()
for doc in docs:
  pprint(doc.page_content)
  pprint(doc.metadata)

* We can also pass in a list of pages to load from.

```
loader_multiple_pages = WebBaseLoader(["https://www.espn.com/", "https://google.com"])
```

* we can pass proxies, bypass SSL verification and much more.
* Load multiple urls concurrently
** We can speed up the scraping process by scraping and parsing multiple urls concurrently.

* * There are reasonable limits to concurrent requests, defaulting to 2 per second. If you aren't concerned about being a good citizen, or you control the server you are scraping and don't care about load, you can change the requests_per_second parameter to increase the max concurrent requests. Note, while this will speed up the scraping process, but may cause the server to block you. Be careful!


* * ```
loader.requests_per_second = 1
```
**Loading a xml file, or using a different BeautifulSoup parser**


* * ```
loader.default_parser = "xml"

* ***Using proxies***
* * Sometimes you might need to use proxies to get around IP blocks. You can pass in a dictionary of proxies to the loader (and requests underneath) to use them.




### Webpages
Lets load the Webpages

[Available Webpage Leaders](https://python.langchain.com/v0.2/docs/integrations/document_loaders/#webpages)


1.  [PyPDF](https://python.langchain.com/v0.2/docs/integrations/document_loaders/pypdfloader)
 document loader  to load and parse PDFs
 > Supports PDFs

2.   [Unstructured](https://python.langchain.com/v0.2/docs/integrations/document_loaders/unstructured_file)
 document loader to load files of many types.
 > Unstructured supports loading of text files, powerpoints, html, pdfs, images, and more
3.   [Amazon Textract](https://python.langchain.com/v0.2/docs/integrations/document_loaders/amazon_textract/)
Amazon Textract is a machine learning (ML) service that automatically

In [None]:
len(pages)

In [None]:
page = pages[224]

In [None]:
print(page)

In [None]:
dir(page)

In [None]:
page.metadata

In [None]:
page.page_content[0:10]

In [None]:
page.json()

### Webpages
Lets load the Webpages

[Available Webpage Leaders](https://python.langchain.com/v0.2/docs/integrations/document_loaders/#webpages)


1.  [PyPDF](https://python.langchain.com/v0.2/docs/integrations/document_loaders/pypdfloader)
 document loader  to load and parse PDFs
 > Supports PDFs

2.   [Unstructured](https://python.langchain.com/v0.2/docs/integrations/document_loaders/unstructured_file)
 document loader to load files of many types.
 > Unstructured supports loading of text files, powerpoints, html, pdfs, images, and more
3.   [Amazon Textract](https://python.langchain.com/v0.2/docs/integrations/document_loaders/amazon_textract/)
Amazon Textract is a machine learning (ML) service that automatically

In [None]:
len(pages)

In [None]:
page = pages[224]

In [None]:
print(page)

In [None]:
dir(page)

In [None]:
page.metadata

In [None]:
page.page_content[0:10]

In [None]:
page.json()