<a href="https://colab.research.google.com/github/Zeeshan138063/rag/blob/main/RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# RAG
RAG (Retrieval-Augmented Generation) pipelines tackle AI hallucinations by integrating real-time information retrieval with text generation. This ensures that models produce more accurate, context-grounded responses, reducing the chances of misleading or incorrect outputs. A powerful step forward in building reliable AI!

[More on RAG](https://www.linkedin.com/pulse/rag-retrieval-augmented-generation-pipelines-muhammad-zeeshan-oodvf/?trackingId=pXByMvHLQDW%2Fw11KVFg8LQ%3D%3D)



1.   Ingestion
2.   Retrieval
3.   Synthesis




# LangChain
***LangChain*** is a framework for developing applications powered by large language models (LLMs).
by providing utilities for working with text, embeddings, memory, and more.

In [None]:
!pip install langchain langchain-community

# Document loaders [#](https://python.langchain.com/v0.2/docs/integrations/document_loaders/):

---

Loaders refer to components that are used to load or ingest data from various sources such as files, databases, APIs, or even web pages, and converting that data into a format that can be processed by LangChain.

DocumentLoaders load data into the standard LangChain Document format.

*Each DocumentLoader has its own specific parameters, but they can all be invoked in the same way with the ***.load*** method.*

## Types of Loaders:


*   **File Loaders:** These load data from files like text files, PDFs, CSVs, JSON, etc.

*   **Database Loaders:** These load data from databases, converting rows or documents into a format that can be used by the language model.

* **Web Loaders:** These scrape or fetch content from web pages, transforming the retrieved text into a structured format.

* **API Loaders:** These fetch data from APIs and convert the responses into a usable format for further processing.

* **Customization:**
 LangChain allows you to create custom loaders if your data source doesn’t fit the pre-existing loaders. This flexibility ensures you can integrate almost any data source into your language model application.


###  Preprocessing:
Loader performs necessary preprocessing steps such as tokenization, normalization and format conversiotn to ensure data is in the optimal state for model consumption

 ### Integration:
 Once the data is loaded, it can be passed through various components of LangChain, such as text splitting, embedding generation, memory integration, or directly into a language model for processing.



---

#### Example Use Case

---


If you have a large set of PDF documents and you want to extract the text content for use in a language model application, you could use a PDF loader in LangChain to automate this process. The loader would read each PDF, extract the text, and format it in a way that the language model can use it for tasks like summarization, question answering, or information retrieval.

Loaders are a crucial part of building data pipelines in LangChain, ensuring that data is efficiently and correctly ingested into the system for further processing.

### [PDF Loader](https://python.langchain.com/v0.2/docs/integrations/document_loaders/pypdfloader/)
Lets load the PDF

In [None]:
!pip install pypdf

In [None]:
from langchain.document_loaders import PyPDFLoader #Initializatio
file_path="/content/drive/MyDrive/Colab Notebooks/Rich-Dad-Poor-Dad.pdf"
loader = PyPDFLoader(file_path)
pages = loader.load()

#### Each page is a Document
A Document contains text(page_content) and metadata.

In [None]:
len(pages)

In [None]:
page = pages[224]

In [None]:
print(page)

In [None]:
dir(page)

In [None]:
page.metadata

In [None]:
page.page_content[0:10]

In [None]:
page.json()

We can compare 2 pages.

In [None]:
pages[1]==pages[-1]

While loading with

```
 extract_images=True
```
Then I had to install following dependencies as well.


In [None]:
!pip install onnxruntime-gpu
!pip install rapidocr-onnxruntime

In [None]:
docs = []
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("/content/drive/MyDrive/Colab Notebooks/Rich-Dad-Poor-Dad.pdf", extract_images=True)
# docs_lazy = loader.lazy_load()
docs = await loader.aload()

for doc in docs:
    print(doc.page_content)
    print(doc.metadata)
    docs.append(doc)

[Other Available Document Leaders ](https://python.langchain.com/v0.2/docs/integrations/document_loaders/#pdfs)


1.  [PyPDF](https://python.langchain.com/v0.2/docs/integrations/document_loaders/pypdfloader)
 document loader  to load and parse PDFs
 > Supports PDFs

2.   [Unstructured](https://python.langchain.com/v0.2/docs/integrations/document_loaders/unstructured_file)
 document loader to load files of many types.
 > Unstructured supports loading of text files, powerpoints, html, pdfs, images, and more
3.   [Amazon Textract](https://python.langchain.com/v0.2/docs/integrations/document_loaders/amazon_textract/)
Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from scanned documents.
It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables. Today, many companies manually extract data from scanned documents such as PDFs, images, tables, and forms, or through simple OCR software that requires manual configuration (which often must be updated when the form changes). To overcome these manual and expensive processes, Textract uses ML to read and process any type of document, accurately extracting text, handwriting, tables, and other data with no manual effort.
> Textract supports PDF, TIFF, PNG and JPEG format.
4.   [MathPix](https://python.langchain.com/v0.2/docs/integrations/document_loaders/mathpix/)
Uses MathPix to laod PDFs
>   Supports PDFs

5.     [PDFPlumber](https://python.langchain.com/v0.2/docs/integrations/document_loaders/pdfplumber/) Like PyMuPDF, the output Documents contain detailed ***metadata about the PDF and its pages***, and returns one document per page.
>   Supports PDFs
6.   [PyPDFDirectry](https://python.langchain.com/v0.2/docs/integrations/document_loaders/pypdfdirectory)  loads all PDF files from a specific directory.
>   Supports PDFs
7.   [PyPDFium2](https://python.langchain.com/v0.2/docs/integrations/document_loaders/pypdfium2/)  Load PDF files using PyPDFium2.
>   Supports PDFs
8.    [UnstructuredPDFLoader](https://python.langchain.com/v0.2/docs/integrations/document_loaders/unstructured_pdfloader/)  Under the hood, Unstructured creates different "elements" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying mode="elements"
>   Supports PDFs
9.   [PyMuPDF](https://python.langchain.com/v0.2/docs/integrations/document_loaders/pymupdf/PyMuPDF) is optimized for ***speed, and contains detailed metadata about the PDF and its pages.***   It returns one document per page.

  >   Supports PDFs

10.  [PDFMiner](https://python.langchain.com/v0.2/docs/integrations/document_loaders/pdfminer/)   Load PDF files using PDFMiner. Using PDFMiner to generate HTML text
>   Supports PDFs


all other PDF loaders can also be used to fetch remote PDFs


Below are some useful examples by using the above mentioned Loaders

In [None]:
!pip install pdfplumber

In [None]:
from langchain_community.document_loaders import PDFPlumberLoader

loader = PDFPlumberLoader("/content/drive/MyDrive/Colab Notebooks/Rich-Dad-Poor-Dad.pdf")
docs = loader.load()
docs[0]

In [10]:
docs[4]

Document(metadata={'source': '/content/drive/MyDrive/Colab Notebooks/Rich-Dad-Poor-Dad.pdf', 'file_path': '/content/drive/MyDrive/Colab Notebooks/Rich-Dad-Poor-Dad.pdf', 'page': 4, 'total_pages': 241, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'Adobe InDesign CS6 (Macintosh)', 'producer': 'Adobe PDF Library 10.0.1', 'creationDate': "D:20140328154719-07'00'", 'modDate': "D:20140328154734-07'00'", 'trapped': ''}, page_content='If you purchase this book without a cover, or purchase a PDF, jpg, or tiff copy of this book, \nit is likely stolen property or a counterfeit. In that case, neither the authors, the publisher, \nnor any of their employees or agents has received any payment for the copy. Furthermore, \ncounterfeiting is a known avenue of financial support for organized crime and terrorist \ngroups. We urge you to please not purchase any such copy and to report any instance of \nsomeone selling such copies to Plata Publishing LLC.\nThis 

In [None]:
!pip install unstructured
!pip install pillow-heif
!pip install pi-heif # install the correct package
!pip install unstructured[local-inference]


In [None]:
from langchain_community.document_loaders import UnstructuredPDFLoader
file_path = "/content/drive/MyDrive/Colab Notebooks/Rich-Dad-Poor-Dad.pdf"
loader = UnstructuredPDFLoader(file_path, mode="elements")
data = loader.load()
data[0]


In [None]:
!pip install onnxruntime-gpu
!pip install rapidocr-onnxruntime

In [None]:
# from langchain.document_loaders import PyPDFLoader
# loader = PyPDFLoader("/content/drive/MyDrive/Colab Notebooks/Rich-Dad-Poor-Dad.pdf")
# pages = loader.load()
docs = []
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("/content/drive/MyDrive/Colab Notebooks/Rich-Dad-Poor-Dad.pdf", extract_images=True)
# docs_lazy = loader.lazy_load()
docs = await loader.aload()

for doc in docs:
    print(doc.page_content)
    print(doc.metadata)
    docs.append(doc)

In [None]:
!pip install -qU langchain-community pymupdf

In [None]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(file_path=file_path)
docs = loader.load()
docs[0]

Document(metadata={'source': '/content/drive/MyDrive/Colab Notebooks/Rich-Dad-Poor-Dad.pdf', 'file_path': '/content/drive/MyDrive/Colab Notebooks/Rich-Dad-Poor-Dad.pdf', 'page': 0, 'total_pages': 241, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'Adobe InDesign CS6 (Macintosh)', 'producer': 'Adobe PDF Library 10.0.1', 'creationDate': "D:20140328154719-07'00'", 'modDate': "D:20140328154734-07'00'", 'trapped': ''}, page_content='Robert T. Kiyosaki\nWhat The Rich Teach Their Kids About Money – \nThat The Poor And Middle Class Do Not!\n')

In [9]:
docs[4]

Document(metadata={'source': '/content/drive/MyDrive/Colab Notebooks/Rich-Dad-Poor-Dad.pdf', 'file_path': '/content/drive/MyDrive/Colab Notebooks/Rich-Dad-Poor-Dad.pdf', 'page': 4, 'total_pages': 241, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'Adobe InDesign CS6 (Macintosh)', 'producer': 'Adobe PDF Library 10.0.1', 'creationDate': "D:20140328154719-07'00'", 'modDate': "D:20140328154734-07'00'", 'trapped': ''}, page_content='If you purchase this book without a cover, or purchase a PDF, jpg, or tiff copy of this book, \nit is likely stolen property or a counterfeit. In that case, neither the authors, the publisher, \nnor any of their employees or agents has received any payment for the copy. Furthermore, \ncounterfeiting is a known avenue of financial support for organized crime and terrorist \ngroups. We urge you to please not purchase any such copy and to report any instance of \nsomeone selling such copies to Plata Publishing LLC.\nThis 

### Webpages
Lets load the Webpages

[Available Webpage Leaders](https://python.langchain.com/v0.2/docs/integrations/document_loaders/#webpages)


1.  [PyPDF](https://python.langchain.com/v0.2/docs/integrations/document_loaders/pypdfloader)
 document loader  to load and parse PDFs
 > Supports PDFs

2.   [Unstructured](https://python.langchain.com/v0.2/docs/integrations/document_loaders/unstructured_file)
 document loader to load files of many types.
 > Unstructured supports loading of text files, powerpoints, html, pdfs, images, and more
3.   [Amazon Textract](https://python.langchain.com/v0.2/docs/integrations/document_loaders/amazon_textract/)
Amazon Textract is a machine learning (ML) service that automatically

In [None]:
len(pages)

In [None]:
page = pages[224]

In [None]:
print(page)

In [None]:
dir(page)

In [None]:
page.metadata

In [None]:
page.page_content[0:10]

In [None]:
page.json()