<a href="https://colab.research.google.com/github/Zeeshan138063/rag/blob/main/RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# RAG
RAG (Retrieval-Augmented Generation) pipelines tackle AI hallucinations by integrating real-time information retrieval with text generation. This ensures that models produce more accurate, context-grounded responses, reducing the chances of misleading or incorrect outputs. A powerful step forward in building reliable AI!

[More on RAG](https://www.linkedin.com/pulse/rag-retrieval-augmented-generation-pipelines-muhammad-zeeshan-oodvf/?trackingId=pXByMvHLQDW%2Fw11KVFg8LQ%3D%3D)



1.   Ingestion
2.   Retrieval
3.   Synthesis




# LangChain
***LangChain*** is a framework for developing applications powered by large language models (LLMs).
by providing utilities for working with text, embeddings, memory, and more.

In [None]:
!pip install langchain langchain-community

# [Loaders](https://python.langchain.com/v0.2/docs/integrations/document_loaders/):

---


 Loaders refer to components that are used to load or ingest data from various sources such as files, databases, APIs, or even web pages, and converting that data into a format that can be processed by LangChain.

DocumentLoaders load data into the standard LangChain Document format.

Each DocumentLoader has its own specific parameters, but they can all be invoked in the same way with the .load method.

## Types of Loaders:


*   **File Loaders:** These load data from files like text files, PDFs, CSVs, JSON, etc.

*   **Database Loaders:** These load data from databases, converting rows or documents into a format that can be used by the language model.

* **Web Loaders:** These scrape or fetch content from web pages, transforming the retrieved text into a structured format.

* **API Loaders:** These fetch data from APIs and convert the responses into a usable format for further processing.

* **Customization:**
 LangChain allows you to create custom loaders if your data source doesn’t fit the pre-existing loaders. This flexibility ensures you can integrate almost any data source into your language model application.


###  Preprocessing:
Loader performs necessary preprocessing steps such as tokenization, normalization and format conversiotn to ensure data is in the optimal state for model consumption

 ### Integration:
 Once the data is loaded, it can be passed through various components of LangChain, such as text splitting, embedding generation, memory integration, or directly into a language model for processing.



---

#### Example Use Case

---


If you have a large set of PDF documents and you want to extract the text content for use in a language model application, you could use a PDF loader in LangChain to automate this process. The loader would read each PDF, extract the text, and format it in a way that the language model can use it for tasks like summarization, question answering, or information retrieval.

Loaders are a crucial part of building data pipelines in LangChain, ensuring that data is efficiently and correctly ingested into the system for further processing.

### [PDF Loader](https://python.langchain.com/v0.2/docs/integrations/document_loaders/pypdfloader/)
Lets load the PDF

In [None]:
!pip install pypdf

In [None]:
from langchain.document_loaders import PyPDFLoader #Initializatio
loader = PyPDFLoader("/content/drive/MyDrive/Colab Notebooks/Rich-Dad-Poor-Dad.pdf")
pages = loader.load()

#### Each page is a Document
A Document contains text(page_content) and metadata.

In [None]:
len(pages)

241

In [None]:
page = pages[224]

In [None]:
print(page)

In [None]:
dir(page)

In [None]:
page.metadata

In [None]:
page.page_content[0:10]

In [None]:
page.json()

We can compare 2 pages.

In [None]:
pages[1]==pages[-1]

### PDF Loader
Lets load the PDF

In [None]:
!pip install pypdf
!pip install onnxruntime-gpu
!pip install rapidocr-onnxruntime

In [None]:
# from langchain.document_loaders import PyPDFLoader
# loader = PyPDFLoader("/content/drive/MyDrive/Colab Notebooks/Rich-Dad-Poor-Dad.pdf")
# pages = loader.load()
docs = []
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("/content/drive/MyDrive/Colab Notebooks/Rich-Dad-Poor-Dad.pdf", extract_images=True)
# docs_lazy = loader.lazy_load()
docs = await loader.aload()

for doc in docs:
    print(doc.page_content)
    print(doc.metadata)
    docs.append(doc)

#### Each page is a Document
A Document contains text(page_content) and metadata.

In [None]:
len(pages)

241

In [None]:
page = pages[224]

In [None]:
print(page)

In [None]:
dir(page)

In [None]:
page.metadata

In [None]:
page.page_content[0:10]

In [None]:
page.json()

We can compare 2 pages.

In [None]:
pages[1]==pages[-1]