### Data Ingestion

In [1]:
### document Structure

from langchain_core.documents import Document # import Document class

In [3]:
# define object of Document class

doc = Document(
    page_content = "This is the content of the document.",
    metadata = {                                              # in order to make the retrieval more efficient, we can add metadata (also it helps in filtering the documents based on certain criteria)
        "source": "rag_document.txt",
        "pages" : 10,
        "author": "Navneeth"
    }  
)
doc

Document(metadata={'source': 'rag_document.txt', 'pages': 10, 'author': 'Navneeth'}, page_content='This is the content of the document.')

In [4]:
## txt file content
import os
os.makedirs("../data/text_files", exist_ok=True)

In [9]:
sample_texts = {
  "../data/text_files/rag_intro.txt": """ 
Introduction to RAG
RAG (Retrieval-Augmented Generation) is a technique that makes a Large Language Model (LLM) "smarter" by connecting it to an external knowledge base.

Think of a standard LLM as a student taking a closed-book exam. It can only answer questions using the information it memorized during its training.

RAG turns this into an open-book exam. Before answering a question, the AI first retrieves relevant information (like notes from a textbook or your private documents) and then augments its prompt with these facts. It then generates an answer based on this new, "grounded" information.

The Core Problem RAG Solves
RAG is designed to fix three major weaknesses of LLMs:

Outdated Knowledge: An LLM's knowledge is frozen at the end of its training. RAG connects it to a knowledge base that can be constantly updated.

Lack of Private Context: A model trained on the public internet knows nothing about your specific company files or personal notes. RAG gives the model access to this private data.

Hallucinations: LLMs often "make up" plausible-sounding but incorrect facts. RAG forces the model to base its answers on specific, retrieved data, making it far more factual and verifiable.

How RAG Works: The Two-Phase Process
The entire RAG system operates in two distinct phases.

Phase 1: Indexing (Data Preparation)
This is the one-time, offline process of building your "knowledge library."

Load: You gather all your documents (e.g., PDFs, .txt files, web pages). This is what the ../data/text_files folder from your screenshot is for—it holds these source documents.

Chunk: You break these large documents into smaller, manageable chunks (e.g., paragraphs or sentences).

Embed: Each chunk of text is converted into a numerical representation called a vector using an embedding model. This vector captures the semantic meaning of the text.

Store: All these vectors are loaded into a special, highly efficient database called a Vector Database (or vector store), where they are indexed for fast searching.

Phase 2: Retrieval and Generation (Answering a Question)
This phase happens every time you ask a question.

Query: You ask a question (e.g., "What is the author's name in exmaple.txt?").

Retrieve: Your question is also converted into a vector (using the same embedding model). The system then searches the vector database to find the text chunks with vectors that are most similar in meaning to your question's vector.

Augment: The original text of these matching chunks (the "context") is retrieved. This context and your original question are combined into a new, detailed prompt for the LLM.

Generate: This combined prompt is sent to the LLM. The LLM then generates an answer based specifically on the context it was given, not just its internal memory.
  """,

"../data/text_files/vectordb_intro.txt":
"""
A Short Introduction to Vector Databases
A vector database is a special type of database designed to store, manage, and search vectors (long lists of numbers) efficiently.

Instead of searching for exact matches like a traditional database (e.g., finding a row where user_id = 123), a vector database finds the "closest" or most similar items.

Why Are They Needed?
In AI, embedding models turn complex data—like text, images, or audio—into vectors. These vectors represent the data's semantic meaning.

The vector for "king" will be mathematically close to the vector for "queen."

The vector for "apple" (the fruit) will be far from the vector for "Apple" (the company).

A vector database is built to search this meaning. When you provide a query (like the vector for your question), the database can instantly find the vectors that are "nearest" to it. This is called Approximate Nearest Neighbor (ANN) search.

How They Work (A Simple Analogy)
Imagine all your data chunks are stars plotted in a 3D space.

Store: The database stores the exact (x, y, z) coordinates for every star.

Search: You give the database the coordinates of a new point (your query).

Find: The database's special algorithms don't check every single star. Instead, they rapidly navigate this 3D space to find the closest cluster of stars to your new point.

These "closest stars" are the most semantically related pieces of data (e.g., the most relevant text chunks) for your query.
"""
 }   # here key is the filename and value is the content of the file

for filepath,content in sample_texts.items(): 
 with open(filepath, "w", encoding="utf-8") as f:
     f.write(content)

print("Sample text files created.") 

Sample text files created.


In [3]:
# one way -- textloader
from langchain_community.document_loaders import TextLoader

loader = TextLoader("../data/text_files/rag_intro.txt", encoding = "utf-8")  # create loader object
document = loader.load() # call the load method to read the file and create document object
print(document)


[Document(metadata={'source': '../data/text_files/rag_intro.txt'}, page_content=' \nIntroduction to RAG\nRAG (Retrieval-Augmented Generation) is a technique that makes a Large Language Model (LLM) "smarter" by connecting it to an external knowledge base.\n\nThink of a standard LLM as a student taking a closed-book exam. It can only answer questions using the information it memorized during its training.\n\nRAG turns this into an open-book exam. Before answering a question, the AI first retrieves relevant information (like notes from a textbook or your private documents) and then augments its prompt with these facts. It then generates an answer based on this new, "grounded" information.\n\nThe Core Problem RAG Solves\nRAG is designed to fix three major weaknesses of LLMs:\n\nOutdated Knowledge: An LLM\'s knowledge is frozen at the end of its training. RAG connects it to a knowledge base that can be constantly updated.\n\nLack of Private Context: A model trained on the public internet kn

In [5]:
# another way -- directory loader
from langchain_community.document_loaders import DirectoryLoader

# create directory loader object -- to load all text files in a directory -- instantiate DirectoryLoader class
dir_loader = DirectoryLoader(
    "../data/text_files",
    glob = "**/*.txt", # pattern to match all text files recursively
    loader_cls = TextLoader, # specify the loader class to use for each file
    loader_kwargs = {"encoding": "utf-8"}, # additional arguments for the loader class
    show_progress = False
)

documents = dir_loader.load() # call load method to read all files and create document objects
documents


[Document(metadata={'source': '..\\data\\text_files\\rag_intro.txt'}, page_content=' \nIntroduction to RAG\nRAG (Retrieval-Augmented Generation) is a technique that makes a Large Language Model (LLM) "smarter" by connecting it to an external knowledge base.\n\nThink of a standard LLM as a student taking a closed-book exam. It can only answer questions using the information it memorized during its training.\n\nRAG turns this into an open-book exam. Before answering a question, the AI first retrieves relevant information (like notes from a textbook or your private documents) and then augments its prompt with these facts. It then generates an answer based on this new, "grounded" information.\n\nThe Core Problem RAG Solves\nRAG is designed to fix three major weaknesses of LLMs:\n\nOutdated Knowledge: An LLM\'s knowledge is frozen at the end of its training. RAG connects it to a knowledge base that can be constantly updated.\n\nLack of Private Context: A model trained on the public internet

In [7]:
# loading pdf files
from langchain_community.document_loaders import PyPDFLoader, PyMuPDFLoader

dir_loader = DirectoryLoader(
    "../data/pdf_files",
    glob = "**/*.pdf", # pattern to match all text files recursively
    loader_cls = PyMuPDFLoader, # specify the loader class to use for each file
    show_progress = False
)

pdf_documents = dir_loader.load() # call load method to read all files and create document objects
pdf_documents

[Document(metadata={'producer': 'www.ilovepdf.com', 'creator': 'Microsoft® Word 2016', 'creationdate': '2024-01-10T07:07:09+00:00', 'source': '..\\data\\pdf_files\\1028-ArticleText-7147-1-10-202401291.pdf', 'file_path': '..\\data\\pdf_files\\1028-ArticleText-7147-1-10-202401291.pdf', 'total_pages': 11, 'format': 'PDF 1.5', 'title': '', 'author': 'USER', 'subject': '', 'keywords': '', 'moddate': '2024-01-10T07:07:10+00:00', 'trapped': '', 'modDate': 'D:20240110070710Z', 'creationDate': "D:20240110070709+00'00'", 'page': 0}, page_content='See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/377844972\nAnomaly Detection In IoT Sensor Data Using Machine Learning Techniques\nFor Predictive Maintenance In Smart Grids\nArticle\xa0\xa0in\xa0\xa0International Journal Of Science Technology & Management · January 2024\nDOI: 10.46729/ijstm.v5i1.1028\nCITATIONS\n24\nREADS\n1,497\n4 authors:\nEdwin Omol\nKCA University\n24 PUBLICATIONS\xa0\xa0