# Retrieval Augmented Generation (RAG) - The Data Ingestion Pipeline
## Data Ingestion Steps
- `Load`
- `Split`
- `Embed`
- `Store`

In [203]:
# ignore warnings
import warnings
warnings.filterwarnings('ignore')

### 1. Data Loading

#### Load a **TEXT** document (.txt)

In [204]:
from langchain_community.document_loaders import TextLoader

file_path = "data/Artificial Intelligence.txt"  # Path of the document to be loaded
loader = TextLoader(document_path)              # Initialize the text loader
documents = loader.load()                       # Load the text document 

print(documents)

[Document(metadata={'source': 'data/Artificial Intelligence.txt'}, page_content='Artificial intelligence (AI), in its broadest sense, is intelligence exhibited by machines, particularly computer systems. It is a field of research in computer science that develops and studies methods and software that enable machines to perceive their environment and use learning and intelligence to take actions that maximize their chances of achieving defined goals.[1] Such machines may be called AIs.\n\nSome high-profile applications of AI include advanced web search engines (e.g., Google Search); recommendation systems (used by YouTube, Amazon, and Netflix); interacting via human speech (e.g., Google Assistant, Siri, and Alexa); autonomous vehicles (e.g., Waymo); generative and creative tools (e.g., ChatGPT, Apple Intelligence, and AI art); and superhuman play and analysis in strategy games (e.g., chess and Go). However, many AI applications are not perceived as AI: "A lot of cutting edge AI has filt

In [205]:
print(len(documents))                                       # Print the length of documents
print(documents[0].__getattribute__('metadata'))            # Print the metadata of the first document
print(documents[0].__getattribute__('page_content')[:100])  # Print the first 100 characters of the first document content

1
{'source': 'data/Artificial Intelligence.txt'}
Artificial intelligence (AI), in its broadest sense, is intelligence exhibited by machines, particul


#### Load a **PDF** document (.pdf)

In [206]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "data/Dale_Carnegie_Golden_Book-Se.pdf"  # Path of the document to be loaded
loader = PyPDFLoader(file_path)                      # Initialize the pdf loader
documents = loader.load()                            # Load the pdf document 

print(documents)

[Document(metadata={'source': 'data/Dale_Carnegie_Golden_Book-Se.pdf', 'page': 0}, page_content='DALE CARNEGIE’S \nGOLDEN BOOK \nwww.dalecarnegie.com '), Document(metadata={'source': 'data/Dale_Carnegie_Golden_Book-Se.pdf', 'page': 1}, page_content='GOLDEN BOOK \nPrinciples from How to Win Friends and Influence People \nBecome a Friendlier Person \n1.Don’t criticize, condemn or complain.\n2.Give honest, sincere appreciation.\n3.Arouse in the other person an eager want.\n4.Become genuinely interested in other people.\n5.Smile.\n6.Remember that a person’s name is to that person the sweetest \nand most important sound in any language.\n7. Be a good listener. Encourage others to talk about themselves.\n8. Talk in terms of the other person’s interests.\n9. Make the other person feel important - and do it sincerely.\n– – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – \nWin People to Your Way of Thinking \n10.The only way to get the best of an argument is to avoid i

In [207]:
print(len(documents))                                       # Print the length of documents
print(documents[0].__getattribute__('metadata'))            # Print the metadata of the first document
print(documents[0].__getattribute__('page_content')[:100])  # Print the first 100 characters of the first document content

7
{'source': 'data/Dale_Carnegie_Golden_Book-Se.pdf', 'page': 0}
DALE CARNEGIE’S 
GOLDEN BOOK 
www.dalecarnegie.com 


#### Load a **WORD** document (.docx)

In [208]:
from langchain_community.document_loaders import Docx2txtLoader

file_path = "data/Data_Literacy_Quiz.docx"  # Path of the document to be loaded
loader = Docx2txtLoader(file_path)          # Initialize the word document loader
documents = loader.load()                   # Load the word document 

print(documents)

[Document(metadata={'source': 'data/Data_Literacy_Quiz.docx'}, page_content="QUIZ:\n\nIntroduction to Data Literacy\n\n1. What is Data Literacy?\n\na) The ability to read novels\n\nb) The ability to read, understand, create, and communicate data as information\n\nc) The ability to speak multiple languages\n\nd) The ability to perform mathematical calculations\n\n\n\n2. Why is data literacy important?\n\na) It helps in making better decisions based on data\n\nb) It's only important for data scientists\n\nc) It's not important in today's world\n\nd) It's only about using complicated software\n\n\n\nData Basics\n\n3. What is data?\n\na) Information stored in computers\nb) Raw facts and figures\nc) The result of a calculation\nd) All of the above\n\n4. Which of the following statements correctly describes quantitative and qualitative data?\n\na) Quantitative data is numerical and measurable, while qualitative data is descriptive and non-numeric.\nb) Qualitative data is numerical and measur

In [209]:
print(len(documents))                                       # Print the length of documents
print(documents[0].__getattribute__('metadata'))            # Print the metadata of the first document
print(documents[0].__getattribute__('page_content')[:100])  # Print the first 100 characters of the first document content

1
{'source': 'data/Data_Literacy_Quiz.docx'}
QUIZ:

Introduction to Data Literacy

1. What is Data Literacy?

a) The ability to read novels

b) T


#### Load a **WEB** page (.html)

In [210]:
from langchain_community.document_loaders import WebBaseLoader

web_path = "https://python.langchain.com/v0.2/docs/introduction/"   # Path of the web page to be loaded
loader = WebBaseLoader(web_path)                                    # Initialize the web page loader
documents = loader.load()                                           # Load the web page 

print(documents)

[Document(metadata={'source': 'https://python.langchain.com/v0.2/docs/introduction/', 'title': 'Introduction | 🦜️🔗 LangChain', 'description': 'LangChain is a framework for developing applications powered by large language models (LLMs).', 'language': 'en'}, page_content='\n\n\n\n\nIntroduction | 🦜️🔗 LangChain\n\n\n\n\n\n\n\nSkip to main contentShare your thoughts on AI agents. Take the 3-min survey.IntegrationsAPI referenceLatestLegacyMorePeopleContributingCookbooks3rd party tutorialsYouTubearXivv0.2v0.2v0.1🦜️🔗LangSmithLangSmith DocsLangChain HubJS/TS Docs💬SearchIntroductionTutorialsBuild a Question Answering application over a Graph DatabaseTutorialsBuild a Simple LLM Application with LCELBuild a Query Analysis SystemBuild a ChatbotConversational RAGBuild an Extraction ChainBuild an AgentTaggingdata_generationBuild a Local RAG ApplicationBuild a PDF ingestion and Question/Answering systemBuild a Retrieval Augmented Generation (RAG) AppVector stores and retrieversBuild a Question/Answe

In [93]:
print(len(documents))                                       # Print the length of documents
print(documents[0].__getattribute__('metadata'))            # Print the metadata of the first document
print(documents[0].__getattribute__('page_content')[:100])  # Print the first 100 characters of the first document content

1
{'source': 'https://python.langchain.com/v0.2/docs/introduction/', 'title': 'Introduction | 🦜️🔗 LangChain', 'description': 'LangChain is a framework for developing applications powered by large language models (LLMs).', 'language': 'en'}





Introduction | 🦜️🔗 LangChain







Skip to main contentShare your thoughts on AI agents. Take 


### 2. Data Splitting/Chunking

In [211]:
# Load a PDF document
from langchain_community.document_loaders import PyPDFLoader

file_path = "data/Dale_Carnegie_Golden_Book-Se.pdf"  # Path of the document to be loaded
loader = PyPDFLoader(file_path)                      # Initialize the pdf loader
documents = loader.load()                            # Load the pdf document 

print(documents)

[Document(metadata={'source': 'data/Dale_Carnegie_Golden_Book-Se.pdf', 'page': 0}, page_content='DALE CARNEGIE’S \nGOLDEN BOOK \nwww.dalecarnegie.com '), Document(metadata={'source': 'data/Dale_Carnegie_Golden_Book-Se.pdf', 'page': 1}, page_content='GOLDEN BOOK \nPrinciples from How to Win Friends and Influence People \nBecome a Friendlier Person \n1.Don’t criticize, condemn or complain.\n2.Give honest, sincere appreciation.\n3.Arouse in the other person an eager want.\n4.Become genuinely interested in other people.\n5.Smile.\n6.Remember that a person’s name is to that person the sweetest \nand most important sound in any language.\n7. Be a good listener. Encourage others to talk about themselves.\n8. Talk in terms of the other person’s interests.\n9. Make the other person feel important - and do it sincerely.\n– – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – \nWin People to Your Way of Thinking \n10.The only way to get the best of an argument is to avoid i

#### Character Text Splitter

In [212]:
from langchain_text_splitters import CharacterTextSplitter

# Initialize the character text splitter
text_splitter = CharacterTextSplitter(              
    separator="",
    chunk_size=100,
    chunk_overlap=20
)   

# Split the documents into chunks
chunks = []
for doc in documents:
    texts = text_splitter.split_text(doc.page_content)
    chunks.extend(texts)

for chunk in enumerate(chunks):
    print(chunk)

(0, 'DALE CARNEGIE’S \nGOLDEN BOOK \nwww.dalecarnegie.com')
(1, 'GOLDEN BOOK \nPrinciples from How to Win Friends and Influence People \nBecome a Friendlier Person \n1.')
(2, 'riendlier Person \n1.Don’t criticize, condemn or complain.\n2.Give honest, sincere appreciation.\n3.Aro')
(3, 'appreciation.\n3.Arouse in the other person an eager want.\n4.Become genuinely interested in other pe')
(4, 'terested in other people.\n5.Smile.\n6.Remember that a person’s name is to that person the sweetest \na')
(5, 'rson the sweetest \nand most important sound in any language.\n7. Be a good listener. Encourage others')
(6, 'er. Encourage others to talk about themselves.\n8. Talk in terms of the other person’s interests.\n9.')
(7, 'son’s interests.\n9. Make the other person feel important - and do it sincerely.\n– – – – – – – – – –')
(8, '– – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – \nWin People to Your Wa')
(9, 'in People to Your Way of Thinking \n10.The only way to g

#### Recursive Character Text Splitter

In [213]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Initialize the recursive character text splitter
text_splitter = RecursiveCharacterTextSplitter(              
    separators="",
    chunk_size=100,
    chunk_overlap=20
)   

# Split the documents into chunks
chunks = []
for doc in documents:
    texts = text_splitter.split_text(doc.page_content)
    chunks.extend(texts)

for chunk in enumerate(chunks):
    print(chunk)

(0, 'DALE CARNEGIE’S \nGOLDEN BOOK \nwww.dalecarnegie.com')
(1, 'GOLDEN BOOK \nPrinciples from How to Win Friends and Influence People \nBecome a Friendlier Person')
(2, '1.Don’t criticize, condemn or complain.\n2.Give honest, sincere appreciation.')
(3, '3.Arouse in the other person an eager want.\n4.Become genuinely interested in other people.\n5.Smile.')
(4, '5.Smile.\n6.Remember that a person’s name is to that person the sweetest')
(5, 'and most important sound in any language.')
(6, '7. Be a good listener. Encourage others to talk about themselves.')
(7, '8. Talk in terms of the other person’s interests.')
(8, '9. Make the other person feel important - and do it sincerely.')
(9, '– – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – –')
(10, 'Win People to Your Way of Thinking \n10.The only way to get the best of an argument is to avoid it.')
(11, '11.Show respect for the other person’s opinion. Never say, “You’re wrong.”')
(12, '12.If you are wrong, admit it

#### Semantic Chunker

In [214]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_huggingface import HuggingFaceEmbeddings

# Use a smaller, faster model for embeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Initialize the semantic chunker
text_splitter = SemanticChunker(embeddings, breakpoint_threshold_type='percentile')

# Split the documents into chunks
chunks = []
for doc in documents:
    texts = text_splitter.split_text(doc.page_content)
    chunks.extend(texts)

for chunk in enumerate(chunks):
    print(chunk)

(0, 'DALE CARNEGIE’S \nGOLDEN BOOK \nwww.dalecarnegie.com ')
(1, 'GOLDEN BOOK \nPrinciples from How to Win Friends and Influence People \nBecome a Friendlier Person \n1.Don’t criticize, condemn or complain. 2.Give honest, sincere appreciation.')
(2, '3.Arouse in the other person an eager want. 4.Become genuinely interested in other people. 5.Smile. 6.Remember that a person’s name is to that person the sweetest \nand most important sound in any language. 7. Be a good listener. Encourage others to talk about themselves. 8. Talk in terms of the other person’s interests. 9.')
(3, 'Make the other person feel important - and do it sincerely. – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – \nWin People to Your Way of Thinking \n10.The only way to get the best of an argument is to avoid it. 11.Show respect for the other person’s opinion. Never say, “You’re wrong.”\n12.If you are wrong, admit it quickly and emphatically. 13.Begin in a friendly way. 14.Get the other

### 3. Embeddings 

In [215]:
# Load a PDF document and split it into chunks
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

file_path = "data/Dale_Carnegie_Golden_Book-Se.pdf"  # Path of the document to be loaded
loader = PyPDFLoader(file_path)                      # Initialize the pdf loader
documents = loader.load()                            # Load the pdf document 

print(documents)

# Initialize the recursive character text splitter
text_splitter = RecursiveCharacterTextSplitter(              
    separators="",
    chunk_size=100,
    chunk_overlap=20
)   

# Split the documents into chunks
chunks = []
for doc in documents:
    texts = text_splitter.split_text(doc.page_content)
    chunks.extend(texts)

for chunk in enumerate(chunks):
    print(chunk)

[Document(metadata={'source': 'data/Dale_Carnegie_Golden_Book-Se.pdf', 'page': 0}, page_content='DALE CARNEGIE’S \nGOLDEN BOOK \nwww.dalecarnegie.com '), Document(metadata={'source': 'data/Dale_Carnegie_Golden_Book-Se.pdf', 'page': 1}, page_content='GOLDEN BOOK \nPrinciples from How to Win Friends and Influence People \nBecome a Friendlier Person \n1.Don’t criticize, condemn or complain.\n2.Give honest, sincere appreciation.\n3.Arouse in the other person an eager want.\n4.Become genuinely interested in other people.\n5.Smile.\n6.Remember that a person’s name is to that person the sweetest \nand most important sound in any language.\n7. Be a good listener. Encourage others to talk about themselves.\n8. Talk in terms of the other person’s interests.\n9. Make the other person feel important - and do it sincerely.\n– – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – \nWin People to Your Way of Thinking \n10.The only way to get the best of an argument is to avoid i

#### **Hugging Face** Embeddings

In [216]:
from langchain_huggingface.embeddings import HuggingFaceEmbeddings

# Initialize the Hugging Face embedding model
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

results = []
# Create embeddings of text chunks
for i, chunk in enumerate(chunks):
    print("Text chunk ", i)
    print("--------------")
    print(chunk, "\n")
    query_result = embeddings.embed_query(chunk)
    print("Embeddings", "\n", query_result, "\n\n")
    results.append(query_result)

Text chunk  0
--------------
DALE CARNEGIE’S 
GOLDEN BOOK 
www.dalecarnegie.com 

Embeddings 
 [-0.017324471846222878, 0.048662059009075165, -0.010236980393528938, 0.04299083352088928, -0.03296734020113945, 0.010704209096729755, -0.009957400150597095, 0.005690128076821566, -0.0536043681204319, 0.026528891175985336, 0.036545202136039734, 0.03634728491306305, 0.02241397462785244, 0.008098303340375423, 0.08435066044330597, -0.08276300877332687, 0.04672907292842865, 0.038672737777233124, -0.042616020888090134, -0.01789054274559021, -0.056012317538261414, -0.0027350361924618483, -0.03825553134083748, 0.025791630148887634, 0.006420682650059462, -0.043224968016147614, 0.034694600850343704, 0.025465991348028183, -0.015515495091676712, -0.03580369055271149, -0.031441424041986465, 0.019822221249341965, 0.030112048611044884, 0.0829143151640892, 1.6053357967393822e-06, -0.02732936292886734, -0.012340398505330086, -0.03176262602210045, -0.03168348968029022, 0.013180854730308056, 0.0568135604262352,

### 4. Data Storage

In [217]:
# Load a PDF document and split it into chunks
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

file_path = "data/Dale_Carnegie_Golden_Book-Se.pdf"  # Path of the document to be loaded
loader = PyPDFLoader(file_path)                      # Initialize the pdf loader
documents = loader.load()                            # Load the pdf document 

# Initialize the recursive character text splitter
text_splitter = RecursiveCharacterTextSplitter(              
    separators="",
    chunk_size=100,
    chunk_overlap=20
)   

# Split the documents into chunks
chunks = text_splitter.split_documents(documents)

#### Data Storage using **FAISS**

In [218]:
from langchain_community.vectorstores import FAISS
from langchain_huggingface.embeddings import HuggingFaceEmbeddings

# Initialize the Hugging Face embedding model
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

# Store embeddings into the vector store
vector_store = FAISS.from_documents(
    documents=chunks,
    embedding=embeddings
)

print(vector_store)

<langchain_community.vectorstores.faiss.FAISS object at 0x000001BE9CBC2A50>
