# **Demo: LangChain Loader, Splitter, Embeddings, and VectorStore**

# __Description:__
In this activity, you will implement the functionalities of LangChain’s loaders, splitters, embeddings, and VectorStores.
The two files in the tutorial serve as practical examples of real-world data that one might encounter in natural language processing tasks. They are:

•	The **state_of_union.txt** file, which contains transcripts of the United States’ State of the Union Addresses, represents a large text document that can be loaded and processed.

•	The **michael_resume.pdf** file, an open source resume, represents a common type of document that one might analyze for tasks such as resume screening or information extraction.




# **Steps to Perform:**


1.   Import the Necessary Modules
2.   Load Text Data from a File Using TextLoader
3.   Load PDFs from the Internet Using PyPDFLoader
4.   Split the Documents Using RecursiveCharacterTextSplitter
5.   Embed the Documents Using HuggingFaceEmbeddings and Print the Length of the Embedding
6.   Embed the Documents Using OpenAIEmbeddings and Print the Length of the Embedding
7.   Create a FAISS Instance
8.   Perform a Similarity Search on the FAISS Instance
9.   Persist the FAISS Instance
10.  Load the Persisted FAISS Instance




# **Step 1: Import the Necessary Modules**







In [1]:
!pip install pysqlite3
!pip install pysqlite3-binary
!pip install pypdf

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [1]:
from langchain.document_loaders import TextLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings, OpenAIEmbeddings
from langchain.vectorstores import FAISS
import faiss
import pysqlite3
import sys
sys.modules["sqlite3"] = sys.modules.pop("pysqlite3")

#**Step 2: Load Text Data from a File Using TextLoader**



*   Print the first 100 characters from the loaded text.



In [2]:
text_loader = TextLoader("state_of_union.txt")
text_document = text_loader.load()
print(text_document[0].page_content[:100])  # Prints the first 100 characters of the text document

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th


# **Step 3: Load PDFs from the Internet Using PyPDFLoader**






In [3]:
from langchain.document_loaders import PyPDFLoader

pdf_loader = PyPDFLoader("michael_resume.pdf")
pdf_pages = pdf_loader.load_and_split()
print(pdf_pages[0].page_content[:100])  # Prints the first 100 characters of the first page of the PDF


CURRICULUM VITAE :  
M ichael M . Scott OBE, B.Sc., Dip.Ed  
 
Home address :  Strome House     Date


# **Step 4: Split the Documents Using RecursiveCharacterTextSplitter**


*   Split the PDF pages into smaller chunks and print the number of chunks.



In [4]:
doc_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=64)
split_texts = doc_splitter.split_documents(pdf_pages)
print(len(split_texts))  # Prints the number of chunks the PDF has been split into


15


# **Step 5: Embed the Documents Using HuggingFaceEmbeddings and Print the Length of the Embedding**






In [5]:
# MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"
# hf_embed = HuggingFaceEmbeddings(model_name=MODEL_NAME)
text = split_texts[0].page_content
# hf_embed_result = hf_embed.embed_documents([text])
# print(len(hf_embed_result[0]))  # Prints the length of the first embedded document

# **Step 6: Embed the Documents Using OpenAIEmbeddings and Print the Length of the Embedding**




In [6]:
openai_embed = OpenAIEmbeddings()
openai_embed_result = openai_embed.embed_documents([text])
print(len(openai_embed_result[0]))  # Prints the length of the first embedded document


  openai_embed = OpenAIEmbeddings()


1536


# **Step 7: Create a FAISS Instance**

*   Create a FAISS instance using the split texts and the OpenAIEmbeddings.

In [7]:
# Create FAISS instance from documents and embeddings
faiss_index = FAISS.from_documents(split_texts, openai_embed)


# **Step 8: Perform a Similarity Search on the FAISS Instance**


*   Print the top two most similar documents.

In [8]:
# Perform a similarity search and print the top two most similar documents
search_result = faiss_index.similarity_search_with_score("What is the candidate's skill sets?", k=2)
print(search_result)  # Prints the top 2 most similar documents to the query


[(Document(id='37997f0a-b92d-4a9a-bced-3b3c8599ec2f', metadata={'producer': 'BCL easyPDF 2.00.030', 'creator': 'PyPDF', 'creationdate': '', 'source': 'michael_resume.pdf', 'total_pages': 4, 'page': 3, 'page_label': '4'}, page_content='spring 2005, I went fully digital, and all photographs can be supplied in electronic format. \n \nComputer knowledge  \nI am reasonably fluent in basic PC computer skills, using Windows XP, Word, WordPro, Excel, PowerPoint, \nAdobe Photoshop Elements, e-mail, internet etc.  I have full computer and broadband facilities at home. \n \nOther interests  \nBotanising (especially mountain flowers), travel, walking, Scottish islands, gardening, photography, \ncomputers, rugby supporter, cinema, good wine, Runrig concerts (!). \n \n[updated, 26.03.08]'), 0.45482162), (Document(id='5ec9b061-b695-4189-9c1a-3a6eb7ef0400', metadata={'producer': 'BCL easyPDF 2.00.030', 'creator': 'PyPDF', 'creationdate': '', 'source': 'michael_resume.pdf', 'total_pages': 4, 'page': 0,

# **Step 9: Persist the FAISS Instance**


*   Create a folder in the current working directory that persists the FAISS instance.

In [9]:
# Save the FAISS index to a file
faiss_index.save_local("faiss_index")


# **Step 10: Load the Persisted FAISS Instance**




In [10]:
from langchain_community.vectorstores import FAISS

# Load the persisted FAISS index from the file with deserialization allowed
faiss_index_loaded = FAISS.load_local(
    "faiss_index", 
    openai_embed, 
    allow_dangerous_deserialization=True
)

# Perform a similarity search with the loaded FAISS index
vector_search_result = faiss_index_loaded.similarity_search_with_score(
    "Whats the address", k=1)
print(vector_search_result)


[(Document(id='5ec9b061-b695-4189-9c1a-3a6eb7ef0400', metadata={'producer': 'BCL easyPDF 2.00.030', 'creator': 'PyPDF', 'creationdate': '', 'source': 'michael_resume.pdf', 'total_pages': 4, 'page': 0, 'page_label': '1'}, page_content='CURRICULUM VITAE :  \nM ichael M . Scott OBE, B.Sc., Dip.Ed  \n \nHome address :  Strome House     Date of Birth : 10.5.51 \n   North Strome     Place of Birth : Edinburgh \n   Lochcarron     Married to Sue Scott; 2 stepchildren \n   Ross-shire, IV54 8YJ \nTelephone (work) : 01520 722901     Website : www.mmscott.co.uk  \nTelephone (home) :  01520 722588    E-mail :  MSStrome@aol.com  \n \nAwarded OBE in Queen’s Birthday Honours, June 2005, “for services to biodiversity conservation in \nScotland”. \nAwarded Planta Europa ‘Silver Lead’ Award in September 2007, “for excellent work in European wild plant \nconservation”. \n \nEducation  \nPrimary education: George Heriots School, Edinburgh (1956-1962) \nSecondary education:  Madras College, St Andrews (1962

# Create a chain

In [11]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model='gpt-3.5-turbo',temperature = 0.2)

qa_chain = RetrievalQA.from_chain_type(llm=llm,retriever=faiss_index.as_retriever(),return_source_documents = False)

query = "Whats the skill set?"
response = qa_chain({"query":query})
print(response['result'])

qa_chain({"query":query})

  llm = ChatOpenAI(model='gpt-3.5-turbo',temperature = 0.2)
  response = qa_chain({"query":query})


Based on the provided information, Michael M. Scott OBE has a skill set that includes expertise in botany, photography, writing, biodiversity conservation, and education. He is also proficient in computer skills, including using Windows XP, Word, WordPro, Excel, PowerPoint, Adobe Photoshop Elements, email, and internet. Additionally, he has experience in public speaking and has a PADI diving qualification.


{'query': 'Whats the skill set?',
 'result': 'Based on the provided context, Michael M. Scott OBE has skills in botany, photography, writing, and education. He is also fluent in basic PC computer skills and uses various software like Adobe Photoshop Elements. Additionally, he has experience in biodiversity conservation, European wild plant conservation, and has a strong interest in the marine environment. His writing experience includes authoring books on ecology and Scottish wildflowers.'}

In [12]:
# Required imports
import ipywidgets as widgets
from IPython.display import display, clear_output
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

# Set up the LLM
llm = ChatOpenAI(model='gpt-3.5-turbo', temperature=0.2)

# Assume `faiss_index` is already created and available in the notebook
retriever = faiss_index.as_retriever()

# Create the RetrievalQA chain
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever, return_source_documents=False)

# UI elements
query_input = widgets.Text(
    value='',
    placeholder='Enter your question here...',
    description='Query:',
    layout=widgets.Layout(width='80%')
)

submit_button = widgets.Button(
    description='Get Answer',
    button_style='success'
)

output_area = widgets.Output()

# Callback function
def on_submit(b):
    with output_area:
        clear_output()
        query = query_input.value
        if query.strip() == "":
            print("Please enter a valid question.")
            return
        response = qa_chain({"query": query})
        print("Answer:", response['result'])

# Attach callback
submit_button.on_click(on_submit)

# Display the app
display(widgets.VBox([query_input, submit_button, output_area]))


VBox(children=(Text(value='', description='Query:', layout=Layout(width='80%'), placeholder='Enter your questi…

# **Conclusion**

This activity provided a step-by-step guide on how to use LangChain’s loaders, splitters, embeddings, and vector stores. You now know how to load documents, split them into manageable chunks, embed them into a numerical space, and store these embeddings for efficient similarity searches.