<a href="https://colab.research.google.com/github/abhilash0203/toolteam/blob/main/Another_copy_of_Demo_01_Langchain_Loader_splitter_embeddings_vectorstore.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Demo: LangChain Loader, Splitter, Embeddings, and VectorStore**

#__Description:__
In this activity, you will implement the functionalities of LangChain’s loaders, splitters, embeddings, and VectorStores.
The two files in the tutorial serve as practical examples of real-world data that one might encounter in natural language processing tasks. They are:

•	The **state_of_union.txt** file, which contains transcripts of the United States’ State of the Union Addresses, represents a large text document that can be loaded and processed.

•	The **michael_resume.pdf** file, an open source resume, represents a common type of document that one might analyze for tasks such as resume screening or information extraction.




# **Steps to Perform:**


1.   Import the Necessary Modules
2.   Load Text Data from a File Using TextLoader
3.   Load PDFs from the Internet Using PyPDFLoader
4.   Split the Documents Using RecursiveCharacterTextSplitter
5.   Embed the Documents Using HuggingFaceEmbeddings and Print the Length of the Embedding
6.   Embed the Documents Using OpenAIEmbeddings and Print the Length of the Embedding
7.   Create a Chroma Instance
8.   Perform a Similarity Search on the Chroma Instance
9.   Persist the Chroma Instance
10.  Load the Persisted Chroma Instance




# **Step 1: Import the Necessary Modules**







In [None]:
#install following libraries if not installed before
!pip install chromadb==0.3.29
!pip install pysqlite3
!pip install pysqlite3-binary
!pip install pypdf
!pip install langchain

Collecting langchain
  Downloading langchain-0.1.17-py3-none-any.whl (867 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m867.6/867.6 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.5-py3-none-any.whl (28 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langchain-community<0.1,>=0.0.36 (from langchain)
  Downloading langchain_community-0.0.36-py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-core<0.2.0,>=0.1.48 (from langchain)
  Downloading langchain_core-0.1.48-py3-none-any.whl (302 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.9/302.9 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-text-splitters<0.1,>=0.0.1 (from langchain)
  Downl

In [None]:
from langchain.document_loaders import TextLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings, OpenAIEmbeddings
from langchain.vectorstores import Chroma
import pysqlite3
import sys
sys.modules["sqlite3"] = sys.modules.pop("pysqlite3")
import chromadb

#**Step 2: Load Text Data from a File Using TextLoader**



*   Print the first 100 characters from the loaded text.



In [None]:
text_loader = TextLoader("state_of_union.txt")
text_document = text_loader.load()
print(text_document[:100])  # Prints the first 100 characters of the text document

[Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. \n\nGroups of citize

# **Step 3: Load PDFs from the Internet Using PyPDFLoader**






In [None]:
from langchain.document_loaders import PyPDFLoader

pdf_loader = PyPDFLoader("michael_resume.pdf")
pdf_pages = pdf_loader.load_and_split()
print(pdf_pages[0])  # Prints the first 100 characters of the first page of the PDF


page_content='CURRICULUM VITAE :  \nM ichael M . Scott OBE, B.Sc., Dip.Ed  \n \nHome address :  Strome House     Date of Birth : 10.5.51 \n   North Strome     Place of Birth : Edinburgh \n   Lochcarron     Married to Sue Scott; 2 stepchildren \n   Ross-shire, IV54 8YJ \nTelephone (work) : 01520 722901     Website : www.mmscott.co.uk  \nTelephone (home) :  01520 722588    E-mail :  MSStrome@aol.com  \n \nAwarded OBE in Queen’s Birthday Honours, June 2005, “for services to biodiversity conservation in \nScotland”. \nAwarded Planta Europa ‘Silver Lead’ Award in September 2007, “for excellent work in European wild plant \nconservation”. \n \nEducation  \nPrimary education: George Heriots School, Edinburgh (1956-1962) \nSecondary education:  Madras College, St Andrews (1962-69). \nFurther education: University of Aberdeen (1969-1974): \n    Bachelor of Science (Honours; upper second) in Botany, 1973 \n    Diploma of Education, 1974 \n   Aberdeen College of Education (1973 - 1974): \n    Cer

# **Step 4: Split the Documents Using RecursiveCharacterTextSplitter**


*   Split the PDF pages into smaller chunks and print the number of chunks.



In [None]:
doc_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=64)
split_texts = doc_splitter.split_documents(pdf_pages)
print(len(split_texts))  # Prints the number of chunks the PDF has been split into


15


# **Step 5: Embed the Documents Using HuggingFaceEmbeddings and Print the Length of the Embedding**






In [None]:
!pip install sentence-transformers
MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"
hf_embed = HuggingFaceEmbeddings(model_name=MODEL_NAME)
text = split_texts[0].page_content
hf_embed_result = hf_embed.embed_documents([text])
print(len(hf_embed_result[0]))  # Prints the length of the first embedded document



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

768


# **Step 6: Embed the Documents Using OpenAIEmbeddings and Print the Length of the Embedding**




In [None]:
!pip install openai
!pip install tiktoken
#openai.api_key = "sk-proj-snmAobrEap4kwDai19mvT3BlbkFJajzBKuIprjn4IthhYB2m"
openai_embed = OpenAIEmbeddings(api_key="sk-proj-snmAobrEap4kwDai19mvT3BlbkFJajzBKuIprjn4IthhYB2m")
openai_embed_result = openai_embed.embed_documents([text])
print(len(openai_embed_result[0]))  # Prints the length of the first embedded document


Collecting tiktoken
  Downloading tiktoken-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
Successfully installed tiktoken-0.6.0
1536


# **Step 7: Create a Chroma Instance**

*   Create a Chroma instance using the split texts and the OpenAIEmbeddings.

In [None]:
chroma_db = Chroma.from_documents(split_texts, openai_embed)


# **Step 8: Perform a Similarity Search on the Chroma Instance**


*   Print the top two most similar documents.

In [None]:
search_result = chroma_db.similarity_search_with_score("What is the candidate's skill sets?", k=2)
print(search_result)  # Prints the top 2 most similar documents to the query

[(Document(page_content='spring 2005, I went fully digital, and all photographs can be supplied in electronic format. \n \nComputer knowledge  \nI am reasonably fluent in basic PC computer skills, using Windows XP, Word, WordPro, Excel, PowerPoint, \nAdobe Photoshop Elements, e-mail, internet etc.  I have full computer and broadband facilities at home. \n \nOther interests  \nBotanising (especially mountain flowers), travel, walking, Scottish islands, gardening, photography, \ncomputers, rugby supporter, cinema, good wine, Runrig concerts (!). \n \n[updated, 26.03.08]', metadata={'source': 'michael_resume.pdf', 'page': 3}), 0.4553832411766052), (Document(page_content='CURRICULUM VITAE :  \nM ichael M . Scott OBE, B.Sc., Dip.Ed  \n \nHome address :  Strome House     Date of Birth : 10.5.51 \n   North Strome     Place of Birth : Edinburgh \n   Lochcarron     Married to Sue Scott; 2 stepchildren \n   Ross-shire, IV54 8YJ \nTelephone (work) : 01520 722901     Website : www.mmscott.co.uk  \

# **Step 9: Persist the Chroma Instance**


*   Create a db folder in the current working directory that persists the Chroma instance.

In [None]:
PERSIST_DIRECTORY = "db"
chroma_db = Chroma.from_documents(documents=split_texts, embedding=openai_embed, persist_directory=PERSIST_DIRECTORY)
chroma_db.persist()

  warn_deprecated(


# **Step 10: Load the Persisted Chroma Instance**




In [None]:
vector_db = Chroma(persist_directory=PERSIST_DIRECTORY, embedding_function=openai_embed)
vector_search_result = vector_db.similarity_search_with_score("What is the candidate's skill sets?", k=2)
print(vector_search_result)  # Prints the top 2 most similar documents to the query from the loaded Chroma instance


[(Document(page_content='spring 2005, I went fully digital, and all photographs can be supplied in electronic format. \n \nComputer knowledge  \nI am reasonably fluent in basic PC computer skills, using Windows XP, Word, WordPro, Excel, PowerPoint, \nAdobe Photoshop Elements, e-mail, internet etc.  I have full computer and broadband facilities at home. \n \nOther interests  \nBotanising (especially mountain flowers), travel, walking, Scottish islands, gardening, photography, \ncomputers, rugby supporter, cinema, good wine, Runrig concerts (!). \n \n[updated, 26.03.08]', metadata={'source': 'michael_resume.pdf', 'page': 3}), 0.4553832411766052), (Document(page_content='CURRICULUM VITAE :  \nM ichael M . Scott OBE, B.Sc., Dip.Ed  \n \nHome address :  Strome House     Date of Birth : 10.5.51 \n   North Strome     Place of Birth : Edinburgh \n   Lochcarron     Married to Sue Scott; 2 stepchildren \n   Ross-shire, IV54 8YJ \nTelephone (work) : 01520 722901     Website : www.mmscott.co.uk  \

# **Conclusion**

This activity provided a step-by-step guide on how to use LangChain’s loaders, splitters, embeddings, and vector stores. You now know how to load documents, split them into manageable chunks, embed them into a numerical space, and store these embeddings for efficient similarity searches.