<a href="https://colab.research.google.com/github/ahsanrazi/LangChain/blob/main/07_Semantic_Search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import userdata
gemini_api_key = userdata.get('GEMINI_API_KEY').strip()

In [2]:
!pip install -qU langchain-community pypdf

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.5/2.5 MB[0m [31m13.5 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.5/2.5 MB[0m [31m39.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m28.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/298.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.7/298.7 kB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/50.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.8/50.8 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h

# Semantic Searh Engine

In [3]:
# This tutorial will familiarize us with LangChain's document loader, embedding, and vector store abstractions.
# They are important for applications that fetch data to be reasoned over as part of model inference as in the case of RAG.

# We will build a search engine over a PDF document. This will allow us to retrieve passages in the PDF that are similar to an input query.

#  Documents and Document Loaders

In [4]:
# LangChain implements a Document abstraction, which is intended to represent a unit of text and associated metadata.

# It has three attributes:
# page_content: a string representing the content
# metadata: a dict containing arbitrary metadata
# id: (optional) a string identifier for the document.

# The metadata attribute can capture information about the source of the document, its relationship to other documents, and other information.
# An individual Document object often represents a chunk of a larger document.

In [5]:
# We can generate sample documents when desired.

from langchain_core.documents import Document

documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
]

# Loading documents

In [6]:
# LangChain ecosystem implements document loaders that integrate with hundreds of common sources.

In [52]:
# load a PDF into a sequence of Document objects.
# Use PyPDFLoader, which is fairly lightweight.

from langchain_community.document_loaders import PyPDFLoader

file_path = "/content/WBT.pdf"

loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))

15


In [53]:
# PyPDFLoader loads one Document object per PDF page. For each, we can easily access:

# The string content of the page;
# Metadata containing the file name and page number.

print(f"{docs[0].page_content[:200]}\n")

Knowledge Base: We Build Trades 
About 
We Build Trades is a UK-based digital marketing and software agency founded in 2017 by Daniel Brown. 
It specializes in comprehensive marketing solutions for tr



In [54]:
print(docs[0].metadata)

{'source': '/content/WBT.pdf', 'page': 0, 'page_label': '1'}


# Splitting

In [55]:
# For both information retrieval and downstream question-answering purposes, a page may be too coarse a representation.
# Our goal in the end will be to retrieve Document objects that answer an input query, and further splitting our PDF will help ensure that
# the meanings of relevant portions of the document are not "washed out" by surrounding text.

# We can use text splitters for this purpose.
# We use the RecursiveCharacterTextSplitter, which will recursively split the document using common separators like new lines
# until each chunk is the appropriate size. This is the recommended text splitter for generic text use cases.

# We set add_start_index=True so that the character index where each split Document starts
# within the initial Document is preserved as metadata attribute “start_index”.

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=750, chunk_overlap=300, add_start_index=True)

all_splits = text_splitter.split_documents(docs)

len(all_splits)

46

# Embeddings

In [56]:
!pip install -qU langchain-google-genai

In [57]:
# Vector search is a common way to store and search over unstructured data (such as unstructured text).
# The idea is to store numeric vectors that are associated with the text.
# We can embed it as a vector of the same dimension and use vector similarity metrics (such as cosine similarity) to identify related text.

In [58]:
# import os

# os.environ['GOOGLE_API_KEY'] = gemini_api_key

In [68]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001", google_api_key = gemini_api_key)

In [69]:
vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

In [70]:
print(vector_1[:10])

[0.042216330766677856, -0.06837313622236252, -0.02941972203552723, -0.01358675118535757, 0.06945652514696121, 0.0012776809744536877, 0.052310261875391006, 0.003017732407897711, 0.03554920852184296, 0.029503092169761658]


In [71]:
print(vector_2[:10])

[0.06741836667060852, -0.053920891135931015, -0.03039383701980114, -0.01967283897101879, 0.033709362149238586, 0.005014019086956978, 0.024784499779343605, -0.024772273376584053, 0.03712800145149231, 0.03918985277414322]


# Vector stores

In [72]:
!pip install -qU langchain-pinecone

In [73]:
# LangChain VectorStore objects contain methods for adding text and Document objects to the store, and querying them using various similarity metrics.
# They are often initialized with embedding models, which determine how text data is translated to numeric vectors.

In [74]:
from langchain_pinecone import PineconeVectorStore
from pinecone import Pinecone

index_name = "langchain"
namespace = "Example"

pc = Pinecone(api_key= userdata.get('PINECONE_API'))
index = pc.Index(index_name)

vector_store = PineconeVectorStore(embedding=embeddings, index=index, namespace=namespace)

In [75]:
# Having instantiated our vector store, we can now index the documents.

ids = vector_store.add_documents(documents=all_splits)

# Usage

In [76]:
# Embeddings typically represent text as a "dense" vector such that texts with similar meanings are geometrically close.
# This lets us retrieve relevant information just by passing in a question, without knowledge of any specific key-terms used in the document.

# Return documents based on similarity to a string query
results = vector_store.similarity_search("Who is the founder of We Build Trade")

print(results[0])

page_content='Core Values 
We Build Trades operates with the following principles: 
1. Client Obsession 
o Always prioritizing what is best for clients. 
2. Full Transparency 
o Maintaining honesty and openness in all dealings. 
3. Long-Term Thinking 
o Emphasizing sustainable growth over short-term gains. 
4. Relentless Ambition 
o Continuously striving for excellence and improvement. 
 
Founder: Daniel Brown 
• Role: Founder and CEO of We Build Trades. 
• Career Start: Former senior marketing consultant. 
• Journey: Founded We Build Trades in 2017 with limited resources but strong mentorship and a 
passion for learning. 
• Achievements: Under his leadership, the agency has served over 60 clients and expanded 
internationally. 
 
Key Differentiators' metadata={'page': 1.0, 'page_label': '2', 'source': '/content/WBT.pdf', 'start_index': 0.0}


In [77]:
# Async query:

results = await vector_store.asimilarity_search("Who is the founder of We Build Trade")

print(results[0])

page_content='Core Values 
We Build Trades operates with the following principles: 
1. Client Obsession 
o Always prioritizing what is best for clients. 
2. Full Transparency 
o Maintaining honesty and openness in all dealings. 
3. Long-Term Thinking 
o Emphasizing sustainable growth over short-term gains. 
4. Relentless Ambition 
o Continuously striving for excellence and improvement. 
 
Founder: Daniel Brown 
• Role: Founder and CEO of We Build Trades. 
• Career Start: Former senior marketing consultant. 
• Journey: Founded We Build Trades in 2017 with limited resources but strong mentorship and a 
passion for learning. 
• Achievements: Under his leadership, the agency has served over 60 clients and expanded 
internationally. 
 
Key Differentiators' metadata={'page': 1.0, 'page_label': '2', 'source': '/content/WBT.pdf', 'start_index': 0.0}
