# Splitting and Embedding Text Using LangChain (Similarity Search)

This notebook uses the latest versions of the libraries OpenAI, LangChain, and Pinecone.

In [None]:
pip install -q -r ./requirements.txt

Download [requirements.txt](https://drive.google.com/file/d/1UpURYL9kqjXfe9J8o-_Dq5KJTbQpzMef/view?usp=sharing)

In [19]:
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(), override=True)

True

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

with open('files/churchill_speech.txt') as f:
    churchill_speech = f.read()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
    length_function=len
)

In [3]:
chunks = text_splitter.create_documents([churchill_speech])
# print(chunks[2])
# print(chunks[10].page_content)
print(f'Now you have {len(chunks)}')

Now you have 300


#### Embedding Cost

In [4]:
def print_embedding_cost(texts):
    import tiktoken
    enc = tiktoken.encoding_for_model('text-embedding-3-small')
    total_tokens = sum([len(enc.encode(page.page_content)) for page in texts])
    print(f'Total Tokens: {total_tokens}')
    print(f'Embedding Cost in USD: {total_tokens / 1000 * 0.0004:.6f}')
    
print_embedding_cost(chunks)

Total Tokens: 4820
Embedding Cost in USD: 0.001928


### Creating embeddings

In [11]:
# import warnings
# warnings.filterwarnings('ignore', module='langchain')

In [12]:
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model='text-embedding-3-small', dimensions=1536)  # 512 works as well

In [13]:
vector = embeddings.embed_query(chunks[0].page_content)
vector

[0.021032487132780428,
 0.0420881381197107,
 0.0793582616314768,
 0.019179404877824057,
 0.00022132030269142003,
 -0.0307611699026239,
 -0.0038943687294045004,
 -0.0015099726712204623,
 -0.009757637587530777,
 0.01807913670686713,
 0.03261425215758027,
 -0.0025827336489591367,
 0.016955704681760355,
 -0.035417041256809736,
 -0.004612438196332347,
 0.016700906011402235,
 -0.013608574323234956,
 0.003934904775674822,
 0.009827128218657766,
 0.020905087797601366,
 -0.026082136556933293,
 -0.004537156601667896,
 -0.042713551005886,
 -0.01885511464501641,
 -0.034119880325579145,
 0.007696083439193435,
 -0.003433993414603504,
 -0.04709145983556388,
 0.023036132577065698,
 -0.012913668943287587,
 0.030575860932070233,
 -0.018113880625446818,
 -0.0016315812756926959,
 0.03685317786235909,
 0.02791205495773982,
 -0.05295183297909079,
 0.044450817715383305,
 -0.013237958244772698,
 -0.04259773546042694,
 0.0020499724540930537,
 -0.006636351547337462,
 -0.04855076029526307,
 0.029950446183249855,

### Inserting the Embeddings into a Pinecone Index

In [20]:
# I'm importing the necessary libraries and initializing the Pinecone client
import os
import pinecone
from langchain_community.vectorstores import Pinecone

pc = pinecone.Pinecone()

In [22]:
# deleting all indexes
indexes = pc.list_indexes().names()
for i in indexes:
    print('Deleting all indexes ... ', end='')
    pc.delete_index(i)
    print('Done')

Deleting all indexes ... Done


In [26]:
# creating an index
from pinecone import PodSpec
index_name = 'churchill-speech'
if index_name not in pc.list_indexes().names():
    print(f'Creating index {index_name}')
    pc.create_index(
        name=index_name,
        dimension=1536,
        metric='cosine',
        spec=PodSpec(
            environment='gcp-starter'
        )
    )
    print('Index created! 😊')
else:
    print(f'Index {index_name} already exists!')

Creating index churchill-speech
Index created! 😊


In [28]:
# processing the input documents, generating embeddings using the provided `OpenAIEmbeddings` instance,
# inserting the embeddings into the index and returning a new Pinecone vector store object. 
vector_store = Pinecone.from_documents(chunks, embeddings, index_name=index_name)

In [29]:
index = pc.Index(index_name)
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.003,
 'namespaces': {'': {'vector_count': 300}},
 'total_vector_count': 300}

### Asking Questions (Similarity Search)

In [30]:
query = 'Where should we fight?'
result = vector_store.similarity_search(query)
print(result)

[Document(page_content='shall fight on the beaches, we shall fight on the landing grounds, we shall fight in the fields and'), Document(page_content='end, we shall fight in France, we shall fight on the seas and oceans, we shall fight with growing'), Document(page_content='streets, we shall fight in the hills; we shall never surrender, and even if, which I do not for a'), Document(page_content='number of the enemy, and fought fiercely on some of the old grounds that so many of us knew so')]


In [31]:
for r in result:
    print(r.page_content)
    print('-' * 50)

shall fight on the beaches, we shall fight on the landing grounds, we shall fight in the fields and
--------------------------------------------------
end, we shall fight in France, we shall fight on the seas and oceans, we shall fight with growing
--------------------------------------------------
streets, we shall fight in the hills; we shall never surrender, and even if, which I do not for a
--------------------------------------------------
number of the enemy, and fought fiercely on some of the old grounds that so many of us knew so
--------------------------------------------------


### Answering in Natural Language using an LLM

In [47]:
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

# Initialize the LLM with the specified model and temperature
llm = ChatOpenAI(model='gpt-3.5-turbo', temperature=0.2)

# Use the provided vector store with similarity search and retrieve top 3 results
retriever = vector_store.as_retriever(search_type='similarity', search_kwargs={'k': 3})

# Create a RetrievalQA chain using the defined LLM, chain type 'stuff', and retriever
chain = RetrievalQA.from_chain_type(llm=llm, chain_type='stuff', retriever=retriever)


In [48]:
query = 'Answer only from the provided input. Where should we fight?'
answer = chain.invoke(query)
print(answer)

{'query': 'Answer only from the provided input. Where should we fight?', 'result': 'We should fight on the beaches, landing grounds, fields, in France, on the seas and oceans, in the streets, and in the hills.'}


In [49]:
query = 'Who was the king of Belgium at that time?'
answer = chain.run(query)
print(answer)

The king of Belgium at that time was King Leopold.


In [51]:
query = 'What about the French Armies??'
answer = chain.run(query)
print(answer)

The French Armies were involved in the fighting alongside the British Armies. They were supposed to advance across the Somme in great strength to capture the territory. However, without further context, it is unclear what specific role or actions the French Armies had during the battle.
