## Splitting and Embedding Text Using LangChain

Last Update: Jan 10, 2024

In [1]:
pip install -r ./requirements.txt -q

Note: you may need to restart the kernel to use updated packages.


Download [requirements.txt](
https://drive.google.com/file/d/1IeTp3JOjhkHYr21tEEh_7X8ozx39v2xc/view?usp=sharing)

In [1]:
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(), override=True)

True

In [48]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

with open('data/churchill_speech.txt') as f:
    churchill_speech = f.read()


text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    length_function=len
)

In [51]:
chunks = text_splitter.create_documents([churchill_speech])
# print(chunks[2])
# print(chunks[10].page_content)
print(f'Now you have {len(chunks)}')

Now you have 24


In [52]:
chunks

[Document(page_content='Winston Churchill Speech - We Shall Fight on the Beaches\nWe Shall Fight on the Beaches\nJune 4, 1940\nHouse of Commons\nFrom the moment that the French defenses at Sedan and on the Meuse were broken at the end of the\nsecond week of May, only a rapid retreat to Amiens and the south could have saved the British and\nFrench Armies who had entered Belgium at the appeal of the Belgian King; but this strategic fact was\nnot immediately realized. The French High Command hoped they would be able to close the gap, and\nthe Armies of the north were under their orders. Moreover, a retirement of this kind would have\ninvolved almost certainly the destruction of the fine Belgian Army of over 20 divisions and the\nabandonment of the whole of Belgium. Therefore, when the force and scope of the German\npenetration were realized and when a new French Generalissimo, General Weygand, assumed\ncommand in place of General Gamelin, an effort was made by the French and British Armie

#### Embedding Cost

In [53]:
def print_embedding_cost(texts):
    import tiktoken
    enc = tiktoken.encoding_for_model('text-embedding-ada-002')
    total_tokens = sum([len(enc.encode(page.page_content)) for page in texts])
    print(f'Total Tokens: {total_tokens}')
    print(f'Embedding Cost in USD: {total_tokens / 1000 * 0.0004:.6f}')
    
print_embedding_cost(chunks)

Total Tokens: 5015
Embedding Cost in USD: 0.002006


### Creating embeddings

In [54]:
import warnings
warnings.filterwarnings('ignore', module='langchain')

In [55]:
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [56]:
vector = embeddings.embed_query(chunks[0].page_content)
vector

[-0.03597686139290937,
 -0.019671557865208782,
 0.010164515318932128,
 1.675782766059516e-05,
 0.01771229168490454,
 0.03158494893401218,
 -0.014424932478240147,
 -0.019355971753898037,
 0.021012800235263338,
 -0.004539843273951114,
 0.013609667581251894,
 0.028560577532558345,
 -0.001090581434273212,
 -0.010992929559614777,
 -0.002071036300198569,
 -0.011591228888661567,
 0.0228274227036066,
 -0.004365613305847096,
 0.003839635599950144,
 -0.017817488297104916,
 -0.0004865291800486292,
 -0.029296946367380207,
 0.03679212628989762,
 0.012656333178488572,
 -0.014424932478240147,
 -0.020210685613292076,
 0.025181172920033162,
 -0.019185029354548265,
 -0.02222254916837391,
 -0.004829130697873062,
 -0.00024100453202492437,
 -0.0012746735265633533,
 -0.023918826612144995,
 -0.0032281869272089555,
 -0.016528843301827953,
 -0.022169951793596317,
 -0.016752383075955318,
 0.0035667848556461936,
 0.013491322556679718,
 -0.014753668864567884,
 0.0162921532526836,
 0.012189528217708358,
 0.0112361

### Inserting the Embeddings into a Pinecone Index

In [57]:
import os
import pinecone
from langchain.vectorstores import Pinecone

pinecone.init(api_key=os.environ.get('PINECONE_API_KEY'), environment=os.environ.get('PINECONE_ENV'))

In [31]:
# deleting all indexes
indexes = pinecone.list_indexes()
for i in indexes:
    print('Deleting all indexes ... ', end='')
    pinecone.delete_index(i)
    print('Done')

Deleting all indexes ... Done


In [32]:
# creating an index
index_name = 'churchill-speech'
if index_name not in pinecone.list_indexes():
    print(f'Creating index {index_name} ...')
    pinecone.create_index(index_name, dimension=1536, metric='cosine')
    print('Done!')

Creating index churchill-speech ...
Done!


In [58]:
vector_store = Pinecone.from_documents(chunks, embeddings, index_name=index_name)

### Asking Questions (Similarity Search)

In [59]:
query = 'Where should we fight?'
result = vector_store.similarity_search(query)


In [60]:
for r in result:
    print(r.page_content)
    print('-' * 50)

shall fight on the beaches, we shall fight on the landing grounds, we shall fight in the fields and
--------------------------------------------------
front, now on that, fighting
--------------------------------------------------
end, we shall fight in France, we shall fight on the seas and oceans, we shall fight with growing
--------------------------------------------------
Winston Churchill Speech - We Shall Fight on the Beaches
We Shall Fight on the Beaches
June 4, 1940
--------------------------------------------------


### Answering in Natural Language using an LLM

In [67]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model='gpt-3.5-turbo', temperature=1)

retriever = vector_store.as_retriever(search_type='similarity')

chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)


In [71]:
query = 'Where should we fight?'
answer = chain.run(query)
print(answer)

According to Winston Churchill's speech, "We Shall Fight on the Beaches," we should fight on the beaches, the landing grounds, in the fields, in France, on the seas and oceans, and wherever necessary to defend our cause.


In [69]:
query = 'Who was the king of Belgium at that time?'
# query = 'What about the French Armies??'
answer = chain.run(query)
print(answer)

The king of Belgium at that time was King Leopold.
