# Splitting and Embedding Text Using LangChain (Similarity Search)

This notebook uses the latest versions of the libraries OpenAI, LangChain, and Pinecone.

In [1]:
pip install -q -r ./requirements.txt

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


Download [requirements.txt](https://drive.google.com/file/d/1UpURYL9kqjXfe9J8o-_Dq5KJTbQpzMef/view?usp=sharing)

In [2]:
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(), override=True)

True

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# with open('files/Steve_Jobs_speech.txt') as f:
#     churchill_speech = f.read()

input_file_path = "files/Steve_Jobs_speech.txt"

# Open the input file with 'cp950' encoding, ignoring errors
with open(input_file_path, 'r', encoding='cp950', errors='ignore') as file:
    speech_content = file.read()



text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=20,
    length_function=len
)



In [4]:
chunks = text_splitter.create_documents([speech_content])

print(chunks[2])

print("***" * 20)

print(chunks[10].page_content)

print("***" * 20)

print(f'Now you have {len(chunks)} chunks of text.')

page_content='deal. Just three stories.'
************************************************************
It wasnt all romantic. I didnt have a dorm room, so I slept on the floor in friends rooms, I returned Coke bottles for the 5瞽 deposits to buy food with, and I would walk the 7 miles across town every Sunday night to get one good meal a week at the Hare Krishna temple. I loved it. And much of what I
************************************************************
Now you have 54 chunks of text.


#### Embedding Cost

語言模型不像你我那樣看到文本，而是看到一系列數字（稱為標記）。位元組對編碼 (BPE) 是一種將文字轉換為標記的方法。它有幾個理想的特性：

It's reversible and lossless, so you can convert tokens back into the original text <br>
它是可逆且無損的，因此您可以將標記轉換回原始文本

It works on arbitrary text, even text that is not in the tokeniser's training data<br>
它適用於任意文本，甚至是不在標記器訓練資料中的文本

It compresses the text: the token sequence is shorter than the bytes corresponding to the original text. On average, in practice, each token corresponds to about 4 bytes.<br>
它壓縮文字：令牌序列比原始文字對應的位元組短。實際上，平均而言，每個令牌對應大約 4 個位元組。

It attempts to let the model see common subwords. For instance, "ing" is a common subword in English, so BPE encodings will often split "encoding" into tokens like "encod" and "ing" (instead of e.g. "enc" and "oding"). Because the model will then see the "ing" token again and again in different contexts, it helps models generalise and better understand grammar.<br>
它試圖讓模型看到常見的子詞。例如，「ing」是英語中的常見子詞，因此 BPE 編碼通常會將「encoding」拆分為「encod」和「ing」等標記（而不是「enc」和「oding」等標記）。因為模型將在不同的上下文中一次又一次地看到“ing”標記，所以它有助於模型泛化並更好地理解語法。

In [5]:
len(speech_content)

12006

In [6]:
import tiktoken
# def print_embedding_cost(texts):
 
#     enc = tiktoken.get_encoding('text-embedding-3-small')
#     total_tokens = sum([len(enc.encode(page.page_content)) for page in texts])
#     print(f'Total Tokens: {total_tokens}')
#     print(f'Embedding Cost in USD: {total_tokens / 1000 * 0.0004:.6f}')
    
# print_embedding_cost(chunks)

def print_embedding_cost(texts):
    enc = tiktoken.encoding_for_model('text-embedding-ada-002')
    total_tokens = sum([len(enc.encode(page.page_content)) for page in texts])
    print(f'Total Tokens: {total_tokens}')
    print(f'Embedding Cost in USD: {total_tokens / 1000 * 0.0004:.6f}')
    
print_embedding_cost(chunks)

Total Tokens: 2789
Embedding Cost in USD: 0.001116


### Creating embeddings

In [7]:
# import warnings
# warnings.filterwarnings('ignore', module='langchain')

In [10]:
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model='text-embedding-3-small', 
                              dimensions=1536)  # 512 works as well

In [11]:
vector = embeddings.embed_query(chunks[0].page_content)
vector

[0.060803676403939445,
 -0.0159058231706826,
 0.00828262491442416,
 0.042203302397315956,
 0.03133413119197265,
 -0.0548006208288112,
 -0.013063467720464364,
 0.04920686661408387,
 -0.01799779713967108,
 -0.017747669979261162,
 0.02159053593376417,
 0.006082641743801183,
 -0.003996352567669523,
 -0.03272120202258594,
 -0.005792721499054794,
 -0.025717635943172924,
 -0.034699480134438,
 0.00563923395456573,
 0.018407096637426883,
 -0.01995333794274096,
 0.009169440492895068,
 0.011949264153010822,
 -0.05134431892592692,
 0.06757985635701493,
 0.02138588525356372,
 -0.01396165195332636,
 -0.007969965963911762,
 0.025262859965207655,
 0.019839643948249643,
 -0.0340173161674901,
 0.02288665007866833,
 -0.017099612906809083,
 0.010908961961373403,
 0.01655388024313468,
 -0.004414178789049282,
 -0.010516716842188063,
 0.006076956950944362,
 -0.0038997122532567555,
 0.026604452452966384,
 -0.04140744257323163,
 0.0034762012390201756,
 -0.012483627230971584,
 0.06830749494152719,
 0.0485701772

### Inserting the Embeddings into a Pinecone Index

In [19]:
# I'm importing the necessary libraries and initializing the Pinecone client
import os
import pinecone
from langchain_community.vectorstores import Pinecone

pc = pinecone.Pinecone()

pc.list_indexes()

{'indexes': [{'dimension': 1536,
              'host': 'langchain-37o83s8.svc.gcp-starter.pinecone.io',
              'metric': 'cosine',
              'name': 'langchain',
              'spec': {'pod': {'environment': 'gcp-starter',
                               'pod_type': 'starter',
                               'pods': 1,
                               'replicas': 1,
                               'shards': 1}},
              'status': {'ready': True, 'state': 'Ready'}}]}

In [24]:
# deleting all indexes
indexes = pc.list_indexes().names()
for i in indexes:
    print('Deleting all indexes ... ', end='')
    pc.delete_index(i)
    print('Done')

In [26]:
# creating an index
from pinecone import PodSpec
index_name = 'stevejobs-speech'
if index_name not in pc.list_indexes().names():
    print(f'Creating index {index_name}')
    pc.create_index(
        name=index_name,
        dimension=1536,
        metric='cosine',
        spec=PodSpec(
            environment='gcp-starter'
        )
    )
    print('Index created! 😊')
else:
    print(f'Index {index_name} already exists!')

Creating index stevejobs-speech
Index created! 😊


In [27]:
# processing the input documents, generating embeddings using the provided `OpenAIEmbeddings` instance,
# inserting the embeddings into the index and returning a new Pinecone vector store object. 
vector_store = Pinecone.from_documents(chunks, embeddings, index_name=index_name)

In [33]:
index = pc.Index(index_name)
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.00054,
 'namespaces': {'': {'vector_count': 54}},
 'total_vector_count': 54}

### Asking Questions (Similarity Search)

In [37]:
query = 'Where should we fight?'
result = vector_store.similarity_search(query, k=3)
print(result)

[Document(page_content='No one wants to die. Even people who want to go to heaven dont want to die to get there. And yet death is the destination we all share. No one has ever escaped it. And that is as it should be, because Death is very likely the single best invention of Life. It is Lifes change agent. It clears out'), Document(page_content='Stay Hungry. Stay Foolish.\n\nThank you all very much.'), Document(page_content='kind you might find yourself hitchhiking on if you were so adventurous. Beneath it were the words: Stay Hungry. Stay Foolish. It was their farewell message as they signed off. Stay Hungry. Stay Foolish. And I have always wished that for myself. And now, as you graduate to begin anew, I wish that')]


In [38]:
for r in result:
    print(r.page_content)
    print('-' * 50)

No one wants to die. Even people who want to go to heaven dont want to die to get there. And yet death is the destination we all share. No one has ever escaped it. And that is as it should be, because Death is very likely the single best invention of Life. It is Lifes change agent. It clears out
--------------------------------------------------
Stay Hungry. Stay Foolish.

Thank you all very much.
--------------------------------------------------
kind you might find yourself hitchhiking on if you were so adventurous. Beneath it were the words: Stay Hungry. Stay Foolish. It was their farewell message as they signed off. Stay Hungry. Stay Foolish. And I have always wished that for myself. And now, as you graduate to begin anew, I wish that
--------------------------------------------------


### Answering in Natural Language using an LLM

In [42]:
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

# Initialize the LLM with the specified model and temperature
llm = ChatOpenAI(model='gpt-3.5-turbo', temperature=0.2)

# Use the provided vector store with similarity search and retrieve top 3 results
retriever = vector_store.as_retriever(search_type='similarity', search_kwargs={'k': 3})

# Create a RetrievalQA chain using the defined LLM, chain type 'stuff', and retriever
chain = RetrievalQA.from_chain_type(llm=llm, chain_type='stuff', retriever=retriever)


In [45]:
query = 'What impact did dropping out of school have on Steve Jobs life and career?'
answer = chain.invoke(query)
print(answer)

{'query': 'What impact did dropping out of school have on Steve Jobs life and career?', 'result': "Dropping out of school had a significant impact on Steve Jobs' life and career. It allowed him to focus on his passion for technology and entrepreneurship, leading him to co-found Apple with Steve Wozniak. This decision ultimately shaped his future success and allowed him to pursue his creative and innovative ideas."}


In [46]:
query = 'What were the impacts of being fired on Steve Jobs?'
answer = chain.run(query)
print(answer)

  warn_deprecated(


The impact of being fired from Apple had a profound effect on Steve Jobs. Initially, he felt a sense of heaviness being successful, but being fired allowed him to experience the lightness of being a beginner again. It made him less sure about everything and freed him to enter one of the most creative periods of his life. This led to the creation of the Macintosh and marked a turning point for Jobs.


In [47]:
query = 'What about the French Armies??'
answer = chain.run(query)
print(answer)

I'm sorry, but I don't have enough information to answer your question about the French Armies.
