In [1]:
!pip3 install -r ./requirements.txt -q

In [2]:
!pip3 show langchain

Name: langchain
Version: 0.0.300
Summary: Building applications with LLMs through composability
Home-page: https://github.com/langchain-ai/langchain
Author: 
Author-email: 
License: MIT
Location: /opt/homebrew/lib/python3.11/site-packages
Requires: aiohttp, anyio, dataclasses-json, jsonpatch, langsmith, numexpr, numpy, pydantic, PyYAML, requests, SQLAlchemy, tenacity
Required-by: 


In [3]:
!pip3 install langchain --upgrade -q

### Python-dotenv

In [8]:
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(), override=True)

# Test the .env was loaded correctly.
# os.environ.get('PINECONE_API_KEY')

True

### LLM Models (Wrappers): GPT-3

In [9]:
from langchain.llms import OpenAI
llm = OpenAI(model_name='text-davinci-003', temperature=0.7, max_tokens=512)
print(llm)

[1mOpenAI[0m
Params: {'model_name': 'text-davinci-003', 'temperature': 0.7, 'max_tokens': 512, 'top_p': 1, 'frequency_penalty': 0, 'presence_penalty': 0, 'n': 1, 'request_timeout': None, 'logit_bias': {}}


In [10]:
output = llm("Explain quantim mechanics in one sentence")
print(output)



Quantum mechanics is a physical theory describing the behaviour of matter and energy at the atomic and subatomic levels.


In [12]:
print(llm.get_num_tokens("Explain quantim mechanics in one sentence"))

8


In [13]:
output = llm.generate(['... is the capital of France.', 'What is the formula for the area of a circle?'])

In [14]:
print(output.generations)

[[Generation(text='\n\nParis', generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text='\n\nA = πr² (A is the area of the circle, π is the constant 3.14, and r is the radius of the circle)', generation_info={'finish_reason': 'stop', 'logprobs': None})]]


In [17]:
print(output.generations[0][0].text)



Paris


In [18]:
len(output.generations)

2

In [19]:
output = llm.generate(['Write an original tagline for a burger restaurant'] * 3)

In [20]:
print(output)

generations=[[Generation(text='\n\n"Tastes So Good, You\'ll Flip Out!"', generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text='\n\n"The burgers that will make your mouth water!"', generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text='\n\n"Big, Juicy, Delicious - Our Burgers Can\'t Be Beat!"', generation_info={'finish_reason': 'stop', 'logprobs': None})]] llm_output={'token_usage': {'total_tokens': 71, 'completion_tokens': 44, 'prompt_tokens': 27}, 'model_name': 'text-davinci-003'} run=[RunInfo(run_id=UUID('5fb808fc-3a69-40fe-8349-abf081f0f2c1')), RunInfo(run_id=UUID('9231bd96-a4de-4e33-9595-da77ebf68474')), RunInfo(run_id=UUID('c21f8850-70fc-49e4-bb66-078fa990488f'))]


In [21]:
for o in output.generations:
    print(o[0].text)



"Tastes So Good, You'll Flip Out!"


"The burgers that will make your mouth water!"


"Big, Juicy, Delicious - Our Burgers Can't Be Beat!"


### ChatModels: GPT-3.5-Turbo and GPT-4

In [23]:
from langchain.schema import(
    AIMessage,
    HumanMessage,
    SystemMessage
)
from langchain.chat_models import ChatOpenAI

In [25]:
chat = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0.5, max_tokens=1024)
messages = [
    SystemMessage(content='You are a physicist and respond only in German.'),
    HumanMessage(content='Explain quantum mechanics in one sentence')
]
output = chat(messages)

In [30]:
print(output.content)

Quantenmechanik beschreibt das Verhalten von Teilchen auf subatomarer Ebene und ermöglicht es uns, ihre Positionen und Geschwindigkeiten nicht genau vorherzusagen, sondern nur Wahrscheinlichkeiten anzugeben.


### Prompt templates

In [31]:
from langchain import PromptTemplate

In [32]:
template = """
    You are an experienced virologist. Write a few sentences about the following
    {virus} in {language}.
"""

prompt = PromptTemplate(
    input_variables = ['virus','language'],
    template=template
)
print(prompt)

input_variables=['virus', 'language'] output_parser=None partial_variables={} template='\n    You are an experienced virologist. Write a few sentences about the following\n    {virus} in {language}.\n' template_format='f-string' validate_template=True


In [34]:
from langchain.llms import OpenAI
llm = OpenAI(model_name='text-davinci-003', temperature=0.7)
output = llm(prompt.format(virus='ebola', language='gaelic'))
print(output)


Is é ebola an t-ainm atá ar an níos mó de na víreas a bhaineann le buaiclíneacht súl. Tá sé ina chúis do chliseadh éigeandála ar fud an domhain agus is féidir leis an víreas a bhaineann le buaiclíneacht súl críonna a bheith ina chúis le buaiteanna móra sábhála. Tá ábhartha éagsúla ann chun tocsainí a chosc ó ebola, lena n-áirítear úsáid a bhaint as cógaisíochtaí, ábhartha sláinte a choinneáil saor in aisce agus ábhartha eile a úsáid a bhaint as chun an t-ebola a chosc.


### Simple chains

In [36]:
from langchain.chat_models import ChatOpenAI
from langchain import PromptTemplate
from langchain.chains import LLMChain

llm = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0.5)

template = """
    You are an experienced virologist. Write a few sentences about the following
    {virus} in {language}.
"""

prompt = PromptTemplate(
    input_variables = ['virus','language'],
    template=template
)

chain = LLMChain(llm=llm, prompt=prompt)
output = chain.run({'virus': 'COVID-19', 'language':'French'})


In [37]:
print(output)

COVID-19, également connu sous le nom de coronavirus, est une maladie respiratoire causée par le virus SARS-CoV-2. Il a été identifié pour la première fois à Wuhan, en Chine, en décembre 2019 et s'est depuis propagé dans le monde entier, devenant une pandémie. Les symptômes courants de la COVID-19 comprennent la fièvre, la toux, la fatigue et les difficultés respiratoires. Il est essentiel de suivre les mesures de prévention telles que le lavage régulier des mains, le port du masque et la distanciation sociale pour limiter la propagation du virus.


### Sequential chains
* Make a series of calls to one or more LLMs. Take the output from one chain and use it as the input to another chain.

Two types:
1. **SimpleSequentialChain** - series of chains, where each individual chain has a single input and isngle output, and the output of one step is used as input to the next.
2. General form of sequential chains

In [40]:
from langchain.chat_models import ChatOpenAI
from langchain.llms import OpenAI
from langchain import PromptTemplate
from langchain.chains import LLMChain, SimpleSequentialChain

llm1 = OpenAI(model_name='text-davinci-003', temperature=0.7, max_tokens=1024)
prompt1 = PromptTemplate(
    input_variables=['concept'],
    template="""You are an experienced scientist and Python programmer.
    Write a function that implements the concept of {concept}."""
)
chain1 = LLMChain(llm=llm1, prompt=prompt1)


llm2 = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=1.2)
prompt2 = PromptTemplate(
    input_variables=['function'],
    template='Given the Python {function}, describe it as detailed as possible.'
)
chain2 = LLMChain(llm=llm2, prompt=prompt2)

overall_chain = SimpleSequentialChain(chains=[chain1, chain2], verbose=True)
output = overall_chain.run('softmax')



[1m> Entering new SimpleSequentialChain chain...[0m
[36;1m[1;3m

def softmax(x):
    """Computes the softmax values for each value in x.

    Parameters
    ----------
    x : array-like
        Input values

    Returns
    -------
    array-like
        Softmax values
    """
    # Calculate the exponentials of each value in x
    exp_x = np.exp(x)
    
    # Calculate the sum of the exponentials
    sum_exp_x = np.sum(exp_x)
    
    # Calculate the softmax values by dividing each exponential value by the sum of the exponentials
    softmax_values = exp_x/sum_exp_x
    
    return softmax_values[0m
[33;1m[1;3mThe given Python code defines a function named "softmax" which computes the softmax values for each value in the input array "x". 

The function takes a single parameter "x". It is expected that "x" should be an array-like object containing the input values.

The first step of the function is to calculate the exponential of each value in the input array using the "np.e

### LangChain Agents
* LLMs like ChatGPT cannot do math very well because they need to use approximations
* They also can't browse the internet, genuinely run database queries, etc.
* LangChain Agents can

In [41]:
from langchain.agents.agent_toolkits import create_python_agent
from langchain.tools.python.tool import PythonREPLTool
from langchain.llms import OpenAI

In [43]:
llm = OpenAI(temperature=0)
agent_executor = create_python_agent(
    llm=llm,
    tool=PythonREPLTool(),  # functions that agents can use to interact with the outside world
    verbose=True  # so we can see the intermediate text
)
agent_executor.run('Calculate the square root of the factorial of 20 \
                   and display it with four decimal points.')




[1m> Entering new AgentExecutor chain...[0m


Python REPL can execute arbitrary code. Use with caution.


[32;1m[1;3m I need to calculate the factorial of 20 and then take the square root of that number.
Action: Python_REPL
Action Input: from math import factorial; print(round(factorial(20)**0.5, 4))[0m
Observation: [36;1m[1;3m1559776268.6285
[0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: 1559776268.6285[0m

[1m> Finished chain.[0m


'1559776268.6285'

In [46]:
agent_executor.run('what is the answer to 5.1 ** 7.3? Format the output for read-ability with commas and two decimals.')



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m I need to use the Python REPL to calculate the answer
Action: Python_REPL
Action Input: print("{:,.2f}".format(5.1 ** 7.3))[0m
Observation: [36;1m[1;3m146,306.05
[0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: 146,306.05[0m

[1m> Finished chain.[0m


'146,306.05'

### Embeddings
* Embeddings are the core of LLM applications.
* Text embeddings are numeric representationa os text and are used in NLP and ML tasks.
* The *distance* between two embeddings or two vectors *measures their relatedness* which translates to the relatedness between the text concepts they represent.
* Similar embeddings or vectors represent similar concepts.

**Embeddings Applications**
* *Text Classification*: assigning a label to a piece of text.
* *Text Clusting*: grouping together pieces of text that are similar in meaning.
* *Question Answering*: answering a question posed in natural language.

### Vector Databases

**AI Challenges**
* Efficient data processing (AI uses huge amounts of data)
* Many of the latest AI applications rely on vector embeddings. Chatbots, question-answering systems, and machine translation rely on vector embeddings.

**Vector Databases**

A new type of database designed to store and query unstructured data.
* *Pinecone*: Which we are using in this project.
* *chroma*
* *milvus*
* *drant*

**SQL vs. Vector Database**

* SQL is structured, but when querying a vector database the query looks for the most similar item.
* Vector databases use a combination of different optimized algorithms that all participate in **Approximage Nearest Neighbor (ANN)** search.
Steps:
1. *Embedding*: create vector embeddings for the content we want to index.
2. *Indexing*: insert the vector embedding into the vector database.
3. *Querying*

### Splitting and Embedding Text Using LangChain

In [97]:
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(), override=True)

True

In [98]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
with open('./churchill_speech.txt') as f:
    churchill_speech = f.read()


text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,  # help maintain continuity
    length_function=len
)

In [99]:
chunks = text_splitter.create_documents([churchill_speech])
print(chunks[1].page_content)
print(f'Now you have {len(chunks)} chunks.')

The position of the B. E.F had now become critical As a result of a most skillfully conducted
Now you have 281 chunks.


### Embedding cost

In [100]:
def print_embedding_cost(texts):
    import tiktoken
    enc = tiktoken.encoding_for_model('text-embedding-ada-002')
    total_tokens = sum([len(enc.encode(page.page_content)) for page in texts])
    print(f'Total Tokens: {total_tokens}')
    print(f'Embedding cost in USD: ${total_tokens / 1000 * 0.0004:.6f}')

print_embedding_cost(chunks)

Total Tokens: 5629
Embedding cost in USD: $0.002252


In [101]:
from langchain.embeddings import OpenAIEmbeddings
embedding = OpenAIEmbeddings()

In [102]:
vector = embedding.embed_query(chunks[0].page_content)
vector[0:10]

[-0.02832016386419352,
 -0.0036371849028618967,
 -0.03352817192461715,
 -0.03601557809223016,
 -0.012896945871190453,
 -0.0003635565274675487,
 0.006833891474988558,
 -0.004553768378574473,
 -0.013305035596318493,
 -0.0009651656856314292]

### Inserting the Embeddings into a Pinecone Index

In [110]:
import os
import pinecone
from langchain.vectorstores import Pinecone

pinecone.init(api_key=os.environ.get('PINECONE_API_KEY'), environment=os.environ.get('PINECONE_ENV'))

In [111]:
# Create a pinecone index (or use an existing)
# Free version is limited to 1 index, so we need to delete any before making a new one.
indexes = pinecone.list_indexes()
for i in indexes:
    print("Deleting all Pinecone indices.", end="")
    pinecone.delete_index(i)
    print("Done")

Deleting all Pinecone indices.Done


In [112]:
index_name = 'churchill-speech'
if index_name not in pinecone.list_indexes():
    print(f"Creating index {index_name} ...")
    pinecone.create_index(index_name, dimension=1536, metric='cosine')  # OpenAI dimension defaults to 1,536
    print('Done.')

Creating index churchill-speech ...
Done.


In [113]:
vector_store = Pinecone.from_documents(chunks, embedding, index_name=index_name)

### Asking Questions (Similarity Seach)

In [114]:
query = 'Where should we fight?'
result = vector_store.similarity_search(query)
print(result)

[Document(page_content='on the beaches, we shall fight on the landing grounds, we shall fight in the fields and in the', metadata={}), Document(page_content='fields and in the streets, we shall fight in the hills; we shall never surrender, and even if,', metadata={}), Document(page_content='When we consider how much greater would be our advantage in defending the air above this Island', metadata={}), Document(page_content='front, now on that, fighting on three fronts at once, battles fought by two or three divisions', metadata={})]


In [115]:
for r in result:
    print(r.page_content)
    print('-' * 50)

on the beaches, we shall fight on the landing grounds, we shall fight in the fields and in the
--------------------------------------------------
fields and in the streets, we shall fight in the hills; we shall never surrender, and even if,
--------------------------------------------------
When we consider how much greater would be our advantage in defending the air above this Island
--------------------------------------------------
front, now on that, fighting on three fronts at once, battles fought by two or three divisions
--------------------------------------------------


In [116]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model='gpt-3.5-turbo', temperature=1)

retriever = vector_store.as_retriever(search_type='similarity',search_kwargs={'k': 3})  # the 3 most similar chunks

chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)  #stuff is the default and uses all of the text

In [117]:
# query = "Where should we fight?"
query = 'Who was the king of Belgium at the time?'
answer = chain.run(query)
print(answer)  # I guess I didn't get the correct speech

The king of Belgium at the time was not mentioned in the provided context.


In [119]:
query = "Where should we fight?"
chain.run(query)

'We should fight on the beaches, the landing grounds, in the fields, streets, and hills.'

In [120]:
query = "What about the French Armies?"
answer = chain.run(query)
print(answer)

The French Armies played a significant role in the context mentioned. They were part of the conflict with the British Armies and were involved in the efforts to reopen communications to Amiens. Additionally, a French Army was planned to advance across the Somme in order to secure strategic positions.


In [121]:
query = "What about the French Armies?"
answer = chain.run(query)
print(answer)  # The answers vary as well.

The French Armies were a significant force that played a crucial role in the military operations. They were involved in the fighting against the British Armies and their main objective was to reopen communications to Amiens. Additionally, a French Army was created specifically to advance across the Somme and seize control of important territories.


In [122]:
# Let's see if we can read the player's handbook for D&D 5e.
!pip3 install pypdf

Collecting pypdf
  Obtaining dependency information for pypdf from https://files.pythonhosted.org/packages/bf/53/8840f93c5dcd108c02cac7343e194f9dc5d15ade6200ccc661ab4e1352b5/pypdf-3.16.2-py3-none-any.whl.metadata
  Downloading pypdf-3.16.2-py3-none-any.whl.metadata (7.3 kB)
Downloading pypdf-3.16.2-py3-none-any.whl (276 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m276.3/276.3 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0mm
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-3.16.2


In [124]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("./fifth_player_handbook.pdf")
pages = loader.load_and_split()

In [127]:
pages[120]

Document(page_content='Rogues devote as much effort to mastering the use of a variety of skills as they do to perfecting their com bat abilities, giving them a broad expertise that few other characters can match. Many rogues focus on stealth and deception, while others refine the skills that help them in a dungeon environment, such as climbing, finding and disarming traps, and opening locks.When i t  comes to combat, rogues prioritize cunning over brute strength. A  rogue w ould rather make one precise strike, placing it exactly where the attack will hurt the target most, than w ear an opponent down with a barrage of attacks. R ogues have an almost supernatural knack for avoiding danger, and a few learn magical tricks to supplement their other abilities.A Shady Liv in gEvery town and city has its share of rogues. M ost of them live up to the worst stereotypes of the class, making a living as burglars, assassins, cutpurses, and con artists. Often, these scoundrels are organized into thi