In [1]:
import os 
import openai
import warnings 

warnings.filterwarnings('ignore')

In [2]:
%reload_ext watermark
%watermark -a "Dhaval Antala" -vmp langchain,openai

Author: Dhaval Antala

Python implementation: CPython
Python version       : 3.10.0
IPython version      : 8.25.0

langchain: 0.2.5
openai   : 1.32.0

Compiler    : Clang 12.0.0 
OS          : Darwin
Release     : 23.5.0
Machine     : arm64
Processor   : arm
CPU cores   : 8
Architecture: 64bit



In [4]:
from getpass import getpass
from dotenv import load_dotenv

load_dotenv()

OPENAI_API_KEY = getpass()
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
openai.api_key = os.getenv("OPENAI_API_KEY")

In [5]:
from langchain.chat_models import ChatOpenAI
from langchain.llms import OpenAI

## 1. ❓ Question Answering Over Docs


In [6]:
llm = OpenAI()

  warn_deprecated(


In [7]:
llm("who is from Gujarat?")

  warn_deprecated(


'\n\nThere are many people from Gujarat, a state in western India. Some famous people from Gujarat include Mahatma Gandhi, Narendra Modi, and Amitabh Bachchan.'

In [8]:
context = """
Dhaval is from Gujarat.
Sudarshan is from UP.
Mikko is from AP.
Khoa is from Delhi.
"""

question = "who is from Gujarat?"

In [9]:
output = llm(context + question)
print(output)



Dhaval is from Gujarat.


## 2. 💬 Chatbots

In [10]:
from langchain import OpenAI, ConversationChain, LLMChain, PromptTemplate
from langchain.memory import ConversationBufferWindowMemory

In [11]:
template = """
You are an assistant trained by OpenAI.
Your goal is to provide help just with foods.
Don't provide answer other than food related topics. 
Just output "I don't know" if other topics are asked.

{history}
Human: {human_input}
Assistant:"""

In [12]:
prompt = PromptTemplate(
    input_variable = ["history", "human_input"],
    template = template)
    

In [13]:
chatgpt_chain = LLMChain(
    llm=OpenAI(temperature=0),
    prompt=prompt,
    verbose = True,
    memory = ConversationBufferWindowMemory(memory_key="history")
)

  warn_deprecated(


In [14]:
output = chatgpt_chain.predict(
    human_input="What is Python?"
)
print(output)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
You are an assistant trained by OpenAI.
Your goal is to provide help just with foods.
Don't provide answer other than food related topics. 
Just output "I don't know" if other topics are asked.


Human: What is Python?
Assistant:[0m

[1m> Finished chain.[0m
 I don't know.


In [15]:
output = chatgpt_chain.predict(human_input="Which fruit is better, apple or orange ?")
print(output)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
You are an assistant trained by OpenAI.
Your goal is to provide help just with foods.
Don't provide answer other than food related topics. 
Just output "I don't know" if other topics are asked.

Human: What is Python?
AI:  I don't know.
Human: Which fruit is better, apple or orange ?
Assistant:[0m

[1m> Finished chain.[0m
 I don't know.


In [16]:
output = chatgpt_chain.predict(human_input="What about apple and samsung ?")
print(output)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
You are an assistant trained by OpenAI.
Your goal is to provide help just with foods.
Don't provide answer other than food related topics. 
Just output "I don't know" if other topics are asked.

Human: What is Python?
AI:  I don't know.
Human: Which fruit is better, apple or orange ?
AI:  I don't know.
Human: What about apple and samsung ?
Assistant:[0m

[1m> Finished chain.[0m
 I don't know.


In [17]:
output = chatgpt_chain.predict(human_input="What is the first question I asked you ?")
print(output)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
You are an assistant trained by OpenAI.
Your goal is to provide help just with foods.
Don't provide answer other than food related topics. 
Just output "I don't know" if other topics are asked.

Human: What is Python?
AI:  I don't know.
Human: Which fruit is better, apple or orange ?
AI:  I don't know.
Human: What about apple and samsung ?
AI:  I don't know.
Human: What is the first question I asked you ?
Assistant:[0m

[1m> Finished chain.[0m
 I don't know.


## 3. 📚 Querying Tabular Data

In [85]:
from langchain import SQLDatabase
from langchain_experimental.sql import SQLDatabaseChain, SQLDatabaseSequentialChain

In [86]:
mysql_uri = 'mysql+mysqlconnector://root:pass1234@localhost:3306/chinook'
db = SQLDatabase.from_uri(mysql_uri)
llm = OpenAI(temperature=0)

In [89]:
db_chain = SQLDatabaseChain.from_llm(llm=llm, db=db, verbose=True)

In [90]:
db_chain.run("How many employees are there?")



[1m> Entering new SQLDatabaseChain chain...[0m
How many employees are there?
SQLQuery:[32;1m[1;3mSELECT COUNT(*) FROM Employee[0m
SQLResult: [33;1m[1;3m[(8,)][0m
Answer:[32;1m[1;3mThere are 8 employees.[0m
[1m> Finished chain.[0m


'There are 8 employees.'

## 4. 🔌 Interacting with APIs

- APIs sre powerfull because if you need to perform some action or talk to data from behind an API, we need LLM to interact with it.
- Lets go through on example using [Open-Meteo](https://open-meteo.com/) which is a free weather api.
- Open-Meteo is an open-source weather API with free access for non-comercial use. No API key required. 

In [21]:
from langchain.chains.api.base import APIChain
from langchain.prompts.prompt import PromptTemplate
from langchain.llms import OpenAI
from langchain.chains.api import open_meteo_docs

In [22]:
llm = OpenAI(temperature=0)

In [23]:
# chain_new = APIChain.from_llm_and_api_docs(llm, open_meteo_docs.OPEN_METEO_DOCS, verbose=True)
chain_new = APIChain.from_llm_and_api_docs(
    llm,
    open_meteo_docs.OPEN_METEO_DOCS,
    verbose=True,
    limit_to_domains=["https://api.open-meteo.com/"],
)

In [24]:
chain_new.run('What is the weather like right now in Mahhiem, Germany in degrees Celsius?')

  warn_deprecated(




[1m> Entering new APIChain chain...[0m
[32;1m[1;3m https://api.open-meteo.com/v1/forecast?latitude=49.489591&longitude=8.461330&hourly=temperature_2m&current_weather=true&temperature_unit=celsius&timezone=auto[0m
[33;1m[1;3m{"latitude":49.48,"longitude":8.459999,"generationtime_ms":0.10204315185546875,"utc_offset_seconds":7200,"timezone":"Europe/Berlin","timezone_abbreviation":"CEST","elevation":102.0,"current_weather_units":{"time":"iso8601","interval":"seconds","temperature":"°C","windspeed":"km/h","winddirection":"°","is_day":"","weathercode":"wmo code"},"current_weather":{"time":"2024-06-26T11:15","interval":900,"temperature":25.6,"windspeed":3.1,"winddirection":234,"is_day":1,"weathercode":0},"hourly_units":{"time":"iso8601","temperature_2m":"°C"},"hourly":{"time":["2024-06-26T00:00","2024-06-26T01:00","2024-06-26T02:00","2024-06-26T03:00","2024-06-26T04:00","2024-06-26T05:00","2024-06-26T06:00","2024-06-26T07:00","2024-06-26T08:00","2024-06-26T09:00","2024-06-26T10:00","

' The current temperature in Mahhiem, Germany is 25.6 degrees Celsius. This information was obtained from the API url: https://api.open-meteo.com/v1/forecast?latitude=49.489591&longitude=8.461330&hourly=temperature_2m&current_weather=true&temperature_unit=celsius&timezone=auto.'

## 5. 📝 Summarization

- Creating smaller summary from longer documents.
- There are different chain types
- Many ways how you can interact with PDF.
- Summarization can be done from couple of sentences to entire book.

In [25]:
%%capture 
!pip install tiktoken

In [26]:
# paragraph summarization
from langchain import OpenAI
llm = OpenAI(temperature=0)
     

In [27]:
prompt = """
Please provide a summary of the following text.
Provide answer in simple terms and max lenght of 30 words.

TEXT:
A common use case is wanting to summarize long documents. This naturally runs into \
the context window limitations. Unlike in question-answering, you can't just do some \
semantic search hacks to only select the chunks of text most relevant to the question \
(because, in this case, there is no particular question - you want to summarize everything). So what do you do then?

The most common way around this is to split the documents into chunks and then do \
summarization in a recursive manner. By this we mean you first summarize each chunk \
by itself, then you group the summaries into chunks and summarize each chunk of summaries, and continue doing that until only one is left.
"""

In [28]:
num_tokens = llm.get_num_tokens(prompt)
print (f"Our prompt has {num_tokens} tokens")

Our prompt has 159 tokens


In [29]:
summary = llm(prompt)

In [30]:
print(summary)


Summarizing long documents can be challenging due to context window limitations. To overcome this, the document is split into chunks and recursively summarized until only one remains.


In [31]:
%%capture 
!pip install pypdf

In [32]:
from langchain.chains.summarize import load_summarize_chain
from langchain.document_loaders import PyPDFLoader

In [33]:
loader = PyPDFLoader('../langchain/data/human-nutrition-text.pdf')
doc=loader.load_and_split()

In [34]:
chain = load_summarize_chain(llm, chain_type="map_reduce")
chain.run(doc)

'\n\nThis chapter introduces the Food Science and Human Nutrition Program at the University of Hawaii at Manoa and emphasizes the importance of a strong foundation in both the program and in life. It covers basic concepts in nutrition, the six classes of nutrients, and the role of macronutrients in providing energy and regulating bodily functions. Water is also considered a macronutrient, while micronutrients such as minerals and vitamins play important roles in the body.'

## 6. 📤 Extraction
- Extracting something.
- Extraction is related to [output parsing](https://python.langchain.com/v0.1/docs/modules/model_io/output_parsers/) which are responsible for instructing LLMs to respond in a specific format. 
- For deep divem LangChain recommends checking [KOR](https://eyurtsev.github.io/kor/index.html) library which uses LangChain chain and OutputParser abstractions but allows deep dive on allowing extraction of more complicated schemas. 

In [35]:
# To help construct our Chat Messages
from langchain.schema import HumanMessage
from langchain.prompts import PromptTemplate, ChatPromptTemplate, HumanMessagePromptTemplate

# We will be using a chat model, defaults to gpt-3.5-turbo
from langchain.chat_models import ChatOpenAI

# To parse outputs and get structured data back
from langchain.output_parsers import StructuredOutputParser, ResponseSchema

chat_model = ChatOpenAI(temperature=0)

  warn_deprecated(


In [36]:
instructions = """
Given a random sentence which contains animals name, extract animal names and assign an emoji to that and return just the animal name with emoji.
"""

animal_names = """
Dog, cat and rabbit are in the garden.
"""

In [37]:
prompt = (instructions + animal_names)
output = chat_model([HumanMessage(content=prompt)])
print(output.content)

  warn_deprecated(


🐶 Dog
🐱 Cat
🐰 Rabbit


Let's go through one example from kor too.

- Kor is a thin wrapper on top of LLMs that helps to extract structured data using LLMs.

In [38]:
%%capture
!pip install kor

In [39]:
from kor.extraction import create_extraction_chain
from kor.nodes import Object, Text, Number
from langchain.chat_models import ChatOpenAI

In [40]:
schema = Object(
    id="person",
    description="Personal information",
    examples=[
        ("Alice and Bob are friends", [{"first_name": "Alice"}, {"first_name": "Bob"}])
    ],
    attributes=[
        Text(
            id="first_name",
            description="The first name of a person.",
        )
    ],
    many=True,
)

In [41]:
# instantiate a langchain llm and create a chain
llm = ChatOpenAI(
    model_name="gpt-3.5-turbo",
    temperature=0,
    max_tokens=2000,
)
chain = create_extraction_chain(llm, schema)

In [42]:
# extract
chain.run(("My name is Bobby. My brother's name Joe."))["data"]

{'person': [{'first_name': 'Bobby'}, {'first_name': 'Joe'}]}

In [43]:
chain.run(("My name is Bobby. My brother's name Joe."))

{'data': {'person': [{'first_name': 'Bobby'}, {'first_name': 'Joe'}]},
 'raw': 'first_name\nBobby\nJoe',
 'errors': [],
 'validated_data': {}}

In [44]:
chain.run(("My name is Bobby. My brother's name Joe. My another brother's name is Stephen"))['data']

{'person': [{'first_name': 'Bobby'},
  {'first_name': 'Joe'},
  {'first_name': 'Stephen'}]}

## 7. 🧐 Evaluation

- Evaluation after creating chins/agents internally as well as application building on top of Langchain is necessary. 
- Lack of data and lack of metrics are the key issues. 
- How LangChain is tackling this and will improve:
  - For lack of data, there is a [LangChainDatassets](https://huggingface.co/LangChainDatasets) in Hugging Face. 
  - For lack of metrics, using no metrics, meaning just relying on the output and doing human observation. Next is using [tracing](https://js.langchain.com/v0.1/docs/modules/agents/how_to/logging_and_tracing/), a UI-based visualizer of your chain and agent runs. 
  - As I went through the SQL querying in the Tabular Data part, lets use the [SQL Question Answering Benchmarking: Chinook](https://python.langchain.com/v0.1/docs/use_cases/sql/large_db/)

**Loading the Data**

#### data with sql connection 

In [71]:
import mysql.connector

try:
    conn = mysql.connector.connect(
        user='root',
        password='Dhaval@87588',
        host='localhost',
        database='Chinook'  # Optional, specify if connecting to a specific database
    )
    print("Connected successfully!")
    conn.close()
except mysql.connector.Error as e:
    print(f"Error connecting to MySQL: {e}")

Connected successfully!


In [72]:
from langchain_community.utilities import SQLDatabase

In [91]:
# if you are using MySQL
mysql_uri = 'mysql+mysqlconnector://root:pass1234@localhost:3306/chinook'

db = SQLDatabase.from_uri(mysql_uri)

In [92]:
llm = OpenAI(temperature=0)

In [93]:
# sql database chain
chain = SQLDatabaseChain(llm=llm, database=db, input_key="question")

In [127]:
# %%capture
!pip install datasets



In [119]:
## DatasetGenerationError: An error occurred while generating the dataset

# ~/.cache/huggingface/datasets delete

from datasets import Dataset
Dataset.cleanup_cache_files

<function datasets.arrow_dataset.Dataset.cleanup_cache_files(self) -> int>

In [None]:
## Loading the data
from langchain.evaluation.loading  import load_dataset
dataset = load_dataset("sql-qa-chinook")

In [None]:
dataset[0]

#### **Setting up a Chain**

In [123]:
from langchain import OpenAI, SQLDatabase
from langchain_experimental.sql.base import SQLDatabaseChain

In [124]:
# sql database chain
chain = SQLDatabaseChain(llm=llm, database=db, input_key="question")

In [125]:
# doing just one prediction to check
chain(dataset[0])

NameError: name 'dataset' is not defined

In [None]:
# bulk predictions
predictions = []
predicted_dataset = []
error_dataset = []
for data in dataset:
    try:
        predictions.append(chain(data))
        predicted_dataset.append(data)
    except:
        error_dataset.append(data)

#### **Evaluate the performance**

In [None]:
from langchain.evaluation.qa import QAEvalChain
llm = OpenAI(temperature=0)
eval_chain = QAEvalChain.from_llm(llm)
graded_outputs = eval_chain.evaluate(predicted_dataset, predictions, question_key="question", prediction_key="result")

In [None]:
# adding graded output to preditions dict
for i, prediction in enumerate(predictions):
    prediction['grade'] = graded_outputs[i]['text']

In [None]:
# now getting a count of the grades
from collections import Counter
Counter([pred['grade'] for pred in predictions])

In [None]:
# filter datapoints to the incorrect examples
incorrect = [pred for pred in predictions if pred['grade'] == " INCORRECT"]
incorrect[0]

## 8. 🤔💻 Code Understanding

- LLMs are good at code understanding. I hope you are already using it to creat code based on your query. For example, in ChatGPT and similar chatbots. You might have heard about [copilot](https://github.com/features/copilot)
- LangChain is a useful tool designed to parse the GitHub code repositories.
- Let's use [pandas-ai](https://github.com/gventuri/pandas-ai)

In [157]:
!git clone https://github.com/gventuri/pandas-ai.git

Cloning into 'pandas-ai'...
remote: Enumerating objects: 13049, done.[K
remote: Counting objects: 100% (2934/2934), done.[K
remote: Compressing objects: 100% (1340/1340), done.[K
remote: Total 13049 (delta 1574), reused 2665 (delta 1458), pack-reused 10115[K
Receiving objects: 100% (13049/13049), 37.63 MiB | 2.87 MiB/s, done.
Resolving deltas: 100% (8790/8790), done.


In [142]:
import os
from langchain.vectorstores import Chroma, FAISS
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader

llm = ChatOpenAI()
     

In [143]:
embeddings = OpenAIEmbeddings(disallowed_special=())

In [158]:
import os
from langchain.document_loaders import TextLoader

root_dir = '/Users/dhavalantala/Desktop/langchain/langchain/pandas-ai'
docs = []
for dirpath, dirnames, filenames in os.walk(root_dir):
    for file in filenames:
        try: 
            loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')
            docs.extend(loader.load_and_split())
        except Exception as e: 
            pass
     

In [159]:
root_dir

'/Users/dhavalantala/Desktop/langchain/langchain/pandas-ai'

In [161]:
print(f"You have {len(docs)} documents\n")
print("------ Start Document ------")
print(docs[0])

You have 1611 documents

------ Start Document ------
page_content='# ignore-words.txt\nselectin' metadata={'source': '/Users/dhavalantala/Desktop/langchain/langchain/pandas-ai/ignore-words.txt'}


#### **Let's use Chroma for storing documents.**

In [162]:
%%capture
!pip install -U chromadb tiktoken

In [164]:
docsearch = Chroma.from_documents(docs[:100], embeddings)

In [165]:
from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=docsearch.as_retriever())

In [166]:
query = "What class should I import from pandasai to instantiate OpenAI llm?"
output = qa.run(query)
print(output)

To instantiate the OpenAI LLM from the PandasAI library, you should import the `OpenAIAgent` class.


## 9. 🤖 Agents
- Agents can be used in variety of tasks and the use case are evolving with the advancement in LLMs.
- You could even create your own agent based on Langchain documentation.
- Check out my [Auto-GPT with LangChain video](https://youtu.be/imDfPmMKEjM).
- Let's use `ArXiv API Tool`
- What is arxiv --> [https://arxiv.org/](https://arxiv.org/).

In [167]:
%%capture
!pip install arxiv

In [168]:
from langchain.chat_models import ChatOpenAI
from langchain.agents import load_tools, initialize_agent, AgentType

llm = ChatOpenAI(temperature=0.0)
tools = load_tools(
    ["arxiv"], 
)

agent_chain = initialize_agent(
    tools,
    llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True,
)

  warn_deprecated(


In [169]:
# https://arxiv.org/abs/1706.03762
agent_chain.run(
    "What's the paper 1706.03762 about?",
)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mI should use arxiv to search for the paper with the identifier 1706.03762.
Action: arxiv
Action Input: 1706.03762[0m
Observation: [36;1m[1;3mPublished: 2023-08-02
Title: Attention Is All You Need
Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
Summary: The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks in an encoder-decoder configuration. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer, based
solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experiments on two machine translation tasks show these models to be
superior in quality while being more parallelizable and requiring significantly
less time to train. Our model achieves 28.4 BLEU on the WM

'The paper 1706.03762 is about the Transformer model, a network architecture based solely on attention mechanisms for sequence transduction tasks.'

In [170]:
from langchain.utilities import ArxivAPIWrapper

In [171]:
arxiv = ArxivAPIWrapper()
docs = arxiv.run("1706.03762")
docs

'Published: 2023-08-02\nTitle: Attention Is All You Need\nAuthors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin\nSummary: The dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks in an encoder-decoder configuration. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer, based\nsolely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to be\nsuperior in quality while being more parallelizable and requiring significantly\nless time to train. Our model achieves 28.4 BLEU on the WMT 2014\nEnglish-to-German translation task, improving over the existing best results,\nincluding ensembles by over 2 BLEU. On the WMT 2014 English-to-French\ntranslation task, our model establis

In [172]:
docs = arxiv.run("Ashish Vaswani")
docs

'Published: 2016-09-28\nTitle: Unsupervised Neural Hidden Markov Models\nAuthors: Ke Tran, Yonatan Bisk, Ashish Vaswani, Daniel Marcu, Kevin Knight\nSummary: In this work, we present the first results for neuralizing an Unsupervised\nHidden Markov Model. We evaluate our approach on tag in- duction. Our approach\noutperforms existing generative models and is competitive with the\nstate-of-the-art though with a simpler model easily extended to include\nadditional context.\n\nPublished: 2018-04-12\nTitle: Self-Attention with Relative Position Representations\nAuthors: Peter Shaw, Jakob Uszkoreit, Ashish Vaswani\nSummary: Relying entirely on an attention mechanism, the Transformer introduced by\nVaswani et al. (2017) achieves state-of-the-art results for machine\ntranslation. In contrast to recurrent and convolutional neural networks, it\ndoes not explicitly model relative or absolute position information in its\nstructure. Instead, it requires adding representations of absolute positions 

In [173]:
# random input off-course will throw error.
docs = arxiv.run("1605.08386WWW")
docs

'No good Arxiv Result was found'

#### **Happy Learning 😎**