## Data Loaders

In [66]:
from langchain.document_loaders import CSVLoader

In [72]:
loader = CSVLoader("langchain_with_python/some_data/penguins.csv")

In [73]:
data = loader.load()

In [79]:
data[0]

Document(page_content='species: Adelie\nisland: Torgersen\nbill_length_mm: 39.1\nbill_depth_mm: 18.7\nflipper_length_mm: 181\nbody_mass_g: 3750\nsex: MALE', metadata={'source': 'langchain_with_python/some_data/penguins.csv', 'row': 0})

In [77]:
type(data[0])

langchain_core.documents.base.Document

In [80]:
print(data[0].page_content)

species: Adelie
island: Torgersen
bill_length_mm: 39.1
bill_depth_mm: 18.7
flipper_length_mm: 181
body_mass_g: 3750
sex: MALE


In [82]:
print(data[2].metadata)

{'source': 'langchain_with_python/some_data/penguins.csv', 'row': 2}


In [83]:
!pip install beautifulsoup4



In [1]:
from langchain.document_loaders import BSHTMLLoader

In [2]:
loader = BSHTMLLoader("langchain_with_python/some_data/some_website.html")

In [3]:
hdata = loader.load()
print(hdata)

[Document(page_content='Heading 1', metadata={'source': 'langchain_with_python/some_data/some_website.html', 'title': ''})]


In [4]:
print(hdata[0].page_content)

Heading 1


### pdf parsers

In [2]:
from langchain.document_loaders import PyPDFLoader

In [3]:
pdf_loader = PyPDFLoader("langchain_with_python/some_data/SomeReport.pdf")

In [7]:
pages = pdf_loader.load()

In [8]:
print(pages)

[Document(page_content='This\nis\nthe\nfirst\nline\nPDF.\nThis\nis\nthe\nsecond\nline\nin\nthe\nPDF.\nThis\nis\nthe\nthird\nline\nin\nthe\nPDF.', metadata={'source': 'langchain_with_python/some_data/SomeReport.pdf', 'page': 0})]


In [9]:
print(pages[0].page_content)

This
is
the
first
line
PDF.
This
is
the
second
line
in
the
PDF.
This
is
the
third
line
in
the
PDF.


In [10]:
from langchain.document_loaders import HNLoader

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [11]:
hn_loader = HNLoader("https://news.ycombinator.com/item?id=36697119")

In [12]:
html_pages = hn_loader.load()

In [13]:
html_pages

[Document(page_content='endisneigh 11 months ago  \n             | next [–] \n\nI wish the folks who clearly do not like Google would just not use their products instead of spamming every thread about how they will kill the product, true or not.——Anyway,It’s not clear which model they’re using for this. I assume whatever Bard is using, but who knows. This is relevant because depending on the intended experience the latency will matter.Overall it’s not a bad idea, but I do wonder what the monetization path will be for Google. I imagine this will be part of workspace. Perhaps they will add more tiers to include these offerings.I wish they shared a bit about how this will be differentiated from Bard. Is this simply a new front end to Bard? It’s really an open question. I haven’t seen many products that use LLMs that are better than the prompt response UX.The most interesting thing about this blog post is the “source grounding.” I’m curious if there’s actual engineering behind it, or is it

In [14]:
html_pages[0]

Document(page_content='endisneigh 11 months ago  \n             | next [–] \n\nI wish the folks who clearly do not like Google would just not use their products instead of spamming every thread about how they will kill the product, true or not.——Anyway,It’s not clear which model they’re using for this. I assume whatever Bard is using, but who knows. This is relevant because depending on the intended experience the latency will matter.Overall it’s not a bad idea, but I do wonder what the monetization path will be for Google. I imagine this will be part of workspace. Perhaps they will add more tiers to include these offerings.I wish they shared a bit about how this will be differentiated from Bard. Is this simply a new front end to Bard? It’s really an open question. I haven’t seen many products that use LLMs that are better than the prompt response UX.The most interesting thing about this blog post is the “source grounding.” I’m curious if there’s actual engineering behind it, or is it 

In [15]:
html_pages[0].page_content

'endisneigh 11 months ago  \n             | next [–] \n\nI wish the folks who clearly do not like Google would just not use their products instead of spamming every thread about how they will kill the product, true or not.——Anyway,It’s not clear which model they’re using for this. I assume whatever Bard is using, but who knows. This is relevant because depending on the intended experience the latency will matter.Overall it’s not a bad idea, but I do wonder what the monetization path will be for Google. I imagine this will be part of workspace. Perhaps they will add more tiers to include these offerings.I wish they shared a bit about how this will be differentiated from Bard. Is this simply a new front end to Bard? It’s really an open question. I haven’t seen many products that use LLMs that are better than the prompt response UX.The most interesting thing about this blog post is the “source grounding.” I’m curious if there’s actual engineering behind it, or is it prompt tweaking contex

In [16]:
from langchain.prompts import ChatPromptTemplate, HumanMessagePromptTemplate, SystemMessagePromptTemplate
from langchain_openai import AzureChatOpenAI
from dotenv import load_dotenv

In [18]:
import os
load_dotenv()
deployment_name = os.environ["DEPLOYMENT_NAME"]

In [19]:
chat = AzureChatOpenAI(deployment_name=deployment_name)

In [20]:
human_prompt = HumanMessagePromptTemplate.from_template("Please give me a short summary of the following HackerNew comment:\n{comment}")

In [21]:
chat_prompt = ChatPromptTemplate.from_messages([human_prompt])

In [22]:
result = chat(chat_prompt.format_prompt(comment=html_pages[0].page_content).to_messages())

In [23]:
result.content

'The commenter, endisneigh, expresses frustration with people who criticize Google and its tendency to discontinue products, suggesting that those who dislike Google should simply not use its products. They mention uncertainty about the technical model Google is using for a new product, speculating that it might be based on Bard, but noting the importance of latency in user experience. They find the idea of the product good but are curious about how Google plans to monetize it, suggesting it might be integrated into Google Workspace with additional paid tiers.\n\nThe commenter also wonders how the new product will differ from Bard, questioning whether it\'s just a new interface for the same underlying technology. They find the concept of "source grounding" mentioned in a blog post intriguing and question whether it involves significant engineering or is just a sophisticated way of tweaking prompts based on document context.'

In [24]:
result2 = chat(chat_prompt.format_prompt(comment=html_pages[2].page_content).to_messages())
print("comment->", html_pages[2].page_content)
print(result2.content)

The user lolinder acknowledges that there are many simplistic comments criticizing Google's self-inflicted harm to its reputation. While they concur with the sentiment and are willing to support a well-articulated critique, they find that the numerous brief and snarky remarks contribute no substantial value to the discussion.


In [36]:
result3 = chat(chat_prompt.format_prompt(comment=html_pages[66].page_content).to_messages())
print("comment->", html_pages[66].page_content)
print(result3.content)

comment-> wodenokoto 11 months ago  
             | prev | next [–] 

Doesn’t this take away the important part of doing notes? The writing?A good AI should foster a discussion with the student, not write notes.
The commenter, wodenokoto, suggests that the act of writing notes is an essential part of the note-taking process. They believe that a good AI tool for students should encourage and facilitate a discussion rather than simply taking notes on the student's behalf.


In [35]:
print("comment->", html_pages[66].page_content)

comment-> wodenokoto 11 months ago  
             | prev | next [–] 

Doesn’t this take away the important part of doing notes? The writing?A good AI should foster a discussion with the student, not write notes.


## Data Transsformers

In [37]:
from langchain.text_splitter import CharacterTextSplitter

In [38]:
with open('langchain_with_python/some_data/FDR_State_of_Union_1944.txt') as file:
    speech_text = file.read()

In [39]:
len(speech_text)

21927

In [40]:
len(speech_text.split())

3750

In [41]:
text_splitter = CharacterTextSplitter(separator="\n\n", chunk_size=1000)

In [42]:
splitted_data = text_splitter.create_documents([speech_text])

In [44]:
print(splitted_data[0].page_content)

This Nation in the past two years has become an active partner in the world's greatest war against human slavery.

We have joined with like-minded people in order to defend ourselves in a world that has been gravely threatened with gangster rule.

But I do not think that any of us Americans can be content with mere survival. Sacrifices that we and our allies are making impose upon us all a sacred obligation to see to it that out of this war we and our children will gain something better than mere survival.

We are united in determination that this war shall not be followed by another interim which leads to new disaster- that we shall not repeat the tragic errors of ostrich isolationism—that we shall not repeat the excesses of the wild twenties when this Nation went for a joy ride on a roller coaster which ended in a tragic crash.


In [53]:
token_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=500)

In [54]:
splitted_tokens = token_splitter.split_text(speech_text)

In [55]:
print(splitted_tokens[0])

This Nation in the past two years has become an active partner in the world's greatest war against human slavery.

We have joined with like-minded people in order to defend ourselves in a world that has been gravely threatened with gangster rule.

But I do not think that any of us Americans can be content with mere survival. Sacrifices that we and our allies are making impose upon us all a sacred obligation to see to it that out of this war we and our children will gain something better than mere survival.

We are united in determination that this war shall not be followed by another interim which leads to new disaster- that we shall not repeat the tragic errors of ostrich isolationism—that we shall not repeat the excesses of the wild twenties when this Nation went for a joy ride on a roller coaster which ended in a tragic crash.

When Mr. Hull went to Moscow in October, and when I went to Cairo and Teheran in November, we knew that we were in agreement with our allies in our common de

In [56]:
len(splitted_tokens)

15

## Text Embeddings

In [1]:
from dotenv import load_dotenv
load_dotenv()

True

In [2]:
from langchain.embeddings import OpenAIEmbeddings

In [3]:
embeddings = OpenAIEmbeddings()

In [4]:
text = "this is some normal text string that I want to embed as a vector."

In [5]:
embedded_text = embeddings.embed_query(text)

In [13]:
len(embedded_text)

1536

In [7]:
from langchain.document_loaders import CSVLoader
csv_loader = CSVLoader("langchain_with_python/some_data/penguins.csv")
data = csv_loader.load()

In [8]:
page_contents = [item.page_content for item in data]

In [9]:
embedded_docs = embeddings.embed_documents(page_contents)

In [11]:
len(embedded_docs), len(page_contents)

(344, 344)

In [12]:
!pip install chromadb
import chromadb



In [14]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader

In [15]:
# load document --> split chunks

# embedding --> embed chunks --> vectors

# vector chunks --> save chromadb

# "query" --> similarity search chromadb

In [17]:
txt_loader = TextLoader('langchain_with_python/some_data/FDR_State_of_Union_1944.txt')
documents = txt_loader.load()
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=500)
docs = text_splitter.split_documents(documents)

In [18]:
len(docs)

15

In [19]:
docs[0]

Document(page_content='This Nation in the past two years has become an active partner in the world\'s greatest war against human slavery.\n\nWe have joined with like-minded people in order to defend ourselves in a world that has been gravely threatened with gangster rule.\n\nBut I do not think that any of us Americans can be content with mere survival. Sacrifices that we and our allies are making impose upon us all a sacred obligation to see to it that out of this war we and our children will gain something better than mere survival.\n\nWe are united in determination that this war shall not be followed by another interim which leads to new disaster- that we shall not repeat the tragic errors of ostrich isolationism—that we shall not repeat the excesses of the wild twenties when this Nation went for a joy ride on a roller coaster which ended in a tragic crash.\n\nWhen Mr. Hull went to Moscow in October, and when I went to Cairo and Teheran in November, we knew that we were in agreement 

In [20]:
page_contents = [item.page_content for item in docs]

In [21]:
embedding_function = OpenAIEmbeddings()

In [22]:
db = Chroma.from_documents(docs, embedding_function, persist_directory="./speech_new.db")

In [23]:
db.persist()

In [24]:
db_new_connection = Chroma(persist_directory='./speech_new.db', embedding_function=embedding_function)

In [26]:
new_doc = "What did FDR say about the cost of food law?"
new_doc2 = "cost of food law, FDR"

In [27]:
similar_docs = db_new_connection.similarity_search(new_doc)

In [29]:
similar_docs[0]

Document(page_content='That is the way to fight and win a war—all out—and not with half-an-eye on the battlefronts abroad and the other eye-and-a-half on personal, selfish, or political interests here at home.\n\nTherefore, in order to concentrate all our energies and resources on winning the war, and to maintain a fair and stable economy at home, I recommend that the Congress adopt:\n\n(1) A realistic tax law—which will tax all unreasonable profits, both individual and corporate, and reduce the ultimate cost of the war to our sons and daughters. The tax bill now under consideration by the Congress does not begin to meet this test.\n\n(2) A continuation of the law for the renegotiation of war contracts—which will prevent exorbitant profits and assure fair prices to the Government. For two long years I have pleaded with the Congress to take undue profits out of war.\n\n(3) A cost of food law—which will enable the Government (a) to place a reasonable floor under the prices the farmer may

In [30]:
similar_docs[0].page_content

'That is the way to fight and win a war—all out—and not with half-an-eye on the battlefronts abroad and the other eye-and-a-half on personal, selfish, or political interests here at home.\n\nTherefore, in order to concentrate all our energies and resources on winning the war, and to maintain a fair and stable economy at home, I recommend that the Congress adopt:\n\n(1) A realistic tax law—which will tax all unreasonable profits, both individual and corporate, and reduce the ultimate cost of the war to our sons and daughters. The tax bill now under consideration by the Congress does not begin to meet this test.\n\n(2) A continuation of the law for the renegotiation of war contracts—which will prevent exorbitant profits and assure fair prices to the Government. For two long years I have pleaded with the Congress to take undue profits out of war.\n\n(3) A cost of food law—which will enable the Government (a) to place a reasonable floor under the prices the farmer may expect for his produc

In [38]:
text_loader2 = TextLoader('langchain_with_python/some_data/Lincoln_State_of_Union_1862.txt')
documents2 = text_loader2.load()
docs2 = text_splitter.split_documents(documents2)

Created a chunk of size 608, which is longer than the specified 500
Created a chunk of size 539, which is longer than the specified 500
Created a chunk of size 686, which is longer than the specified 500


In [39]:
db_new = Chroma.from_documents(docs2, embedding_function, persist_directory="./speech_new.db")

In [40]:
searched_data = db_new.similarity_search("slavery")

In [41]:
searched_data[0]

Document(page_content='As to the second article, I think it would be impracticable to return to bondage the class of persons therein contemplated. Some of them, doubtless, in the property sense belong to loyal owners, and hence provision is made in this article for compensating such. The third article relates to the future of the freed people. It does not oblige, but merely authorizes Congress to aid in colonizing such as may consent. This ought not to be regarded as objectionable on the one hand or on the other, insomuch as it comes to nothing unless by the mutual consent of the people to be deported and the American voters, through their representatives in Congress.\n\nI can not make it better known than it already is that I strongly favor colonization; and yet I wish to say there is an objection urged against free colored persons remaining in the country which is largely imaginary, if not sometimes malicious.\n\nIt is insisted that their presence would injure and displace white labo

In [35]:
searched_data2 = db.similarity_search("slavery")

In [36]:
searched_data2[0]

Document(page_content='It is our duty now to begin to lay the plans and determine the strategy for the winning of a lasting peace and the establishment of an American standard of living higher than ever before known. We cannot be content, no matter how high that general standard of living may be, if some fraction of our people—whether it be one-third or one-fifth or one-tenth- is ill-fed, ill-clothed, ill housed, and insecure.\n\nThis Republic had its beginning, and grew to its present strength, under the protection of certain inalienable political rights—among them the right of free speech, free press, free worship, trial by jury, freedom from unreasonable searches and seizures. They were our rights to life and liberty.\n\nAs our Nation has grown in size and stature, however—as our industrial economy expanded—these political rights proved inadequate to assure us equality in the pursuit of happiness.\n\nWe have come to a clear realization of the fact that true individual freedom cannot

## Vector store - Retrievers

In [43]:
type(db_new)

langchain_community.vectorstores.chroma.Chroma

In [44]:
retriever = db_new.as_retriever()

In [45]:
type(retriever)

langchain_core.vectorstores.VectorStoreRetriever

In [48]:
results = retriever.get_relevant_documents('cost of food law')
print(results)

[Document(page_content='That is the way to fight and win a war—all out—and not with half-an-eye on the battlefronts abroad and the other eye-and-a-half on personal, selfish, or political interests here at home.\n\nTherefore, in order to concentrate all our energies and resources on winning the war, and to maintain a fair and stable economy at home, I recommend that the Congress adopt:\n\n(1) A realistic tax law—which will tax all unreasonable profits, both individual and corporate, and reduce the ultimate cost of the war to our sons and daughters. The tax bill now under consideration by the Congress does not begin to meet this test.\n\n(2) A continuation of the law for the renegotiation of war contracts—which will prevent exorbitant profits and assure fair prices to the Government. For two long years I have pleaded with the Congress to take undue profits out of war.\n\n(3) A cost of food law—which will enable the Government (a) to place a reasonable floor under the prices the farmer ma

## MultiQuery Retriever

In [1]:
from langchain.document_loaders import WikipediaLoader
wiki_loader = WikipediaLoader(query="MKUltra")
documents = wiki_loader.load()



  lis = BeautifulSoup(html).find_all('li')


In [2]:
len(documents)

24

In [3]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=500)
docs = text_splitter.split_documents(documents)

Created a chunk of size 525, which is longer than the specified 500
Created a chunk of size 501, which is longer than the specified 500


In [4]:
len(docs)

50

In [6]:
from dotenv import load_dotenv
load_dotenv()

True

In [7]:
from langchain.embeddings import OpenAIEmbeddings
embedding_function = OpenAIEmbeddings()

In [8]:
from langchain.vectorstores import Chroma

In [11]:
db  = Chroma.from_documents(docs, embedding_function, persist_directory="./langchain_with_python/mkultra.db")

  warn_deprecated(


In [12]:
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI

In [13]:
question = "When was this declassified?"

In [14]:
chat = ChatOpenAI()

In [15]:
retriever_from_chat = MultiQueryRetriever.from_llm(retriever=db.as_retriever(), llm=chat)

In [23]:
import logging 
logging.basicConfig()
logging.getLogger('langchain.retrievers.multi_query').setLevel(logging.DEBUG)

In [24]:
unique_docs = retriever_from_chat.invoke(question)

INFO:langchain.retrievers.multi_query:Generated queries: ['1. What is the date of the declassification of this information?', '2. Can you provide the timeline for when this was declassified?', '3. At what point in time was this information made public?']


In [25]:
len(unique_docs)

3

In [26]:
print(unique_docs[0].page_content)

== Background ==
In 1974, a New York Times article was published that accused the CIA of illegal operations committed against US citizens. Authored by Seymour M. Hersh, it documented an intelligence operation against the anti-war movement, as well as "break-ins, wiretapping and the surreptitious inspection of mail" conducted since the 1950s. According to former CIA Official Cord Meyer, these disclosures "Convinced large sections of the American public that the CIA had become a domestic Gestapo and stimulated an overwhelming demand for the wide-ranging congressional investigations that were to follow."
Hersh had been tipped off to the possibility of an "in house operation" by an unidentified member of the CIA in spring of 1974. He embarked on an investigation, speaking to sources that included CIA Chief of Counterintelligence James Angleton. Although he was not aware of its existence, Hersh uncovered much information that had been documented in the "Family Jewels", a report ordered by D

In [28]:
chat0 = ChatOpenAI(temperature=0)
retriever_from_chat0 = MultiQueryRetriever.from_llm(retriever=db.as_retriever(), llm=chat0)

In [29]:
unique_docs0 = retriever_from_chat0.invoke(question)

INFO:langchain.retrievers.multi_query:Generated queries: ['1. What is the date of the declassification of this information?', '2. Can you provide the timeline for when this was declassified?', '3. At what point in time was this information made public?']


In [30]:
print(unique_docs0[0].page_content)

== Background ==
In 1974, a New York Times article was published that accused the CIA of illegal operations committed against US citizens. Authored by Seymour M. Hersh, it documented an intelligence operation against the anti-war movement, as well as "break-ins, wiretapping and the surreptitious inspection of mail" conducted since the 1950s. According to former CIA Official Cord Meyer, these disclosures "Convinced large sections of the American public that the CIA had become a domestic Gestapo and stimulated an overwhelming demand for the wide-ranging congressional investigations that were to follow."
Hersh had been tipped off to the possibility of an "in house operation" by an unidentified member of the CIA in spring of 1974. He embarked on an investigation, speaking to sources that included CIA Chief of Counterintelligence James Angleton. Although he was not aware of its existence, Hersh uncovered much information that had been documented in the "Family Jewels", a report ordered by D

## Context Compression

In [31]:
from langchain.vectorstores import Chroma
from langchain.document_loaders import WikipediaLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter

In [32]:
from dotenv import load_dotenv
load_dotenv()

True

In [58]:
embedding_function = OpenAIEmbeddings()
db_conn = Chroma(persist_directory='./langchain_with_python/mkultra.db', embedding_function=embedding_function)

In [37]:
from langchain_openai import ChatOpenAI
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [38]:
# llm use compression
# llm -> llmchainextractor
# contextual compression

In [39]:
llm = ChatOpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)


In [59]:
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, 
                                                       base_retriever=db_conn.as_retriever())

In [60]:
docs = db_conn.similarity_search('When was this declassified?')

In [63]:
print(docs[0].page_content)

== Background ==
In 1974, a New York Times article was published that accused the CIA of illegal operations committed against US citizens. Authored by Seymour M. Hersh, it documented an intelligence operation against the anti-war movement, as well as "break-ins, wiretapping and the surreptitious inspection of mail" conducted since the 1950s. According to former CIA Official Cord Meyer, these disclosures "Convinced large sections of the American public that the CIA had become a domestic Gestapo and stimulated an overwhelming demand for the wide-ranging congressional investigations that were to follow."
Hersh had been tipped off to the possibility of an "in house operation" by an unidentified member of the CIA in spring of 1974. He embarked on an investigation, speaking to sources that included CIA Chief of Counterintelligence James Angleton. Although he was not aware of its existence, Hersh uncovered much information that had been documented in the "Family Jewels", a report ordered by D

In [64]:
compressed_docs = compression_retriever.invoke("when was this declassified?")

In [65]:
compressed_docs[0]

Document(page_content='1974', metadata={'source': 'https://en.wikipedia.org/wiki/United_States_President%27s_Commission_on_CIA_Activities_within_the_United_States', 'summary': 'The United States President\'s Commission on CIA Activities within the United States was ordained by President Gerald Ford in 1975 to investigate the activities of the Central Intelligence Agency and other intelligence agencies within the United States. The Presidential Commission was led by Vice President Nelson Rockefeller, from whom it gained the nickname the Rockefeller Commission.\nThe commission was created in response to a December 1974 report in The New York Times that the CIA had conducted illegal domestic activities, including experiments on US citizens, during the 1960s. The commission issued a single report in 1975, touching upon certain CIA abuses including mail opening and surveillance of domestic dissident groups. It also publicized Project MKUltra, a CIA mind control research program.\nSeveral we

In [68]:
print(compressed_docs[0].metadata['summary'])

The United States President's Commission on CIA Activities within the United States was ordained by President Gerald Ford in 1975 to investigate the activities of the Central Intelligence Agency and other intelligence agencies within the United States. The Presidential Commission was led by Vice President Nelson Rockefeller, from whom it gained the nickname the Rockefeller Commission.
The commission was created in response to a December 1974 report in The New York Times that the CIA had conducted illegal domestic activities, including experiments on US citizens, during the 1960s. The commission issued a single report in 1975, touching upon certain CIA abuses including mail opening and surveillance of domestic dissident groups. It also publicized Project MKUltra, a CIA mind control research program.
Several weeks later, committees were established in the House and Senate for a similar purpose. White House Personnel, including future Vice President Dick Cheney, edited the results, exclud

In [70]:
len(compressed_docs)

3