# Analyzing documents

In [1]:
with open("../../state_of_the_union.txt") as f:
    state_of_the_union = f.read()

## Summarise long document

In [3]:
# count words in document
word_count = len(state_of_the_union.split(" "))
word_count

6571

6.5K words. Number of tokens will be higher. This exceeds GPT 3.5's context lengths

In [6]:
from langchain import OpenAI
from langchain.chains.summarize import load_summarize_chain
import os
import yaml
# Load the config file
with open('../../config.yaml', 'r') as f:
    config = yaml.safe_load(f)

# Get the value of the environment variable from the config
os.environ['OPENAI_API_KEY'] = config['OPENAI_API_KEY']

llm = OpenAI(temperature=0)
summary_chain = load_summarize_chain(llm, chain_type="map_reduce")

In [7]:
from langchain.chains import AnalyzeDocumentChain
summarize_document_chain = AnalyzeDocumentChain(combine_docs_chain=summary_chain)
summarize_document_chain.run(state_of_the_union)

" In this speech, President Biden addresses the American people and the world, discussing the recent aggression of Russia's Vladimir Putin in Ukraine and the US response. He outlines economic sanctions and other measures taken to hold Putin accountable, and announces the US Department of Justice's task force to go after the crimes of Russian oligarchs. He also proposes a plan to fight inflation, invest in America, and create jobs, and nominates Circuit Court of Appeals Judge Ketanji Brown Jackson to the Supreme Court. He calls for the passage of laws to reduce gun violence, protect the right to vote, and secure the border, and outlines a Unity Agenda for the Nation. He is also committed to helping lower-income veterans get VA care debt-free, and to finding out the cause of the diseases of many of our troops. He is optimistic about America's future and believes that the nation can turn every crisis into an opportunity."

In [15]:
# read webpage into a string variable - http://gutenberg.net.au/ebooks01/0100011h.html
import requests

# Send an HTTP GET request to the URL and retrieve the HTML content
response = requests.get('http://gutenberg.net.au/ebooks01/0100011h.html')
animal_farm = response.text
len(animal_farm.split(" "))

27482

Far too many tokens for the context. 27k is word count. Token count is higher.

In [17]:
summarize_document_chain.run(animal_farm)

' In George Orwell\'s Animal Farm, the animals of Manor Farm rebel against their human oppressors and establish a society of their own. Led by the pigs Snowball and Napoleon, the animals work together to build a windmill and increase production. However, Snowball is chased away and Napoleon takes charge. Despite hardships, the animals rebuild the windmill and increase production. Years later, the animals are still working hard and the pigs have taken on human characteristics. Napoleon makes changes to the farm, including the abolishment of the custom of addressing each other as "Comrade" and the burial of the boar\'s skull, and changes the name of the farm from "Animal Farm" to "The Manor Farm".'

## QA Chain

In [20]:
from langchain.chains.question_answering import load_qa_chain
qa_chain = load_qa_chain(llm, chain_type="map_reduce")
qa_document_chain = AnalyzeDocumentChain(combine_docs_chain=qa_chain)
qa_document_chain.run(input_document=animal_farm, question="What does Napolean represent in the story?")

' None. This text does not provide any information about what Napolean represents in the story.'

This question cannot by answered by the AnalyzeDocumentChain. Perhaps the question is too open ended.

In [21]:
qa_document_chain.run(input_document=animal_farm, question="What kind of animal is Napolean?")

' Napoleon is a pig.'

Very literal question. Correct answer

In [22]:
qa_document_chain.run(input_document=animal_farm, question="What do Napolean and Snowball have in common?")

' Napoleon and Snowball both advocated for the building of the windmill and both had the common goal of keeping the pigs in good health and reserving the milk and apples for the pigs alone, and both had their reputations tarnished by false rumors and stories.'

In [23]:
qa_document_chain.run(input_document=animal_farm, question="How do Napolean and Snowball differ?")

' Napoleon and Snowball differ in that Snowball wanted to keep the resolution of never having any dealings with human beings, never engaging in trade, while Napolean was willing to make deals with human beings and use money. Napoleon also wanted to take full control of the farm and was willing to use force to do so, while Snowball wanted to keep the Meetings and allow the animals to have a say in the running of the farm. Napoleon had abolished the custom of addressing one another as "Comrade" and the custom of marching past a boar\'s skull every Sunday morning. He had also changed the flag from one with a white hoof and horn to a plain green flag. Snowball had not done any of these things.'

In [24]:
qa_document_chain.run(input_document=animal_farm, question="Which animal was the happiest in the end?")

' None of the text is relevant to answer the question.'

It appears that this langauge chain is good at answering question about concrete entities and characters in the story. It fails to answer open ended questions.