# Project: Summarization

In [49]:
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(),override=True)

True

## A - Basic Prompt

In [50]:
from langchain_openai import ChatOpenAI
from langchain.schema import (
    AIMessage,
    HumanMessage,
    SystemMessage
)

In [51]:
text="""
Mojo combines the usability of Python with the performance of C, unlocking unparalleled programmability of AI hardware and extensibility of AI models.
Mojo is a new programming language that bridges the gap between research and production by combining the best of Python syntax with systems programming and metaprogramming. With Mojo, you can write portable code that's faster than C and seamlessly inter-op with the Python ecosystem.
"""

messages = [
    SystemMessage(content='You are an expert copywriter with expertize in summarizing documents'),
    HumanMessage(content=f'Please provide a shot and concise summary of the following text:\n TEXT: {text}')
]

llm = ChatOpenAI(temperature=0, model_name='gpt-3.5-turbo')


In [52]:
llm.get_num_tokens(text)

82

In [53]:
summary_output = llm(messages)

In [54]:
print(summary_output.content)

Summary:
Mojo is a new programming language that merges Python's user-friendliness with C's performance, enabling advanced programmability of AI hardware and extensibility of AI models. It facilitates writing fast, portable code that integrates smoothly with Python, bridging the gap between research and production.


### Summarizing using prompt templates

In [55]:
from langchain import PromptTemplate
from langchain.chains import LLMChain

In [56]:
template= '''
Write a concise and short summary of the following text:
TEXT: `{text}`
Translate the summary to {language}.
'''

prompt = PromptTemplate(
    input_variables=['text','language'],
    template=template
)

In [57]:
llm.get_num_tokens(prompt.format(text=text,language='English'))

103

In [58]:
chain = LLMChain(llm=llm, prompt=prompt)
summary = chain.invoke({'text': text, 'language': 'Portuguese'})

In [59]:
print(summary)

{'text': "Summary: Mojo is a new programming language that combines Python's usability with C's performance, allowing for enhanced programmability of AI hardware and extensibility of AI models.\n\nResumo: Mojo é uma nova linguagem de programação que combina a usabilidade do Python com o desempenho do C, permitindo uma maior programabilidade do hardware de IA e extensibilidade dos modelos de IA.", 'language': 'Portuguese'}


## Summarizing using StuffDocumentChain

In [60]:
from langchain_openai import ChatOpenAI
from langchain import PromptTemplate
from langchain.chains.summarize import load_summarize_chain
from langchain.docstore.document import Document

In [61]:
with open('files/sj.txt', encoding='utf-8') as f:
    text = f.read()

#text
    
docs = [Document(page_content=text)]
llm = ChatOpenAI(temperature=0, model_name='gpt-3.5-turbo')

In [62]:
template = '''Write a concise and short summary of the following text.
TEXT: `{text}`
'''
prompt = PromptTemplate(
    input_variables=['text'],
    template=template
)


In [63]:
chain = load_summarize_chain(
    llm,
    chain_type='stuff',
    prompt=prompt,
    verbose=True
)
output_summary = chain.invoke(docs)
print(output_summary['output_text'])



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise and short summary of the following text.
TEXT: `I am honored to be with you today at your commencement from one of the finest universities in the world. I never graduated from college. Truth be told, this is the closest I’ve ever gotten to a college graduation. Today I want to tell you three stories from my life. That’s it. No big deal. Just three stories.

The first story is about connecting the dots.

I dropped out of Reed College after the first 6 months, but then stayed around as a drop-in for another 18 months or so before I really quit. So why did I drop out?

It started before I was born. My biological mother was a young, unwed college graduate student, and she decided to put me up for adoption. She felt very strongly that I should be adopted by college graduates, so everything was all set for me to be adopted at birth by a lawyer


[1m> Finished chain.[0m

[1m> Finished chain.[0m
The speaker shares three stories from his life during a commencement speech. The first story is about connecting the dots and dropping out of college, leading to unexpected opportunities. The second story is about love and loss, including being fired from the company he co-founded. The third story is about facing death and the importance of following one's heart. The speaker encourages the audience to stay hungry and stay foolish as they begin anew.


## Summarizing large douments using map_reduce

Basically we get the document and splite in `X` chunks, summarize each of the `X` chunks and summarize the summary of chunks to get the final summary.

In [64]:
from langchain import PromptTemplate
from langchain_openai import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [65]:
with open('files/sj.txt', encoding='utf-8') as f:
    text = f.read()

llm = ChatOpenAI(temperature=0, model_name='gpt-3.5-turbo')

In [66]:
llm.get_num_tokens(text)

2653

In [67]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000, chunk_overlap=50)
chunks = text_splitter.create_documents([text])

In [70]:
len(chunks) # if = 2 -> 2 api calls

2

In [73]:
chain = load_summarize_chain(
    llm,
    chain_type='map_reduce',
    verbose=False
)
output_summary = chain.invoke(chunks)

In [75]:
print(output_summary['output_text'])

Steve Jobs shares three stories from his life in his commencement speech, highlighting the importance of following one's curiosity, not settling for less, and living each day as if it were the last. He encourages the audience to stay hungry and foolish in pursuing their dreams, reflecting on the inevitability of death and the importance of living authentically. Jobs references The Whole Earth Catalog's message of staying curious and open-minded as the audience graduates and begins a new chapter in their lives.


In [76]:
chain.llm_chain.prompt.template

'Write a concise summary of the following:\n\n\n"{text}"\n\n\nCONCISE SUMMARY:'

In [77]:
chain.combine_document_chain.llm_chain.prompt.template

'Write a concise summary of the following:\n\n\n"{text}"\n\n\nCONCISE SUMMARY:'

### map_reduce with Custom Prompts

In [78]:
map_prompt = ''' 
Write a showt and concise summary of the following:
Text: `{text}`
CONCISE SUMMARY:
'''
map_prompt_template = PromptTemplate(
    input_variables=['text'],
    template=map_prompt
)

In [79]:
combine_prompt = '''
Write a concise summary of the following text that covers the key points.
Add a title to the summary.
Start yiur summary with an INTRODUCTION PARAGRAPH that gives an overview of the topic FOLLOWED 
by BULLET POINTS if possible AND end the summary with a CONCLUSION PHRASE.
Text: `{text}`
'''

combine_prompt_template=PromptTemplate(template=combine_prompt, input_variables=['text'])

In [80]:
summary_chain = load_summarize_chain(
    llm,
    chain_type='map_reduce',
    map_prompt=map_prompt_template,
    combine_prompt=combine_prompt_template,
    verbose=False
)
output = summary_chain.invoke(chunks)

In [81]:
print(output['output_text'])

Title: Lessons from Steve Jobs' Commencement Speech

Introduction:
In a commencement speech, Steve Jobs shares three impactful stories from his life, emphasizing the importance of following one's passion and living authentically.

Key Points:
- Story 1: Dropping out of college, following curiosity, and designing the Macintosh computer
- Story 2: Being fired from Apple, starting over with NeXT and Pixar, and finding success
- Story 3: Facing death after a cancer diagnosis and changing perspective on life
- Emphasis on following passion, not settling, and living each day as if it were the last
- Reference to The Whole Earth Catalog and the message to "Stay Hungry. Stay Foolish."

Conclusion:
Steve Jobs' stories highlight the importance of living authentically, following intuition, and embracing new beginnings with passion and determination.


### Summarization using refine chain

1. Step 1: summarize(chunk #1) => summary #1
2. Step 2: summarize(summary #1 + chunk #2) => summary #2
3. Step n: summarize(summary #n-1 + chunk#n) => final summary

Pros:
- uses a more relevant context (better summarization)
- les lossy than map_reduce

Const:
- it required many more calls to the LLM
- the calls are not independent and cannot be parallelized

In [1]:
from langchain import PromptTemplate
from langchain_openai import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredPDFLoader

In [2]:
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(), override=True)

True

In [3]:
pip install unstructured -q

[33mDEPRECATION: distro-info 0.23ubuntu1 has a non-standard version number. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of distro-info or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m[33mDEPRECATION: python-debian 0.1.36ubuntu1 has a non-standard version number. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of python-debian or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [4]:
pip install pdf2image

Defaulting to user installation because normal site-packages is not writeable
[33mDEPRECATION: distro-info 0.23ubuntu1 has a non-standard version number. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of distro-info or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m[33mDEPRECATION: python-debian 0.1.36ubuntu1 has a non-standard version number. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of python-debian or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [5]:
pip install pdfminer-six -q

[33mDEPRECATION: distro-info 0.23ubuntu1 has a non-standard version number. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of distro-info or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m[33mDEPRECATION: python-debian 0.1.36ubuntu1 has a non-standard version number. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of python-debian or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [9]:
pip install opencv-python -q

[33mDEPRECATION: distro-info 0.23ubuntu1 has a non-standard version number. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of distro-info or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m[33mDEPRECATION: python-debian 0.1.36ubuntu1 has a non-standard version number. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of python-debian or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [6]:
# from langchain.document_loaders import PyPDFLoader
# loader = PyPDFLoader('files/attention_is_all_you_need.pdf')
# data = loader.load()

In [10]:
loader = UnstructuredPDFLoader('files/attention_is_all_you_need.pdf')
data = loader.load()

[nltk_data] Downloading package punkt to /home/nishihara/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/nishihara/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [11]:
print(data[0].page_content)

Attention Is All You Need

7 1 0 2 c e D 6

Ashish Vaswani∗ Google Brain avaswani@google.com

Llion Jones∗ Google Research llion@google.com

Noam Shazeer∗ Google Brain noam@google.com

Niki Parmar∗ Google Research nikip@google.com

Jakob Uszkoreit∗ Google Research usz@google.com

Aidan N. Gomez∗ † University of Toronto aidan@cs.toronto.edu

Łukasz Kaiser∗ Google Brain lukaszkaiser@google.com

Illia Polosukhin∗ ‡ illia.polosukhin@gmail.com

] L C . s c [

5 v 2 6 7 3 0 . 6 0 7 1 : v i X r a

1

Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while 

In [12]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000, chunk_overlap = 100)
chunks = text_splitter.split_documents(data)


In [13]:
len(chunks)

5

In [14]:
llm = ChatOpenAI(temperature=0, model_name='gpt-3.5-turbo')

In [15]:
def print_embedding_cost(texts):
    import tiktoken
    enc = tiktoken.encoding_for_model('text-embedding-ada-002')  # recommended 'text-embedding-3-small' or 'text-embedding-3-large'
    total_tokens = sum([len(enc.encode(page.page_content)) for page in texts])
    print(f'Total Tokens: {total_tokens}')
    print(f'Embedding Cost in USD: {total_tokens/ 1000 * 0.0004:.6f}')

print(print_embedding_cost(chunks))

Total Tokens: 10049
Embedding Cost in USD: 0.004020
None


In [16]:
chain = load_summarize_chain(
    llm=llm,
    chain_type='refine',
    verbose=False
)

output_summary = chain.invoke(chunks)

In [17]:
print(output_summary['output_text'])

The paper "Attention Is All You Need" introduces the Transformer model, which relies solely on attention mechanisms and eliminates the need for recurrent or convolutional layers. The model outperforms existing models in machine translation tasks, achieving state-of-the-art results with faster training times. It uses self-attention to compute representations of input and output sequences, allowing for more parallelization. The architecture of the Transformer includes encoder and decoder stacks, as well as multi-head attention mechanisms. The model also incorporates position-wise feed-forward networks, embeddings, softmax functions, and positional encodings to capture the relative or absolute position of tokens in the sequence. The comparison of self-attention layers with recurrent and convolutional layers highlights the computational complexity, parallelizability, and path length between long-range dependencies in the network. The paper provides insights into why self-attention is a fav

### Refine with custom prompts

In [19]:
prompt_template = """Write a concise summary of the following extracting the key information:
Text: `{text}`
CONCISE SUMMARY:"""
initial_prompt = PromptTemplate(template=prompt_template, input_variables=['text'])

refine_template = '''
Your job is to produce a final summary.
I have provided an existing summary up to a certain point: {existing_answer}.
Please refine the existing summary with some more context below.
----------
{text}
----------
Start the final summary with an INTRODUCTION PARAGRAPH that fives an overview of the topic FOLLOWED by BULLET POINTS if possible AND end the summary with a CONCLUSION PHRASE.
'''

refine_prompt = PromptTemplate(
    template=refine_template, 
    input_variables=['existing answer', 'text']
)

In [20]:
chain = load_summarize_chain(
    llm=llm,
    chain_type='refine',
    question_prompt = initial_prompt,
    refine_prompt=refine_prompt,
    return_intermediate_steps=False
)
output_summary = chain.invoke(chunks)

In [21]:
print(output_summary['output_text'])

The Transformer model has revolutionized the field of natural language processing with its innovative use of attention mechanisms, outperforming traditional models in tasks such as machine translation. By eliminating the need for recurrence and convolutions, the Transformer offers superior performance, handling long-range dependencies and providing interpretable models through self-attention.

Key points discussed in the text include:
- The use of Scaled Dot-Product Attention and Multi-Head Attention in the Transformer model
- Applications of attention in encoder-decoder architecture, self-attention layers, and positional encodings
- Inclusion of Position-wise Feed-Forward Networks in each layer
- Utilization of embeddings and softmax functions for input and output tokens
- Importance of Positional Encoding in determining token positions in a sequence

The Transformer model has achieved state-of-the-art results in translation tasks, surpassing previous ensembles and showcasing its pote

## Summarizing using LangChain Agents

In [22]:
from langchain_openai import ChatOpenAI
from langchain.agents import initialize_agent, Tool
from langchain.utilities import WikipediaAPIWrapper

In [23]:
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(), override=True)

True

In [24]:
llm = ChatOpenAI(temperature=0, model_name='gpt-3.5-turbo')
wikipedia = WikipediaAPIWrapper()

In [25]:
tools = [
    Tool(
        name='Wikipedia',
        func=wikipedia.run,
        description='Useful for when you need to get information from wikipedia about a topic'
    )
]

In [26]:
agent_executor = initialize_agent(tools,llm,agent='zero-shot-react-description',verbose=True)

  warn_deprecated(


In [27]:
output = agent_executor.invoke('Can you please provide a short summary of George Washington?')



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mI should look up George Washington on Wikipedia to get a short summary of his life and accomplishments.
Action: Wikipedia
Action Input: George Washington[0m
Observation: [36;1m[1;3mPage: George Washington
Summary: George Washington (February 22, 1732 – December 14, 1799) was an American Founding Father, military officer, and politician who served as the first president of the United States from 1789 to 1797. Appointed by the Second Continental Congress as commander of the Continental Army in 1775, Washington led Patriot forces to victory in the American Revolutionary War and then served as president of the Constitutional Convention in 1787, which drafted and ratified the Constitution of the United States and established the U.S. federal government. Washington has thus become commonly known as the "Father of his Country".
Washington's first public office, from 1749 to 1750, was as surveyor of Culpeper County in the Colony o

In [28]:
print(output)

{'input': 'Can you please provide a short summary of George Washington?', 'output': 'George Washington was the first president of the United States and a key figure in the American Revolutionary War. George Washington Carver was an agricultural scientist and inventor who promoted alternative crops to cotton.'}


In [30]:
print(output['output'])

George Washington was the first president of the United States and a key figure in the American Revolutionary War. George Washington Carver was an agricultural scientist and inventor who promoted alternative crops to cotton.
