# Basic Summarization of Text with Amazon Bedrock
In this notebook, we will look at two basic kinds of summarization.  This is a good place to start for many tasks.  However, when higher quality, longer, or other more advanced summarizations are required, we recommend looking at the other notebook in this repo, Advanced Summarize.ipynb. The two techniques shown here are provided as a reference.  The two basic types of summarization are as follows, with strengths and weaknesses:
  1) Stuff it.  Stuff the whole content into the prompt, and ask for a summary.
    * Strengths - the simplest approach.
    * Weaknesses - Content may not fit as a single prompt.  Less control than with a multi-step process.
  2) Map reduce.  For longer documents or sets of documents, break them into parts, summarize each part, and then iteratively combine the summarizes until you have a single result.
    * Strengths - Can handle any length of document or group of documents
    * Weaknesses - May lose context when chunking, and create hallucinations.

This notebook uses the built in functions of Lang Chain, which comes with both kinds of summarizations as described above built in.  We then test them with the sample data created in the Data Collection and Cleaning.ipynb from this repo.
  
This notebook follows this layout:

  1) Set up the environment.
  2) Set up the two types of summarizations.
  2) Explore using the two summarizing functions.
  
For convient use in other scripts, both types of summarizations are wrapped in a python function.

## 1) Set up the environment
First, let's install some dependances:


In [36]:
#!pip install langchain tiktoken

Collecting tiktoken
  Downloading tiktoken-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading tiktoken-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.5.1
[0m

In [4]:
#for connecting with Bedrock, use Boto3
import boto3, pickle
from botocore.config import Config

#AmazonSageMaker-ExecutionRole-20200702T102022

#increase the standard time out limits in boto3, because Bedrock may take a while to respond to large requests.
my_config = Config(
    connect_timeout=60*3,
    read_timeout=60*3,
)

bedrock_client = boto3.client(service_name='bedrock-runtime',config=my_config)

In [5]:
#now import langchain, and connect it to Bedrock
from langchain.llms.bedrock import Bedrock

model_parameter = {"temperature": 0.0, "top_p": .5, "max_tokens_to_sample": 2000} #parameters define
llm = Bedrock(model_id="anthropic.claude-v2", model_kwargs=model_parameter,client=bedrock_client) #model define

## 2) Set up the two kinds of summarizations.
### First "stuff it" where everything is a single prompt.

In [18]:
from langchain.chains.llm import LLMChain
from langchain.prompts import PromptTemplate
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
from langchain.schema.document import Document

# Define prompt
prompt_template = """\n\nHuman:  Consider this text:
<text>
{text}
</text>
Please create a concise summary in narative format.

Assistiant:  Here is the concise summary:"""
prompt = PromptTemplate.from_template(prompt_template)

# Define LLM chain
llm_chain = LLMChain(llm=llm, prompt=prompt)

# Define StuffDocumentsChain
stuff_chain = StuffDocumentsChain(llm_chain=llm_chain, document_variable_name="text")

#Note that although langchain often stores douments in small chunks for the 
#convience of models with smaller context windows, this "stuff it" method will
#combind all those chunks into a single prompt call.

#wrapping in a python function to make it easy to use in other scripts.
def stuff_it_summary(doc):
    if type(doc) == str:
        docs = [Document(page_content=doc)]
    return stuff_chain.run(docs)

### And now "map reduce" where a long text is reduced to chunks, summarized, and iteratively combined.

In [22]:
from langchain.chains.mapreduce import MapReduceChain
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains import ReduceDocumentsChain, MapReduceDocumentsChain

# Map
map_template = """\n\nHuman: The following is a set of documents
<documnets>
{docs}
</documents>
Based on this list of docs, please identify the main themes.

Assistant:  Here are the main themes:"""
map_prompt = PromptTemplate.from_template(map_template)
map_chain = LLMChain(llm=llm, prompt=map_prompt)

# Reduce
reduce_template = """\n\nHuman: The following is set of summaries:
<summaries>
{doc_summaries}
</summaries>
Please take these and distill them into a final, consolidated summary of the main themes in narative format. 

Assistant:  Here are the main themes:"""
reduce_prompt = PromptTemplate.from_template(reduce_template)
reduce_chain = LLMChain(llm=llm, prompt=reduce_prompt)

# Takes a list of documents, combines them into a single string, and passes this to an LLMChain
combine_documents_chain = StuffDocumentsChain(
    llm_chain=reduce_chain, document_variable_name="doc_summaries"
)

# Combines and iteravely reduces the mapped documents
reduce_documents_chain = ReduceDocumentsChain(
    # This is final chain that is called.
    combine_documents_chain=combine_documents_chain,
    # If documents exceed context for `StuffDocumentsChain`
    collapse_documents_chain=combine_documents_chain,
    # The maximum number of tokens to group documents into.
    token_max=4000,
)

# Combining documents by mapping a chain over them, then combining results
map_reduce_chain = MapReduceDocumentsChain(
    # Map chain
    llm_chain=map_chain,
    # Reduce chain
    reduce_documents_chain=reduce_documents_chain,
    # The variable name in the llm_chain to put the documents in
    document_variable_name="docs",
    # Return the results of the map steps in the output
    return_intermediate_steps=False,
)

#wrapping in a python function to make it easy to use in other scripts.
def map_reduce_summary(doc, DEBUG=False):
    if type(doc) == str:
        #use the LangChain built in text splitter to split our text
        from langchain.text_splitter import RecursiveCharacterTextSplitter
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size = 5000,
            chunk_overlap  = 200,
            length_function = len,
            add_start_index = True,
        )
        split_docs = text_splitter.create_documents([doc])
        if DEBUG: print("Text was split into %s docs"%len(split_docs))
    return map_reduce_chain.run(split_docs)

## 3) Explore using the two basic types of summarization
First, load the sample data to test with, as prepaired in Data Collction and Cleaning.ipynb

In [12]:
#Set this to true to run the examples, by default it is off so that this script can be loaded elsewhere.
RUN_EXAMPLES = False
if __name__ == '__main__':
    RUN_EXAMPLES = True

In [15]:
if RUN_EXAMPLES:
    #file locations for pickels of text.  These are a single string containing the whole text.
    #shore, medium, and long texts are provided as exampels.
    text_to_open_short = 'sample texts/hills.pkl'  #2-3 page story, Hills like White Elephants
    text_to_open_mid = 'sample texts/algernon.pkl'  #short story, Flowers for Algernon
    text_to_open_long = 'sample texts/frankenstien.pkl' #short novel, Frankenstine
    text_to_open_short_factual = 'sample texts/elvis.pkl'  #longest wikipedia article, Elvis.

    from langchain.schema.document import Document

    with open(text_to_open_short, 'rb') as file:
        #note that here, we're loading a single text, but the examples below require each text to be in a list.
        doc = pickle.load(file)

    print (len(doc))

7607


### Test the "stuff it" method, where everything goes into a single prompt:

In [16]:
#%%time
if RUN_EXAMPLES:
    print(stuff_it_summary(doc))

 Here is a concise narrative summary of the passage:

An American man and a girl named Jig are sitting outside a train station bar in Spain, waiting for a train to Madrid. It is a hot day and they order beers. Jig looks at the hills across the valley and remarks that they look like white elephants. The man says he's never seen a white elephant. 

They order a new drink called Anis del Toro. As they drink, the man tries to convince Jig to have an unnamed operation, implying an abortion. He says it's simple and will make their relationship happy again, but Jig is reluctant. 

Jig walks to the end of the station and looks across at the hills, trees, and mountains in the distance. She laments that they could have had "everything" but now it's been taken away. The man urges Jig not to feel that way. 

When he persists in pressing her to have the operation, Jig asks him to stop talking. They sit in silence until their train arrives. The man carries their bags across the station while Jig wai

### Test the "map reduce" method, where we first split out text into chunks, then summarize them, then mix them back together:

In [23]:
#%%time
if RUN_EXAMPLES:
    print(map_reduce_summary(doc, DEBUG=True))

Text was split into 2 docs


The story depicts a complex relationship between a man and a woman traveling together. There is tension between them as the woman grapples with a major life decision that could significantly impact their relationship. She is pregnant and contemplating getting an abortion, which the man seems to support but the woman has doubts about. 

The setting, described vividly, takes place in train stations and bars in the Spanish countryside near the Ebro river valley. The heat and hilly landscape with trees are mentioned frequently, almost becoming characters themselves. 

Throughout their conversations, the woman expresses a desire for more out of life and a disappointment with their current transient, unsettled circumstances. She wants things to be different but feels resigned that they won't change. The metaphor of "white elephants" seems to symbolize an unwanted burden.

While the man tries to reassure the woman, their differing views on the pregnancy and aborti