# Summarization on Custom Dataset with SageMaker Jumpstart and [LangChain](https://python.langchain.com/en/latest/index.html) Library

Reference: https://github.com/gkamradt/langchain-tutorials/tree/main/data_generation


 There are two main types of methods for summarizing text: abstractive and extractive.

Abstractive summarization generates a new shorter summary in its own words based on understanding the meaning and concepts of the original text. It analyzes the text using advanced natural language techniques to grasp the key ideas and then expresses those ideas in a summarized form using different words and phrases. This is similar to how humans summarize by reading something and then explaining the main points in their own words.

Extractive summarization works by selecting the most important sentences, phrases or words from the original text to construct a summary. It calculates the weight or importance of each part of the text using algorithms and then chooses the parts with the highest weights to put into the summary. This pulls summarizes by extracting key elements from the text itself rather than interpreting the meaning.

So in short, abstractive summarization rewrites the key ideas in new words while extractive summarization selects the most salient parts of the existing text. Both aim to distill the essence and most significant information from the original document into a condensed summary.

We're going to run through 3 methods for summarization that start with basic prompting to summarizing large documents using `map_reduce` method. These aren't the only options, feel free to modify it based on your use case. 

**3 Levels Of Summarization:**
1. **Summarize a couple sentences** - Basic Prompt
2. **Summarize a couple paragraphs** - Prompt Templates
3. **Summarize a large document with multiple pages** - Map Reduce
4. **Summarize a book**

In this notebook we will demonstrate how to use **AI21 Summary API** for text summarization using a library of documents as a reference.

**This notebook serves a template such that you can easily replace the example dataset by your own to build a custom text summarization application.**

## Deploy large language model (LLM) and embedding model in SageMaker JumpStart

Make sure to deploy the ai21 summary model from jumpstart before you begin following the notebook and provide the endpoint here. You can do this by subscribing to the AI21 Summarize model, then clicking on `Open Notebook` option. This will open the notebook in Amazon SageMaker Studio. Run through the notebook to deploy the model, capture the endpoint name and return to this notebook. 

In [None]:
!pip install --upgrade sagemaker
!pip install ipywidgets==7.0.0
!pip install langchain
!pip install faiss-cpu 
!pip install pytesseract
!pip install unstructured
!pip install transformers
!pip install pypdf

In [None]:
import time
import sagemaker, boto3, json
from sagemaker.session import Session
from sagemaker.model import Model
from sagemaker import image_uris, model_uris, script_uris, hyperparameters
from sagemaker.predictor import Predictor
from sagemaker.utils import name_from_base
from typing import Any, Dict, List, Optional
from langchain.embeddings import SagemakerEndpointEmbeddings
from langchain.llms.sagemaker_endpoint import ContentHandlerBase
from langchain import PromptTemplate

sagemaker_session = Session()
aws_role = sagemaker_session.get_caller_identity_arn()
aws_region = boto3.Session().region_name
sess = sagemaker.Session()
model_version = "*"
endpoint_name = 'summarize' # replace this with your endpoint name.

## Summarize couple of sentences 

In [None]:
prompt = """
Philosophy (from Greek: φιλοσοφία, philosophia, 'love of wisdom') \
is the systematized study of general and fundamental questions, \
such as those about existence, reason, knowledge, values, mind, and language. \
Some sources claim the term was coined by Pythagoras (c. 570 – c. 495 BCE), \
although this theory is disputed by some. Philosophical methods include questioning, \
critical discussion, rational argument, and systematic presentation.
"""

Next, we wrap up our SageMaker endpoints for LLM into `langchain.llms.sagemaker_endpoint.SagemakerEndpoint`. 

In [None]:
from langchain.llms.sagemaker_endpoint import LLMContentHandler, SagemakerEndpoint

class ContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        input_str = json.dumps({
            "source":prompt,
            "sourceType":"TEXT"})
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))
        return response_json["summary"]


content_handler = ContentHandler()

sm_llm = SagemakerEndpoint(
    endpoint_name=endpoint_name, ## add endpoint name for ai21 summary model
    region_name=aws_region,
    # model_kwargs=parameters,
    content_handler=content_handler,
)

In [None]:
num_tokens = sm_llm.get_num_tokens(prompt)
print (f"Our prompt has {num_tokens} tokens")

In [None]:
output = sm_llm(prompt)
print (output)

In [None]:
prompt = """
Write a ~ 1 sentence summary of the following text:

TEXT:
Philosophy (from Greek: φιλοσοφία, philosophia, 'love of wisdom') \
is the systematized study of general and fundamental questions, \
such as those about existence, reason, knowledge, values, mind, and language. \
Some sources claim the term was coined by Pythagoras (c. 570 – c. 495 BCE), \
although this theory is disputed by some. Philosophical methods include questioning, \
critical discussion, rational argument, and systematic presentation.
"""

In [None]:
output = sm_llm(prompt)
print (output)

##  Summarize a couple paragraphs -  Prompt Templates

Prompt templates are a great way to dynamically place text within your prompts. They are like [python f-strings](https://realpython.com/python-f-strings/) but specialized for working with language models.

We're going to look at 2 short Paul Graham essays

In [None]:
paul_graham_essays = ['data/PaulGrahamEssaySmall/getideas.txt', 'data/PaulGrahamEssaySmall/noob.txt']

essays = []

for file_name in paul_graham_essays:
    with open(file_name, 'r') as file:
        essays.append(file.read())

In [None]:
for i, essay in enumerate(essays):
    print (f"Essay #{i+1}: {essay[:300]}\n")

Next let's create a prompt template which will hold our instructions and a placeholder for the essay. In this example we only want a 1 sentence summary to come back

In [None]:
template = """
Write a ~ 50 words summary of the following text:
{essay}
"""

prompt = PromptTemplate(
    input_variables=["essay"],
    template=template
)

In [None]:
for essay in essays:
    summary_prompt = prompt.format(essay=essay)
    
    num_tokens = sm_llm.get_num_tokens(summary_prompt)
    print (f"This prompt + essay has {num_tokens} tokens")
    
    summary = sm_llm(summary_prompt)
    
    print (f"Summary: {summary.strip()}")
    print ("\n")

## Summarize a couple pages multiple pages - MapReduce

If you have multiple pages you'd like to summarize, you'll likely run into a token limit. Token limits won't always be a problem, but it is good to know how to handle them if you run into the issue.

The chain type "Map Reduce" is a method that helps with this. You first generate a summary of smaller chunks (that fit within the token limit) and then you get a summary of the summaries.\

Check out [this video](https://www.youtube.com/watch?v=f9_BWhCI4Zo) for more information on how chain types work.

In [None]:
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
paul_graham_essay = 'data/PaulGrahamEssays/startupideas.txt'

with open(paul_graham_essay, 'r') as file:
    essay = file.read()

In [None]:
sm_llm.get_num_tokens(essay)

That's too many, let's split our text up into chunks so they fit into the prompt limit. I'm going a chunk size of 10,000 characters. 

> You can think of tokens as pieces of words used for natural language processing. For English text, **1 token is approximately 4 characters** or 0.75 words. As a point of reference, the collected works of Shakespeare are about 900,000 words or 1.2M tokens.

This means the number of tokens we should expect is 10,000 / 4 = ~2,500 token chunks. But this will vary, each body of text/code will be different

In [None]:
text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n"], chunk_size=10000, chunk_overlap=500)

docs = text_splitter.create_documents([essay])

In [None]:
num_docs = len(docs)

num_tokens_first_doc = sm_llm.get_num_tokens(docs[0].page_content)

print (f"Now we have {num_docs} documents and the first one has {num_tokens_first_doc} tokens")

Great, assuming that number of tokens is consistent in the other docs we should be good to go. Let's use LangChain's [load_summarize_chain](https://python.langchain.com/en/latest/use_cases/summarization.html) method, we will use `refine` chain type for summarization. We first need to initialize our chain

In [None]:
summary_chain = load_summarize_chain(llm=sm_llm, chain_type='map_reduce',
                                     verbose=True # Set verbose=True if you want to see the prompts being used
                                    )

In [None]:
output = summary_chain.run(docs)

In [None]:
summaries = output.split('\n')
for summary in summaries: 
    print('- '+summary)

This summary is a great start, but lets modify to get only the key points in the summary.

In order to do this we will use custom promopts (like we did above) to instruct the model on what we need. Please note that the prompts format that is used in the notebook is based on flan t5, taken from this [source.](https://huggingface.co/jordiclive/flan-t5-11b-summarizer-filtered?text=The+tower+is+324+metres+%281%2C063+ft%29+tall%2C+about+the+same+height+as+an+81-storey+building%2C+and+the+tallest+structure+in+Paris.+Its+base+is+square%2C+measuring+125+metres+%28410+ft%29+on+each+side.+During+its+construction%2C+the+Eiffel+Tower+surpassed+the+Washington+Monument+to+become+the+tallest+man-made+structure+in+the+world%2C+a+title+it+held+for+41+years+until+the+Chrysler+Building+in+New+York+City+was+finished+in+1930.+It+was+the+first+structure+to+reach+a+height+of+300+metres.+Due+to+the+addition+of+a+broadcasting+aerial+at+the+top+of+the+tower+in+1957%2C+it+is+now+taller+than+the+Chrysler+Building+by+5.2+metres+%2817+ft%29.+Excluding+transmitters%2C+the+Eiffel+Tower+is+the+second+tallest+free-standing+structure+in+France+after+the+Millau+Viaduct)

The map_prompt is going to stay the same (just showing it for clarity), but I'll edit the combine_prompt.

In [None]:
map_prompt = """
Write a ~ 500 word summary of the following text:
"{text}"
"""
map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text"])

In [None]:
combine_prompt = """
Cover only  the key points of the text.
{text}
"""
combine_prompt_template = PromptTemplate(template=combine_prompt, input_variables=["text"])

In [None]:
summary_chain_key_points = load_summarize_chain(llm=sm_llm,
                                     chain_type='map_reduce',
                                     map_prompt=map_prompt_template,
                                     combine_prompt=combine_prompt_template,
                                     # verbose=True
                                    )

Instead of summarizing all the 30 split documents (chunks), I am using only 15 of them to save time  as it can take few minutes and does not run out of memory on the notebook instance.

In [None]:
output_key_points = summary_chain_key_points.run(docs)

In [None]:
summaries = output_key_points.split('\n')
for summary in summaries: 
    print('- '+summary)

## Summarize a book

In [None]:
from langchain.document_loaders import PyPDFLoader
# Loaders
from langchain.schema import Document

# Load the book
loader = PyPDFLoader("data/book/IntoThinAirBook.pdf")
pages = loader.load()

#print number of pages
print('number of pages: ', len(pages))
# Cut out the open and closing parts
pages = pages[28:len(pages)]

# Combine the pages, and replace the tabs with spaces
text = ""

for page in pages:
    text += page.page_content
    
text = text.replace('\t', ' ')

In [None]:
num_tokens = sm_llm.get_num_tokens(text)

print (f"This book has {num_tokens} tokens in it")

Note that AI21 Summarize model can take upto 40k chunk size, therefore, dividing the book into 30k chunks. 

In [None]:
text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n", "\t"], chunk_size=20000, chunk_overlap=3000)

docs = text_splitter.create_documents([text])

In [None]:
num_docs = len(docs)

num_tokens_first_doc = sm_llm.get_num_tokens(docs[0].page_content)

print (f"Now we have {num_docs} documents and the first one has {num_tokens_first_doc} tokens")

In [None]:
map_prompt = """
"{text}"
"""
map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text"])

In [None]:
map_chain = load_summarize_chain(llm=sm_llm,
                             chain_type="stuff",
                             prompt=map_prompt_template)

In [None]:
# Make an empty list to hold your summaries
summary_list = []

# Loop through a range of the lenght of your selected docs
for i, doc in enumerate(docs):
    
    # Go get a summary of the chunk
    chunk_summary = map_chain.run([doc])
    
    # Append that summary to your list
    summary_list.append(chunk_summary)
    
    # print (f"Summary #{i+1} - Preview: {chunk_summary[:250]} \n")

In [None]:
summaries = "\n".join(summary_list)

# Convert it back to a document
summaries = Document(page_content=summaries)

print (f"Your total summary has {sm_llm.get_num_tokens(summaries.page_content)} tokens")

In [None]:
combine_prompt = """
"{text}"
"""
combine_prompt_template = PromptTemplate(template=combine_prompt, input_variables=["text"])

In [None]:
reduce_chain = load_summarize_chain(llm=sm_llm,
                             chain_type="stuff",
                             prompt=combine_prompt_template,
#                              verbose=True # Set this to true if you want to see the inner workings
                                   )

In [None]:
output = reduce_chain.run([summaries])

In [None]:
key_points = output.split('\n')
for key_point in key_points: 
    print('- '+key_point)

## Clean Up
*NOTE:* Please make sure to delete the endpoint, if you are not using it, as it will incur charges. 

In [None]:
# # Specify the name of your endpoint
# endpoint_name_llm="summarize"

# # # Create a low-level SageMaker service client.
# sagemaker_client = boto3.client('sagemaker', region_name=aws_region)
                        
# # # Delete endpoint configuration
# sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_name_llm)

# # Delete endpoint
# sagemaker_client.delete_endpoint(EndpointName=endpoint_name_llm)