### Document summarization application with Open Source Llama2 LLM with Langchain using Sagemaker Jumpstart

* Author : Dipjyoti Das
* Last Edited : Jan 26, 2024
* This notebook provides an example for how to use Sagemaker Jumpstart -for text summarization use case on a Fannie Mae public pdf document. It uses Llama-7b-chat fine tuned open source model from Jumsptart model hub with Langchain.

#### Prerequisites
* AWS Innovation Sandbox should be installed and Domain created in Sagemaker

* Before deploying model endpoint on an instance, you have to request Service Quota limit increase for that particular instance. Check the form link on the confluence page. It might take upto a day for the request to be approved for the associated Sagemaker account.

* Upload the example pdf document ('6_extracted_FM-esg-report-2022.pdf') from Conflunce on your Sagemaker Jupyterlab folder

#### Minimum Instance sizes for the following Llama2 Foundational models in Jumpstart:
* Llama2-7b-chat : ml.g5.2xlarge
* Llama2-13b-chat : ml.g5.12xlarge
* Llama2-70b-chat : ml.g5.48xlarge or ml.p4d.24xlarge
* Pls request service quota increase for instance associated with the Sagemaker account.


#### Endpoint Names of the deployed Model:

In [61]:
# Get the Endpoint name and InferenceComponentName after you deploy the llama2_7b_chat model from Jumpstart model hub on an instance (ml.g5.2xlarge). 
# Navigate to Endpoint summary -> Test Inference -> Testing Options ->  Use Python SDK example code -> Example inference request

llama2_7b_chat_endpoint_name = 'jumpstart-dft-meta-textgeneration-l-20240126-180653'
llama2_7b_chat_InferenceComponentName = 'meta-textgeneration-llama-2-7b-f-20240126-180653'

# Get the region name from the Sagemaker account
region_name = "us-east-1"

#### Install and import the required Libraries with Langchain

In [62]:
# Import the Boto3 and JSON modules
import json
import boto3
print(boto3.__version__)

import warnings
warnings.filterwarnings('ignore')

1.34.27


In [3]:
#!pip install -q transformers pypdfium2 accelerate langchain

In [63]:
# Import the relevant modules and break down the long document into chunks:
import langchain
from langchain import SagemakerEndpoint, PromptTemplate
from langchain.llms.sagemaker_endpoint import LLMContentHandler
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain import LLMChain

from langchain.document_loaders import TextLoader
from langchain.docstore.document import Document
from langchain.document_loaders import PyPDFium2Loader
import transformers
import torch

In [64]:
# Get the filepath of the Fannie Mae public docucment uploaded in the sagemaker directory:
pdf_filepath = '/home/sagemaker-user/6_extracted_FM-esg-report-2022.pdf'

In [65]:
# Function to load a pdf document and split in into chunks using Character Text Splitter function from Langchain
def pdf_output(pdf_filepath):
    loader = PyPDFium2Loader(pdf_filepath)
    data = loader.load()
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    texts_FM1 = text_splitter.split_documents(data)
    return texts_FM1

In [66]:
# To see the contents of the pdf document:
pdf_output(pdf_filepath)

[Document(page_content='About Fannie Mae\r\nWho we are\r\nThe Federal National Mortgage Association, better known as \r\nFannie Mae, is a purpose-driven company by charter and by \r\nchoice. Our business supports mortgage lenders by providing \r\nmortgage financing to help people buy or rent a home. We help \r\nmake the popular 30-year fixed-rate mortgage possible, enabling \r\npredictable mortgage payments over the life of the loan and \r\ngiving homeowners stability and peace of mind. \r\nOur charter, an act of Congress, establishes our purposes: \r\nto provide liquidity and stability to the residential mortgage \r\nmarket and to promote access to mortgage credit. This mandate \r\nincludes facilitating mortgages on housing for low- and \r\nmoderate-income families involving a reasonable economic \r\nreturn that may be less than the return earned on other \r\nactivities. Congress declared that our operations should be \r\nfinanced by private capital to the maximum extent feasible. Wit

#### Using Langchain for Text Summarization:

* Check :  https://python.langchain.com/docs/integrations/llms/sagemaker

In [67]:
# To make LangChain work effectively with Llama models, we need to define the default Content Handler classes for valid input and output:

class ContentHandlerTextSummarization(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        input_str = json.dumps({"inputs": prompt, **model_kwargs})
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> json:
        response_json = json.loads(output.read().decode("utf-8"))
        generated_text = response_json[0]['generated_text']
        return generated_text.split("summary:")[-1]
    
content_handler = ContentHandlerTextSummarization()

* Define the model with parameters using Sagemaker Endpoint class from Langchain 

In [68]:
summary_model_llm = SagemakerEndpoint( endpoint_name=llama2_7b_chat_endpoint_name, 
                                      region_name= region_name,
                                      model_kwargs={"max_new_tokens": 2000, "top_p": 0.9, "temperature": 0.6, "top_k":10, "do_sample" :True, "max_length": 1000},
                                      endpoint_kwargs={ "CustomAttributes": 'accept_eula=true', "InferenceComponentName" : llama2_7b_chat_InferenceComponentName}, 
                                      content_handler=content_handler )

In [69]:
# Write the custom Prompt Template:
template = """
              Write a summary of the following text delimited by triple backquotes.
              Return your response in bullet points which covers the key points of the text.
              ```{text}```
              BULLET POINT SUMMARY:
           """

prompt = PromptTemplate(template=template, input_variables=["text"])
print(prompt)

input_variables=['text'] template='\n              Write a summary of the following text delimited by triple backquotes.\n              Return your response in bullet points which covers the key points of the text.\n              ```{text}```\n              BULLET POINT SUMMARY:\n           '


In [70]:
# Invoke the LLMChain from Langchain library:
llm_chain = LLMChain(prompt=prompt, llm=summary_model_llm)

In [73]:
# Result of Llama-2-7b-chat model
print(llm_chain.run(summarize_pdf(pdf_filepath)))

 • Fannie Mae is a purpose-driven company by charter and by choice,


In [None]:
# The summary may not be exactly what you are looking, you have to test it with Llama2-7b-chat model parameters or test with
# bigger Llama2-13b or Llama2-70b parameter models

#### * Putting it all together in a function:

In [75]:
def summarize_pdf(pdf_filepath):
    loader = PyPDFium2Loader(pdf_filepath)
    data = loader.load()

    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    text_FM = text_splitter.split_documents(data)

    template = """
              Write a concise summary of the following text delimited by triple backquotes.
              Return your response in bullet points which covers the key points of the text.
              ```{text}```
              BULLET POINT SUMMARY:
           """

    prompt = PromptTemplate(template=template, input_variables=["text"])

    llm_chain = LLMChain(prompt=prompt, llm=summary_model_llm)

    return llm_chain.run(text_FM)

In [76]:
# result of Llama2-7b-chat model
summarize_pdf(pdf_filepath = pdf_filepath)

' • Fannie Mae is a purpose-driven company by charter and by choice.'

#### Using Gradio UI Interface:

In [59]:
#!pip install gradio

In [77]:
import gradio as gr
print(gr.__version__)

4.15.0


In [78]:
# Get the filepath of the uploaded pdf document in Sagemkaer jupyterlab workspace:
print(pdf_filepath)

/home/sagemaker-user/6_extracted_FM-esg-report-2022.pdf


In [None]:
# Input the above pdf_filepath in the Gradio: 'Provide PDF file path to get the summary' textbox field and you will see the summarized output.

In [79]:
input_pdf_path = gr.components.Textbox(label="Provide the PDF file path")
output_summary = gr.components.Textbox(label="Summary")

interface = gr.Interface(
    fn=summarize_pdf_full,
    inputs=input_pdf_path,
    outputs=output_summary,
    title="PDF Summarizer",
    description="Provide PDF file path to get the summary."
).queue().launch(share=True, debug = False)

Running on local URL:  http://127.0.0.1:7867
Running on public URL: https://fff323c07ca7ee6033.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


### Don't Forget : Delete the endpoints from the Llama2-7b-chat model notebook, also confirm deletion from the Jumpstart Deployment UI