# Document Understanding Solution - Document Summarization


---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_summarization.ipynb)

---


This solution consists of comphrensive notebooks for five natural language processing tasks Document Summarization, Text Classification, Question and Answering, Name Entity Recognition, and Semantic Relation Extracion. To view each notebook, click [Text Classification](./2_text_classification.ipynb), [Question and Answering](./3_question_answering.ipynb), [Name Entity Recognition](./4_entity_recognition.ipynb), and [Semantic Relation Extraction](./5_relationship_extraction.ipynb).

Now let's first look at the Document Summarization. Summarization is useful when you want to distill the information found in
a large amount of text down to a few sentences. We use an
'extractive' summarization method in this notebook, that extracts the
most important sentences from the document verbatim. We don't cover
'abstractive' summarization here, because it's a lot more challenging and
error prone to generate new sentences that summarize the document.

In this notebook, we deploy and use a document summarization model
[T5-base](https://huggingface.co/t5-base) from the [transformers](https://huggingface.co/transformers/) library. Next, we send an example article to the deployed endpoints to get a response (summarization result) and run evaluation metric [ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)) to compare the input article and summmarization result. 


**Note**: When running this notebook on SageMaker Studio, you should make
sure the `SageMaker JumpStart PyTorch 1.0` image/kernel is used. When
running this notebook on SageMaker Notebook Instance, you should make
sure the 'sagemaker-soln' kernel is used.

Install required packages to run this notebook

In [None]:
!pip install -U sagemaker ipywidgets datasets nltk rouge_score

We start by importing a variety of packages that are used throughout
the notebook. One of the most important packages is the Amazon SageMaker
Python SDK (i.e. `import sagemaker`). We also import modules from our own
custom (and editable) package that can be found at `../package`.

In [None]:
import boto3
import sagemaker
from sagemaker.pytorch import PyTorchModel
import sys
import config
import uuid

Up next, we define the current folder and create a SageMaker client (from
`boto3`). We can use the SageMaker client to call SageMaker APIs
directly, as an alternative to using the Amazon SageMaker SDK. We use
it at the end of the notebook to delete certain resources that are
created in this notebook.

In [None]:
sagemaker_client = boto3.client("sagemaker")
sagemaker_session = sagemaker.Session()
DEFAULT_BUCKET = sagemaker_session.default_bucket()
IAM_ROLE = sagemaker.get_execution_role()

Our text summarization system needs a machine learning model. In this
section, we deploy a model to an Amazon SageMaker Endpoint and then
invoke the endpoint from the notebook. We use a pre-trained model from
the [transformers](https://huggingface.co/transformers/) library instead
of training a model from scratch, specifically the T5 Base model.

We use the unique solution prefix to name the model and endpoint.

In [None]:
unique_hash = str(uuid.uuid4())[:6]
endpoint_name = f"{config.SOLUTION_PREFIX}-{unique_hash}-summarization-endpoint"

Up next, we need to define the Amazon SageMaker Model which references
the source code and the specifies which container to use. Our pre-trained
model is from the transformers library which uses PyTorch. As a result,
we should use the PyTorchModel from the Amazon SageMaker Python SDK.
Using PyTorchModel and setting the framework_version argument, means that
our deployed model runs inside a container that has PyTorch
pre-installed. Other requirements can be installed by defining a
requirements.txt file at the specified source_dir location. We use the
entry_point argument to reference the code (within source_dir) that
should be run for model inference: functions called model_fn, input_fn,
predict_fn and output_fn are expected to be defined. And lastly, you can
pass `model_data` from a training job, but we are going to load the
pre-trained model in the source code running on the endpoint. We still
need to provide `model_data`, so we pass an empty archive.

In [None]:
!aws s3 cp --recursive s3://sagemaker-solutions-prod-us-east-2/0.2.0/Document-understanding/3.0.1/artifacts/models/summarization data

In [None]:
!aws s3 cp --recursive data s3://$DEFAULT_BUCKET/document-understanding/summarization

In [None]:
!aws s3 cp s3://sagemaker-solutions-prod-us-east-2/0.2.0/Document-understanding/3.0.1/artifacts/models/empty.tar.gz .

In [None]:
!aws s3 cp empty.tar.gz s3://$DEFAULT_BUCKET/document-understanding/models/

In [None]:
model = PyTorchModel(
    model_data=f"S3://{DEFAULT_BUCKET}/document-understanding/models/empty.tar.gz",
    entry_point="entry_point.py",
    source_dir="containers/summarization",
    role=IAM_ROLE,
    framework_version="1.5.0",
    py_version="py3",
    code_location="s3://" + DEFAULT_BUCKET + "/code",
    env={
        "MODEL_ASSETS_S3_BUCKET": DEFAULT_BUCKET,
        "MODEL_ASSETS_S3_PREFIX": "document-understanding/summarization/",
        "MMS_DEFAULT_RESPONSE_TIMEOUT": "3000",
    },
)

Using this Amazon SageMaker Model, we can deploy a HTTPS endpoint on a
dedicated instance. We choose to deploy the endpoint on a single
ml.p3.2xlarge instance (or ml.g4dn.2xlarge if unavailable in this
region). Our summarization model is transfomer that benefits from GPU
optimization, and a ml.p3.2xlarge has a high performance NVIDIA V100 GPU
that can reduce inference latency on each request. You can expect this
deployment step to take around 5 minutes. After approximately 15 dashes,
you can expect to see an exclamation mark which indicates a successful
deployment.

In [None]:
import time
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

predictor = model.deploy(
    endpoint_name=endpoint_name,
    instance_type=config.HOSTING_INSTANCE_TYPE,
    initial_instance_count=1,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer(),
)

time.sleep(10)

When you're trying to update the model for development purposes, but
experiencing issues because the model/endpoint-config/endpoint already
exists, you can delete the existing model/endpoint-config/endpoint by
uncommenting and running the following commands:

In [None]:
# sagemaker_client.delete_endpoint(EndpointName=endpoint_name)
# sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_name)

When calling our new endpoint from the notebook, we use a Amazon
SageMaker SDK
[`Predictor`](https://sagemaker.readthedocs.io/en/stable/predictors.html).
A `Predictor` is used to send data to an endpoint (as part of a request),
and interpret the response. Our `model.deploy` command returned a
`Predictor` but, by default, it send and receive numpy arrays. Our
endpoint expects to receive (and also sends) JSON formatted objects, so
we modify the `Predictor` to use JSON instead of the PyTorch endpoint
default of numpy arrays. JSON is used here because it is a standard
endpoint format and the endpoint response can contain nested data
structures.

With our model successfully deployed and our predictor configured, we can
try out the summarizer out on example inputs. All we need to do is
construct a dictionary object with a single key called `text` and provide
the the input string. We call `predict` on our predictor and we should
get a response from the endpoint that contains the summary.

In [None]:
ARTICLE = """ Documents are a primary tool for communication,
collaboration, record keeping, and transactions across industries,
including financial, medical, legal, and real estate. The format of data
can pose an extra challenge in data extraction, especially if the content
is typed, handwritten, or embedded in a form or table. Furthermore,
extracting data from your documents is manual, error-prone, time-consuming,
expensive, and does not scale. Amazon Textract is a machine learning (ML)
service that extracts printed text and other data from documents as well as
tables and forms. We’re pleased to announce two new features for Amazon
Textract: support for handwriting in English documents, and expanding
language support for extracting printed text from documents typed in
Spanish, Portuguese, French, German, and Italian. Many documents, such as
medical intake forms or employment applications, contain both handwritten
and printed text. The ability to extract text and handwriting has been a
need our customers have asked us for. Amazon Textract can now extract
printed text and handwriting from documents written in English with high
confidence scores, whether it’s free-form text or text embedded in tables
and forms. Documents can also contain a mix of typed text or handwritten
text. The following image shows an example input document containing a mix
of typed and handwritten text, and its converted output document. You can
log in to the Amazon Textract console to test out the handwriting feature,
or check out the new demo by Amazon Machine Learning Hero Mike Chambers.
Not only can you upload documents with both printed text and handwriting,
you can also use Amazon Augmented AI (Amazon A2I), which makes it easy to
build workflows for a human review of the ML predictions. Adding in Amazon
A2I can help you get to market faster by having your employees or AWS
Marketplace contractors review the Amazon Textract output for sensitive
workloads. For more information about implementing a human review, see
Using Amazon Textract with Amazon Augmented AI for processing critical
documents. If you want to use one of our AWS Partners, take a look at how
Quantiphi is using handwriting recognition for their customers.
Additionally, we’re pleased to announce our language expansion. Customers
can now extract and process documents in more languages. Amazon Textract
now supports processing printed documents in Spanish, German, Italian,
French, and Portuguese. You can send documents in these languages,
including forms and tables, for data and text extraction, and Amazon
Textract automatically detects and extracts the information for you. You
can simply upload the documents on the Amazon Textract console or send them
using either the AWS Command Line Interface (AWS CLI) or AWS SDKs."""

In [None]:
data = {"text": ARTICLE}
response = predictor.predict(data=data)

We have the responce and we can print out the summary that has been
extracted from the text above.

In [None]:
print(response["summary"])

## Evaluate and compare input article and summarization reseult

In [None]:
from datasets import load_metric

In [None]:
rouge = load_metric("rouge")

In [None]:
results = rouge.compute(predictions=[response["summary"]], references=[ARTICLE])

In [None]:
for each_rouge in results:
    print(f"{each_rouge}: {results[each_rouge]}\n")

`rougeN` measures the number of matching `n-grams` between our model-generated text (summarization result) and a `reference` (input text).

An n-gram is simply a grouping of tokens/words. A unigram (1-gram) would consist of a single word. A bigram (2-gram) consists of two consecutive words.

The metric `rougeL` and `rougeLsum` measure the longest matching sequences of words by looking for the longest common substrings in the generated and reference summaries. The “sum” in rougeLsum refers to the fact that this metric is computed over a whole summary, while rougeL is computed as the average over individual sentences.



For each metric described above, it computes confidence intervals for precision, recall, and F1-score; these are the `low`, `mid`, and `high` attributes you can see here. For each of precision, recall, and F1-score metrics, higher value indicates better results.

You can try more examples above, but note that this model has been
pretrained on a news dataset. You may need to fine-tune this model with
your own summarizations to obtain better results.

## Clean Up

When you've finished with the summarization endpoint (and associated
endpoint-config), make sure that you delete it to avoid accidental
charges.

In [None]:
# Delete the SageMaker endpoint and the attached resources
predictor.delete_model()
predictor.delete_endpoint()

## Next Stage

We've just looked at how to analyse documents from a high level. Up next
we look at a technique that can be used to classify each sentence in the document, called Text Classification.

[Click here to continue with Text Classification.](./2_text_classification.ipynb)

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_summarization.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_summarization.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_summarization.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_summarization.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_summarization.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_summarization.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_summarization.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_summarization.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_summarization.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_summarization.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_summarization.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_summarization.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_summarization.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_summarization.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/introduction_to_applying_machine_learning|identify_key_insights_from_textual_document|document_summarization.ipynb)
