# Summarize Scientific Documents with Amazon Comprehend and HuggingFace

Researchers must stay up-to-date on their fields of interest. However, it's difficult to keep track of the large number of journals, whitepapers, and research pre-prints generated in many areas. In response, many research groups have turned to AI/ML tools to summarize and classify new documents.

In this workshop, we'll use several AWS AI/ML services to process scientific documents from the [NIH NCBI PMC Article Dataset](https://registry.opendata.aws/ncbi-pmc/) on the Registry of Open Data. This is a free full-text archive of biomedical and life sciences journal article at the U.S. National Institutes of Health's National Library of Medicine.

This notebook requires that the SageMaker Execution Role has additional permission to call the Amazon Comprehend services. Please reach out to your system administrator if you are running this outside of an AWS-hosted workshop.

This notebook was created and tested on an `ml.t3.medium (2 vCPU + 4 GiB)` notebook instance running the `Python 3 (Data Science 3.0)` kernel in SageMaker Studio.

# 1. Import Libraries and Create Clients

In [None]:
%pip install --disable-pip-version-check -U -q -r requirements.txt

In [None]:
import boto3
import sagemaker
import os
import json
import pprint
import pandas as pd
from random import sample
import re

boto_session = boto3.Session()
s3 = boto_session.client("s3")
sm_session = sagemaker.Session(boto_session=boto_session)
s3_bucket = sm_session.default_bucket()
s3_prefix = "sci-docs/data"
print(f"S3 path is {s3_bucket}/{s3_prefix}")

# 2. Download Documents from the NIH NCBI PMC Article Dataset

Copy 25 articles from the PubMed open data set (https://registry.opendata.aws/ncbi-pmc/) into the SageMaker default bucket for this account.

In [None]:
pmc_bucket = "pmc-oa-opendata"
pmc_prefix = "oa_comm/txt/all/"
local_raw_data_dir = "data/raw"
# let us specify 25 specific PMC articles to download
article_names = [
    "PMC1043862.txt",
    "PMC1054881.txt",
    "PMC1054888.txt",
    "PMC1064076.txt",
    "PMC1064081.txt",
    "PMC1064104.txt",
    "PMC1064852.txt",
    "PMC1064855.txt",
    "PMC1064860.txt",
    "PMC1064883.txt",
    "PMC1064892.txt",
    "PMC1064893.txt",
    "PMC1065049.txt",
    "PMC1065056.txt",
    "PMC1065057.txt",
    "PMC1065073.txt",
    "PMC1065100.txt",
    "PMC1065320.txt",
    "PMC1065326.txt",
    "PMC1069647.txt",
    "PMC1069665.txt",
    "PMC1073698.txt",
    "PMC1074343.txt",
    "PMC1074358.txt",
    "PMC1074751.txt",
]
for article in article_names:
    print(article)
    sm_session.download_data(
        local_raw_data_dir, bucket=pmc_bucket, key_prefix=pmc_prefix + article
    )

# Once all files have been downloaded, upload them all to the S3 bucket for your project
sm_session.upload_data(
    local_raw_data_dir, bucket=s3_bucket, key_prefix=s3_prefix + "/raw"
)

Look at a few examples

In [13]:
art = sample(article_names, 1)[0]
print(art)
!head data/raw/{art}

PMC1043862.txt

==== Front
PLoS BiolPLoS BiolpbioplosbiolPLoS Biology1544-91731545-7885Public Library of Science San Francisco, USA 10.1371/journal.pbio.0030122SynopsisAnimal BehaviorNeuroscienceZoologyInsectsCracking the Olfactory Code Synopsis4 2005 22 2 2005 22 2 2005 3 4 e122Copyright: © 2005 Public Library of Science.2005This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
Perceptual and Neural Olfactory Similarity in Honeybees 

Perception Space-The Final Frontier
==== Body
For Proust, a taste of cookie was enough to trigger vivid recollections of his childhood, the first of a long string of reveries that he fashioned into his famous memoir Remembrance of Things Past. For many animals, too, tastes and smells are evocative and play a crucial role in finding food, allowing them to build on pas

# 3. Summarize the Documents Using Amazon Comprehend Topic Modelling

Submit an Amazon Comprehend topic modelling job

In [24]:
comprehend = boto_session.client(service_name="comprehend")

sagemaker.s3.s3_path_join(s3_bucket, s3_prefix, "raw")

input_s3_url = sagemaker.s3.s3_path_join("s3://", s3_bucket, s3_prefix, "raw")
input_doc_format = "ONE_DOC_PER_FILE"
output_s3_url = sagemaker.s3.s3_path_join("s3://", s3_bucket, s3_prefix, "output")
data_access_role_arn = sagemaker.session.get_execution_role()
number_of_topics = 25

input_data_config = {"S3Uri": input_s3_url, "InputFormat": input_doc_format}
output_data_config = {"S3Uri": output_s3_url}

start_topics_detection_job_result = comprehend.start_topics_detection_job(
    NumberOfTopics=number_of_topics,
    InputDataConfig=input_data_config,
    OutputDataConfig=output_data_config,
    DataAccessRoleArn=data_access_role_arn,
)

job_id = start_topics_detection_job_result["JobId"]
print(f"Job {job_id} submitted")

Job 5c6f6d4cd8c8bad35b69608b3d574e8d submitted


Check the job status by running this cell. Note that the job may take 10 minutes or more to finish.

In [30]:
describe_topics_detection_job_result = comprehend.describe_topics_detection_job(
    JobId=job_id
).get("TopicsDetectionJobProperties", [])
print(f"Job {job_id} status is {describe_topics_detection_job_result['JobStatus']}")

Job 5c6f6d4cd8c8bad35b69608b3d574e8d status is COMPLETED


Once job is complete, download and unpack the results

In [31]:
if describe_topics_detection_job_result["JobStatus"] == "COMPLETED":
    output_url = sagemaker.s3.parse_s3_url(
        describe_topics_detection_job_result["OutputDataConfig"]["S3Uri"]
    )
    sm_session.download_data(
        "data",
        bucket=output_url[0],
        key_prefix=output_url[1],
    )
    os.system("tar xvfz data/output.tar.gz -C data/")

    topics = (
        pd.read_csv("data/topic-terms.csv")
        .sort_values(["topic", "weight"], ascending=[True, False])
        .groupby(["topic"])["term"]
        .agg(lambda x: ", ".join(x))
    )
    docs = pd.read_csv("data/doc-topics.csv").sort_values(
        ["docname", "proportion"], ascending=[True, False]
    )
    results = pd.merge(docs, topics, how="left", on="topic")
    display(results)

topic-terms.csv
doc-topics.csv


Unnamed: 0,docname,topic,proportion,term
0,PMC1043862.txt,19,1.000000,"bee, scent, brain, chemical, smell, distance, ..."
1,PMC1054881.txt,9,0.824953,"cohesin, sa2, scc1, cell, phosphorylation, chr..."
2,PMC1054881.txt,0,0.069081,"abbreviation, declare, sr, mt, excess, wm, car..."
3,PMC1054881.txt,1,0.034858,"discussion, abbreviation, ltd.this, declare, w..."
4,PMC1054881.txt,21,0.017785,"cell, camp, e2, receptor, estrogen, level, mem..."
...,...,...,...,...
177,PMC1074358.txt,12,0.009949,"inflammation, inflammatory, journal, access, r..."
178,PMC1074751.txt,22,0.852933,"activity, stimulus, spontaneous, neuron, oscil..."
179,PMC1074751.txt,0,0.071058,"abbreviation, declare, sr, mt, excess, wm, car..."
180,PMC1074751.txt,1,0.059401,"discussion, abbreviation, ltd.this, declare, w..."


Let's look at some specific examples:

In [32]:
if describe_topics_detection_job_result["JobStatus"] == "COMPLETED":

    input_url = sagemaker.s3.parse_s3_url(
        describe_topics_detection_job_result["InputDataConfig"]["S3Uri"]
    )
    sample = results.sample()
    docname, idx, score, terms = sample.iloc[0, :]

    print(f"Document name is {docname}")
    print(f"Identified terms are {terms}")

    sm_session.download_data(
        "data", bucket=input_url[0], key_prefix=os.path.join(input_url[1], docname)
    )
    os.system(f"head -n 25 data/{docname}")

Document name is PMC1064852.txt
Identified terms are node, cancer, breast, nx, model, numb, involve, nodal, positive, ratio

==== Front
PLoS BiolPLoS BiolpbioplosbiolPLoS Biology1544-91731545-7885Public Library of Science San Francisco, USA 1576027810.1371/journal.pbio.0030095Research ArticleGenetics/Genomics/Gene TherapyMicrobiologyYeast and FungiLight Controls Growth and Development via a Conserved Pathway in the Fungal Kingdom Light Regulates C. neoformans DevelopmentIdnurm Alexander 
1
Heitman Joseph heitm001@duke.edu
1
1Department of Molecular Genetics and Microbiology, Howard Hughes Medical InstituteDuke University Medical Center, Durham, North CarolinaUnited States of AmericaChory Joanne Academic EditorThe Salk Institute for Biological StudiesUnited States of America4 2005 15 3 2005 15 3 2005 3 4 e9520 6 2004 18 1 2005 Copyright: © 2005 Idnurm and Heitman.2005This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unr

# 4. Generate TLDR Summaries Using a Pre-Trained NLP Model from HuggingFace

In [17]:
import sagemaker
from sagemaker.huggingface import HuggingFaceModel
from sagemaker.async_inference.async_inference_config import AsyncInferenceConfig
from sagemaker.async_inference.waiter_config import WaiterConfig

role = sagemaker.get_execution_role()

# Hub Model configuration. https://huggingface.co/models
hub = {"HF_MODEL_ID": "alk/pegasus-scitldr", "HF_TASK": "text2text-generation"}


# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    transformers_version="4.17.0",
    pytorch_version="1.10.2",
    py_version="py38",
    env=hub,
    role=role,
)

In [18]:
# deploy model to SageMaker Inference
async_config = AsyncInferenceConfig(
    output_path=f"s3://{s3_bucket}/{s3_prefix}/tldr_output",
    max_concurrent_invocations_per_instance=4,
)

predictor = huggingface_model.deploy(
    async_inference_config=async_config,
    initial_instance_count=1,  # number of instances
    instance_type="ml.m5.4xlarge",  # ec2 instance type
    wait=True,
)

------!

Convert document text to json format and upload to s3

In [20]:
# Find an article with well-defined background information
result = None
while result is None:
    art = sample(article_names, 1)[0]
    print(art)

    with open(f"data/raw/{art}", "r", encoding="utf-8", errors="replace") as f:
        text = f.read().replace("\n", " ").replace("\t", " ")
        result = re.search("Background (.{,1000})", text)

dict = {"inputs": result.group(1)}  # Search for background infomation
print(dict)

PMC1065049.txt
PMC1065320.txt
PMC1064081.txt
PMC1064855.txt
PMC1065056.txt
PMC1074751.txt
PMC1064852.txt
PMC1073698.txt
PMC1064104.txt
PMC1065057.txt
PMC1064852.txt
PMC1069665.txt
{'inputs': 'The integrin αvβ3 plays an important role in angiogenesis and tumor cell metastasis, and is currently being evaluated as a target for new therapeutic approaches. Several techniques are being studied to enable noninvasive determination of αvβ3 expression. We developed [18F]Galacto-RGD, a 18F-labeled glycosylated αvβ3 antagonist, allowing monitoring of αvβ3 expression with positron emission tomography (PET).  Methods and Findings Here we show by quantitative analysis of images resulting from a small-animal PET scanner that uptake of [18F]Galacto-RGD in the tumor correlates with αvβ3 expression subsequently determined by Western blot analyses. Moreover, using the A431 human squamous cell carcinoma model we demonstrate that this approach is sensitive enough to visualize αvβ3 expression resulting exclu

In [21]:
async_response = predictor.predict_async(data=dict)

waiter = WaiterConfig(max_attempts=24, delay=15)
result = async_response.get_result(waiter)
pprint.pprint(result)

[{'generated_text': 'We developed [18F]Galacto-RGD, a 18F-labeled glycosylated '
                    'v3 antagonist, allowing monitoring of v3 expression with '
                    'positron emission tomography (PET).'}]


# 5. Clean Up

In [33]:
predictor.delete_endpoint()

# Delete all S3 objects
bucket = boto_session.resource("s3").Bucket(s3_bucket)
bucket.objects.filter(Prefix="sci-docs").delete()
os.system(f"rm -rf data")

0