# Pipeline to process medical reports

In this notebook, we will walkthrough on how to build a pipeline that will process medical reports in PDF format to extract relevant medical information by using the following AWS services:

- [Textract](https://aws.amazon.com/textract/): To extract text from the PDF medical report
- [Comprehend Medical](https://aws.amazon.com/comprehend/medical/): To extract relevant medical information from the output of textract

In [54]:
import boto3
import time
import sagemaker
import os 
from trp import Document

bucket = sagemaker.Session().default_bucket()
prefix = 'sagemaker/medical_notes'

## Process PDF with Amazon Textract

In this section, we will be using Amazon Textract to extract text from a sample PDF medical record. If you're interested, you can preview the PDF report in the reports directory.

Amazon Textract can detect and analyze text in both single page and multi page documents. Single page document analysis can be performed using a Textract synchronous operation. However, multi page processing is an asynchronous operation. In this lab, we will be processing our medical report using the multi page processing method.

For more information on single page and multi page document processing, please refer to the following link:

- Single Page Processing: https://docs.aws.amazon.com/textract/latest/dg/sync.html
- Multi Page Processing: https://docs.aws.amazon.com/textract/latest/dg/async.html

As we are using the multi page textract operation, we will need to need upload our sample medical record to an S3 bucket. Run the next cell to upload our sample medical report.

In [56]:
fileName =  'sample_report_1.pdf'
fileUploadPath = os.path.join('./data', fileName)
textractObjectName = os.path.join(prefix, 'data', fileName)

# Upload medical report file
boto3.Session().resource('s3').Bucket(bucket).Object(textractObjectName).upload_file(fileUploadPath)

In the next step, we will start the asynchronous textract operation by calling the `start_document_analysis()` function. The function will kickoff an asynchronous job that will process our medical report file in the stipulated S3 bucket.

In [50]:
client = boto3.client('textract')
response = client.start_document_analysis(
    DocumentLocation={
        'S3Object': {
            'Bucket': bucket,
            'Name': textractObjectName
        }},
    FeatureTypes=[
        'TABLES',
    ]
    )

textractJobId = response["JobId"]

As the job is kicked off in the background, we can monitor the progress of the job by calling the `get_document_analysis()` function and passing the job id of the job that we created. 

Run the next cell and wait for the Textract Job status to return a SUCCEEDED status.

In [51]:
time.sleep(5)
response = client.get_document_analysis(JobId=textractJobId)
status = response["JobStatus"]
# print("Job status: {}".format(status))

while(status == "IN_PROGRESS"):
    time.sleep(5)
    response = client.get_document_analysis(JobId=textractJobId)
    status = response["JobStatus"]
    print("Textract Job status: {}".format(status))

Textract Job status: IN_PROGRESS
Textract Job status: IN_PROGRESS
Textract Job status: IN_PROGRESS
Textract Job status: IN_PROGRESS
Textract Job status: IN_PROGRESS
Textract Job status: IN_PROGRESS
Textract Job status: IN_PROGRESS
Textract Job status: IN_PROGRESS
Textract Job status: IN_PROGRESS
Textract Job status: IN_PROGRESS
Textract Job status: SUCCEEDED


Now that we've successfully extracted the text from the medical report, let us extract the textract results and consolidate the text so that we can pass it to Comprehend Medical to start extract medical information from the report.

In [59]:
pages = []

time.sleep(5)

response = client.get_document_analysis(JobId=textractJobId)

pages.append(response)
print("Resultset page recieved: {}".format(len(pages)))
nextToken = None
if('NextToken' in response):
    nextToken = response['NextToken']

while(nextToken):
    time.sleep(5)

    response = client.get_document_analysis(JobId=textractJobId, NextToken=nextToken)

    pages.append(response)
    print("Resultset page recieved: {}".format(len(pages)))
    nextToken = None
    if('NextToken' in response):
        nextToken = response['NextToken']

# Format the response from textract using the trp Document library
doc = Document(pages)

Resultset page recieved: 1


## Extract medical information with Amazon Comprehend Medical 

In the following section, we will be using Amazon Comprehend Medical to extract medical information from the text we got from the Textract operation on the PDF medical report.

With Amazon Comprehend Medical, you can perform the following on your documents:

- [Detect Entities (Version 2)](https://docs.aws.amazon.com/comprehend/latest/dg/extracted-med-info-V2.html) - Examine unstructured clinical text to detect textual references to medical information such as medical condition, treatment, tests and results, and medications. This version uses a new model and changes the way some entities are returned in the output. For more information, see [DetectEntitiesV2](https://docs.aws.amazon.com/comprehend/latest/dg/API_medical_DetectEntitiesV2.html).

- [Detect PHI](https://docs.aws.amazon.com/comprehend/latest/dg/how-medical-phi.html) —Examine unstructured clinical text to detect textual references to protected health information (PHI) such as names and addresses.


In this lab, we will be using the detect entities function to extract the medical condition. In the following cell, we will be processing the text on each page in batches of 20,000 UTF-8 characters. This is because Comprehend Medical has a maximum document size of 20,000 bytes (reference: https://docs.aws.amazon.com/comprehend/latest/dg/guidelines-and-limits-med.html). Once we've processed the text, we will then stich up the response into a into a single variable where we can either save to a csv or use for our analysis.

In [60]:
maxLength=20000

comprehendResponse = []

for page in doc.pages:
    pageText = page.text
    comprehend_medical_client = boto3.client(service_name='comprehendmedical', region_name='ap-southeast-2')
    for i in range(0, len(pageText), maxLength):
        response = comprehend_medical_client.detect_entities_v2(Text=pageText[0+i:maxLength+i])
        comprehendResponse.append(response)

In [45]:
# Ouput
comprehendResponse

[{'Entities': [{'Id': 2,
    'BeginOffset': 23,
    'EndOffset': 35,
    'Score': 0.9898292422294617,
    'Text': 'Terri Hodosy',
    'Category': 'PROTECTED_HEALTH_INFORMATION',
    'Type': 'NAME',
    'Traits': []},
   {'Id': 3,
    'BeginOffset': 47,
    'EndOffset': 57,
    'Score': 0.9998424053192139,
    'Text': '10/22/1962',
    'Category': 'PROTECTED_HEALTH_INFORMATION',
    'Type': 'DATE',
    'Traits': []},
   {'Id': 4,
    'BeginOffset': 102,
    'EndOffset': 112,
    'Score': 0.9999436140060425,
    'Text': '01/01/2020',
    'Category': 'PROTECTED_HEALTH_INFORMATION',
    'Type': 'DATE',
    'Traits': []},
   {'Id': 5,
    'BeginOffset': 128,
    'EndOffset': 138,
    'Score': 0.9998565912246704,
    'Text': '01/20/2020',
    'Category': 'PROTECTED_HEALTH_INFORMATION',
    'Type': 'DATE',
    'Traits': []},
   {'Id': 6,
    'BeginOffset': 165,
    'EndOffset': 167,
    'Score': 0.9463181495666504,
    'Text': '15',
    'Category': 'PROTECTED_HEALTH_INFORMATION',
    'Type': 

In [None]:
client = boto3.client('sagemaker-runtime')

client.invoke_endpoint(
    EndpointName='string',
    Body=b'bytes'|file,
    ContentType='string',
    Accept='string',
    CustomAttributes='string',
    TargetModel='string',
    TargetVariant='string'
)