# 1. Data Processing - Build a data processing pipeline to process electronic medical reports (EMR) using Amazon Textract and Comprehend Medical

In this notebook, we will walkthrough on how to build a data processing pipeline that will process electronic medical reports (EMR) in PDF format to extract relevant medical information by using the following AWS services:

- [Textract](https://aws.amazon.com/textract/): To extract text from the PDF medical report
- [Comprehend Medical](https://aws.amazon.com/comprehend/medical/): To extract relevant medical information from the output of textract

## Contents

1. [Objective](#Objective)
1. [Setupe Environment](#Setup-Environment)
1. [Extract text using Amazon Textract](#Step-1:-Process-PDF-with-Amazon-Textract)
1. [Extract medical information using Amazon Comprehend Medical](#Step-2:-Extract-medical-information-with-Amazon-Comprehend-Medical)
1. [Clean up resources](#Clean-up-resources)

---

# Objective

The objective of this section of the workshop is to learn how to use Amazon Textract and Comprehend Medical to extract the medical information from an electronic medical report in PDF format.

---

# Setup environment

Before be begin, let us setup our environment. We will need the following:

* Amazon Textract Results Parser `textract-trp` to process our Textract results.
* Python libraries 
* Pre-processing functions that will help with processing and visualization of our results. For the purpose of this workshop, we have provided a pre-processing function library that can be found in [util/preprocess.py](./util/preprocess.py)

Note: `textract-trp` will require Python 3.6 or newer.

In [1]:
!pip install textract-trp

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
import boto3
import time
import sagemaker
import os 
import trp
from util.preprocess import *
import pandas as pd
bucket = sagemaker.Session().default_bucket()
prefix = 'sagemaker/medical_notes'

---

# Step 1: Process PDF with Amazon Textract

In this section we will be extracting the text from a medical report in PDF format using Textract. To facilitate this workshop, we have generated a [sample PDF medical report](./data/sample_report_1.pdf) using the [MTSample dataset](https://www.kaggle.com/tboyle10/medicaltranscriptions) from kaggle.

## About Textract
Amazon Textract can detect lines of text and the words that make up a line of text. Textract can handle documents in either synchronous or asynchronous processing:
+ [synchronous API](https://docs.aws.amazon.com/textract/latest/dg/sync.html): supports *The input document must be an image in `JPEG` or `PNG` format*. Single page document analysis can be performed using a Textract synchronous operation.
    1. *`detect_document_text`*: detects text in the input document. 
    2. *`analyze_document`*: analyzes an input document for relationships between detected items.
+ [asynchronous API](https://docs.aws.amazon.com/textract/latest/dg/async.html): *can analyze text in documents that are in `JPEG`, `PNG`, and `PDF` format. Multi page processing is an asynchronous operation. The documents are stored in an Amazon S3 bucket. Use DocumentLocation to specify the bucket name and file name of the document.*
    1. for context analysis:
        1. *`start_document_text_detection`*: starts the asynchronous detection of text in a document. 
        2. *`get_document_text_detection`*: gets the results for an Amazon Textract asynchronous operation that detects text in a document.
    2. for relationships between detected items :
        1. *`start_document_analysis`*: starts the asynchronous analysis of relationship in a document. 
        2. *`get_document_analysis`*: Gets the results for an Amazon Textract asynchronous operation that analyzes text in a document
  
For detailed api, refer to documentation [here](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html#Textract.Client.analyze_document).

In this demo, as the input is in pdf format and has multiple pages, we will be using the multi page textract operation, we will need to need upload our sample medical record to an S3 bucket. Run the next cell to upload our sample medical report.

In [3]:
fileName =  'sample_report_1.pdf'
fileUploadPath = os.path.join('./data', fileName)
textractObjectName = os.path.join(prefix, 'data', fileName)

# Upload medical report file
boto3.Session().resource('s3').Bucket(bucket).Object(textractObjectName).upload_file(fileUploadPath)

## Start text detection asynchonously in the pdf
In the next step, we will start the asynchronous textract operation by calling the `start_document_analysis()` function. The function will kickoff an asynchronous job that will process our medical report file in the stipulated S3 bucket.

In [4]:
textract = boto3.client('textract')
response = textract.start_document_analysis(
    DocumentLocation={
        'S3Object': {
            'Bucket': bucket,
            'Name': textractObjectName
        }},
    FeatureTypes=[
        'TABLES',
    ]
    )

textractJobId = response["JobId"]
print('job id is: ',textractJobId)

job id is:  20e30ffda31b82200800f798f5b7c1f037e3db16acc770d20bac3b4bbdb98777


## Wait until the job finishes

As the job is kicked off in the background, we can monitor the progress of the job by calling the `get_document_analysis()` function and passing the job id of the job that we created. 

Run the next cell and wait for the Textract Job status to return a SUCCEEDED status.
the outcome is in json format

In [5]:
%%time
time.sleep(5)
response = textract.get_document_analysis(JobId=textractJobId)
status = response["JobStatus"]

while(status == "IN_PROGRESS"):
    time.sleep(5)
    response = textract.get_document_analysis(JobId=textractJobId)
    status = response["JobStatus"]
    print("Textract Job status: {}".format(status))

Textract Job status: IN_PROGRESS
Textract Job status: SUCCEEDED
CPU times: user 64.2 ms, sys: 0 ns, total: 64.2 ms
Wall time: 15.5 s


## Extract textract results
Now that we've successfully extracted the text from the medical report, let us extract the textract results and consolidate the text so that we can pass it to Comprehend Medical to start extract medical information from the report.

In [6]:
%%time
pages = []

time.sleep(5)

response = textract.get_document_analysis(JobId=textractJobId)

pages.append(response)

nextToken = None
if('NextToken' in response):
    nextToken = response['NextToken']

while(nextToken):
    time.sleep(5)

    response = textract.get_document_analysis(JobId=textractJobId, NextToken=nextToken)

    pages.append(response)
    print("Resultset page recieved: {}".format(len(pages)))
    nextToken = None
    if('NextToken' in response):
        nextToken = response['NextToken']

CPU times: user 133 ms, sys: 7.55 ms, total: 140 ms
Wall time: 5.43 s


### Output from Textract

Let's take a look at the output from textract by using the trp library to extract and format the textract results.

In [7]:
doc = trp.Document(pages)
print("Total length of document is",len(doc.pages))
idx=1
for page in doc.pages:
    print(color.BOLD + f"Results from page {idx}: \n" + color.END, page.text)
    idx=idx+1


Total length of document is 2
[1mResults from page 1: 
[0m Discharge Summary
Name
Terri Hodosy
Birth Date
10/22/1962
Gender
female
Post Code
1826
Admission Date
01/01/2020
Discharge Date
01/20/2020
Medications
HISTORY: This 15-day-old female presents to Children's Hospital and transferred from Hospital
Emergency Department for further evaluation. Information is obtained in discussion with the mother
and the grandmother in review of previous medical records. This patient had the onset on the day of
presentation of a jelly-like red-brown stool started on Tuesday morning. Then, the patient was noted
to vomit after feeds. The patient was evaluated at Hospital with further evaluation with laboratory
data showing a white blood cell count elevated at 22.2; hemoglobin 14.1; sodium 138; potassium 7.2,
possibly hemolyzed; chloride 107; CO2 23; BUN 17; creatinine 1.2; and glucose of 50, which was
repeated and found to be stable in that range. The patient underwent a barium enema, which was
read

---

# Step 2: Extract medical information with Amazon Comprehend Medical

## About Amazon Comprehend Medical

Comprehend Medical detects useful information in unstructured clinical text. As much as 75% of all health record data is found in unstructured text such as physician's notes, discharge summaries, test results, and case notes. Amazon Comprehend Medical uses Natural Language Processing (NLP) models to sort through text for valuable information. Amazon Comprehend Medical only detects medical entities in English language texts.

With Amazon Comprehend Medical, you can perform the following on your documents:

- [Detect Entities (Version 2)](https://docs.aws.amazon.com/comprehend/latest/dg/extracted-med-info-V2.html) - Examine unstructured clinical text to detect textual references to medical information such as medical condition, treatment, tests and results, and medications. This version uses a new model and changes the way some entities are returned in the output. For more information, see [DetectEntitiesV2](https://docs.aws.amazon.com/comprehend/latest/dg/API_medical_DetectEntitiesV2.html).

- [Detect PHI (Verdion 2)](https://docs.aws.amazon.com/comprehend/latest/dg/how-medical-phi.html) â€”Examine unstructured clinical text to detect textual references to protected health information (PHI) such as names and addresses.


In this workshop, we will be using the detect entities function to extract the medical condition. In the following cell, we will be processing the text on each page in batches of 20,000 UTF-8 characters. This is because Comprehend Medical has a maximum document size of 20,000 bytes (reference: https://docs.aws.amazon.com/comprehend/latest/dg/guidelines-and-limits-med.html). Once we've processed the text, we will then stich up the response into a into a single variable where we can either save to a csv or use for our analysis.

In [8]:
maxLength=10000

comprehendResponse = []
comprehend_medical_client = boto3.client(service_name='comprehendmedical', region_name='us-east-1')

for page in doc.pages:
    pageText = page.text
    
    for i in range(0, len(pageText), maxLength):
        response = comprehend_medical_client.detect_entities_v2(Text=pageText[0+i:maxLength+i])
        comprehendResponse.append(response)
    patient_string = ""

## Review comprehend results
The output of detect entities v2 can detect the following entities:

- ANATOMY: Detects references to the parts of the body or body systems and the locations of those parts or systems.
- MEDICAL_CONDITION: Detects the signs, symptoms, and diagnosis of medical conditions.
- MEDICATION: Detects medication and dosage information for the patient.
- PROTECTED_HEALTH_INFORMATION: Detects the patient's personal information.
- TEST_TREATMENT_PROCEDURE: Detects the procedures that are used to determine a medical condition.
- TIME_EXPRESSION: Detects entities related to time when they are associated with a detected entity.

For this workshop, we will be using the MEDICAL_CONDITION entity to train our machine learning model. Let us take a look at some of the data.

In [9]:
## use util function to extract all the medical conditions, confidence score, trait from json file 
df_cm=extractMC_v2(comprehendResponse[0])
df_cm['ID']=1
df_cm.head(10)

Unnamed: 0,MEDICAL_CONDITION,Score,Trait,ID
0,jelly-like red-brown stool,0.370149,SYMPTOM,1
1,vomit,0.688994,,1
2,hypertension,0.992857,DIAGNOSIS,1
3,B strep,0.455369,DIAGNOSIS,1
4,herpes,0.960876,DIAGNOSIS,1
5,Thrush,0.835416,DIAGNOSIS,1
6,decreased feeding,0.90902,SYMPTOM,1
7,vomiting,0.996553,SYMPTOM,1
8,diarrhea,0.998125,SYMPTOM,1
9,well-developed,0.714349,SIGN,1


---

# Clean up resources

Finally, run the following cell to clean up the S3 resource we created.

In [10]:
boto3.Session().resource('s3').Bucket(bucket).Object(textractObjectName).delete()

{'ResponseMetadata': {'RequestId': 'FT7R1J3R8TBY8P0G',
  'HostId': 'pHFLcjTkU3huthy5o+mzEwg+rMJo8e3zQat278jaiQ6RrmqIEw1lhcDgNHcH39Zi8IXR+gDRzY8=',
  'HTTPStatusCode': 204,
  'HTTPHeaders': {'x-amz-id-2': 'pHFLcjTkU3huthy5o+mzEwg+rMJo8e3zQat278jaiQ6RrmqIEw1lhcDgNHcH39Zi8IXR+gDRzY8=',
   'x-amz-request-id': 'FT7R1J3R8TBY8P0G',
   'date': 'Sun, 20 Sep 2020 13:50:02 GMT',
   'server': 'AmazonS3'},
  'RetryAttempts': 0}}