## *DISCLAIMER*
<p style="font-size:16px; color:#117d30;">
 By accessing this code, you acknowledge the code is made available for presentation and demonstration purposes only and that the code: (1) is not subject to SOC 1 and SOC 2 compliance audits; (2) is not designed or intended to be a substitute for the professional advice, diagnosis, treatment, or judgment of a certified financial services professional; (3) is not designed, intended or made available as a medical device; and (4) is not designed or intended to be a substitute for professional medical advice, diagnosis, treatment or judgement. Do not use this code to replace, substitute, or provide professional financial advice or judgment, or to replace, substitute or provide medical advice, diagnosis, treatment or judgement. You are solely responsible for ensuring the regulatory, legal, and/or contractual compliance of any use of the code, including obtaining any authorizations or consents, and any solution you choose to build that incorporates this code in whole or in part.
</p>

# Information Extraction from Documents in Healthcare - Automatic Form Recognition 
<h3><span style="color: #117d30;"> Using Azure Form Recognizer</span></h3>

$*****$ For Demonstration purpose only, Please customize as per your enterprise security needs and compliances.License agreement: https://github.com/microsoft/Azure-Analytics-and-AI-Engagement/blob/main/HealthCare/License.md $*****$ 

## Legal Notices 

This presentation, demonstration, and demonstration model are for informational purposes only. Microsoft makes no warranties, express or implied, in this presentation demonstration, and demonstration model. Nothing in this presentation, demonstration, or demonstration model modifies any of the terms and conditions of Microsoft’s written and signed agreements. This is not an offer and applicable terms and the information provided is subject to revision and may be changed at any time by Microsoft.

This presentation, demonstration, and/or demonstration model do not give you or your organization any license to any patents, trademarks, copyrights, or other intellectual property covering the subject matter in this presentation, demonstration, and demonstration model.

The information contained in this presentation, demonstration and demonstration model represent the current view of Microsoft on the issues discussed as of the date of presentation and/or demonstration, and the duration of your access to the demonstration model. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of presentation and/or demonstration and for the duration of your access to the demonstration model.

No Microsoft technology, nor any of its component technologies, including the demonstration model, is intended or made available: (1) as a medical device; (2) for the diagnosis of disease or other conditions, or in the cure, mitigation, treatment or prevention of a disease or other conditions; or (3) as a substitute for the professional clinical advice, opinion, or judgment of a treating healthcare professional. Partners or customers are responsible for ensuring the regulatory compliance of any solution they build using Microsoft technologies.

© 2020 Microsoft Corporation. All rights reserved

![](https://pocaccelerator.blob.core.windows.net/webappassets/patient_form.jpg)

## Azure Form Recognizer

Azure Form Recognizer is a cognitive service that uses machine learning technology to identify and extract key-value pairs and table data from form documents. It then outputs structured data that includes the relationships in the original file.

## Scenario Overview


Azure Form Recognizer is a cognitive service that uses machine learning technology to identify and extract key-value pairs and table data from form documents. It then outputs structured data that includes the relationships in the original file.

Patient Intake Form Dataset: Raw unstructured data is fed into the pipeline in the form of electronically generated PDFs. These reports contain information about injuries that occurred at 5 different hospital locations. This data provides information on patient intake forms, including the allergies, conditions, symptoms, and the other information about patients.

### Notebook Organization

- Fetch the patient intake  PDF files from a container under an azure storage account.
- Convert the PDF files to JSON by querying the azure trained form recognizer model using the REST API.
- Preprocess the JSON files to extract only relevant information.
- Push the JSON files to a container under an azure storage account.

## Importing required libraries

In [37]:
import json
import time
import requests
import os
from azure.storage.blob import ContainerClient
import pprint
import json
from os import listdir
from os.path import isfile, join
import shutil
import time

In [38]:
import os
os.getcwd()

'/mnt/batch/tasks/shared/LS_root/mounts/clusters/vmhealthcare001/code/Users/demo-healthcare-user/notebooks'

## Creating local directories

In [39]:
# Create local directories if they don't exist

# *input_forms* contains all the pdf files
input_path = os.getcwd()+"/input_forms"
output_path = os.getcwd()+"/output_json"

if (not os.path.isdir(input_path)):
    os.makedirs(input_path)

# *output_json* will contain all the converted json files
if (not os.path.isdir(output_path)):
    os.makedirs(output_path)

## Establishing connection to Azure blob storage

In [1]:
import GlobalVariables

In [40]:
CONNECTION_STRING = GlobalVariables.STORAGE_ACCOUNT_CONNECTION_STRING
CONTAINER_NAME = "formuploadv2"

# creating blob service object and list blobs inside input_folder

container_client = ContainerClient.from_connection_string(conn_str=CONNECTION_STRING, container_name=CONTAINER_NAME)
blobs_list = container_client.list_blobs()

# initializing several lists that will be used in the following cells
blob_client_list=[]
blob_file_list = []

# getting the blob clients and appending them to a list
for c in blobs_list:
    blob_client = container_client.get_blob_client(c)
    blob_file_list.append(c.name)
    blob_client_list.append(blob_client)

for filename, blob_client in zip(blob_file_list, blob_client_list):
    fname = os.path.join(input_path,filename)
    with open(fname, "wb") as blob_file:
        download_stream = blob_client.download_blob()
        download_stream.readinto(blob_file)


## Running Azure Form Recognizer service on forms

In [53]:
%%time
files = [f for f in listdir(os.getcwd()+"/input_forms") if isfile(join(os.getcwd()+"/input_forms", f))]

# Endpoint parameters for querying the customer trained form-recognizer model to return the processed JSON
# Processes PDF files one by one and return JSON files
endpoint = GlobalVariables.FORM_RECOGNIZER_ENDPOINT
apim_key = GlobalVariables.FORM_RECOGNIZER_API_KEY
model_id = GlobalVariables.FORM_RECOGNIZER_MODEL_ID
post_url = endpoint + "/formrecognizer/v2.1-preview.2/custom/models/%s/analyze" % model_id
params = {"includeTextDetails": True}
headers = {'Content-Type': 'application/pdf', 'Ocp-Apim-Subscription-Key': apim_key}

local_path = input_path


for file in files:
    with open(os.path.join(local_path,file), "rb") as f:
        data_bytes = f.read()
        
    try:
        resp = requests.post(url = post_url, data = data_bytes, headers = headers, params = params)
        print('resp',resp)
        if resp.status_code != 202:
            print("POST analyze failed:\n%s" % json.dumps(resp.json()))
            quit()
        print("POST analyze succeeded:\n%s" % resp.headers)
        get_url = resp.headers["operation-location"]
    except Exception as e:
        print("POST analyze failed:\n%s" % str(e))
        quit()
     
    n_tries = 50
    n_try = 0
    wait_sec = 5
    max_wait_sec = 60
    while n_try < n_tries:
        try:
            resp = requests.get(url = get_url, headers = {"Ocp-Apim-Subscription-Key": apim_key})
            resp_json = resp.json()
            if resp.status_code != 200:
                print("GET analyze results failed:\n%s" % json.dumps(resp_json))
                quit()
            status = resp_json["status"]
#             print(status)
            output = json.dumps(resp_json)
            
            
            if status == "succeeded":
                output_dict = json.loads(output)
                        
                print("Analysis succeeded:\n%s \n" % file[:-4])
                
                form_inputs = resp_json['analyzeResult']['documentResults'][0]['fields']
                tags = list(form_inputs.keys())

                temp = {}
                types= ''
                
                for tag in tags: 
                    if form_inputs[tag] != None:
                        types = form_inputs[tag]['type']
                        data = form_inputs[tag]['text']
                        if types == 'selectionMark':
                            if data == 'selected':
                                field = tag.split('_')
                                field_name = field[0]
                                option_chosen = field[-1]
                                
                                if field_name in temp: 
                                    temp_data = temp[field_name]
                                    temp_data.append(option_chosen)
                                    temp[field_name] = temp_data
                                else:
                                
                                    temp[field_name] = [option_chosen]
                            else: 
                                continue
                        else: 
                            temp[tag] = data
                            
                
                    with open(os.path.join(output_path,file[:-4]+".json"), 'w') as outfile:
                        json.dump(temp, outfile)
                break
            if status == "failed":
                print("Analysis failed:\n%s" % json.dumps(resp_json))
                quit()
            # Analysis still running. Wait and retry.
            time.sleep(wait_sec)
            n_try += 1
            wait_sec = min(2*wait_sec, max_wait_sec)     
        except Exception as e:
            msg = "GET analyze results failed:\n%s" % str(e)
            print(msg)
            quit()

resp <Response [202]>
POST analyze succeeded:
{'Content-Length': '0', 'Operation-Location': 'https://cog-formrecognitionv2.cognitiveservices.azure.com/formrecognizer/v2.1-preview.2/custom/models/e46eb4c0-0e9a-4759-b980-28d5cf477236/analyzeresults/f3710d50-61b8-4f75-805f-fc6387d45d21', 'x-envoy-upstream-service-time': '159', 'apim-request-id': 'b34dae70-1b24-4b0b-8191-3cc953015492', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'x-content-type-options': 'nosniff', 'Date': 'Fri, 04 Dec 2020 18:03:18 GMT'}
Analysis succeeded:
Patient_Intake_Form10 

resp <Response [202]>
POST analyze succeeded:
{'Content-Length': '0', 'Operation-Location': 'https://cog-formrecognitionv2.cognitiveservices.azure.com/formrecognizer/v2.1-preview.2/custom/models/e46eb4c0-0e9a-4759-b980-28d5cf477236/analyzeresults/1cead4e6-fc1c-41ff-9b15-d05d4dbff026', 'x-envoy-upstream-service-time': '53', 'apim-request-id': '48450a07-f6b3-4226-9ca0-ce49110f2ad1', 'Strict-Transport-Security': 'ma




## Connection parameters for uploading to Azure blob storage

In [None]:
# Connection paramters for uploading JSON files to blob storage
CONNECTION_STRING = GlobalVariables.STORAGE_ACCOUNT_CONNECTION_STRING
CONTAINER_NAME = "formjson"

container_client_upload = ContainerClient.from_connection_string(conn_str=CONNECTION_STRING, container_name=CONTAINER_NAME)


## Uploading JSON files to Azure blob storage

In [None]:
# Upload JSON files from local folder *output_json* to the container *formrecogoutput*

for pth, dirs, files in os.walk(output_path):
    for filename in files:
        with open (os.path.join(output_path,filename),'rb') as json_file: 
            blob_client =  container_client_upload.upload_blob(name=filename, data=json_file,overwrite=True)
    