<a href="https://colab.research.google.com/github/svetakvsundhar/beam/blob/healthcarenlp/examples/notebooks/healthcare/beam_nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# @title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the "License")

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License

# **Natural Language Processing Pipeline**

**Note**: This example is used from [here](https://github.com/rasalt/healthcarenlp/blob/main/nlp_public.ipynb).



This example demonstrates how to set up an Apache Beam pipeline that reads a file from [Google Cloud Storage](https://https://cloud.google.com/storage), and calls the [Google Cloud Healthcare NLP API](https://cloud.google.com/healthcare-api/docs/how-tos/nlp) to extract information from unstructured data. This application can be used in contexts such as reading scanned clinical documents and extracting structure from it.

An Apache Beam pipeline is a pipeline that reads input data, transforms that data, and writes output data. It consists of PTransforms and PCollections. A PCollection represents a distributed data set that your Beam pipeline operates on. A PTransform represents a data processing operation, or a step, in your pipeline. It takes one or more PCollections as input, performs a processing function that you provide on the elements of that PCollection, and produces zero or more output PCollection objects.

For details about Apache Beam pipelines, including PTransforms and PCollections, visit the [Beam Programming Guide](https://beam.apache.org/documentation/programming-guide/).

You'll be able to use this notebook to explore the data in each PCollection.

Instructions
1. Set the variables in the next cell based upon your project and preferences
2. The files referred to in this notebook nlpsample*.csv are in the format with one
blurb of clinical note.

In [None]:
pip install apache-beam[gcp]



Note that below **us-central1** is hardcoded as the location. This is because of the limited number of [locations](https://cloud.google.com/healthcare-api/docs/how-tos/nlp) the API currently supports.

In [None]:
#Change this variable to True if you want to debug the Interactive Runner Pipeline else it uses Dataflow
debug = False
DATASET="<YOUR_DATASET_NAME>"
TEMP_LOCATION="<YOUR_TEMP_LOCATION>"
PROJECT='<YOUR_PROJECT_ID>'
LOCATION='us-central1'
URL=f'https://healthcare.googleapis.com/v1/projects/{PROJECT}/locations/{LOCATION}/services/nlp:analyzeEntities'
NLP_SERVICE=f'projects/{PROJECT}/locations/{LOCATION}/services/nlp'
GCS_BUCKET=PROJECT

**BigQuery Setup**

We will be using BigQuery to warehouse the structured data revealed in the output of the Healthcare NLP API. For this purpose, we create 3 tables to organize the data

In [None]:
from google.cloud import bigquery

# Construct a BigQuery client object.

TABLE_ENTITY="entity"
TABLE_REL="relations"
TABLE_ENTITYMENTIONS="entitymentions"

schemaEntity = [
    bigquery.SchemaField("entityId", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("preferredTerm", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("vocabularyCodes", "STRING", mode="REPEATED"),
]

schemaRelations = [
    bigquery.SchemaField("subjectId", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("objectId", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("confidence", "FLOAT64", mode="NULLABLE"),
    bigquery.SchemaField("id", "STRING", mode="NULLABLE"),
]

schemaEntityMentions = [
    bigquery.SchemaField("mentionId", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("type", "STRING", mode="NULLABLE"),
    bigquery.SchemaField(
        "text",
        "RECORD",
         mode="NULLABLE",
         fields=[
             bigquery.SchemaField("content", "STRING", mode="NULLABLE"),
             bigquery.SchemaField("beginOffset", "INTEGER", mode="NULLABLE"),
         ],
    ),
    bigquery.SchemaField(
        "linkedEntities",
        "RECORD",
         mode="REPEATED",
         fields=[
             bigquery.SchemaField("entityId", "STRING", mode="NULLABLE"),
         ],
    ),
    bigquery.SchemaField(
        "temporalAssessment",
        "RECORD",
         mode="NULLABLE",
         fields=[
             bigquery.SchemaField("value", "STRING", mode="NULLABLE"),
             bigquery.SchemaField("confidence", "FLOAT64", mode="NULLABLE"),
         ],
    ),
    bigquery.SchemaField(
        "certaintyAssessment",
        "RECORD",
         mode="NULLABLE",
         fields=[
             bigquery.SchemaField("value", "STRING", mode="NULLABLE"),
             bigquery.SchemaField("confidence", "FLOAT64", mode="NULLABLE"),
         ],
    ),
    bigquery.SchemaField(
        "subject",
        "RECORD",
         mode="NULLABLE",
         fields=[
             bigquery.SchemaField("value", "STRING", mode="NULLABLE"),
             bigquery.SchemaField("confidence", "FLOAT64", mode="NULLABLE"),
         ],
    ),
    bigquery.SchemaField("confidence", "FLOAT64", mode="NULLABLE"),
    bigquery.SchemaField("id", "STRING", mode="NULLABLE")
]

client = bigquery.Client()

# Create Table IDs
table_ent = PROJECT+"."+DATASET+"."+TABLE_ENTITY
table_rel = PROJECT+"."+DATASET+"."+TABLE_REL
table_mentions = PROJECT+"."+DATASET+"."+TABLE_ENTITYMENTIONS

# If table exists, delete the tables.
client.delete_table(table_ent, not_found_ok=True)
client.delete_table(table_rel, not_found_ok=True)
client.delete_table(table_mentions, not_found_ok=True)

# Create tables

table = bigquery.Table(table_ent, schema=schemaEntity)
table = client.create_table(table)  # Make an API request.

print(
    "Created table {}.{}.{}".format(table.project, table.dataset_id, table.table_id)
)

table = bigquery.Table(table_rel, schema=schemaRelations)
table = client.create_table(table)  # Make an API request.
print(
    "Created table {}.{}.{}".format(table.project, table.dataset_id, table.table_id)
)
table = bigquery.Table(table_mentions, schema=schemaEntityMentions)
table = client.create_table(table)  # Make an API request.
print(
    "Created table {}.{}.{}".format(table.project, table.dataset_id, table.table_id)
)

Created table svetak-sandbox.sampledataset.entity
Created table svetak-sandbox.sampledataset.relations
Created table svetak-sandbox.sampledataset.entitymentions


**Pipeline Setup**

For the purpose of experimenting, I have setup an interactiveRunner. But this should be changed to a DatafowRunner once you are comfortable with it

In [None]:
# Python's regular expression library
import re
from sys import argv
# Beam and interactive Beam imports
import apache_beam as beam
from apache_beam.runners.interactive.interactive_runner import InteractiveRunner
import apache_beam.runners.interactive.interactive_beam as ib

#Reference https://cloud.google.com/dataflow/docs/guides/specifying-exec-params#python_1
from apache_beam.options.pipeline_options import PipelineOptions
if debug:
    runnertype = "InteractiveRunner"
else:
    runnertype = "DataflowRunner"

options = PipelineOptions(
    flags=argv,
    runner=runnertype,
    project=PROJECT,
    job_name="my-healthcare-nlp-job",
    temp_location=TEMP_LOCATION,
    region=LOCATION)

The following defines a `PTransform` named `ReadLinesFromText`, that extracts lines from a file.

In [None]:
class ReadLinesFromText(beam.PTransform):

    def __init__(self, file_pattern):
        self._file_pattern = file_pattern

    def expand(self, pcoll):
        return (pcoll.pipeline
                | beam.io.ReadFromText(self._file_pattern))

The following sets up an Apache Beam pipeline with the *Interactive Runner*. The *Interactive Runner* is the runner suitable for running in notebooks. A runner is an execution engine for Apache Beam pipelines.

In [None]:
from google.auth import default

credentials = default()

In [None]:
p = beam.Pipeline(options = options)

The following sets up a PTransform that extracts words from a Google Cloud Storage file that contains lines with each line containing a In our example, each line is a medical notes excerpt that will be passed through the Healthcare NLP API

| is an overloaded operator that applies a PTransform to a PCollection to produce a new PCollection. Together with |, >> allows you to optionally name a PTransform.

Usage: [PCollection] | [PTransform] or [PCollection] | [name] >> [PTransform]

In [None]:
lines = p | 'read' >> ReadLinesFromText(GCS_BUCKET + "nlpsample500.csv")

In [None]:
if debug:
    ib.show(lines)

In [None]:
if debug:

    from google.auth import compute_engine
    credentials = compute_engine.Credentials()
    response = {}
    from google.auth.transport.requests import AuthorizedSession
    authed_session = AuthorizedSession(credentials)

    url = URL
    value=" operative suite and placed supine on the operating room table.  General endotracheal anesthesia was induced without incident.  The patient was then placed in a modified lithotomy position taking great care to pad all extremities.  TEDs and Venodynes were placed as prophylaxis against deep venous thrombosis.  Antibiotics were given for prophylaxis against surgical infection.,A 52-French bougie was placed in the proximal esophagus by Anesthesia, above the cardioesophageal junction.  A 2 cm midline incision was made at the junction of the upper two-thirds and lower one-third between the umbilicus and the xiphoid process.  The fascia was then cleared of subcutaneous tissue using a tonsil clamp.  A 1-2 cm incision was then made in the fascia gaining entry into the abdominal cavity without incident.  Two sutures of 0 Vicryl were then placed superiorly and inferiorly in the fascia, and then tied to the special 12 mm Hasson trocar fitted with a funnel-shaped adaptor in order to occlude the fascial opening.  Pneumoperitoneum was then established using carbon dioxide insufflation to a steady state of pressure of 16 mmHg.  A 30-degree laparoscope was inserted through this port and used to guide the remaining trocars.,The remaining trocars were then placed into the abdomen taking care to make the incisions along Langer's line, spreading the subcutaneous tissue with a tonsil clamp, and confirming the entry site by depressing the abdominal wall prior to insertion of the trocar.  A total of 4 other 10/11 mm trocars were placed.  Under direct vision 1 was inserted in the right upper quadrant at the midclavicular line, at a right supraumbilical position; another at the left upper quadrant at the midclavicular line, at a left supraumbilical position; 1 under the right costal margin in the anterior axillary line; and another laterally under the left costal margin on the anterior axillary line.  All of the trocars were placed without difficulty.  The patient was then placed in reverse Trendelenburg position.,The triangular ligament was taken down sharply, and the left lobe of the liver was retracted superolaterally using a fan retractor placed through the right lateral cannula.  The gastrohepatic ligament was then identified and incised in an avascular plane.  The dissection was carried anteromedially onto the phrenoesophageal membrane.  The phrenoesophageal membrane was divided on the anterior aspect of the hiatal orifice.  This incision was extended to the right to allow identification of the right crus.  Then along the inner side of the crus, the right esophageal wall was freed by dissecting the cleavage plane.,The liberation of the posterior aspect of the esophagus was started by extending the dissection the length of the right diaphragmatic crus.  The pars flaccida of the lesser omentum was opened, preserving the hepatic branches of the vagus nerve.  This allowed free access to the crura, left and right, and the right posterior aspect of the esophagus, and the posterior vagus nerve.,Attention was next turned to the left anterolateral aspect of the esophagus.  At its left border, the left crus was identified.  The dissection plane between it and the left aspect of the esophagus was freed.  The gastrophrenic ligament was incised, beginning the mobilization of the gastric pouch.  By dissecting the intramediastinal portion of the esophagus, we elongated the intra-abdominal segment of the esophagus and reduced the hiatal hernia.,The next step consisted of mobilization of the gastric pouch.  This required ligation and division of the gastrosplenic ligament and several short gastric vessels using the harmonic scalpel.  This dissection started on the stomach at the point where the vessels of the greater curvature turned towards the spleen, away from the gastroepiploic arcade.  The esophagus was lifted by a Babcock inserted through the left upper quadrant port.  Careful dissection of the mesoesophagus and the left crus revealed a cleavage plane between the crus and the posterior gastric wall.  Confirmation of having opened the correct plane was obtained by visualizing the spleen behind the esophagus.  A one-half inch Penrose drain was inserted around the esophagus and sewn to itself in order to facilitate retraction of the distal esophagus.  The retroesophageal channel was enlarged to allow easy passage of the antireflux valve.,The 52-French bougie was then carefully lowered into the proximal stomach, and the hiatal orifice was repaired.  Two interrupted 0 silk sutures were placed in the diaphragmatic crura to close the orifice.,The last part of the operation consisted of the passage and fixation of the antireflux valve.  With anterior retraction on the esophagus using the Penrose drain, a Babcock was passed behind the esophagus, from right to left.  It was used to grab the gastric pouch to the left of the esophagus and to pull it behind, forming the wrap.  The,52-French bougie was used to calibrate the external ring.  Marcaine 0.5% was injected 1 fingerbreadth anterior to the anterior superior iliac spine and around the wound for postanesthetic pain control.  The skin incision was approximated with skin staples.  A dressing was then applied.  All surgical counts were reported as correct.,Having tolerated the procedure well, the patient was subsequently taken to the recovery room in good and stable condition."

    payload = {
           'nlp_service': NLP_SERVICE,
            'document_content': value
    }
    res = authed_session.post(url, data=payload)
    response = res.json()
    response['id'] = 2389

    print(response)

In [None]:
class InvokeNLP(beam.DoFn):

    def process(self, element):
      #  import requests
        import uuid
        from google.auth import compute_engine
        credentials = compute_engine.Credentials()
        from google.auth.transport.requests import AuthorizedSession
        authed_session = AuthorizedSession(credentials)
        url = URL
        payload = {
            'nlp_service': NLP_SERVICE,
            'document_content': element
        }
        resp = authed_session.post(url, data=payload)
        response = resp.json()
        response['id'] = uuid.uuid4().hex[:8]
        yield response

class AnalyzeLines(beam.PTransform):
    def expand(self, pcoll):
        return (
            pcoll
            | "Invoke NLP API" >> beam.ParDo(InvokeNLP())
        )

In [None]:
import json
from apache_beam import pvalue

class getEntityMentions(beam.DoFn):
    def process(self, element):
        obj = {}
        for e in element['entityMentions']:
            e['id'] = element['id']
            yield e

class getRelationships(beam.DoFn):
    def process(self, element):
        obj = {}
        id = element['id']
        for e in element['relationships']:
            obj = e
            obj['id'] = id
            yield obj

class breakUpEntities(beam.DoFn):
    def process(self, element):
        for e in element['entities']:
            print(e)
            yield e

In [None]:
from apache_beam.io.gcp.internal.clients import bigquery


table_spec = bigquery.TableReference(
    projectId=PROJECT,
    datasetId=DATASET,
    tableId=TABLE_ENTITY)

nlp_annotations = (lines
                | "Analyze" >> AnalyzeLines()
                  )


In [None]:
if debug:
    ib.show(nlp_annotations) #, visualize_data=True)

In [None]:
resultsEntities = ( nlp_annotations
                | "Break" >> beam.ParDo(breakUpEntities())
                | "WriteToBigQuery" >> beam.io.WriteToBigQuery(
                    table_spec,
                    write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
                    create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER)
                  )

In [None]:
if debug:
    ib.show(resultsEntities) #, visualize_data=True)

In [None]:
table_spec = bigquery.TableReference(
    projectId=PROJECT,
    datasetId=DATASET,
    tableId=TABLE_REL)

resultsRelationships = ( nlp_annotations
                | "GetRelationships" >>  beam.ParDo(getRelationships())
                | "WriteToBigQuery" >> beam.io.WriteToBigQuery(
                    table_spec,
                    write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
                    create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER)
                  )

In [None]:
if debug:
    ib.show(resultsRelationships) #, visualize_data=True)

In [None]:
table_spec = bigquery.TableReference(
    projectId=PROJECT,
    datasetId=DATASET,
    tableId=TABLE_ENTITYMENTIONS)

resultsEntityMentions = ( nlp_annotations
                | "GetEntityMentions" >> beam.ParDo(getEntityMentions())
                | "WriteToBigQuery" >> beam.io.WriteToBigQuery(
                    table_spec,
                    write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
                    create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER)
                  )

In [None]:
if debug:
    ib.show(resultsEntityMentions) #, visualize_data=True)

You can see the job graph for the pipeline by doing:

In [None]:
ib.show_graph(p)

In [None]:
result = p.run()
result.wait_until_finish()