# Semantic Role Labeling

Partner: Canada Digital Analytics Team

Mentor: Jungyeul Park

Version: 2021.05.24

Author: Alex Chen

## Overview

### Why should I care about this?

- **TL;DR:**  

This model can parse a sentence and help answer **the question of who did what to whom** quickly and automatically.

- The question:

> given a comment, can we fill relevant information in a table like this?

| Agent | Action | Item | Extra |
|-------|:------:|:----:|-------|
| Fed Govt|needs|provide info|re travel|

- The status quo:  

    If the user wants to fill out the table above, the user currently needs to read a comment presented in plain text:
    > _Fed Govt needs to make it mandatory that Ontario provide detailedinfo 1) Cumulative percent of adult population who have received at least one dose of a COVID-19 vaccine 2) Distribution by PH Unit geographic catchment 3) Naming of # distributed to each health care institution_

then manually identify the key parts of the message.

- The problem:

    It takes time to go over the whole sentence and then identify key information such as "who" and "do what".  
    It may get even slower when it's the user's 100th comment to process for the day.

- The solution:  

    Semantic roles labelling (SRL) can extract linguistic features and help highlight parts of a sentence by its meaning.     This makes faster processsing possible.  
    For example, given a setence:  
	> John met him privately in Ottawa, May 3. 

	the features are:

	> ![](img/srl-frames-demo-highlight.png)


### What are these?

They are labels generated by SRL. 

### How does SRL come up with them? 

1. SRL looks at a sentence from different angles. Each angle is a set of **frames**.  Each set of frames is about a verb. 

2. A frame is either a verb or one of its "associates".  
  With the the example above, the model predicts the (labels of) frames as such:
   - who: John
   - did what: met
	- whom: him
	- manner: privately
	- location: in Ottawa
	- time: May 3

### what's the upside of using SRL?

SRL can:  

- automatically process comments in bulk,  
- present the parsed results, with highlights, to the user,
-  save the results into a table,  
-  so the user can look for patterns in it. 

### Show me a real-world example.

Given a comment:
> Critical missing info: Fed Govt needs to make it mandatory that Ontario provide detailedinfo 1) Cumulative percent of adult population who have received at least one dose of a COVID-19 vaccine 2) Distribution by PH Unit geographic catchment 3) Naming of # distributed to each health care institution.

SRL can process it in at least 2 ways:
![srl-two-sets-of-frames-demo-highlight](img/srl-two-sets-of-frames-demo-highlight.png)

The program can then add 2 rows to a table:

| Agent    |  Action |                                                                                                                                 Item                                                                                                                                | Extra                                                                                                                                                                                                                                                                              |
|----------|:-------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| N/A      | missing |                                                                                                                                 info                                                                                                                                | : Fed Govt needs to make it mandatory that Ontario provide detailedinfo 1) Cumulative percent of adult population who have received at least one dose of a COVID-19 vaccine 2) Distribution by PH Unit geographic catchment 3) Naming of # distributed to each health care institution |
| Fed Govt |  needs  | make it mandatory that Ontario provide detailedinfo 1) Cumulative percent of adult population who have received at least one dose of a COVID-19 vaccine 2) Distribution by PH Unit geographic catchment 3) Naming of # distributed to each health care institution. | N/A                                                                                                                                                                                                                                                                                    |

Now apply SRL to tens of thousands of comments, and the user will have a table with substantial data. They can use it to answer questions such as:

- What information is said to be **missing**?
- What are the things that **needs** to done by the Federal Government?

## Theory

- Definition:  
Semantic Role Labeling (SRL) is the task of determining the latent predicate argument structure of a sentence and providing representations that can answer basic questions about sentence meaning, including who did what to whom, etc.

## Implementation

### Tech specs:

- back-end: SRL BERT via AllenNLP

- front-end: html with custom css

### Set-up

In [3]:
import pandas as pd
import spacy
import re
from IPython.display import clear_output
from IPython.display import display, HTML

In [4]:
!pip install allennlp==2.1.0 allennlp-models==2.1.0
clear_output()
print('model installed.')

model installed.


In [5]:
ALLENNLP_MODEL_PATH = 'https://storage.googleapis.com/allennlp-public-models/structured-prediction-srl-bert.2020.12.15.tar.gz'


In [6]:
# this can take a while for the first call
from allennlp.predictors.predictor import Predictor
import allennlp_models.tagging
predictor = Predictor.from_path(ALLENNLP_MODEL_PATH)
clear_output()
print('model loaded.')

model loaded.


In [7]:
css_name = 'highlight.css'
css_link = 'https://github.com/sattree/gpr_pub/blob/master/visualization/highlight.css?raw=true'
!wget $css_link -O $css_name
with open(css_name, 'r') as file:
    css_str = file.read()

clear_output()

from IPython.display import HTML, display #,Math

def set_css_in_cell_output():
  display(HTML(css_str))

get_ipython().events.register('pre_run_cell', set_css_in_cell_output)

### A single-comment pipeline

Input: a comment

Output: a dataframe with html object in it

### Input to model

In [8]:
test_comment = """Critical missing info:
 
 Fed Govt needs to make it mandatory that Ontario provide detailedinfo
 
 1) Cumulative percent of adult population who have received at least one dose of a COVID-19 vaccine
 
 2) Distribution by PH Unit geographic catchment
 
 3) Naming of # distributed to each health care institution"""

test_al_output = predictor.predict(sentence=test_comment)


#### Parse raw output

In [9]:
test_annos = test_al_output['verbs']
i = 0
test_anno = test_annos[i]['description']

In [10]:
def anno2lst(anno):

    # helper functions begin

    def splitkeepsep(s, sep,Append=True,RemoveTrailing=True):
        if Append:
            result = [x+sep for x in s.split(sep)]
            if RemoveTrailing:
                result[-1] = result[-1][:-len(sep)]
        else:
            result = [sep+x for x in s.split(sep)]
            if RemoveTrailing:
                result[0] = result[0][len(sep):]
        
        return result

    def str2tpl(s):
        l = s[1:-1].split(': ')
        return (l[0],l[1])

    def convert(e):
        if ']' in e and ']' in e and ': ' in e:
            # print(e)
            return str2tpl(e)
        else:
            return e

    def getColor(tp):
        if tp:
            colorDict = {
                'V': 'blue',
                'ARG0': 'green',
                'ARG1': 'green',
                 'ARG2': 'green',
                'R-ARG0': 'purple',
                'ARGM-TMP':'teal',
                'ARGM-LOC':'teal',
            }
            return colorDict.get(tp, '')
        else:
            return tp
    def format(ipt):
        if isinstance(ipt, str):
            out_str = ipt
        if isinstance(ipt, tuple):
            tpl = ipt
            tp, st = tpl[0].strip(), tpl[1]
            color = getColor(tp)
            parts = [
                    '<span class=\"highlight ',
                     color,
                     '\">',
                    '<span class=\"highlight__label\"><strong>',
                    tp,
                    '</strong></span>',
                    st,
                    '</span>'
                    ]
            out_str = ''.join(parts)
        return out_str

    # main function begin
    anno = anno.replace("\n", "")
    l = splitkeepsep(anno, '] ',Append=True,RemoveTrailing=True)
    l = [splitkeepsep(e, ' [',Append=False,RemoveTrailing=True) for e in l]
    nl = []
    for e in l:
        if isinstance(e, list):
            nl.extend(e)
        if isinstance(e,str):
            nl.append(e)
    l = nl
    l = [e.strip() for e in l]
    l = [convert(e) for e in l]
    l = [format(e) for e in l]
    return l

#### Render parsed output

In [11]:
def anno2html(anno):
    lst = anno2lst(anno)
    single_str = ''.join(lst)
    html =  HTML(single_str)
    return html

In [12]:
anno2html(test_anno)

<hr>
Note: 
Output above is not rendered by github properly. Actual rendering should be:
<br>

![srl-anno2html-output](img/srl-anno2html-output.png)

#### Construct a dataframe

In [13]:
def alout2df(id,alout):
    names = []
    htmls = []
    frames = alout['verbs']
    for frame in frames:
        anno = frame['description']
        name = frame['verb']
        html = anno2html(anno)
        names.append(name)
        htmls.append(html)
    
    df = pd.DataFrame(
    {'frame': names,
     'viz': htmls
    })
    
    df.insert(0, 'comment_id', id)
    return df


In [14]:
test_comment_id = '-'
alout2df(test_comment_id, test_al_output).head(3)

Unnamed: 0,comment_id,frame,viz
0,-,missing,<IPython.core.display.HTML object>
1,-,needs,<IPython.core.display.HTML object>
2,-,make,<IPython.core.display.HTML object>


In [15]:
# This is the final single-comment pipeline
def idcmt2df(id, cmt, predictor):
    """
    input:
        id(str): Unique ID for every comment
        cmt(str): Content of comment 
        predictor: pre-trained model
    output:
        a pandas dataframe with comment_id and SRL viz.
    """
    alout = predictor.predict(sentence=cmt)
    return alout2df(id, alout)

## Scale up

Now that we have a pipeline working for a single comment. We use it to process more.

### load data

In [16]:
def gs2tsv(FILE_ID,outname='input.csv'):
    tsv_link = 'https://docs.google.com/spreadsheets/d/'+FILE_ID+'/export?format=tsv'
    !wget $tsv_link -O $outname
    clear_output()
    print(outname, 'downloaed.')
    return


In [17]:

FILE_ID_FROM_YUNDONG = '1FL1pA7hjayBd8Ffptzj1RkpqtL5-Q7vnK1rDD3ox2Xc'
outname = 'input.tsv'
gs2tsv(FILE_ID_FROM_YUNDONG,outname)

input.tsv downloaed.


In [18]:
def tsv2df(inname='input.csv'):
    indf = pd.read_csv(inname, sep='\t', header=0)
    return indf


In [19]:

inname= 'input.tsv'
test_indf = tsv2df(inname)
test_indf.head(3)

Unnamed: 0,Unique ID,Domain,Comment,Tags
0,6070f5c0800d871e0c75d919,Vaccine,I got my jab on March 29. Your literature says...,Vaccine effectiveness / delayed dosage
1,606ac6aa8d190c273ca7ebe3,Vaccine,How reliable the shipment is ?? Spending o...,Data and tracking vaccines
2,601c05426c4b8d189822fcec,Vaccine,Critical missing info: Fed Govt needs to ma...,Data and tracking vaccines


In [20]:
def df2df(indf):
    dfs = []
    for idx, row in indf.iterrows():
        # print(idx)
        id = row['Unique ID']
        cmt = row['Comment']
        df_to_append = idcmt2df(id, cmt, predictor=predictor)
        dfs.append(df_to_append) 
        out_df = pd.concat(dfs)
    return out_df.reset_index()


In [22]:
def fileId2df(file_id):
    gs2tsv(file_id)
    indf = tsv2df()
    outdf = df2df(indf)
    return outdf

test_file_id = FILE_ID_FROM_YUNDONG
outdf = fileId2df(test_file_id)
outdf.head(3)

input.csv downloaed.


Unnamed: 0,index,comment_id,frame,viz
0,0,6070f5c0800d871e0c75d919,got,<IPython.core.display.HTML object>
1,1,6070f5c0800d871e0c75d919,says,<IPython.core.display.HTML object>
2,2,6070f5c0800d871e0c75d919,need,<IPython.core.display.HTML object>


### Display

In [24]:
ids = list(outdf['comment_id'].unique())

In [33]:
id2cmtDict  = pd.Series(test_indf.Comment.values,index=test_indf['Unique ID']).to_dict()

In [None]:
def dfid2viz(df, id):
    sub_df = df.loc[df['comment_id'] == id]
    sub_df = sub_df.reset_index()
    n_frame = len(sub_df)
    cmt = id2cmtDict[id]
    parts = [
             "{}  Total Frames for  the following comment: <br><br><i>{}</i><br><br>".format(n_frame, cmt),
    ]
    display(HTML(''.join(parts)))
    
    display(HTML('<hr>'))
    for i in range(n_frame):
        name = sub_df['frame'][i]
        display(HTML('<b>Frames for <span class=\"highlight blue\">{}</span></b>:\n'.format(name)))
        display(sub_df['viz'][i])
        display(HTML('<hr><br>'))
    print('\n\n\n')
    return

test_id = '601c05426c4b8d189822fcec'
dfid2viz(outdf, test_id)
# excerpt of output (last 2 sets of frames) shown in markdown cell below:

![srl-frames-highlight-output-excerpt](img/srl-frames-highlight-output-excerpt.png)

In [27]:
## Uncomment and run if you want output for all comments loaded from input.tsv

# for id in ids:
#     dfid2viz(outdf, id)

## Next steps

We would like to hear from our partner and our mentor before taking the next step. But here are some ideas we thought of:

|           Done           |                To-do                |                            So that the user can answer...                            |
|:------------------------:|:-----------------------------------:|:------------------------------------------------------------------------------------:|
|       Visualization      |            Query function           |                Which word shows up most often with the verb "missing"                |
| Comment-level drill-down |       higher-level aggregation      | What are the top 3 locations(e.g. cities) mentioned in all comments on vaccine page? |
| Input data as is         | Pre-process text (e.g. spell-check) | Can we clean the data first to make better use of it?                                   |