# Using Semantic Role Labeling for Text Classfication

Partner: Canada Digital Analytics Team

Mentor: Jungyeul Park

Version: 2021.05.28

Author: Alex Chen

## Overview

### Objectives

- Set up a Semantic Role Labeling (SRL) pipepine
- Use it to extract linguistic features from a sentence
- (Optional) Transform output to be used as input for other model(s)

### Deliverable

a function:
- input (str): `U.N. official Ekeus goes to Baghdad.`
- output (str): `B-ARG0,I-ARG0,I-ARG0,B-V,B-ARG4,I-ARG4,O`

]


## Implementation



### Tech specs

- model: SRL BERT via AllenNLP

### Set-up

In [3]:
import pandas as pd
import spacy
import re
from IPython.display import clear_output
# from IPython.display import display, HTML

In [4]:
!pip install allennlp==2.1.0 allennlp-models==2.1.0
clear_output()
print('model installed.')

model installed.


In [5]:
ALLENNLP_MODEL_PATH = 'https://storage.googleapis.com/allennlp-public-models/structured-prediction-srl-bert.2020.12.15.tar.gz'


In [6]:
# this can take a while for the first call
from allennlp.predictors.predictor import Predictor
import allennlp_models.tagging
pretrained_precitor = Predictor.from_path(ALLENNLP_MODEL_PATH)
clear_output()
print('model loaded.')

model loaded.


### Build a single-comment pipeline


#### Write a function

In [7]:
def sentence2roles(predictor, sentence):
    """
    Convert a setence to list of semantic roles of a word in the sentence.
    :param predictor: pre-trained model
    :param sentence(str): a sentence
    :return (str), delimited by comma, each part being the semantic roles
    """
    model_output = predictor.predict(sentence=sentence)
    verbs = model_output['verbs']
    
    i = 0  # by default, only check the first frameset
    try:
        return ','.join(model_output['verbs'][i]['tags'])
    except: # when SRL model detects 0 frameset
        return ''



#### Run a test

In [8]:
test_comment = """U.N. official Ekeus goes to Baghdad."""

sentence2roles(predictor=pretrained_precitor, sentence=test_comment)

'B-ARG0,I-ARG0,I-ARG0,B-V,B-ARG4,I-ARG4,O'

Up to this point, we have done all we set out to do. 

If intergrated, the pipeline can process text in batch.  Please find a demo in section below.


### Scale up

Now that we have a pipeline working for a single comment. We use it to process more.

#### Load data

In [9]:

FILE_ID_FROM_YUNDONG = '1FL1pA7hjayBd8Ffptzj1RkpqtL5-Q7vnK1rDD3ox2Xc'
local_tsv_path = 'input.tsv'

def gs2tsv(FILE_ID=FILE_ID_FROM_YUNDONG,outname=local_tsv_path):
    tsv_link = 'https://docs.google.com/spreadsheets/d/'+FILE_ID+'/export?format=tsv'
    !wget $tsv_link -O $outname
    clear_output()
    print(outname, 'downloaded.')
    return


In [10]:
gs2tsv()

input.tsv downloaded.


In [11]:
def tsv2df(inname=local_tsv_path):
    indf = pd.read_csv(inname, sep='\t', header=0)
    return indf


In [13]:

test_indf = tsv2df()
test_indf.head(3)

Unnamed: 0,Unique ID,Domain,Comment,Tags
0,6070f5c0800d871e0c75d919,Vaccine,I got my jab on March 29. Your literature says...,Vaccine effectiveness / delayed dosage
1,606ac6aa8d190c273ca7ebe3,Vaccine,How reliable the shipment is ?? Spending o...,Data and tracking vaccines
2,601c05426c4b8d189822fcec,Vaccine,Critical missing info: Fed Govt needs to ma...,Data and tracking vaccines


In [14]:
def df2df(indf):
    dfs = []
    for idx, row in indf.iterrows():
        id = row['Unique ID']
        cmt = row['Comment']
        srl = [sentence2roles(predictor=pretrained_precitor, sentence=cmt)]
        df_to_append = pd.DataFrame(
            {'id': id,
             'cmt': cmt,
             'srl': srl
             })
        dfs.append(df_to_append) 
        out_df = pd.concat(dfs)
    return out_df#.reset_index()


In [15]:
def fileId2df(file_id=FILE_ID_FROM_YUNDONG):
    gs2tsv(file_id)
    indf = tsv2df()
    outdf = df2df(indf)
    return outdf.reset_index(drop=True)

In [16]:
fileId2df()

input.tsv downloaded.


Unnamed: 0,id,cmt,srl
0,6070f5c0800d871e0c75d919,I got my jab on March 29. Your literature says...,"B-ARG0,B-V,B-ARG1,I-ARG1,B-ARGM-TMP,I-ARGM-TMP..."
1,606ac6aa8d190c273ca7ebe3,How reliable the shipment is ?? Spending o...,"B-ARG2,I-ARG2,B-ARG1,I-ARG1,B-V,O,O,O,O,O,O,O,..."
2,601c05426c4b8d189822fcec,Critical missing info: Fed Govt needs to ma...,"O,B-V,B-ARG1,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O..."
3,604e366623caed19c087f936,When coming from Portugal and the itinerary is...,"B-ARGM-TMP,B-V,B-ARG3,I-ARG3,O,B-ARGM-DIR,I-AR..."
4,604498689a91901f24b82c39,Pre-entry test requirements: You must show pr...,"O,O,O,O,O,O,O,B-V,O,O,O,O,O,O,O,O,O,O,O,O,O,O,..."


## Next steps

We would like to get more feedback before taking the next step. 

For now we are reading Rago(2018) and Yi(2007) for reference.