## Feature Extraction for SRL task

INSTRUCTION:

Toolkit requirenet - Please ensure that spaCy is installed locally with the data 'en_core_web_sm downloaded, also locally.
To get the feature extraction output, please go to the menu bar, select "Run", then "Run All Cells". Output will be listed at the end of the notebook.

Output:

the named entity of target token + the relative paths (number of steps) from target token to head along with their dependency relation tag. 
a Panda Dataframe of tokens along with the features to be extratced. Features includes the NER tags, the Path and the dependency tag will be included in the dataframe.

Data:

Sample text were selected from SEM-2012-SharedTask-CD-SCO-dev-simple.v2.txt (ID=40 & 120, chapter='baskervilles08')

In [1]:
# baskervilles01, 40th sentence
sent_1 = '''When i said that you stimulated me I meant, to be frank, 
that in noting your fallacies I was occasionally guided towards the truth.'''
# baskervilles08, 120th sentence
sent_2 = '''When I came round the balcony he had reached the end of the farther corridor, 
and I could see from the glimmer of light through an open door that he had entered one of the rooms.'''

In [47]:
import spacy
import pandas as pd
from spacy import displacy
import networkx as nx
import io
import stanza
import benepar

In [3]:
nlp= spacy.load('en_core_web_sm')
doc_1,doc_2, doc_3 = nlp(sent_1), nlp(sent_2), nlp(sent_3)
df_1, df_2, df_3  = pd.DataFrame([token for token in list(doc_1) if token],columns=['Surface_form']), pd.DataFrame([token for token in list(doc_2) if token],columns=['Surface_form']), pd.DataFrame([token for token in list(doc_3) if token],columns=['Surface_form'])

In [6]:
def get_graph(doc):
    '''Function that calculate the path from current tokens to the root.'''
    edges = []
    paths = []
    for token in doc:
        if token.dep_ == 'ROOT':
            entity1 = token.text.lower()
        for child in token.children:
            edges.append(('{0}'.format(token.lower_),'{0}'.format(child.lower_)))
    graph = nx.Graph(edges)
   
    for token in doc:
        entity2 = token.text.lower()
        path_len = nx.shortest_path_length(graph, source=entity1, target=entity2)
        paths.append(path_len)
    return paths

def create_dataframe(df, doc):
    '''Function that display tokens and other dependency information from the original sentence.'''
    Dependency = []
    Head = []
    Token_spcy = []
    POS = []
    NER = []

    named_entities = doc.ents
    for token in doc:
        ne = 'NAN'
        dependency = format(token.dep_)
        head = token.head.text
        token_spcy = format(token.text)
        pos_tag = token.pos_
        for ent in named_entities:
            if ent.text == token.text:
                ne = ent.label_
            else:
                ne = 'NAN'
        Dependency.append(dependency)
        Head.append(head)
        Token_spcy.append(token_spcy)
        POS.append(pos_tag)
        NER.append(ne)
   
    df['Token_spcy'] = Token_spcy
    df['Head'] = Head
    df['Relation2_head'] = Dependency
    df['Path'] = get_graph(doc)
    df['POS'] = POS
    df['NER'] = NER
    return df

In [7]:
create_dataframe(df_1, doc_1)

Unnamed: 0,Surface_form,Token_spcy,Head,Relation2_head,Path,POS,NER
0,When,When,said,advmod,1,SCONJ,NAN
1,i,i,said,nsubj,1,PRON,NAN
2,said,said,said,ROOT,0,VERB,NAN
3,that,that,stimulated,mark,1,SCONJ,NAN
4,you,you,stimulated,nsubj,2,PRON,NAN
5,stimulated,stimulated,said,ccomp,1,VERB,NAN
6,me,me,stimulated,dobj,2,PRON,NAN
7,I,I,meant,nsubj,1,PRON,NAN
8,meant,meant,stimulated,parataxis,2,VERB,NAN
9,",",",",meant,punct,3,PUNCT,NAN


In [8]:
create_dataframe(df_2, doc_2)

Unnamed: 0,Surface_form,Token_spcy,Head,Relation2_head,Path,POS,NER
0,When,When,came,advmod,1,SCONJ,NAN
1,I,I,came,nsubj,1,PRON,NAN
2,came,came,came,ROOT,0,VERB,NAN
3,round,round,came,prep,1,ADV,NAN
4,the,the,balcony,det,3,DET,NAN
5,balcony,balcony,round,pobj,2,NOUN,NAN
6,he,he,reached,nsubj,4,PRON,NAN
7,had,had,reached,aux,4,AUX,NAN
8,reached,reached,balcony,relcl,3,VERB,NAN
9,the,the,end,det,3,DET,NAN
