### Import Data

In [1]:
import pandas as pd
import numpy as np
import re

In [2]:
citations = pd.read_csv("/nfs/turbo/hrg/data_detection/outputs_pipeline/4_pubs_sents_preds_ids.csv")

In [3]:
citations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 402637 entries, 0 to 402636
Data columns (total 5 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   paper_id            402637 non-null  int64 
 1   paper_title         364875 non-null  object
 2   paper_section       379748 non-null  object
 3   sentence_text       402637 non-null  object
 4   dataset_prediction  7486 non-null    object
dtypes: int64(1), object(4)
memory usage: 15.4+ MB


In [4]:
citations_true = citations[~citations.dataset_prediction.isna()].reset_index(drop=True)

In [5]:
citations_true.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7486 entries, 0 to 7485
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   paper_id            7486 non-null   int64 
 1   paper_title         6754 non-null   object
 2   paper_section       6948 non-null   object
 3   sentence_text       7486 non-null   object
 4   dataset_prediction  7486 non-null   object
dtypes: int64(1), object(4)
memory usage: 292.5+ KB


### Extract

In [6]:
from openie import StanfordOpenIE

In [7]:
# ref: https://github.com/philipperemy/stanford-openie-python
def extract_triple(text):
    with StanfordOpenIE() as client:
        return client.annotate(text)

In [8]:
# Example
print(citations_true.sentence_text[0])
extract_triple(citations_true.sentence_text[0])

ADR data are collected from the major depositary bank websites: Bank of New York, Citibank, the Deutsche Bank, and JPMorgan. 
Starting server with command: java -Xmx8G -cp /home/lizhouf/stanfordnlp_resources/stanford-corenlp-full-2018-10-05/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-78ba13d7deab48de.props -preload openie


[{'subject': 'ADR data', 'relation': 'are', 'object': 'collected'},
 {'subject': 'ADR data',
  'relation': 'are collected from',
  'object': 'major bank websites'},
 {'subject': 'ADR data',
  'relation': 'are collected from',
  'object': 'major depositary bank websites'},
 {'subject': 'New York', 'relation': 'of Bank is', 'object': 'Deutsche Bank'},
 {'subject': 'ADR data',
  'relation': 'are collected from',
  'object': 'bank websites'},
 {'subject': 'ADR data',
  'relation': 'are collected from',
  'object': 'depositary bank websites'}]

In [10]:
print(citations_true.sentence_text[1])
eg1 = extract_triple(citations_true.sentence_text[1])
eg1

For the non-ADR cross-listed shares (direct listing and New York Registered shares), we obtain the name of the firms, type of listing from the NYSE, Nasdaq, and AMEX websites. 
Starting server with command: java -Xmx8G -cp /home/lizhouf/stanfordnlp_resources/stanford-corenlp-full-2018-10-05/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-68493c7d9c0549c2.props -preload openie


[{'subject': 'we',
  'relation': 'obtain name For',
  'object': 'non-ADR cross-listed shares'},
 {'subject': 'firms', 'relation': 'type from', 'object': 'NYSE websites'},
 {'subject': 'we', 'relation': 'obtain', 'object': 'name'},
 {'subject': 'we', 'relation': 'obtain name For', 'object': 'non-ADR shares'},
 {'subject': 'we',
  'relation': 'obtain name For',
  'object': 'cross-listed shares'},
 {'subject': 'firms', 'relation': 'type of', 'object': 'listing'},
 {'subject': 'we', 'relation': 'obtain name For', 'object': 'shares'},
 {'subject': 'we', 'relation': 'obtain', 'object': 'name of firms'}]

In [11]:
print(citations_true.sentence_text[4])
eg2 = extract_triple(citations_true.sentence_text[4])
eg2

In addition to linking our results to existing empirical studies, we provide some new evidence on our central implications by studying the autonomy of workers in a sample of firms in the National Organizations Survey, 1996-97 and . 
Starting server with command: java -Xmx8G -cp /home/lizhouf/stanfordnlp_resources/stanford-corenlp-full-2018-10-05/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-918c3f1425864090.props -preload openie


[{'subject': 'autonomy', 'relation': 'is in', 'object': 'sample of firms'},
 {'subject': 'firms',
  'relation': 'is in',
  'object': 'National Organizations Survey'},
 {'subject': 'we', 'relation': 'linking', 'object': 'our results'}]

In [12]:
print(citations_true.sentence_text[5])
eg3 = extract_triple(citations_true.sentence_text[5])
eg3

The data are drawn from the National Organizations Survey, 1996 -97 and 2002 (Kalleberg et al., 2001 Smith et al., 2005) . 
Starting server with command: java -Xmx8G -cp /home/lizhouf/stanfordnlp_resources/stanford-corenlp-full-2018-10-05/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-ded3fe6cfb3a4918.props -preload openie


[{'subject': 'data',
  'relation': 'are drawn from',
  'object': 'National Organizations Survey'},
 {'subject': 'data', 'relation': 'are', 'object': 'drawn'}]

Mask first VS mask later (but both went less ideal)

In [13]:
text = "The number of enrolled students is taken from the China Statistical Yearbook (NBS 2003) , based on three geographic classifications: 8 urban areas (chengshi), counties and towns (xianzhen) and rural areas (nongcun)."
eg4 = extract_triple(text)
eg4

Starting server with command: java -Xmx8G -cp /home/lizhouf/stanfordnlp_resources/stanford-corenlp-full-2018-10-05/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-b3226e2b528e410b.props -preload openie


Exception ignored in: <function StanfordOpenIE.__del__ at 0x2ba4f5322940>
Traceback (most recent call last):
  File "/home/lizhouf/.local/lib/python3.8/site-packages/openie/openie.py", line 90, in __del__
    del os.environ['CORENLP_HOME']
  File "/sw/arcts/centos7/python3.8-anaconda/2021.05/lib/python3.8/os.py", line 691, in __delitem__
    raise KeyError(key) from None
KeyError: 'CORENLP_HOME'


[{'subject': 'number', 'relation': 'is taken', 'object': '8 urban areas'},
 {'subject': 'number', 'relation': 'is taken', 'object': 'chengshi'},
 {'subject': 'number', 'relation': 'is taken', 'object': 'counties'},
 {'subject': 'three geographic classifications',
  'relation': 'based on Yearbook is',
  'object': 'NBS 2003'},
 {'subject': 'number', 'relation': 'is taken', 'object': '8 areas'}]

In [14]:
text = "The number of enrolled students is taken from the DATASET , based on three geographic classifications: 8 urban areas (chengshi), counties and towns (xianzhen) and rural areas (nongcun)."
eg5 = extract_triple(text)
eg5

Starting server with command: java -Xmx8G -cp /home/lizhouf/stanfordnlp_resources/stanford-corenlp-full-2018-10-05/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-bff17cbf88134a99.props -preload openie


[{'subject': 'number', 'relation': 'is taken', 'object': '8 urban areas'},
 {'subject': 'number', 'relation': 'is taken', 'object': 'chengshi'},
 {'subject': 'number', 'relation': 'is taken', 'object': 'counties'},
 {'subject': 'number', 'relation': 'is taken', 'object': '8 areas'}]

In [15]:
text = "The analysis takes advantage of rich data from the the Mexican Family Life Survey (MxFLS), which includes modules on health, anthropometry, cognitive skill, parental characteristics, and labor market outcomes. "
eg6 = extract_triple(text)
eg6

Starting server with command: java -Xmx8G -cp /home/lizhouf/stanfordnlp_resources/stanford-corenlp-full-2018-10-05/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-8b81ea37b34948a6.props -preload openie


[{'subject': 'analysis', 'relation': 'takes', 'object': 'advantage'},
 {'subject': 'analysis', 'relation': 'takes', 'object': 'advantage of data'},
 {'subject': 'analysis',
  'relation': 'takes',
  'object': 'advantage of rich data'}]

In [16]:
text = "The analysis takes advantage of rich data from the the Mexican Family Life Survey (MxFLS)"
extract_triple(text)

Starting server with command: java -Xmx8G -cp /home/lizhouf/stanfordnlp_resources/stanford-corenlp-full-2018-10-05/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-6aad9a48dce947a9.props -preload openie


[{'subject': 'analysis', 'relation': 'takes', 'object': 'advantage of data'},
 {'subject': 'analysis', 'relation': 'takes', 'object': 'advantage'},
 {'subject': 'analysis',
  'relation': 'takes',
  'object': 'advantage of rich data'}]

### Select

In [17]:
# for similar jsons, keep the shortest json 
# include triples with and without the dataset_prediction (for possible coref)

In [18]:
# Helper: count the "longest" object
def long_stuff(compare_stuff):  # a list of objects, return a list of longest objects
    stf = pd.DataFrame()
    stf["stuff"] = compare_stuff
    stf["len"] = stf.stuff.apply(lambda x: x.count(" "))  # a list of object length
    return stf.stuff[stf.len.idxmax()]  # we only interested in one object - the longest
    # later, we can extract more info from the long object


# Helper: get the stuff that are not a subject of others
def no_subset(A):  # a list of strings
    return list(set([x for x in A if not any(x in y and x != y for y in A)]))


# Helper: Using rules to select tri:
def select_triple(triplets):
    """
    Extract a equence of subject-verb-object (SVO) triples
    from a opie_tri fucntion acquired and processed doc,
    including both active and passive entities and actions.

    Args:
        triplets are lists of dictionaries;
        assume number of triplets >=2, i.e. len(triplets)>=2.

    Yields:
        List of dictionaries: the main/longest triplets from ``triplets``
        representing a (subject, verb, object) triple.
    """
    # initiate
    selected = []
    # only extract the longest subject
    compare_sub = list(map(lambda x: x["subject"], triplets))
    subjects = no_subset(compare_sub)
    for sub in subjects:
        # extract different unique relations
        tri_for_this_sub = [d for d in triplets if d["subject"] == sub]
        compare_rel = list(map(lambda x: x["relation"], tri_for_this_sub))
        relations = no_subset(compare_rel)
        # for each of the relation, extract the longest obeject
        for rel in relations:
            tri_for_this_rel = [d for d in tri_for_this_sub if d["relation"] == rel]
            compare_obj = list(map(lambda x: x["object"], tri_for_this_rel))
            this_object = long_stuff(compare_obj)
            selected.append({"subject": sub, "relation": rel, "object": this_object})

    # for the selected ones, if both subject and object are the same
    # we keep the one with the longest relation
    if len(selected) > 1:
        # initiate
        re_select = []

        # give group number
        group_list = [0] * len(selected)
        group_list[0] = 1  # initiate the first group
        group_num = 1
        pos = 0
        # if both subject and object are the same
        # assign the same group number
        # but avoid reassignment
        for i in range(len(selected)):
            pos = i
            for j in range(i + 1, len(selected)):
                pos += 1
                if group_list[pos] == 0:
                    if selected[i]["subject"] == selected[j]["subject"] and \
                            selected[i]["object"] == selected[j]["object"]:
                        group_list[pos] = group_num
                    else:
                        group_list[pos] = group_num + 1
            group_num += 1

        # for each group, find the longest relation
        numbers = list(set(group_list))
        selected_df = pd.DataFrame()
        selected_df["tri"] = selected
        selected_df["grp"] = group_list
        for num in numbers:
            # find all the triplets for this group
            tri_for_this_grp = selected_df[selected_df.grp == num].tri
            # acquire a list of relations in this group
            compare_rel = list(map(lambda x: x["relation"], tri_for_this_grp))
            # find the longest relation
            this_rel = long_stuff(compare_rel)
            # get the subjects and objects for this group
            # since these values are the same for each one of the values
            # we extract the first one
            this_subject = list(tri_for_this_grp)[0]["subject"]  # change a series to a list
            this_object = list(tri_for_this_grp)[0]["object"]  # change a series to a list
            # all the triplets in this group to the list
            re_select.append({"subject": this_subject, "relation": this_rel, "object": this_object})

        return re_select

    return selected

In [19]:
print(eg1)
select_triple(eg1)

[{'subject': 'we', 'relation': 'obtain name For', 'object': 'non-ADR cross-listed shares'}, {'subject': 'firms', 'relation': 'type from', 'object': 'NYSE websites'}, {'subject': 'we', 'relation': 'obtain', 'object': 'name'}, {'subject': 'we', 'relation': 'obtain name For', 'object': 'non-ADR shares'}, {'subject': 'we', 'relation': 'obtain name For', 'object': 'cross-listed shares'}, {'subject': 'firms', 'relation': 'type of', 'object': 'listing'}, {'subject': 'we', 'relation': 'obtain name For', 'object': 'shares'}, {'subject': 'we', 'relation': 'obtain', 'object': 'name of firms'}]


[{'subject': 'firms', 'relation': 'type from', 'object': 'NYSE websites'},
 {'subject': 'firms', 'relation': 'obtain name For', 'object': 'listing'}]

In [20]:
print(eg2)
select_triple(eg2)

[{'subject': 'autonomy', 'relation': 'is in', 'object': 'sample of firms'}, {'subject': 'firms', 'relation': 'is in', 'object': 'National Organizations Survey'}, {'subject': 'we', 'relation': 'linking', 'object': 'our results'}]


[{'subject': 'we', 'relation': 'linking', 'object': 'our results'},
 {'subject': 'autonomy', 'relation': 'is in', 'object': 'sample of firms'}]

In [21]:
print(eg3)
select_triple(eg3)

[{'subject': 'data', 'relation': 'are drawn from', 'object': 'National Organizations Survey'}, {'subject': 'data', 'relation': 'are', 'object': 'drawn'}]


[{'subject': 'data',
  'relation': 'are drawn from',
  'object': 'National Organizations Survey'}]

In [22]:
print(eg4)
select_triple(eg4)

[{'subject': 'number', 'relation': 'is taken', 'object': '8 urban areas'}, {'subject': 'number', 'relation': 'is taken', 'object': 'chengshi'}, {'subject': 'number', 'relation': 'is taken', 'object': 'counties'}, {'subject': 'three geographic classifications', 'relation': 'based on Yearbook is', 'object': 'NBS 2003'}, {'subject': 'number', 'relation': 'is taken', 'object': '8 areas'}]


[{'subject': 'three geographic classifications',
  'relation': 'based on Yearbook is',
  'object': 'NBS 2003'},
 {'subject': 'number', 'relation': 'is taken', 'object': '8 urban areas'}]

In [23]:
print(eg5)
select_triple(eg5)

[{'subject': 'number', 'relation': 'is taken', 'object': '8 urban areas'}, {'subject': 'number', 'relation': 'is taken', 'object': 'chengshi'}, {'subject': 'number', 'relation': 'is taken', 'object': 'counties'}, {'subject': 'number', 'relation': 'is taken', 'object': '8 areas'}]


[{'subject': 'number', 'relation': 'is taken', 'object': '8 urban areas'}]

In [24]:
print(eg6)
select_triple(eg6)

[{'subject': 'analysis', 'relation': 'takes', 'object': 'advantage'}, {'subject': 'analysis', 'relation': 'takes', 'object': 'advantage of data'}, {'subject': 'analysis', 'relation': 'takes', 'object': 'advantage of rich data'}]


[{'subject': 'analysis',
  'relation': 'takes',
  'object': 'advantage of rich data'}]