<a href="https://colab.research.google.com/github/darisoy/EE517_Sp21/blob/master/Project/pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Set up the environment

In [1]:
# install packages
%%capture
!pip install allennlp
!pip install allennlp-models
!pip install spacy-dbpedia-spotlight

import spacy
from allennlp.predictors.predictor import Predictor
import spacy_dbpedia_spotlight

# Functions

In [81]:
def get_coref(text, dic, predictor):
    prediction = predictor.predict(document=text)
    return [[dic.index(token) for token in cluster if token in dic] for cluster in prediction['clusters']]

In [87]:
def get_ner(text, dic, nlp):
    doc = nlp(text)
    return [dic.index([ent.start, ent.end-1]) for ent in doc.ents if ent.label_ == 'PERSON' and [ent.start, ent.end-1] in dic]

In [84]:
def get_nel(text, dic, nel):
    threshold = 0.95
    doc = nel(text)
    return [dic.index([ent.start, ent.end-1]) for ent in doc.ents if float(ent._.dbpedia_raw_result['@similarityScore']) >= threshold and [ent.start, ent.end-1] in dic]

In [5]:
def id_to_string(id, dic, doc):
    [a, b] = dic[id]
    return doc[a:b+1].text

In [49]:
def get_cluster(person, clusters):
    for i, cluster in enumerate(clusters):
        if person in cluster:
            return clusters[i]
    return [person]

In [54]:
def split_hist_hypo(clusters, person_mentions, famous_people):
    historical = []
    hypothetical = []
    for person in person_mentions:
        in_historical = any(person in sublist for sublist in historical)
        in_hypothetical = any(person in sublist for sublist in hypothetical)
        if in_historical or in_hypothetical:
            continue
        person_set = get_cluster(person, clusters)
        if person in famous_people:
            historical.append(person_set)
        else:
            hypothetical.append(person_set)
    return historical, hypothetical

In [51]:
def array_to_text(array2D, dic, doc):
    return [[id_to_string(e, dic, doc) for e in arr] for arr in array2D]

In [89]:
def get_hist_hypo_references(text, nlp, nel, debug=False):
    doc = nlp(text)
    dic = [[ent.start, ent.end-1] for ent in list(doc.noun_chunks)]
    clusters = get_coref(text, dic, allen_predictor)
    person_mentions = get_ner(text, dic, nlp)
    famous_people = get_nel(text, dic, nel)
    # TODO: get non-name person mentions using simple rules (for he, she, we, you...)
    # person_mentions.append(...)
    hist, hypo = split_hist_hypo(clusters, person_mentions, famous_people)
    hist = array_to_text(hist, dic, doc)
    hypo = array_to_text(hypo, dic, doc)
    if debug and len(person_mentions) > 0:
        print('Sentence:')
        print(text)
        print()
        print('Coreferences:')
        print(array_to_text(clusters, dic, doc))
        print('People:')
        print(array_to_text([person_mentions], dic, doc))
        print('Famous:')
        print(array_to_text([famous_people], dic, doc))
        print()
        print('Historical references:')
        print(hist)
        print('Hypothetical references:')
        print(hypo)
        print()
        print()
        print()
    return hist, hypo

# Get models

In [10]:
%%capture
!python -m spacy download en_core_web_lg
nlp = spacy.load('en_core_web_lg')
allen_model_url = 'https://storage.googleapis.com/allennlp-public-models/coref-spanbert-large-2020.02.27.tar.gz'
allen_predictor = Predictor.from_path(allen_model_url)  # load the model
nel = spacy_dbpedia_spotlight.create('en')

# Test

In [93]:
sample = 'Isaac Newton invented the wheel. He didn\'t go to kindergarden but he was familiar with circles. When told this story, Jessica didn\'t believe it. She thought Newton was a lie.'
hist, hypo = get_hist_hypo_references(sample, nlp, nel, debug=True)

Sentence:
Isaac Newton invented the wheel. He didn't go to kindergarden but he was familiar with circles. When told this story, Jessica didn't believe it. She thought Newton was a lie.

Coreferences:
[['Isaac Newton', 'He', 'he', 'Newton'], ['this story', 'it'], ['Jessica', 'She']]
People:
[['Isaac Newton', 'Jessica', 'Newton']]
Famous:
[['Isaac Newton', 'Newton']]

Historical references:
[['Isaac Newton', 'He', 'he', 'Newton']]
Hypothetical references:
[['Jessica', 'She']]





In [68]:
import pandas as pd
d_all1 = pd.read_csv('/content/train.csv', delimiter= ",", low_memory=False, index_col=0)
d_all2 = pd.read_csv('/content/test.csv', delimiter= ",", low_memory=False, index_col=0)
d_all1 = d_all1.drop(['bool'], axis=1)
assert all(d_all1.columns == d_all2.columns)
d = pd.concat([d_all1, d_all2], axis = 0)
d.fillna('[]',inplace = True)

In [99]:
d[d.grade=='12'].head()

Unnamed: 0,book,grade,level,science,text,text_org
5,Gr12_PhysicalSciences_Learner_Eng.txt3,12,3,1,sponsor this textbook was developed with corp...,SPONSOR This textbook was developed with corp...
6,Gr12_PhysicalSciences_Learner_Eng.txt3,12,3,1,"well structured , impactful corporate social ...","Well structured, impactful Corporate Social I..."
7,Gr12_PhysicalSciences_Learner_Eng.txt3,12,3,1,the merger between metropolitan and momentum ...,The merger between Metropolitan and Momentum ...
8,Gr12_PhysicalSciences_Learner_Eng.txt3,12,3,1,hiv/aids is becoming a manageable disease in ...,HIV/AIDS is becoming a manageable disease in ...
9,Gr12_PhysicalSciences_Learner_Eng.txt3,12,3,1,momentum 's focus on persons with disabilitie...,Momentum's focus on persons with disabilities...


In [100]:
for text in d[d.grade=='12'].text_org:
    hist, hypo = get_hist_hypo_references(text, nlp, nel, debug=True)

Sentence:
 The theory of gravity has been slowly developing since the beginning of the 16th century. Galileo Galilei is credited with some of the earliest work. At the time it was widely believed that heavier objects accelerated faster toward the earth than light objects did. Galileo had a hypothesis that this was not true, and performed experiments to prove this. Galileo's work allowed Sir Isaac Newton to hypothesise not only a theory of gravity on earth, but that gravity is what held the planets in their orbits. Newton's theory was used by John Couch Adams and Urbain Le Verrier to predict the planet Neptune in the solar system and this prediction was proved experimentally when Neptune was discovered by Johann Gottfried Galle.

Coreferences:
[['Galileo Galilei', 'Galileo'], [], ['the earth', 'earth'], ['the planets'], ['Sir Isaac Newton'], ['this prediction'], ['Neptune']]
People:
[['John Couch Adams', 'Urbain Le Verrier', 'Johann Gottfried Galle']]
Famous:
[['gravity', 'Galileo Galil

KeyboardInterrupt: ignored