# Hugging and AllenNlp combining coreference

In [None]:
!pip install allennlp==1.4.1 --quiet
!pip install --pre allennlp-models==1.4.0 --quiet
!pip install spacy==2.1.0 --quiet
!python -m spacy download en_core_web_sm
!pip install neuralcoref --no-binary neuralcoref
!pip install nltk==3.6.5


Collecting en_core_web_sm==2.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz (11.1 MB)
[K     |████████████████████████████████| 11.1 MB 2.6 MB/s 
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [None]:
!python --version

Python 3.7.12


In [None]:
neuralcoref.__version__

'4.0.0'

In [None]:
import spacy
import neuralcoref

In [None]:
nlp = spacy.load('en_core_web_sm')
neuralcoref.add_to_pipe(nlp)

<spacy.lang.en.English at 0x7efd5617f410>

In [None]:
# text = "Eva and Martha didn't want their friend Jenny to feel lonely so they invited her to the party in Las Vegas."
# text = "Shivaji Bhonsale I (Marathi pronunciation c.19 February 1630 – 3 April 1680[5]), also referred to as Chhatrapati Shivaji, was an Indian ruler and a member of the Bhonsle Maratha clan. Shivaji carved out an enclave from the declining Adilshahi sultanate of Bijapur that formed the genesis of the Maratha Empire. In 1674, he was formally crowned the Chhatrapati of his realm at Raigad"
text = '''Every Tuesday and Friday, Recode’s Kara Swisher and NYU Professor Scott Galloway offer sharp, unfiltered insights into the biggest stories in tech, business, and politics. They make bold predictions, pick winners and losers, and bicker and banter like no one else. Kara is out welcoming the newest member of the Pivot family! Scott is joined by co-host Stephanie Ruhle to talk about The Great Resignation, inflation, J&J’s split, and Steve Bannon’s indictment. Also, Elon is still bullying senators on Twitter, and Beto is officially running for Governor of Texas. Plus, Scott chats with Friend of Pivot, Founder and CEO of Boom Supersonic, Blake Scholl about supersonic air travel.'''
doc = nlp(text)

In [None]:
# it has two clusters 
clusters = doc._.coref_clusters
clusters
# [a machine learning algorithm: [a machine learning algorithm, it],
#  the training set: [the training set, the training set]]

[Every Tuesday and Friday, Recode’s Kara Swisher and NYU Professor Scott Galloway: [Every Tuesday and Friday, Recode’s Kara Swisher and NYU Professor Scott Galloway, They],
 Scott: [Scott, Scott]]

In [None]:
from allennlp.predictors.predictor import Predictor


In [None]:
# model_url = 'https://storage.googleapis.com/allennlp-public-models/coref-spanbert-large-2020.02.27.tar.gz' # Old model
model_url = "https://storage.googleapis.com/allennlp-public-models/coref-spanbert-large-2021.03.10.tar.gz"# new model
predictor = Predictor.from_path(model_url)

In [None]:
prediction = predictor.predict(document=text)
coref_res = predictor.coref_resolved(text)

In [None]:
print(' '.join(prediction['document']))
print(coref_res)


Every Tuesday and Friday , Recode ’s Kara Swisher and NYU Professor Scott Galloway offer sharp , unfiltered insights into the biggest stories in tech , business , and politics . They make bold predictions , pick winners and losers , and bicker and banter like no one else . Kara is out welcoming the newest member of the Pivot family ! Scott is joined by co - host Stephanie Ruhle to talk about The Great Resignation , inflation , J&J ’s split , and Steve Bannon ’s indictment . Also , Elon is still bullying senators on Twitter , and Beto is officially running for Governor of Texas . Plus , Scott chats with Friend of Pivot , Founder and CEO of Boom Supersonic , Blake Scholl about supersonic air travel .
Every Tuesday and Friday, Recode’s Kara Swisher and NYU Professor Scott Galloway offer sharp, unfiltered insights into the biggest stories in tech, business, and politics. Recode’s Kara Swisher and NYU Professor Scott Galloway make bold predictions, pick winners and losers, and bicker and ba

In [None]:
def get_span_words(span, document):
    return ' '.join(document[span[0]:span[1]+1])

def print_clusters(prediction):
    document, clusters = prediction['document'], prediction['clusters']
    for cluster in clusters:
        print(get_span_words(cluster[0], document) + ': ', end='')
        print(f"[{'; '.join([get_span_words(span, document) for span in cluster])}]")

In [None]:
# print_clusters(prediction)
# old model
# deciding: [deciding; it]
# your training set: [your training set; the training set; the training set]
# using: [using; it]

In [None]:
print_clusters(prediction)
# new model
# deciding: [deciding; it]
# your training set: [your training set; the training set; the training set]
# using: [using; it]

Recode ’s Kara Swisher and NYU Professor Scott Galloway: [Recode ’s Kara Swisher and NYU Professor Scott Galloway; They]
Recode ’s Kara Swisher: [Recode ’s Kara Swisher; Kara]
NYU Professor Scott Galloway: [NYU Professor Scott Galloway; Scott; Scott]
the Pivot family: [the Pivot family; Pivot]


#Models intersection strategies (ensemble)
We use two models: **AllenNLP** and **Huggingface**. In multiple tests AllenNLP turned out much better - it has better precision and recall (on Google GAP dataset), and finds much more clusters at all. That's why we decide to take AllenNLP answers as a ground truth.
However AllenNLP also makes mistakes - it has about 8% of false positives which we would like to minimize. That's why we propose **several intersections of AllenNLP and Huggingface** outputs (an ensemble) to modify the results and gain more confidence about the final clusters.

We propose **3** intersection strategies:


*   strict - we take only those clusters that are identical both in AllenNLP and Huggingface (intersection of clusters)
*   partial - we take all of the spans that are identical both in AllenNLP and Huggingface (intersection of spans/mentions)
*   fuzzy - we take all of the spans that are the same but also we find spans that overlap (are relating to the same entity but are composed of different number of tokens) and choose the shorter one
*   List item


##Example

**Text**

In 1311 it was settled on Peter and Lucy for life with remainder to William Everard and his wife Beatrice. Peter had died by 1329 but Lucy lived until 1337 and she was succeeded by William Everard who died in 1343. William's son, Sir Edmund Everard inherited and maintained ownership jointly with his wife Felice until he died in 1370.

**AllenNLP clusters**

William Everard --> William Everard; his; William Everard who died in 1343; William's

Peter --> Peter; Peter

Lucy --> Lucy; Lucy; she

William's son, Sir Edmund Everard --> William's son, Sir Edmund Everard; his; he

**Huggingface clusters**

William Everard --> William Everard; his; William Everard; Sir Edmund Everard; his

Peter --> Peter; Peter; he

Lucy --> Lucy; Lucy; she

his wife Beatrice --> his wife Beatrice; his wife Felice


**Strategies**


1.   **Strict**

        Lucy --> Lucy; Lucy; she
2.   **Partial**
        William Everard --> William Everard; his

        Peter --> Peter; Peter

        Lucy --> Lucy; Lucy; she


3.   **Fuzzy**

      William Everard --> William Everard; his; William Everard

      Peter --> Peter; Peter

      Lucy --> Lucy; Lucy; she

      Sir Edmund Everard --> Sir Edmund Everard; his











In [None]:
from copy import deepcopy
import neuralcoref
import en_core_web_sm

from abc import ABC, abstractmethod
from os import environ
from warnings import warn
from typing import Dict, List
from spacy.tokens import Doc


class IntersectionStrategy(ABC):

    def __init__(self, allen_model, hugging_model):
        self.allen_clusters = []
        self.hugging_clusters = []
        self.allen_model = allen_model
        self.hugging_model = hugging_model
        self.document = []
        self.doc = None

    @abstractmethod
    def get_intersected_clusters(self):
        raise NotImplementedError

    @staticmethod
    def get_span_noun_indices(doc: Doc, cluster: List[List[int]]):
        spans = [doc[span[0]:span[1]+1] for span in cluster]
        spans_pos = [[token.pos_ for token in span] for span in spans]
        span_noun_indices = [i for i, span_pos in enumerate(spans_pos)
            if any(pos in span_pos for pos in ['NOUN', 'PROPN'])]
        return span_noun_indices

    @staticmethod
    def get_cluster_head(doc: Doc, cluster: List[List[int]], noun_indices: List[int]):
        head_idx = noun_indices[0]
        head_start, head_end = cluster[head_idx]
        head_span = doc[head_start:head_end+1]
        return head_span, [head_start, head_end]

    @staticmethod
    def is_containing_other_spans(span: List[int], all_spans: List[List[int]]):
        return any([s[0] >= span[0] and s[1] <= span[1] and s != span for s in all_spans])

    def coref_resolved_improved(self, doc: Doc, clusters: List[List[List[int]]]):
        resolved = [tok.text_with_ws for tok in doc]
        all_spans = [span for cluster in clusters for span in cluster]  # flattened list of all spans

        for cluster in clusters:
            noun_indices = self.get_span_noun_indices(doc, cluster)
            if noun_indices:
                mention_span, mention = self.get_cluster_head(doc, cluster, noun_indices)

                for coref in cluster:
                    if coref != mention and not self.is_containing_other_spans(coref, all_spans):
                        final_token = doc[coref[1]]
                        if final_token.tag_ in ["PRP$", "POS"]:
                            resolved[coref[0]] = mention_span.text + "'s" + final_token.whitespace_
                        else:
                            resolved[coref[0]] = mention_span.text + final_token.whitespace_

                        for i in range(coref[0] + 1, coref[1] + 1):
                            resolved[i] = ""

        return "".join(resolved)

    def clusters(self, text):
        self.acquire_models_clusters(text)
        return self.get_intersected_clusters()

    def resolve_coreferences(self, text: str):
        clusters = self.clusters(text)
        resolved_text = self.coref_resolved_improved(self.doc, clusters)
        return resolved_text

    def acquire_models_clusters(self, text: str):
        allen_prediction = self.allen_model.predict(text)
        self.allen_clusters = allen_prediction['clusters']
        self.document = allen_prediction['document']
        self.doc = self.hugging_model(text)
        hugging_clusters = self._transform_huggingface_answer_to_allen_list_of_clusters()
        self.hugging_clusters = hugging_clusters

    def _transform_huggingface_answer_to_allen_list_of_clusters(self):
        list_of_clusters = []
        for cluster in self.doc._.coref_clusters:
            list_of_clusters.append([])
            for span in cluster:
                list_of_clusters[-1].append([span[0].i, span[-1].i])
        return list_of_clusters


class PartialIntersectionStrategy(IntersectionStrategy):

    def get_intersected_clusters(self):
        intersected_clusters = []
        for allen_cluster in self.allen_clusters:
            intersected_cluster = []
            for hugging_cluster in self.hugging_clusters:
                allen_set = set(tuple([tuple(span) for span in allen_cluster]))
                hugging_set = set(tuple([tuple(span) for span in hugging_cluster]))
                intersect = sorted([list(el) for el in allen_set.intersection(hugging_set)])
                if len(intersect) > 1:
                    intersected_cluster += intersect
            if intersected_cluster:
                intersected_clusters.append(intersected_cluster)
        return intersected_clusters

class FuzzyIntersectionStrategy(PartialIntersectionStrategy):
    """ Is treated as a PartialIntersectionStrategy, yet first must map AllenNLP spans and Huggingface spans. """

    @staticmethod
    def flatten_cluster(list_of_clusters):
        return [span for cluster in list_of_clusters for span in cluster]

    def _check_whether_spans_are_within_range(self, allen_span, hugging_span):
        allen_range = range(allen_span[0], allen_span[1]+1)
        hugging_range = range(hugging_span[0], hugging_span[1]+1)
        allen_within = allen_span[0] in hugging_range and allen_span[1] in hugging_range
        hugging_within = hugging_span[0] in allen_range and hugging_span[1] in allen_range
        return allen_within or hugging_within

    def _add_span_to_list_dict(self, allen_span, hugging_span):
        if (allen_span[1]-allen_span[0] > hugging_span[1]-hugging_span[0]):
            self._add_element(allen_span, hugging_span)
        else:
            self._add_element(hugging_span, allen_span)

    def _add_element(self, key_span, val_span):
        if tuple(key_span) in self.swap_dict_list.keys():
            self.swap_dict_list[tuple(key_span)].append(tuple(val_span))
        else:
            self.swap_dict_list[tuple(key_span)] = [tuple(val_span)]

    def _filter_out_swap_dict(self):
        swap_dict = {}
        for key, vals in self.swap_dict_list.items():
            if self.swap_dict_list[key] != vals[0]:
                swap_dict[key] = sorted(vals, key=lambda x: x[1]-x[0], reverse=True)[0]
        return swap_dict

    def _swap_mapped_spans(self, list_of_clusters, model_dict):
        for cluster_idx, cluster in enumerate(list_of_clusters):
            for span_idx, span in enumerate(cluster):
                if tuple(span) in model_dict.keys():
                    list_of_clusters[cluster_idx][span_idx] = list(model_dict[tuple(span)])
        return list_of_clusters

    def get_mapped_spans_in_lists_of_clusters(self):
        self.swap_dict_list = {}
        for allen_span in self.flatten_cluster(self.allen_clusters):
            for hugging_span in self.flatten_cluster(self.hugging_clusters):
                if self._check_whether_spans_are_within_range(allen_span, hugging_span):
                    self._add_span_to_list_dict(allen_span, hugging_span)
        swap_dict = self._filter_out_swap_dict()

        allen_clusters_mapped = self._swap_mapped_spans(deepcopy(self.allen_clusters), swap_dict)
        hugging_clusters_mapped = self._swap_mapped_spans(deepcopy(self.hugging_clusters), swap_dict)
        return allen_clusters_mapped, hugging_clusters_mapped

    def get_intersected_clusters(self):
        allen_clusters_mapped, hugging_clusters_mapped = self.get_mapped_spans_in_lists_of_clusters()
        self.allen_clusters = allen_clusters_mapped
        self.hugging_clusters = hugging_clusters_mapped
        return super().get_intersected_clusters()



class StrictIntersectionStrategy(IntersectionStrategy):

    def get_intersected_clusters(self):
        intersected_clusters = []
        for allen_cluster in self.allen_clusters:
            for hugging_cluster in self.hugging_clusters:
                if allen_cluster == hugging_cluster:
                    intersected_clusters.append(allen_cluster)
        return intersected_clusters

In [None]:
print("~~~ AllenNLP clusters ~~~")
print_clusters(prediction)
print("\n~~~ Huggingface clusters ~~~")
for cluster in doc._.coref_clusters:
    print(cluster)

# ~~~ AllenNLP clusters ~~~
# deciding: [deciding; it]
# your training set: [your training set; the training set; the training set]
# using: [using; it]

# ~~~ Huggingface clusters ~~~
# a machine learning algorithm: [a machine learning algorithm, it]
# the training set: [the training set, the training set]

~~~ AllenNLP clusters ~~~
Recode ’s Kara Swisher and NYU Professor Scott Galloway: [Recode ’s Kara Swisher and NYU Professor Scott Galloway; They]
Recode ’s Kara Swisher: [Recode ’s Kara Swisher; Kara]
NYU Professor Scott Galloway: [NYU Professor Scott Galloway; Scott; Scott]
the Pivot family: [the Pivot family; Pivot]

~~~ Huggingface clusters ~~~
Every Tuesday and Friday, Recode’s Kara Swisher and NYU Professor Scott Galloway: [Every Tuesday and Friday, Recode’s Kara Swisher and NYU Professor Scott Galloway, They]
Scott: [Scott, Scott]


In [None]:
strict = StrictIntersectionStrategy(predictor, nlp)
partial = PartialIntersectionStrategy(predictor, nlp)
fuzzy = FuzzyIntersectionStrategy(predictor, nlp)

In [None]:

def get_cluster_head_idx(doc, cluster):
    noun_indices = IntersectionStrategy.get_span_noun_indices(doc, cluster)
    return noun_indices[0] if noun_indices else 0


def print_clusters(doc, clusters):
    def get_span_words(span, allen_document):
        return ' '.join(allen_document[span[0]:span[1]+1])

    allen_document, clusters = [t.text for t in doc], clusters
    # new_clusters = []
    for cluster in clusters:
        cluster_head_idx = get_cluster_head_idx(doc, cluster)
        if cluster_head_idx >= 0:
            cluster_head = cluster[cluster_head_idx]
            # key = get_span_words(cluster_head, allen_document)
            print(get_span_words(cluster_head, allen_document) + ' - ', end='')
            print('[', end='')
            value = []
            for i, span in enumerate(cluster):
                print(get_span_words(span, allen_document) + ("; " if i+1 < len(cluster) else ""), end='')
                value.append(get_span_words(span, allen_document))

            print(']')
            # new_clusters.append((key,value))
    # return new_clusters

In [None]:
for intersection_strategy in [strict, partial, fuzzy]:
    print(f'\n~~~ {intersection_strategy.__class__.__name__} clusters ~~~')
    print(print_clusters(doc, intersection_strategy.clusters(text)))

# ~~~ StrictIntersectionStrategy clusters ~~~

# ~~~ PartialIntersectionStrategy clusters ~~~
# the training set - [the training set; the training set]

# ~~~ FuzzyIntersectionStrategy clusters ~~~
# the training set - [the training set; the training set]


~~~ StrictIntersectionStrategy clusters ~~~
[]

~~~ PartialIntersectionStrategy clusters ~~~
Scott - [Scott; Scott]
[('Scott', ['Scott', 'Scott'])]

~~~ FuzzyIntersectionStrategy clusters ~~~
Recode ’s Kara Swisher and NYU Professor Scott Galloway - [Recode ’s Kara Swisher and NYU Professor Scott Galloway; They]
Scott - [Scott; Scott]
[('Recode ’s Kara Swisher and NYU Professor Scott Galloway', ['Recode ’s Kara Swisher and NYU Professor Scott Galloway', 'They']), ('Scott', ['Scott', 'Scott'])]


##The Problem
AllenNLP coreference resolution models seems to find better clusters - numerous clusters that are usually more accurate than the ones found by Huggingface NeuralCoref model. However, the biggest problem lies in the next step - the step of replacing found mentions with the most meaningfull spans from each clusters (that we call the "heads"). We've found a couple of easy-to-fix problems which seem to lead to errors most often. Our ideas can be summed up as:


*   not resolving coreferences in the clusters that doesn't contain any noun phrases (usually it comes down to the clusters composed only of pronouns),
*  chosing the head of the cluster which is a noun phrase (isn't a pronoun),
*   resolving only the inner span in the case of nested coreferent mentions.

Original AllenNLP impelemntation of the replace_corefs method
We extract the main "logic" into the separate function that will be used in our every method as we leave the core of AllenNLP's logic untouched. So as for now, we will compare our solutions to the original_replace_corefs method implemented in AllenNLP coref.py (we've just copied it here explicitly in order to compare with the improved method we propose).

##Original AllenNLP impelemntation of the replace_corefs method

We extract the main "logic" into the separate function that will be used in our every method as we leave the core of AllenNLP's logic untouched. So as for now, we will compare our solutions to the `original_replace_corefs` method implemented in AllenNLP `coref.py` (we've just copied it here explicitly in order to compare with the improved method we propose).

In [None]:
from spacy.tokens import Doc, Span

def core_logic_part(document: Doc, coref: List[int], resolved: List[str], mention_span: Span):
    final_token = document[coref[1]]
    if final_token.tag_ in ["PRP$", "POS"]:
        resolved[coref[0]] = mention_span.text + "'s" + final_token.whitespace_
    else:
        resolved[coref[0]] = mention_span.text + final_token.whitespace_
    for i in range(coref[0] + 1, coref[1] + 1):
        resolved[i] = ""
    return resolved


def get_span_noun_indices(doc: Doc, cluster: List[List[int]]) -> List[int]:
    spans = [doc[span[0]:span[1]+1] for span in cluster]
    spans_pos = [[token.pos_ for token in span] for span in spans]
    span_noun_indices = [i for i, span_pos in enumerate(spans_pos)
        if any(pos in span_pos for pos in ['NOUN', 'PROPN'])]
    return span_noun_indices

def is_containing_other_spans(span: List[int], all_spans: List[List[int]]):
    return any([s[0] >= span[0] and s[1] <= span[1] and s != span for s in all_spans])

def get_cluster_head(doc: Doc, cluster: List[List[int]], noun_indices: List[int]):
    head_idx = noun_indices[0]
    head_start, head_end = cluster[head_idx]
    head_span = doc[head_start:head_end+1]
    return head_span, [head_start, head_end]

def improved_replace_corefs(document, clusters):
    """
    Nested coreferent mentions
    """
    resolved = list(tok.text_with_ws for tok in document)
    all_spans = [span for cluster in clusters for span in cluster]  # flattened list of all spans

    for cluster in clusters:
        noun_indices = get_span_noun_indices(document, cluster)

        if noun_indices:
            mention_span, mention = get_cluster_head(document, cluster, noun_indices)

            for coref in cluster:
                if coref != mention and not is_containing_other_spans(coref, all_spans):
                    core_logic_part(document, coref, resolved, mention_span)

    return "".join(resolved)


# def improved_replace_corefs(document, clusters):
#     """ Corefecnt not head  """
#     resolved = list(tok.text_with_ws for tok in document)

#     for cluster in clusters:
#         noun_indices = get_span_noun_indices(document, cluster)

#         if noun_indices:
#             mention_span, mention = get_cluster_head(document, cluster, noun_indices)

#             for coref in cluster:
#                 if coref != mention:  # we don't replace the head itself
#                     core_logic_part(document, coref, resolved, mention_span)

#     return "".join(resolved)

"""
Improvements
Redundant clusters - lack of a meaningfull mention that could become the head
We completely ignore (we don't resove them at all) the clusters that doesn't contain any noun phrase.
"""


def original_replace_corefs(document: Doc, clusters: List[List[List[int]]]) -> str:
    resolved = list(tok.text_with_ws for tok in document)

    for cluster in clusters:
        mention_start, mention_end = cluster[0][0], cluster[0][1] + 1
        mention_span = document[mention_start:mention_end]

        for coref in cluster[1:]:
            core_logic_part(document, coref, resolved, mention_span)

    return "".join(resolved)



def print_comparison(resolved_original_text, resolved_improved_text):
    print(f"~~~ AllenNLP original replace_corefs ~~~\n{resolved_original_text}")
    print(f"\n~~~ Our improved replace_corefs ~~~\n{resolved_improved_text}")

In [None]:
t = "We want to take our code and create a game. Let's remind ourselves how to do that." #'"He is a great actor!", he said about John Travolta.'
clusters = predictor.predict(t)['clusters']
doc = nlp(t)
print(clusters)

[[[0, 0], [4, 4], [12, 12], [14, 14]]]


In [None]:
print_comparison(original_replace_corefs(doc, clusters), improved_replace_corefs(doc, clusters))
# ~~~ AllenNLP original replace_corefs ~~~
# We want to take We's code and create a game. LetWe remind We how to do that.

# ~~~ Our improved replace_corefs ~~~
# We want to take our code and create a game. Let's remind ourselves how to do that.


~~~ AllenNLP original replace_corefs ~~~
We want to take We's code and create a game. LetWe remind We how to do that.

~~~ Our improved replace_corefs ~~~
We want to take our code and create a game. Let's remind ourselves how to do that.
