# Overview of Tagging System
The ESSC tagging system is designed to automatically detect entities present in a text to support text search and, primarily, discovery of related texts. 

The system extracts both tags (provided keywords) and entities (mentions of organizations, places, etc.) using Natural Language Processing (NLP). The `spacy` package was selected as the primary text processing software for this project for being well suited to this task, well documented, and feature rich out of the box.

The system currently supports English texts, including those with some Spanish phrases and place names.

This notebook illustrates the pipeline for reading, summarizing, and tagging an individual text with example outputs.

For alternatives considered, including alternative packages explored, see links to supporting notebooks mentioned in the text.

#### Contents
* Problem understanding
* Solution description
* Environment set up
* Reading text
* Setting up `spacy`
    * Abbreviation Detector
    * Pattern Matcher (for tags)
    * Entity Recognition
* Processing text with `spacy`
* Entity Linking
    * De-fragmenting entities
    * Linking entities
* Summarizing

## Problem Understanding
The primarly technical problem in this project is a problem known as 'entity linking' (text summarization is also a technical challenge, but by comparison mild, so its discussion is left to the section implementing the code).

For the ESSC database to be more useful than simply a search over all reports, we must be able to identify known entities and link entities even if represented in a different way. These problems are known as Named Entity Recognition (NER) and Entity Linking, respectively. 

`spacy` provides a statistical model for NER out of the box. As a statistical model, it is not perfect but it performs fairly well. It is better than a simple keyword extraction algorithm (see `yake-rake kw extract.ipynb` for examples of keyword extraction) since it identifies entities we are interested in rather than just the most common words or phrases.

However, `spacy` does not yet provide a pre-trained model for entity linking (although they are working on it). Entities may be mentioned in text with slightly different representations (e.g., Chemonics, Chemonics, Inc.). Resolving these mentions back to a known entity (where known entities are captured in a Knowledge Base) allows us to better identify when the same entity is mentioned in a single text or across texts. Using existing Knowledge Bases like wikidata will allow us to further enrich our understanding by incorporating the knowledge in that database. 

There are many 'authorities' (i.e., Knowledge Bases) that can be consulted to find 'official' representations of known entities and their aliases. Many of these are tailored to specific use cases (i.e., the proper name of a gene or of a municipality). Because wikidata is the broadest, and incorporates many of these authorities by reference, we chose wikidata (or wikipedia--which is different) as the primary authority. 

Note that we cannot simply specify all aliases of known entities because the tagging system is meant to be used on new text. New text will surely have as-yet-unseen synonyms of known entities. Thus, we must use a process that will be more robust to new entities.

## Solution Description
To resolve an entity mention back to a wikidata entity, we use a layered approach to processing the text and resolving ambiguity.

First, we translate all acronyms/abbreviations to their expanded names when they are introduced in the text. This reduces the ambiguity for acronyms (which are quite common in our corpus). 

Next, we use `spacy` to identify the entities through NER.

We attempt to resolve fragmented entities based on their relationship to other entities in the text. Due to formatting or sentence structure, an entity can be split in two or only a portion is captured by the NER model. When this happens, we look for other entities in the text that include the entity fragment and resolve the entity back.

If an entity is already represented in the database, we simply assign it the identifier of the existing tag.

For novel entities, we link entities using APIs for wikidata, wikipedia, google, and a disambiguation service called `tagme`. 

Tags (provided keywords) are pre-associated with known entities (if available) or assigned a unique identifier before extraction from the text.

## Environment set up
This project uses python 3.9.1 the following packages:
* **tika**: for reading pdfs
* **spacy**: for natural language processing
* **re**: for basic pre-nlp pattern matching
* **os**: file i/o
* ?**wikipedia**: for entity linking to wikidata

Tika requires Java to be installed on your computer.

To duplicate this notebook, you will also need to download spacy's english language model:
`python -m spacy download en-core-web-lg`
and load the python kernel into your environment:
`python -m ipykernel install --user`

We use the `en-core-web-lg model` because it includes word vectors (rather than simply tensors) for each word, which will become important for the entity linking.

In [1]:
!python --version

Python 3.9.1


In [2]:
from tika import parser
import re
import os
import spacy
from spacy.pipeline import EntityRuler
from collections import OrderedDict
from spacy.lang.en.stop_words import STOP_WORDS as STOP_WORDS_EN
from spacy.lang.es.stop_words import STOP_WORDS as STOP_WORDS_ES
from spacy import displacy

# Combine english and spanish stop words
STOP_WORDS = STOP_WORDS_EN | STOP_WORDS_ES

# Load the english language model
nlp = spacy.load("en_core_web_lg")

## Reading Text
PDF and plain text are currently supported. The text is processed to remove some punctuation (especially bullets and periods used in tables of contents) and to remove excess spaces. Text size is limited to 1,000,000 characters (`spacy` limitation).

In [3]:
def read_text(f):
    """converts pdf or txt file to string"""
    _, file_extension = os.path.splitext(f)
    if file_extension == '.txt':
        with open(f, encoding="utf8") as reader:
            text = reader.read()
    elif file_extension == '.pdf':
        raw = parser.from_file(f)
        text = raw['content']
        if not text:
            print('No text found')
    else:
        print("Incorrect file extension")
    
    # drop punctuation except periods, apostrophes, and hyphens
    # (note some files use different characters than ASCII apostrophes)
#     text = re.sub(r'[/:"]', " ", text)  # r'[()/:"]'
    # drop bullets (requires utf8 encoding)
    text = text.replace('\u2022', '')
    text = text.replace('\u2219', '')
    text = text.replace('\ufeff?', '')
    # drop periods when used in TOC, etc (i.e., multiple periods except elipsis)
    text = re.sub(r'\.\.\.\.+', '', text)  # four or more dots
    # replace elipses so it's not confused by spacy's sentence parser
    text = re.sub(r'\.\.\.', '---', text)  # three dots
    text = re.sub(r'\.\s\.\s\.\s', '---', text)  # three dots with spaces
    text = text.replace('\u2026', '---')  # elipse character
    # replace two dots
    text = re.sub(r'\.\.', '', text)  # two dots
    
    # remove excess spaces
    text = ' '.join(text.split())
    
    return text[:1000000]  # spacy nlp character limit is 1000000

**Specify a folder and filename below to read in text.** Note that on initialization tika will serve a WARN message, disregard.

In [4]:
report_folder = r'D:\data\essc-knowledge-base\reports_eng'
report = '04-14-Minnick-723_paper.pdf'
text = read_text(os.path.join(report_folder, report))
text

2021-02-12 14:11:49,077 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...


'Formalization of Artisanal and Small-Scale Gold Mining in Colombia FORMALIZATION OF ARTISANAL AND SMALL-SCALE GOLD MINING IN COLOMBIA: A PROPOSAL FOR IMPROVING ENVIRONMENTAL, SOCIAL, AND ECONOMIC PERFORMANCE IN A POST-CONFLICT SCENARIO GREGORY MINNICK, PETER DOYLE, THOMAS HENTSCHEL Chemonics Inc., Projekt-Consult GmbH gminnick@chemonics.com Paper prepared for presentation at the “2020 WORLD BANK CONFERENCE ON LAND AND POVERTY” The World Bank - Washington DC, March 16-20, 2020 Copyright 2020 by author(s). All rights reserved. Readers may make verbatim copies of this document for non-commercial purposes by any means, provided that this copyright notice appears on all such copies. Abstract Artisanal and small-scale gold mining (ASGM) is an important economic activity in many rural areas, with a long tradition in Colombia. High rates of informality undermine governance, security, and legal economies, and cause significant environmental and social impacts. As the government of Colombia con

## Processing text with `spacy` (Part 1)
Before doing anything else, we must process the text using `spacy`'s NLP pipeline. The pipeline allows us to do multiple things with one pass over the text for efficiency. First, we'll need to load in any custom pipeline components now so that they are available. We'll continue by processing the text using the pipeline components in Part 2.

Let's see which pipeline components are already loaded:

In [5]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']

For efficiency, we can disable any pipeline components that we don't need. [There are multiple ways to do this](https://spacy.io/usage/processing-pipelines#disabling), but we'll simply use `nlp.select_pipes(disable=[])`.

In [6]:
nlp.select_pipes(disable=['attribute_ruler', 'lemmatizer'])
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'ner']

We'll add a custom pipeline for abbreviations (more below), but first we need to initialize an entity ruler in our pipeline for naming abbreviations as entities. The entity ruler pipeline allows for custom entities to be added to the NER model. We'll be adding abbreviations as named entities to ensure that abbreviations are captured correctly as entities.

Because we'll also use an entity_ruler for tags (provided keywords), we'll need to distinguish the two. We can do that by wrapping the entity ruler in a named Language.factory decorator.

In [7]:
from spacy.language import Language

@Language.factory("ruler_abrv")
def named_entity_ruler(nlp, name):
    return EntityRuler(nlp)

ruler_abrv = nlp.add_pipe('ruler_abrv', before='ner')
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'ruler_abrv', 'ner']

## Parsing Abbreviations
Abbreviations are a key source of ambiguity for NER. We use the Abbreviation module from [`scispacy`](https://pypi.org/project/scispacy/) with a small modification to ignore stop words in acronyms and add the abbreviations to the ruler_abrv Entity Ruler so they can be identified as entities during NER.

In [8]:
# edited to add abbreviations as entity labeled "ABRV" to entity ruler
# patterns. Ensure entity ruler is added before NER and after
# abbrevation_detector

from typing import Tuple, List, Optional, Set, Dict
from collections import defaultdict
from spacy.tokens import Span, Doc
from spacy.matcher import Matcher
from spacy.language import Language


def find_abbreviation(
    long_form_candidate: Span, short_form_candidate: Span, 
    STOP_WORDS: List = None
) -> Tuple[Span, Optional[Span]]:
    """
    Implements the abbreviation detection algorithm in "A simple algorithm
    for identifying abbreviation definitions in biomedical text.", (Schwartz & Hearst, 2003).

    The algorithm works by enumerating the characters in the short form of the abbreviation,
    checking that they can be matched against characters in a candidate text for the long form
    in order, as well as requiring that the first letter of the abbreviated form matches the
    _beginning_ letter of a word.

    Parameters
    ----------
    long_form_candidate: Span, required.
        The spaCy span for the long form candidate of the definition.
    short_form_candidate: Span, required.
        The spaCy span for the abbreviation candidate.

    Returns
    -------
    A Tuple[Span, Optional[Span]], representing the short form abbreviation and the
    span corresponding to the long form expansion, or None if a match is not found.
    """
    long_form = "".join([x.text_with_ws for x in long_form_candidate])
    short_form = "".join([x.text_with_ws for x in short_form_candidate])
    
    # create helper list for skipping stop words
    if STOP_WORDS:
        long_form_stop = [[True]*len(token.text_with_ws) 
                          if 
                          token.text in STOP_WORDS
                          else 
                          [False]*len(token.text_with_ws)
                         for token in long_form_candidate]
        # flatten nested list
        long_form_stop = [item for sublist in long_form_stop for item in sublist]
    
    else:
        long_form_stop = [True]*len(long_form)
    
    long_index = len(long_form) - 1
    short_index = len(short_form) - 1

    while short_index >= 0:
        current_char = short_form[short_index].lower()
        # We don't check non alpha-numeric characters.
        if not current_char.isalnum():
            short_index -= 1
            continue

            # Does the character match at this position? ...
        while (
            (long_index >= 0 and long_form[long_index].lower() != current_char)
            or
            # .... or if we are checking the first character of the abbreviation, we enforce
            # to be the _starting_ character of a span.
            (
                short_index == 0
                and long_index > 0
                and long_form[long_index - 1].isalnum()
            )
            or
            # ... or word is one of STOP_WORDS
            (
                long_form_stop[long_index]
            )
        ):
            long_index -= 1

        if long_index < 0:
            return short_form_candidate, None

        long_index -= 1
        short_index -= 1

    # The last subtraction will either take us on to a whitespace character, or
    # off the front of the string (i.e. long_index == -1). Either way, we want to add
    # one to get back to the start character of the long form
    long_index += 1

    # Now we know the character index of the start of the character span,
    # here we just translate that to the first token beginning after that
    # value, so we can return a spaCy span instead.
    word_lengths = 0
    starting_index = None
    for i, word in enumerate(long_form_candidate):
        # need to add 1 for the space characters
        word_lengths += len(word.text_with_ws)
        if word_lengths > long_index:
            starting_index = i
            break

    return short_form_candidate, long_form_candidate[starting_index:]


def filter_matches(
    matcher_output: List[Tuple[int, int, int]], doc: Doc
) -> List[Tuple[Span, Span]]:
    # Filter into two cases:
    # 1. <Short Form> ( <Long Form> )
    # 2. <Long Form> (<Short Form>) [this case is most common].
    candidates = []
    for match in matcher_output:
        start = match[1]
        end = match[2]
        # Ignore spans with more than 8 words in them, and spans at the start of the doc
        if end - start > 8 or start == 1:
            continue
        if end - start > 3:
            # Long form is inside the parens.
            # Take one word before.
            short_form_candidate = doc[start - 2 : start - 1]
            long_form_candidate = doc[start:end]
        else:
            # Normal case.
            # Short form is inside the parens.
            short_form_candidate = doc[start:end]

            # Sum character lengths of contents of parens.
            abbreviation_length = sum([len(x) for x in short_form_candidate])
            max_words = min(abbreviation_length + 5, abbreviation_length * 2)
            # Look up to max_words backwards
            long_form_candidate = doc[max(start - max_words - 1, 0) : start - 1]

        # add candidate to candidates if candidates pass filters
        if short_form_filter(short_form_candidate):
            candidates.append((long_form_candidate, short_form_candidate))

    return candidates


def short_form_filter(span: Span) -> bool:
    # All words are between length 2 and 10
    if not all([2 <= len(x) < 10 for x in span]):
        return False

    # At least 50% of the short form should be alpha
    if (sum([c.isalpha() for c in span.text]) / len(span.text)) < 0.5:
        return False

    # The first character of the short form should be alpha
    if not span.text[0].isalpha():
        return False
    return True


# add function to add entity from match
def add_event_ent(matcher, doc, i, matches, long_text):
    # Get the current match and create tuple of entity label, start and end.
    # Append entity to the doc's entity. (Don't overwrite doc.ents!)
    match_id, start, end = matches[i]
    entity = Span(doc, start, end, label=matches[i]._.long_form)
    doc.ents += (entity,)
    print(entity.text)


@Language.factory("abbreviation_detector")
class AbbreviationDetector:
    """
    Detects abbreviations using the algorithm in "A simple algorithm for identifying
    abbreviation definitions in biomedical text.", (Schwartz & Hearst, 2003).
    This class sets the `._.abbreviations` attribute on spaCy Doc.
    The abbreviations attribute is a `List[Span]` where each Span has the `Span._.long_form`
    attribute set to the long form definition of the abbreviation.
    Note that this class does not replace the spans, or merge them.
    Parameters
    ----------
    nlp: `Language`, a required argument for spacy to use this as a factory
    name: `str`, a required argument for spacy to use this as a factory
    """

    def __init__(self, nlp: Language, name: str = "abbreviation_detector") -> None:
        Doc.set_extension("abbreviations", default=[], force=True)
        Span.set_extension("long_form", default=None, force=True)

        self.matcher = Matcher(nlp.vocab)
        self.matcher.add("parenthesis", [[{"ORTH": "("}, {"OP": "+"}, {"ORTH": ")"}]])
        self.global_matcher = Matcher(nlp.vocab)

    def find(self, span: Span, doc: Doc) -> Tuple[Span, Set[Span]]:
        """
        Functional version of calling the matcher for a single span.
        This method is helpful if you already have an abbreviation which
        you want to find a definition for.
        """
        dummy_matches = [(-1, int(span.start), int(span.end))]
        filtered = filter_matches(dummy_matches, doc)
        abbreviations = self.find_matches_for(filtered, doc)

        if not abbreviations:
            return span, set()
        else:
            return abbreviations[0]

    def __call__(self, doc: Doc) -> Doc:
        matches = self.matcher(doc)
        matches_no_brackets = [(x[0], x[1] + 1, x[2] - 1) for x in matches]
        filtered = filter_matches(matches_no_brackets, doc)
        occurences = self.find_matches_for(filtered, doc)
        patterns = []

        for (long_form, short_forms) in occurences:
            for short in short_forms: 
                short._.long_form = long_form 
                doc._.abbreviations.append(short)
                # add short form as pattern to entity ruler
                pattern = {"label": "ABRV", "pattern": short.text, "id": long_form.text}
                patterns.append(pattern)
            
        ruler_abrv.add_patterns(patterns)
        
        return doc

    def find_matches_for(self, filtered: List[Tuple[Span, Span]], doc: Doc) -> List[Tuple[Span, Set[Span]]]:
        rules = {}
        all_occurences: Dict[Span, Set[Span]] = defaultdict(set)
        already_seen_long: Set[str] = set()
        already_seen_short: Set[str] = set()
        for (long_candidate, short_candidate) in filtered:
            short, long = find_abbreviation(long_candidate, short_candidate, STOP_WORDS)
            # We need the long and short form definitions to be unique, because we need
            # to store them so we can look them up later. This is a bit of a
            # pathalogical case also, as it would mean an abbreviation had been
            # defined twice in a document. There's not much we can do about this,
            # but at least the case which is discarded will be picked up below by
            # the global matcher. So it's likely that things will work out ok most of the time.
            new_long = long.text not in already_seen_long if long else False
            new_short = short.text not in already_seen_short
            if long is not None and new_long and new_short:
                already_seen_long.add(long.text)
                already_seen_short.add(short.text)
                all_occurences[long].add(short)
                rules[long.text] = long
                # Add a rule to a matcher to find exactly this substring.
                self.global_matcher.add(long.text, [[{"ORTH": x.text} for x in short]])
        to_remove = set()
        global_matches = self.global_matcher(doc)
        for match, start, end in global_matches:
            string_key = self.global_matcher.vocab.strings[match]
            to_remove.add(string_key)
            all_occurences[rules[string_key]].add(doc[start:end])
        for key in to_remove:
            # Clean up the global matcher.
            self.global_matcher.remove(key)

        return list((k, v) for k, v in all_occurences.items())

*NOTE: adding abbreviations as a pattern to the entity ruler is one strategy for handling acronyms. Also considered were replacing acronyms with full text (this was strangely difficult and would likely confuse the NER model and would alter the text which is not ideal for summarization) and adding the abbreviation span as an entity while updating it (within AbbreviationDetector's `__call__` method) but it is only possible to set a Span label (which would be like 'PERSON') during initialization. It would require another pass over all entities to apply labels and ent_ids with a dictionary or similar approach. Ultimately, the entity_ruler seems like a sufficient and simple approach).*

Now we load the Abbreviation Detector into the pipeline. We'll add it before the `ruler_abrv` Entity Ruler and NER components so that abbreviations are identified first.

In [9]:
# abbreviation_pipe = AbbreviationDetector(nlp)
nlp.add_pipe("abbreviation_detector", before='ruler_abrv')
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'abbreviation_detector', 'ruler_abrv', 'ner']

For illustration, we can run the current pipeline on our document to see the abbreviations it found using our custom component abbreviation_detector

In [10]:
doc = nlp(text)

In [11]:
set([(str(abrv), abrv._.long_form) for abrv in doc._.abbreviations])

{('ANC', Agencia Nacional Minera),
 ('ASGM', Artisanal and small-scale gold mining),
 ('ASM', Artisanal and Small Miners),
 ('EIAs', Environmental Impact Studies),
 ('GOC', government of Colombia),
 ('IDEAM', Instituto de Hidrología, Meteorología y Estudios Ambientales),
 ('JAA', Jaime Arteaga & Asociados),
 ('MME', Ministry of Mines and Energy),
 ('MPUs', mining production units),
 ('NGOs', non- governmental organizations),
 ('NUFP', National Unified Formalization Plan),
 ('OECD', Organization for Economic Cooperation and Development),
 ('SBGA', Swiss Better Gold Association),
 ('SECO', Swiss Economic Cooperation and Development),
 ('UNEP', United Nations Environmental Programme),
 ('UNODC', United Nations Office for Drugs and Crime),
 ('USAID', United States Agency for International Development)}

## Entity Recognition
Entity recognition is handled by `spacy`'s NER model. Initially, we tried disambiguating entity mentions by providing aliases encountered in the initial corpus using a pattern file (see [here](https://spacy.io/usage/rule-based-matching#entityruler-files) for instructions). However, this approach will not work as new texts are added and will not allow for generalizing of this approach to alternative corpora. We'll simply use the out-of-the-box functionality combined with our layered approach to disambiguation for entity recognition instead.

Because we've already run the pipeline to check the abbreviations, we can also check to see the entities identified. 

In [12]:
set([(ent.text, ent.label_, ent.ent_id_) for ent in doc.ents])

{('/month', 'DATE', ''),
 ('1', 'CARDINAL', ''),
 ('1,300', 'CARDINAL', ''),
 ('1.5', 'CARDINAL', ''),
 ('104 tons', 'QUANTITY', ''),
 ('11', 'CARDINAL', ''),
 ('12', 'CARDINAL', ''),
 ('12 percent', 'PERCENT', ''),
 ('13', 'CARDINAL', ''),
 ('135', 'CARDINAL', ''),
 ('14', 'CARDINAL', ''),
 ('14,357', 'CARDINAL', ''),
 ('15', 'CARDINAL', ''),
 ('15,000 tons', 'QUANTITY', ''),
 ('150 tons', 'QUANTITY', ''),
 ('16', 'CARDINAL', ''),
 ('17', 'CARDINAL', ''),
 ('18 years', 'DATE', ''),
 ('1990s-era', 'DATE', ''),
 ('1998', 'DATE', ''),
 ('1–5', 'ORG', ''),
 ('2', 'CARDINAL', ''),
 ('2002', 'DATE', ''),
 ('2009', 'DATE', ''),
 ('2010 – 2011', 'DATE', ''),
 ('2011', 'DATE', ''),
 ('2011/2012,3', 'CARDINAL', ''),
 ('2012', 'DATE', ''),
 ('2013', 'DATE', ''),
 ('2014', 'DATE', ''),
 ('2015', 'DATE', ''),
 ('2016', 'DATE', ''),
 ('2016 – 2018', 'DATE', ''),
 ('2017', 'DATE', ''),
 ('2018', 'DATE', ''),
 ('2019', 'DATE', ''),
 ('2020', 'DATE', ''),
 ('2022 5', 'DATE', ''),
 ('2108', 'DATE', '')

From these entities, we can see that some entity types are not helpful ('CARDINAL', 'DATE') and that there is still some ambiguity (e.g., Better Gold and Better Gold Initiative and Better Gold Initiative for Aritsanal and Small-Scale Mining).

Let's exclude specific entity types and also select the `ent_id` rather than the `ent_label` where it exists (since `ent_id` stores our long form version of abbreviations.

In [13]:
not_entity_types=['DATE', 'TIME', 'PERCENT',
                  'MONEY', 'QUANTITY', 'ORDINAL',
                  'CARDINAL']
entities = set()
for entity in doc.ents:  # entities instead of words
    # Store words only with cadidate POS tag
    if entity.label_ not in not_entity_types \
        and entity.text not in STOP_WORDS:
        # use entity id if present
        if entity.ent_id_:  # isinstance(entity.ent_id, str):
            tag = entity.ent_id_
        else:
            tag = entity.text
        entities.add(tag)
entities

{'1–5',
 'ANM',
 'ARTISANAL',
 'AUC',
 'Abstract Artisanal',
 'Actividad Minera de Oro',
 'Afro',
 'Afro-Colombian',
 'Agencia Nacional Minera',
 'Airborne Mercury Monitoring',
 'Annual Report',
 'Antioquia',
 'Article 10 of',
 'Article 11 of',
 'Article 248 of',
 'Article 25 of',
 'Artisanal and Small Miners',
 'Artisanal and small-scale gold mining',
 'Asociación de Agromineros del Cauca',
 'BGI',
 'Bajo Cauca',
 'Better Gold',
 'Better Gold Initiative',
 'Better Gold Initiative for Artisanal and Small-Scale Mining',
 'Bogota',
 'Bolivia',
 'COLOMBIA',
 'Cauca',
 'Chocó',
 'Chocó Department',
 'Colombia',
 'Colombian',
 'Colombians',
 'Cordy',
 'Cordy et al',
 'Corte Constitucional',
 'Development and Cooperation',
 'Devolution of Claims for Formalization',
 'DuPée',
 'EIA',
 'ELN',
 'Echavarria',
 'Economic Performance',
 'El Bagre',
 'El Clan del Golfo',
 'Energía',
 'Environmental Impact Studies',
 'Environmental Management',
 'Estudio Nacional del Agua',
 'Estudio de la Cadena de

Ambiguity in text arises from two sources: synonyms (i.e., Columbia, Columbian, Columbians) and ambiguity (i.e., which Columbia is intended?). Synonyms can be resolved by first examining the text to find similar entities. 

We'll start by using the `.similarity` measure of each entity span with other spans to see if we can get some help finding very similar matches. Let's start with 'Better Gold'.

In [14]:
better_gold = nlp("Better Gold")  # must be processed to create span
similarities = {}
for ent in entities:
    ent = nlp(ent)
    if(ent and ent.vector_norm):  # exclude non-conforming vectors
        similarities[ent.text] = (better_gold.similarity(ent))
sorted(similarities.items(), key=lambda x:x[1], reverse=True)

[('Better Gold', 1.0),
 ('Better Gold Initiative', 0.8792310922513605),
 ('the Swiss Better Gold Association', 0.8233968514068277),
 ('Swiss Better Gold Association', 0.8202524676616129),
 ('Better Gold Initiative for Artisanal and Small-Scale Mining',
  0.7461792027860876),
 ('the Better Gold Initiative for Artisanal and Small-Scale Mining',
  0.744242548249837),
 ('Swiss Better Gold Association Continuous Improvement Escalator for Artisanal and Small Miners',
  0.7396867341654257),
 ('Artisanal and small-scale gold mining', 0.7064036530161446),
 ('the “2020 WORLD BANK CONFERENCE ON LAND AND POVERTY', 0.5667028325236986),
 ('The Ministry of Mines and Energy', 0.5574274285174624),
 ('the Ministry of Mines and Energy', 0.5574274285174624),
 ('Global Report on Artisanal & Small-Scale Mining', 0.5451009399591602),
 ('Ministry of Mines and Energy', 0.5411436300357771),
 ('The World Bank - Washington DC', 0.5175154411011574),
 ('United Nations Office for Drugs and Crime', 0.515126090537126)

This worked fairly well on our toy example to bring forward the most similar mentions of this entity. Note that we have two entangled entities, notably the Initiative and Association. This may or may not be problem. As with many machine learning tasks, the challenge will be identifying the hyperparameter (similarity score) that should be used to include/exclude entities as 

In [20]:
similarities = {}
for ent_key in entities:
    ent_key = nlp(ent_key)
    if(ent_key and ent_key.vector_norm):  # exclude non-conforming vectors
        candidates = []
        for ent_candidate in entities:
            ent_candidate = nlp(ent_candidate)
            if(ent_candidate and ent_candidate.vector_norm):  # exclude non-conforming vectors
                similarity = ent_key.similarity(ent_candidate)
                if similarity > 0.9 and ent_key.text != ent_candidate.text:
                    candidates.append((ent_candidate.text, similarity))
        if len(candidates) > 0:
            candidates.sort(key=lambda x:x[1], reverse=True)
            similarities[ent_key.text] = candidates
similarities

{'COLOMBIA': [('Colombia', 1.0000000159379296)],
 'Artisanal and small-scale gold mining': [('Better Gold Initiative for Artisanal and Small-Scale Mining',
   0.9483232288080448),
  ('the Better Gold Initiative for Artisanal and Small-Scale Mining',
   0.9320467970486239)],
 'Actividad Minera de Oro': [('Política de Formalización Minera de Oro',
   0.9425949943857939)],
 'The Mining Census': [('Mining Census', 0.9393546534433931)],
 'National Unified Formalization Plan': [('a National Unified Formalization Plan',
   0.9677123367101881)],
 'Colombia': [('COLOMBIA', 1.0000000159379296)],
 'Swiss Economic Cooperation and Development': [('Organization for Economic Cooperation and Development',
   0.9068504445207778)],
 'Swiss Better Gold Association': [('the Swiss Better Gold Association',
   0.9740001133226993)],
 'The Ministry of Mines and Energy': [('the Ministry of Mines and Energy',
   0.9999999267586015),
  ('Ministry of Mines and Energy', 0.9863628439570623)],
 'United States': [('t

In [19]:
candidates = [('Yes', 1.0), ('No', .5), ('Maybe', .75)]
candidates.sort(key=lambda x:x[1], reverse=True)
candidates

## Tagging
We'll use `spacy`'s Entity Ruler to identify tags. The Entity Ruler is slightly better than the Phrase Matcher in this case because it adds the tags as entities (simplifying output) and handles overlaps (to ensure that an entity that includes part of the tag are captured only as entities, and not the overlapping tag--although I could see where the overlap would be the desired). 

Tags are user-supplied, in this case we'll read them from a csv where each row is a tag. 

In [12]:
# read in custom tags


## Summarizing
Text summarization can be either extractive (combine existing text) or abstractive (generate new text). We take an extractive approach, selecting the top sentences based on keyword prevalence. A number of algorithms are available on the web, the one below was borrowed from https://github.com/Jcharis/Natural-Language-Processing-Tutorials/blob/master/NLP_with_SpaCy/Text%20Summarization%20In%20SpaCy.ipynb , which provides a helpful explainer on how this works.

In [15]:
from heapq import nlargest
from string import punctuation

nlp.max_length = 1000000

def summarize(raw_docx):    
    raw_text = raw_docx
    docx = nlp(raw_text)
    stopwords = list(STOP_WORDS)
    # Build Word Frequency
    # word.text is tokenization in spacy
    word_frequencies = {}  
    for word in docx:  
        if word.text not in stopwords:
            if word.text not in word_frequencies.keys():
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1

    maximum_frequncy = max(word_frequencies.values())

    for word in word_frequencies.keys():  
        word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)
    
    # Sentence Tokens
    sentence_list = [ sentence for sentence in docx.sents ]

    # Calculate Sentence Score and Ranking
    sentence_scores = {}  
    for sent in sentence_list:  
        for word in sent:
            if word.text.lower() in word_frequencies.keys():
                if len(sent.text.split(' ')) < 200:  # exclude very large sentences
                    if sent not in sentence_scores.keys():
                        sentence_scores[sent] = word_frequencies[word.text.lower()]
                    else:
                        sentence_scores[sent] += word_frequencies[word.text.lower()]

    # Find N Largest
    summary_sentences = nlargest(7, sentence_scores, key=sentence_scores.get)
    final_sentences = [ w.text for w in summary_sentences ]
    summary = ''
    for sent in final_sentences:
        if len(summary) < 1000:
            summary += sent
            summary += ' '
    
    return summary

The summarizer can be fooled by text that inludes many of the text's most common words but not in a useful sentence (reference sections and key word sections are common culprits). For that reason, we first strip away any sentences that are mal-formed (in this case, those that don't have a noun and verb; to improve, we might explore using `spacy`'s dependency parser to ensure sentences that are in the summary are well-formed).

# Next Steps
See here for a script to train a KnowledgeBase from wikidata and wikipedia
https://github.com/explosion/projects/tree/master/nel-wikipedia
and here for a script to train a NER from wikidata (which might out-perform the default NER model which is trained on the OneNote5 corpus since we're linking back to a wikidata/wikipedia KB)
https://github.com/explosion/projects/tree/v3/pipelines/ner_wikiner

Instructions for running both are here
https://github.com/explosion/projects and here
https://spacy.io/usage/projects

## Entity Linker
An `EntityLinker` component disambiguates textual mentions (tagged as named entities) to unique identifiers, grounding the named entities into the “real world”. It requires a `KnowledgeBase`, as well as a function to generate plausible candidates from that `KnowledgeBase` given a certain textual mention, and a machine learning model to pick the right candidate, given the local context of the mention [spacy documentation](https://spacy.io/api/entitylinker).

The code below sets up an Entity Linker in the nlp pipeline using the default configuration. The `get_candidates` parameter generates plausible candidates for a given `Span` object, and the default provided here simply matches an exact, case-dependent alias in the KB.

In [19]:
from spacy.pipeline.entity_linker import DEFAULT_NEL_MODEL
config = {
   "labels_discard": [],
   "incl_prior": True,
   "incl_context": True,
   "model": DEFAULT_NEL_MODEL,
   "entity_vector_length": 64,
   "get_candidates": {'@misc': 'spacy.CandidateGenerator.v1'},
}
nlp.add_pipe("entity_linker", config=config)

ValueError: [E007] 'entity_linker' already exists in pipeline. Existing names: ['tok2vec', 'tagger', 'parser', 'senter', 'abbreviation_detector', 'ruler_abrv', 'ner', 'attribute_ruler', 'lemmatizer', 'entity_linker']

## KnowledgeBase
To use the Entity Linker we just set up, we'll need to establish a `KnowledgeBase` and populate it with Candidates and their aliases [spacy documentation](https://spacy.io/api/kb). 

In [20]:
from spacy.kb import KnowledgeBase
vocab = nlp.vocab
kb = KnowledgeBase(vocab=vocab, entity_vector_length=64)

kb.add_entity(entity="Q42", freq=32, entity_vector=vector1)
kb.add_alias(alias="Douglas", entities=["Q42", "Q463035"], probabilities=[0.6, 0.3])


NameError: name 'vector1' is not defined

See here for a discussion on the defining vectors for each entity.
https://stackoverflow.com/questions/59882071/spacy-entity-linking-word-vectors