# Named Entity Recognition without labelled data: A weak supervision approach

In this walkthrough, we will develop a neural NER model without access to labelled data. 

**Important note**: some of the data/model files used in this walkthrough are too big to be put on the GitHub repository, but are accessible for download [here](https://github.com/NorskRegnesentral/skweak/releases/tag/0.2.8).

Let's look at a particular example of text (from Reuters):


In [185]:
pip install skweak

[0m

In [186]:
pip install -U pip setuptools wheel

[0m

In [187]:
pip install -U spacy

[0m

In [188]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.9/13.9 MB[0m [31m39.1 MB/s[0m eta [36m0:00:00[0m
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [189]:
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.2.0/en_core_web_md-3.2.0-py3-none-any.whl (45.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.7/45.7 MB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [190]:
import sys
sys.path.insert(0, '../..')


In [191]:
import re
news_text  = """
ATLANTA  (Reuters) - Retailer Best Buy Co, seeking new ways to appeal to cost-conscious shoppers, said on Tuesday it is selling refurbished 
 versions of Apple Inc's iPhone 3G at its stores that are priced about $50 less than new iPhones. 
 The electronics chain said the used iPhones, which were returned within 30 days of purchase, are priced at $149 for the model with 8 gigabytes of storage, 
 while the 16-gigabyte version is $249. A two-year service contract with AT&T Inc is required. New iPhone 3Gs currently sell for $199 and $299 at 
 Best Buy Mobile stores. "This is focusing on customers' needs, trying to provide as wide a range of products and networks for our consumers," said 
 Scott Moore, vice president of marketing for Best Buy Mobile. Buyers of first-generation iPhones can also upgrade to the faster refurbished 3G models at 
 Best Buy, he said. Moore said AT&T, the exclusive wireless provider for the iPhone, offers refurbished iPhones online. The sale of used iPhones comes as 
 Best Buy, the top consumer electronics chain, seeks ways to fend off increased competition from discounters such as Wal-Mart Stores Inc, which began 
 selling the popular phone late last month. Wal-Mart sells a new 8-gigabyte iPhone 3G for $197 and $297 for the 16-gigabyte model. The iPhone is also 
 sold at Apple stores and AT&T stores. Moore said Best Buy's move was not in response to other retailers' actions. (Reporting by  Karen Jacobs ; Editing 
 by  Andre Grenon )"""

news_text = re.sub('\\s+', ' ', news_text)

Let's view what the standard Spacy model produces (note it takes a few seconds to reload the vocabulary):

In [192]:
import spacy, skweak

# We load the spacy model
nlp = spacy.load("en_core_web_sm")
doc = nlp(news_text)

# Visualising the entities
skweak.utils.display_entities(doc)

The medium-size model works better, but still contains quite a few errors and omissions:

In [193]:

# We load the spacy model (takes a few seconds)
nlp = spacy.load("en_core_web_md")
doc = nlp(news_text)

# Visualising the entities
skweak.utils.display_entities(doc)

Ideally, one would wish to train a better named entity recognition model, which is better tailored to the specific needs and linguistic patterns found in those articles. However, although we have large amounts of raw text data, we often do not have text data labelled with named entities for our domain. We therefore worked on an alternative approach based on __weak supervision__, combining several (noisy) supervision sources instead of a single "gold standard". 

Indeed, we do have access to several possible supervision sources, such as alternative NER models trained on other corpora, large lists of entity (companies, person names, geographical locations), shallow linguistic patterns, and document-level constraints. 

The key idea behind the proposed approach is thus to (1) use these supervision sources to automatically annotate news corpora, (2) estimate a label model (more precisely an HMM model) that unifies all these sources into a single one, and (3) learn a new NER model based on these unified labels. <br>

__Outline of this notebook__: I describe below the various annotation schemes that I developed.  I then explain how these various sources can be merged into a single source using the `skweak` framework. Finally, I detail the architecture behind the NER model.

## __Step 1:__ Annotations

### 1) Annotators from other Spacy models

A first source of automatic annotation comes from NER models trained on multiple, distinct corpora. I went through [available NE-labelled corpora](https://github.com/juand-r/entity-recognition-datasets) to search for datasets that could be used to train alternative models. I then trained Spacy models for all of them, and then conducted some experiments to assess their performance. At the end of the process, I ended up with four models:
- The standard Spacy model for English (`en_core_web_md`), trained on Ontonotes v5
- A model trained on [ConLL 2003](https://www.clips.uantwerpen.be/conll2003/ner/)
- A model trained on the [Broad Twitter Corpus](https://github.com/GateNLP/broad_twitter_corpus)
- A model trained on a corpus of [SEC filings](https://www.aclweb.org/anthology/U15-1010/).
    
Note there are differences between the entity labels of these models: while Ontonotes contains no less than [18 classes](https://spacy.io/api/annotation#named-entities), the other corpora only contain `PER(SON)`, `ORG`, `LOC` and `MISC`. Furthermore, the labels also do not match each other perfectly: while Ontonotes distinguishes between geopolitical locations (`GPE`) and "natural" locations (such as continents, seas etc., labelled as `LOC`), the three other models regroup all geographical entities as `LOC`. 

We can apply annotations from a Spacy model using the `ModelAnnotator` class.

In [194]:
pip install datasets

[0m

As we can see, the results are not perfect on this model either, but the errors are distinct from the ones made by the Ontonotes model. 

The annotations are written in the `spans` of the Spacy document:

Each `ModelAnnotator` adds two annotation sources: one that is directly based on the Spacy Model (here `conll2003`), and one that also includes the corrections specified in the method `_correct_entities` (in `spacy_wrapper.py`) that we implemented earlier this year.  The corrected version are indicated with a `+c` suffix.

Here are the results from the three other models:

In [195]:
core_web_annotator = skweak.spacy.ModelAnnotator("core_web_md", "en_core_web_md")

doc = core_web_annotator(doc)
skweak.utils.display_entities(doc, "core_web_md")


__Note__: When annotating large collections of news documents, the method `annotator.pipe(news_docs)` is much more efficient than calling `annotate(...)` every single time, as it batches the documents on which to run the NER model.

### 2) Annotators from gazetteers

Another useful source of annotation comes from large lists of entities such as persons, places and organisations. The gazetteers are using a _trie_ to efficiently search for occurrences in the text. Gazetteers can be run in two modes: case-sensitive or case-insentitive.


#### 2.1) Wikipedia
The database from Wikipedia is extracted from the [NECKar](https://event.ifi.uni-heidelberg.de/?page_id=532) dataset.  The postprocessing (which, among others, filters out entities that are also relatively common English words) is implemented in `compile_wikidata` as [WIKIDATA](https://github.com/NorskRegnesentral/skweak/releases/download/v0.2/wikidata_tokenised.json.gz). In addition, I also extracted from Wikidata a list of commercial products and added them to the gazetteer. 

In [196]:


tries = skweak.gazetteers.extract_json_data("wikidata_small_tokenised.json.gz")
annotator = skweak.gazetteers.GazetteerAnnotator("wiki", tries)

annotator(doc)
skweak.utils.display_entities(doc, "wiki")

Extracting data from wikidata_small_tokenised.json.gz
Populating trie for class PERSON (number: 1863434)
Populating trie for class LOC (number: 14241)
Populating trie for class GPE (number: 273373)
Populating trie for class ORG (number: 91341)
Populating trie for class PRODUCT (number: 12457)


Again, the annotation model does make some errors: `Moore` is thought to be a [geopolitical entity](https://en.wikipedia.org/wiki/Moore) instead of a person. Note that `AT&T` has two alternative labels: `ORG` or `GPE` (see [AT&T station](https://en.wikipedia.org/wiki/AT%26T_(SEPTA_station))). The data is available as [WIKI_SMALL](https://github.com/NorskRegnesentral/skweak/raw/main/data/wikidata_small_tokenised.json.gz).

In addition to the full wiki data, I also added a specific gazetteer that only employs wikidata objects containing a text description:

In [197]:
tries = skweak.gazetteers.extract_json_data("wikidata_small_tokenised.json.gz")
annotator = skweak.gazetteers.GazetteerAnnotator("wikismall_cased", tries)
annotator2 = skweak.gazetteers.GazetteerAnnotator("wikismall_uncased", tries, case_sensitive=False)

annotator2(annotator(doc))
skweak.utils.display_entities(doc, "wikismall_cased")
print()
skweak.utils.display_entities(doc, "wikismall_uncased")


Extracting data from wikidata_small_tokenised.json.gz
Populating trie for class PERSON (number: 1863434)
Populating trie for class LOC (number: 14241)
Populating trie for class GPE (number: 273373)
Populating trie for class ORG (number: 91341)
Populating trie for class PRODUCT (number: 12457)





As we can see, the "cased" gazetteers have a higher precision than the uncased gazetteers (at a cost of lower coverage).

#### 2.2 Crunchbase

The second gazetteer [Crunchbase](https://github.com/NorskRegnesentral/skweak/raw/main/data/crunchbase_companies.json.gz) is extracted from the [Open Data Map from Crunchbase](https://data.crunchbase.com/docs/open-data-map), which contains lists of both organisations and (business) persons.

In [198]:
tries = skweak.gazetteers.extract_json_data("crunchbase_companies.json.gz",  spacy_model="en_core_web_sm")
annotator = skweak.gazetteers.GazetteerAnnotator("crunchbase_cased", tries)
annotator2 = skweak.gazetteers.GazetteerAnnotator("crunchbase_uncased", tries)

annotator2(annotator(doc))
skweak.utils.display_entities(doc, ["crunchbase_cased", "crunchbase_uncased"])

Extracting data from crunchbase_companies.json.gz
Populating trie for class COMPANY (number: 539174)


#### 2.3 Geonames

The [geonames](http:www.geonames.org) database [GEO_NAMES](https://github.com/NorskRegnesentral/skweak/blob/main/data/geonames.json) contains a large list of locations, including both geopolitical entities and "natural" locations:

In [199]:
tries = skweak.gazetteers.extract_json_data("geonames.json",  spacy_model="en_core_web_sm")
annotator = skweak.gazetteers.GazetteerAnnotator("geo_cased", tries)
annotator2 = skweak.gazetteers.GazetteerAnnotator("geo_uncased", tries, case_sensitive=False)

annotator2(annotator(doc))
skweak.utils.display_entities(doc, ["geo_cased", "geo_uncased"])

Extracting data from geonames.json
Populating trie for class GPE (number: 15205)


#### 2.4 Product names

Finally, I used [DBPedia](http://www.dbpedia.org) to extract a list of products and brands as [Products](https://github.com/NorskRegnesentral/skweak/blob/main/data/products.json), since the recognition of products is particularly poor in Spacy NER models:

In [200]:

tries = skweak.gazetteers.extract_json_data("products.json",  spacy_model="en_core_web_sm")
annotator = skweak.gazetteers.GazetteerAnnotator("products_cased", tries)
annotator2 = skweak.gazetteers.GazetteerAnnotator("products_uncased", tries)

annotator2(annotator(doc))
skweak.utils.display_entities(doc, ["products_cased", "products_uncased"])

Extracting data from products.json
Populating trie for class PRODUCT (number: 45362)


### 3. Shallow patterns

Some named entities can also be captured through relatively simple, handcrafted patterns defined on the Spacy document. The class `FunctionAnnotator` makes it easy to define an annotator based on a function that takes a Spacy document as input and generate text spans with a label. Relations of mutual exclusivity between annotation sources can also be specified in the annotator. For instance, we can specify that numbers that are part of a date, time or money span should be ignored from the "number_detector" (to avoid having e.g. the `21` in `October 21` labelled as a `CARDINAL`): 

In [201]:

"""Class containing some generic entity names (in English)"""

# List of currency symbols and three-letter codes
CURRENCY_SYMBOLS = {"$", "¥", "£", "€", "kr", "₽", "R$", "₹", "Rp", "₪", "zł", "Rs", "₺", "RS"}

CURRENCY_CODES = {"USD", "EUR", "CNY", "JPY", "GBP", "NOK", "DKK", "CAD", "RUB", "MXN", "ARS", "BGN",
                  "BRL", "CHF", "CLP", "CZK", "INR", "IDR", "ILS", "IRR", "IQD", "KRW", "KZT", "NGN",
                  "QAR", "SEK", "SYP", "TRY", "UAH", "AED", "AUD", "COP", "MYR", "SGD", "NZD", "THB",
                  "HUF", "HKD", "ZAR", "PHP", "KES", "EGP", "PKR", "PLN", "XAU", "VND", "GBX"}

# sets of tokens used for the shallow patterns
MONTHS = {"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November",
          "December"}
MONTHS_ABBRV = {"Jan.", "Feb.", "Mar.", "Apr.", "May.", "Jun.", "Jul.", "Aug.", "Sep.", "Sept.", "Oct.", "Nov.", "Dec."}
DAYS = {"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"}
DAYS_ABBRV = {"Mon.", "Tu.", "Tue.", "Tues.", "Wed.", "Th.", "Thu.", "Thur.", "Thurs.", "Fri.", "Sat.", "Sun."}
MAGNITUDES = {"million", "billion", "mln", "bln", "bn", "thousand", "m", "k", "b", "m.", "k.", "b.", "mln.", "bln.",
              "bn."}
UNITS = {"tons", "tonnes", "barrels", "m", "km", "miles", "kph", "mph", "kg", "°C", "dB", "ft", "gal", "gallons", "g",
         "kW", "s", "oz",
         "m2", "km2", "yards", "W", "kW", "kWh", "kWh/yr", "Gb", "MW", "kilometers", "meters", "liters", "litres", "g",
         "grams", "tons/yr",
         'pounds', 'cubits', 'degrees', 'ton', 'kilograms', 'inches', 'inch', 'megawatts', 'metres', 'feet', 'ounces',
         'watts', 'megabytes',
         'gigabytes', 'terabytes', 'hectares', 'centimeters', 'millimeters', "F", "Celsius"}
ORDINALS = ({"first, second, third", "fourth", "fifth", "sixth", "seventh"} |
            {"%i1st" % i for i in range(100)} | {"%i2nd" % i for i in range(100)} | {"%ith" % i for i in range(1000)})
ROMAN_NUMERALS = {'I', 'II', 'III', 'IV', 'V', 'VI', 'VII', 'VIII', 'IX', 'X', 'XI', 'XII', 'XIII', 'XIV', 'XV', 'XVI',
                  'XVII',
                  'XVIII', 'XIX', 'XX', 'XXI', 'XXII', 'XXIII', 'XXIV', 'XXV', 'XXVI', 'XXVII', 'XXVIII', 'XXIX', 'XXX'}

# Full list of country names
COUNTRIES = {'Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola', 'Antigua', 'Argentina', 'Armenia', 'Australia',
             'Austria',
             'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin',
             'Bhutan',
             'Bolivia', 'Bosnia Herzegovina', 'Botswana', 'Brazil', 'Brunei', 'Bulgaria', 'Burkina', 'Burundi',
             'Cambodia', 'Cameroon',
             'Canada', 'Cape Verde', 'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia', 'Comoros',
             'Congo', 'Costa Rica',
             'Croatia', 'Cuba', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic',
             'East Timor',
             'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Fiji',
             'Finland', 'France',
             'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Greece', 'Grenada', 'Guatemala', 'Guinea',
             'Guinea-Bissau', 'Guyana',
             'Haiti', 'Honduras', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel',
             'Italy', 'Ivory Coast',
             'Jamaica', 'Japan', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea North', 'Korea South', 'Kosovo',
             'Kuwait', 'Kyrgyzstan',
             'Laos', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg',
             'Macedonia', 'Madagascar',
             'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Mauritania', 'Mauritius', 'Mexico',
             'Micronesia',
             'Moldova', 'Monaco', 'Mongolia', 'Montenegro', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru',
             'Nepal', 'Netherlands',
             'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Panama',
             'Papua New Guinea',
             'Paraguay', 'Peru', 'Philippines', 'Poland', 'Portugal', 'Qatar', 'Romania', 'Russian Federation',
             'Rwanda', 'St Kitts & Nevis',
             'St Lucia', 'Saint Vincent & the Grenadines', 'Samoa', 'San Marino', 'Sao Tome & Principe', 'Saudi Arabia',
             'Senegal', 'Serbia',
             'Seychelles', 'Sierra Leone', 'Singapore', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia',
             'South Africa', 'South Sudan',
             'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Swaziland', 'Sweden', 'Switzerland', 'Syria', 'Taiwan',
             'Tajikistan', 'Tanzania',
             'Thailand', 'Togo', 'Tonga', 'Trinidad & Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Tuvalu', 'Uganda',
             'Ukraine',
             'United Arab Emirates', 'United Kingdom', 'United States', 'Uruguay', 'Uzbekistan', 'Vanuatu',
             'Vatican City', 'Venezuela',
             'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe', "USA", "UK", "Russia", "South Korea"}

# Natialities, religious and political groups
NORPS = {'Afghan', 'African', 'Albanian', 'Algerian', 'American', 'Andorran', 'Anglican', 'Angolan', 'Arab', 'Aramean',
         'Argentine', 'Armenian',
         'Asian', 'Australian', 'Austrian', 'Azerbaijani', 'Bahamian', 'Bahraini', 'Baklan', 'Bangladeshi', 'Batswana',
         'Belarusian', 'Belgian',
         'Belizean', 'Beninese', 'Bermudian', 'Bhutanese', 'Bolivian', 'Bosnian', 'Brazilian', 'British', 'Bruneian',
         'Buddhist',
         'Bulgarian', 'Burkinabe', 'Burmese', 'Burundian', 'Californian', 'Cambodian', 'Cameroonian', 'Canadian',
         'Cape Verdian', 'Catholic', 'Caymanian',
         'Central African', 'Central American', 'Chadian', 'Chilean', 'Chinese', 'Christian', 'Christian-Democrat',
         'Christian-Democratic',
         'Colombian', 'Communist', 'Comoran', 'Congolese', 'Conservative', 'Costa Rican', 'Croat', 'Cuban', 'Cypriot',
         'Czech', 'Dane', 'Danish',
         'Democrat', 'Democratic', 'Djibouti', 'Dominican', 'Dutch', 'East European', 'Ecuadorean', 'Egyptian',
         'Emirati', 'English', 'Equatoguinean',
         'Equatorial Guinean', 'Eritrean', 'Estonian', 'Ethiopian', 'Eurasian', 'European', 'Fijian', 'Filipino',
         'Finn', 'Finnish', 'French',
         'Gabonese', 'Gambian', 'Georgian', 'German', 'Germanic', 'Ghanaian', 'Greek', 'Greenlander', 'Grenadan',
         'Grenadian', 'Guadeloupean', 'Guatemalan',
         'Guinea-Bissauan', 'Guinean', 'Guyanese', 'Haitian', 'Hawaiian', 'Hindu', 'Hinduist', 'Hispanic', 'Honduran',
         'Hungarian', 'Icelander', 'Indian',
         'Indonesian', 'Iranian', 'Iraqi', 'Irish', 'Islamic', 'Islamist', 'Israeli', 'Israelite', 'Italian', 'Ivorian',
         'Jain', 'Jamaican', 'Japanese',
         'Jew', 'Jewish', 'Jordanian', 'Kazakhstani', 'Kenyan', 'Kirghiz', 'Korean', 'Kurd', 'Kurdish', 'Kuwaiti',
         'Kyrgyz', 'Labour', 'Latin',
         'Latin American', 'Latvian', 'Lebanese', 'Liberal', 'Liberian', 'Libyan', 'Liechtensteiner', 'Lithuanian',
         'Londoner', 'Luxembourger',
         'Macedonian', 'Malagasy', 'Malawian', 'Malaysian', 'Maldivan', 'Malian', 'Maltese', 'Manxman', 'Marshallese',
         'Martinican', 'Martiniquais',
         'Marxist', 'Mauritanian', 'Mauritian', 'Mexican', 'Micronesian', 'Moldovan', 'Mongolian', 'Montenegrin',
         'Montserratian', 'Moroccan',
         'Motswana', 'Mozambican', 'Muslim', 'Myanmarese', 'Namibian', 'Nationalist', 'Nazi', 'Nauruan', 'Nepalese',
         'Netherlander', 'New Yorker',
         'New Zealander', 'Nicaraguan', 'Nigerian', 'Nordic', 'North American', 'North Korean', 'Norwegian', 'Orthodox',
         'Pakistani', 'Palauan',
         'Palestinian', 'Panamanian', 'Papua New Guinean', 'Paraguayan', 'Parisian', 'Peruvian', 'Philistine', 'Pole',
         'Polish', 'Portuguese',
         'Protestant', 'Puerto Rican', 'Qatari', 'Republican', 'Roman', 'Romanian', 'Russian', 'Rwandan',
         'Saint Helenian', 'Saint Lucian',
         'Saint Vincentian', 'Salvadoran', 'Sammarinese', 'Samoan', 'San Marinese', 'Sao Tomean', 'Saudi',
         'Saudi Arabian', 'Scandinavian', 'Scottish',
         'Senegalese', 'Serb', 'Serbian', 'Shia', 'Shiite', 'Sierra Leonean', 'Sikh', 'Singaporean', 'Slovak',
         'Slovene', 'Social-Democrat', 'Socialist',
         'Somali', 'South African', 'South American', 'South Korean', 'Soviet', 'Spaniard', 'Spanish', 'Sri Lankan',
         'Sudanese', 'Sunni',
         'Surinamer', 'Swazi', 'Swede', 'Swedish', 'Swiss', 'Syrian', 'Taiwanese', 'Tajik', 'Tanzanian', 'Taoist',
         'Texan', 'Thai', 'Tibetan',
         'Tobagonian', 'Togolese', 'Tongan', 'Tunisian', 'Turk', 'Turkish', 'Turkmen(s)', 'Tuvaluan', 'Ugandan',
         'Ukrainian', 'Uruguayan', 'Uzbek',
         'Uzbekistani', 'Venezuelan', 'Vietnamese', 'Vincentian', 'Virgin Islander', 'Welsh', 'West European',
         'Western', 'Yemeni', 'Yemenite',
         'Yugoslav', 'Zambian', 'Zimbabwean', 'Zionist'}

# Facilities
FACILITIES = {"Palace", "Temple", "Gate", "Museum", "Bridge", "Road", "Airport", "Hospital", "School", "Tower",
              "Station", "Avenue",
              "Prison", "Building", "Plant", "Shopping Center", "Shopping Centre", "Mall", "Church", "Synagogue",
              "Mosque", "Harbor", "Harbour",
              "Rail", "Railway", "Metro", "Tram", "Highway", "Tunnel", 'House', 'Field', 'Hall', 'Place', 'Freeway',
              'Wall', 'Square', 'Park',
              'Hotel'}

# Legal documents
LEGAL = {"Law", "Agreement", "Act", 'Bill', "Constitution", "Directive", "Treaty", "Code", "Reform", "Convention",
         "Resolution", "Regulation",
         "Amendment", "Customs", "Protocol", "Charter"}

# event names
EVENTS = {"War", "Festival", "Show", "Massacre", "Battle", "Revolution", "Olympics", "Games", "Cup", "Week", "Day",
          "Year", "Series"}

# Names of languages
LANGUAGES = {'Afar', 'Abkhazian', 'Avestan', 'Afrikaans', 'Akan', 'Amharic', 'Aragonese', 'Arabic', 'Aramaic',
             'Assamese', 'Avaric', 'Aymara',
             'Azerbaijani', 'Bashkir', 'Belarusian', 'Bulgarian', 'Bambara', 'Bislama', 'Bengali', 'Tibetan', 'Breton',
             'Bosnian', 'Cantonese',
             'Catalan', 'Chechen', 'Chamorro', 'Corsican', 'Cree', 'Czech', 'Chuvash', 'Welsh', 'Danish', 'German',
             'Divehi', 'Dzongkha', 'Ewe',
             'Greek', 'English', 'Esperanto', 'Spanish', 'Castilian', 'Estonian', 'Basque', 'Persian', 'Fulah',
             'Filipino', 'Finnish', 'Fijian', 'Faroese',
             'French', 'Western Frisian', 'Irish', 'Gaelic', 'Galician', 'Guarani', 'Gujarati', 'Manx', 'Hausa',
             'Hebrew', 'Hindi', 'Hiri Motu',
             'Croatian', 'Haitian', 'Hungarian', 'Armenian', 'Herero', 'Indonesian', 'Igbo', 'Inupiaq', 'Ido',
             'Icelandic', 'Italian', 'Inuktitut',
             'Japanese', 'Javanese', 'Georgian', 'Kongo', 'Kikuyu', 'Kuanyama', 'Kazakh', 'Kalaallisut', 'Greenlandic',
             'Central Khmer', 'Kannada',
             'Korean', 'Kanuri', 'Kashmiri', 'Kurdish', 'Komi', 'Cornish', 'Kirghiz', 'Latin', 'Luxembourgish', 'Ganda',
             'Limburgish', 'Lingala', 'Lao',
             'Lithuanian', 'Luba-Katanga', 'Latvian', 'Malagasy', 'Marshallese', 'Maori', 'Macedonian', 'Malayalam',
             'Mongolian', 'Marathi', 'Malay',
             'Maltese', 'Burmese', 'Nauru', 'Bokmål', 'Norwegian', 'Ndebele', 'Nepali', 'Ndonga', 'Dutch', 'Flemish',
             'Nynorsk', 'Navajo', 'Chichewa',
             'Occitan', 'Ojibwa', 'Oromo', 'Oriya', 'Ossetian', 'Punjabi', 'Pali', 'Polish', 'Pashto', 'Portuguese',
             'Quechua', 'Romansh', 'Rundi',
             'Romanian', 'Russian', 'Kinyarwanda', 'Sanskrit', 'Sardinian', 'Sindhi', 'Sami', 'Sango', 'Sinhalese',
             'Slovak', 'Slovenian', 'Samoan',
             'Shona', 'Somali', 'Albanian', 'Serbian', 'Swati', 'Sotho', 'Sundanese', 'Swedish', 'Swahili', 'Tamil',
             'Telugu', 'Tajik', 'Thai',
             'Tigrinya', 'Turkmen', 'Taiwanese', 'Tagalog', 'Tswana', 'Tonga', 'Turkish', 'Tsonga', 'Tatar', 'Twi',
             'Tahitian', 'Uighur', 'Ukrainian',
             'Urdu', 'Uzbek', 'Venda', 'Vietnamese', 'Volapük', 'Walloon', 'Wolof', 'Xhosa', 'Yiddish', 'Yoruba',
             'Zhuang', 'Mandarin',
             'Mandarin Chinese', 'Chinese', 'Zulu'}

LEGAL_SUFFIXES = {
    'ltd',  # Limited ~13.000
    'llc',  # limited liability company (UK)
    'ltda',  # limitada (Brazil, Portugal)
    'inc',  # Incorporated ~9700
    'co ltd',  # Company Limited ~9200
    'corp',  # Corporation ~5200
    'sa',  # Spółka Akcyjna (Poland), Société Anonyme (France)  ~3200
    'plc',  # Public Limited Company (Great Britain) ~2100
    'ag',  # Aktiengesellschaft (Germany) ~1000
    'gmbh',  # Gesellschaft mit beschränkter Haftung  (Germany)
    'bhd',  # Berhad (Malaysia) ~900
    'jsc',  # Joint Stock Company (Russia) ~900
    'co',  # Corporation/Company ~900
    'ab',  # Aktiebolag (Sweden) ~800
    'ad',  # Akcionarsko Društvo (Serbia), Aktsionerno Drujestvo (Bulgaria) ~600
    'tbk',  # Terbuka (Indonesia) ~500
    'as',  # Anonim Şirket (Turkey), Aksjeselskap (Norway) ~500
    'pjsc',  # Public Joint Stock Company (Russia, Ukraine) ~400
    'spa',  # Società Per Azioni (Italy) ~300
    'nv',  # Naamloze vennootschap (Netherlands, Belgium) ~230
    'dd',  # Dioničko Društvo (Croatia) ~220
    'a s',  # a/s (Denmark), a.s (Slovakia) ~210
    'oao',  # Открытое акционерное общество (Russia) ~190
    'asa',  # Allmennaksjeselskap (Norway) ~160
    'ojsc',  # Open Joint Stock Company (Russia) ~160
    'lp',  # Limited Partnership (US) ~140
    'llp',  # limited liability partnership
    'oyj',  # julkinen osakeyhtiö (Finland) ~120
    'de cv',  # Capital Variable (Mexico) ~120
    'se',  # Societas Europaea (Germany) ~100
    'kk',  # kabushiki gaisha (Japan)
    'aps',  # Anpartsselskab (Denmark)
    'cv',  # commanditaire vennootschap (Netherlands)
    'sas',  # société par actions simplifiée (France)
    'sro',  # Spoločnosť s ručením obmedzeným (Slovakia)
    'oy',  # Osakeyhtiö (Finland)
    'kg',  # Kommanditgesellschaft (Germany)
    'bv',  # Besloten Vennootschap (Netherlands)
    'sarl',  # société à responsabilité limitée (France)
    'srl',  # Società a responsabilità limitata (Italy)
    'sl'  # Sociedad Limitada (Spain)
}
# Generic words that may appear in official company names but are sometimes skipped when mentioned in news articles (e.g. Nordea Bank -> Nordea)
GENERIC_TOKENS = {"International", "Group", "Solutions", "Technologies", "Management", "Association", "Associates",
                  "Partners",
                  "Systems", "Holdings", "Services", "Bank", "Fund", "Stiftung", "Company"}

# List of tokens that are typically lowercase even when they occur in capitalised segments (e.g. International Council of Shopping Centers)
LOWERCASED_TOKENS = {"'s", "-", "a", "an", "the", "at", "by", "for", "in", "of", "on", "to", "up", "and"}

# Prefixes to family names that are often in lowercase
NAME_PREFIXES = {"-", "von", "van", "de", "di", "le", "la", "het", "'t'", "dem", "der", "den", "d'", "ter"}

In [202]:
pip install snips_nlu_parsers

[0m

In [203]:
from skweak import utils
def date_generator(doc):
    """Searches for occurrences of date patterns in text"""

    spans = []

    i = 0
    while i < len(doc):
        tok = doc[i]
        if tok.lemma_ in DAYS | DAYS_ABBRV:
            spans.append((i, i + 1, "DATE"))
        elif tok.is_digit and re.match("\\d+$", tok.text) and int(tok.text) > 1920 and int(tok.text) < 2040:
            spans.append((i, i + 1, "DATE"))
        elif tok.lemma_ in MONTHS | MONTHS_ABBRV:
            if tok.tag_ == "MD":  # Skipping "May" used as auxiliary
                pass
            elif i > 0 and re.match("\\d+$", doc[i - 1].text) and int(doc[i - 1].text) < 32:
                spans.append((i - 1, i + 1, "DATE"))
            elif i > 1 and re.match("\\d+(?:st|nd|rd|th)$", doc[i - 2].text) and doc[i - 1].lower_ == "of":
                spans.append((i - 2, i + 1, "DATE"))
            elif i < len(doc) - 1 and re.match("\\d+$", doc[i + 1].text) and int(doc[i + 1].text) < 32:
                spans.append((i, i + 2, "DATE"))
                i += 1
            else:
                spans.append((i, i + 1, "DATE"))
        i += 1

    for start, end, content in utils.merge_contiguous_spans(spans, doc):
        yield start, end, content


def time_generator(doc):
    """Searches for occurrences of time patterns in text"""

    i = 0
    while i < len(doc):
        tok = doc[i]

        if (i < len(doc) - 1 and tok.text[0].isdigit() and
                doc[i + 1].lower_ in {"am", "pm", "a.m.", "p.m.", "am.", "pm."}):
            yield i, i + 2, "TIME"
            i += 1
        elif tok.text[0].isdigit() and re.match("\\d{1,2}\\:\\d{1,2}", tok.text):
            yield i, i + 1, "TIME"
            i += 1
        i += 1


def money_generator(doc):
    """Searches for occurrences of money patterns in text"""

    i = 0
    while i < len(doc):
        tok = doc[i]
        if tok.text[0].isdigit():
            j = i + 1
            while (j < len(doc) and (doc[j].text[0].isdigit() or doc[j].norm_ in MAGNITUDES)):
                j += 1

            found_symbol = False
            if i > 0 and doc[i - 1].text in (CURRENCY_CODES | CURRENCY_SYMBOLS):
                i = i - 1
                found_symbol = True
            if (j < len(doc) and doc[j].text in
                    (CURRENCY_CODES | CURRENCY_SYMBOLS | {"euros", "cents", "rubles"})):
                j += 1
                found_symbol = True

            if found_symbol:
                yield i, j, "MONEY"
            i = j
        else:
            i += 1


def number_generator(doc):
    """Searches for occurrences of number patterns (cardinal, ordinal, quantity or percent) in text"""

    i = 0
    while i < len(doc):
        tok = doc[i]

        if tok.lower_ in ORDINALS:
            yield i, i + 1, "ORDINAL"

        elif re.search("\\d", tok.text):
            j = i + 1
            while (j < len(doc) and (doc[j].norm_ in MAGNITUDES)):
                j += 1
            if j < len(doc) and doc[j].lower_.rstrip(".") in UNITS:
                j += 1
                yield i, j, "QUANTITY"
            elif j < len(doc) and doc[j].lower_ in ["%", "percent", "pc.", "pc", "pct", "pct.", "percents",
                                                    "percentage"]:
                j += 1
                yield i, j, "PERCENT"
            else:
                yield i, j, "CARDINAL"
            i = j - 1
        i += 1
def legal_generator(doc):
    legal_spans = []
    for span in utils.get_spans(doc, ["proper2_detector", "nnp_detector"]):
        if not utils.is_likely_proper(doc[span.end-1]):
            continue         
        last_token = doc[span.end-1].text.title().rstrip("s")
                  
        if last_token in LEGAL:     
            legal_spans.append((span.start,span.end, "LAW"))
                     
    
    # Handling legal references such as Article 5
    for i in range(len(doc) - 1):
        if doc[i].text.rstrip("s") in {"Article", "Paragraph", "Section", "Chapter", "§"}:
            if doc[i + 1].text[0].isdigit() or doc[i + 1].text in data_utils.ROMAN_NUMERALS:
                start, end = i, i + 2
                if (i < len(doc) - 3 and doc[i + 2].text in {"-", "to", "and"}
                        and (doc[i + 3].text[0].isdigit() or doc[i + 3].text in data_utils.ROMAN_NUMERALS)):
                    end = i + 4
                legal_spans.append((start, end, "LAW"))

    # Merge contiguous spans of legal references ("Article 5, Paragraph 3")
    legal_spans = utils.merge_contiguous_spans(legal_spans, doc)
    for start, end, label in legal_spans:
        yield start, end, label


def misc_generator(doc):
    """Detects occurrences of countries and various less-common entities (NORP, FAC, EVENT, LANG)"""
    
    spans = set(doc.spans["proper2_detector"])
    spans |= {doc[i:i+1] for i in range(len(doc))}
    
    for span in sorted(spans):

        span_text = span.text
        if span_text.isupper():
            span_text = span_text.title()
        last_token = doc[span.end-1].text

        if span_text in COUNTRIES:
            yield span.start, span.end, "GPE"

        if len(span) <= 3 and (span in NORPS or last_token in NORPS 
                               or last_token.rstrip("s") in NORPS):
            yield span.start, span.end, "NORP"
    
        if span in LANGUAGES and doc[span.start].tag_=="NNP":
            yield span.start, span.end, "LANGUAGE"
            
        if last_token in FACILITIES and len(span) > 1:
            yield span.start, span.end, "FAC"     

        if last_token in EVENTS  and len(span) > 1:
            yield span.start, span.end, "EVENT"     

from spacy.tokens import Span
import json
class FullNameDetector():
    """Search for occurrences of full person names (first name followed by at least one title token)"""

    def __init__(self):
        fd = open("firstnames.json")
        self.first_names = set(json.load(fd))
        fd.close()

    def __call__(self, span: Span) -> bool:
        # We assume full names are between 2 and 5 tokens
        if len(span) < 2 or len(span) > 5:
            return False

        return (span[0].text in self.first_names and
                span[-1].is_alpha and span[-1].is_title)
        
from skweak.base import CombinedAnnotator, SpanAnnotator
from spacy.tokens import Doc, Span
from typing import Iterable, Tuple
import snips_nlu_parsers
class SnipsAnnotator(SpanAnnotator):
    """Annotation using the Snips NLU entity parser. 
       You must install  "snips-nlu-parsers" (pip install snips-nlu-parsers) to make it work.
    """
    
    def __init__(self, name: str):
        """Initialise the annotation tool."""

        super(SnipsAnnotator, self).__init__(name)
        self.parser = snips_nlu_parsers.BuiltinEntityParser.build(language="en")

    def find_spans(self, doc: Doc) -> Iterable[Tuple[int, int, str]]:
        """Runs the parser on the spacy document, and convert the result to labels."""

        text = doc.text

        # The current version of Snips has a bug that makes it crash with some rare
        # Turkish characters, or mentions of "billion years"
        text = text.replace("’", "'").replace("”", "\"").replace("“", "\"").replace("—", "-")
        text = text.encode("iso-8859-15", "ignore").decode("iso-8859-15")
        text = re.sub("(\\d+) ([bm]illion(?: (?:\\d+|one|two|three|four|five|six|seven" +
                      "|eight|nine|ten))? years?)", "\\g<1>.0 \\g<2>", text)

        results = self.parser.parse(text)
        for result in results:
            span = doc.char_span(result["range"]["start"], result["range"]["end"])
            if span is None or span.text.lower() in {"now"} or span.text in {"may"}:
                continue
            label = None
            if (result["entity_kind"] == "snips/number" and span.text.lower() not in
                    {"one", "some", "few", "many", "several"}):
                label = "CARDINAL"
            elif (result["entity_kind"] == "snips/ordinal" and span.text.lower() not in
                  {"first", "second", "the first", "the second"}):
                label = "ORDINAL"
            elif result["entity_kind"] == "snips/temperature":
                label = "QUANTITY"
            elif result["entity_kind"] == "snips/amountOfMoney":
                label = "MONEY"
            elif result["entity_kind"] == "snips/percentage":
                label = "PERCENT"
            elif result["entity_kind"] in {"snips/date", "snips/datePeriod", "snips/datetime"}:
                label = "DATE"
            elif result["entity_kind"] in {"snips/time", "snips/timePeriod"}:
                label = "TIME"

            if label:
                yield span.start, span.end, label
class ConLL2003Standardiser(SpanAnnotator):
    """Annotator taking existing annotations and standardising them
    to fit the ConLL 2003 tag scheme"""

    def __init__(self):
        super(ConLL2003Standardiser, self).__init__("")

    def __call__(self, doc):
        """Annotates one single document"""     
               
        for source in doc.spans:
               
            new_spans = []  
            for span in doc.spans[source]:
                if "\n" in span.text:
                    continue
                elif span.label_=="PERSON":
                    new_spans.append(Span(doc, span.start, span.end, label="PER"))
                elif span.label_ in {"ORGANIZATION", "ORGANISATION", "COMPANY"}:
                    new_spans.append(Span(doc, span.start, span.end, label="ORG"))
                elif span.label_ in {"GPE"}:
                    new_spans.append(Span(doc, span.start, span.end, label="LOC"))
                elif span.label_ in {"EVENT", "FAC", "LANGUAGE", "LAW", "NORP", "PRODUCT", "WORK_OF_ART"}:
                    new_spans.append(Span(doc, span.start, span.end, label="MISC"))
                else:
                    new_spans.append(span)         
            doc.spans[source] = new_spans      
        return doc    
    

In [204]:

date_annotator = skweak.heuristics.FunctionAnnotator("date_detector", date_generator)
time_annotator = skweak.heuristics.FunctionAnnotator("time_detector", time_generator)
money_annotator = skweak.heuristics.FunctionAnnotator("money_detector", money_generator)
exclusives = ["date_detector", "time_detector", "money_detector"]
number_annotator = skweak.heuristics.FunctionAnnotator("number_detector", number_generator)
number_annotator.add_incompatible_sources(exclusives)

date_annotator(doc)
time_annotator(doc)
money_annotator(doc)
number_annotator(doc)
skweak.utils.display_entities(doc, "date_detector")
skweak.utils.display_entities(doc, "time_detector")
skweak.utils.display_entities(doc, "money_detector")
skweak.utils.display_entities(doc, "number_detector")

I have also created a range of patterns aiming to improve the _detection_ of named entities, even though they leave the actual label underspecified (as a generic `ENT` label). Four such detectors are constructed:
- two detectors of proper names based on casing (marking sequence of tokens whose lemma are "titled" as potential named entities)
- one detector of NNP sequences (based on the Spacy POS tagger)
- and one detector of sequences with proper names linked with "compound" dependency relations

In [205]:
# Detection based on casing
proper_detector = skweak.heuristics.TokenConstraintAnnotator("proper_detector", skweak.utils.is_likely_proper, "ENT")
    
# Detection based on casing, but allowing some lowercased tokens
proper2_detector = skweak.heuristics.TokenConstraintAnnotator("proper2_detector", skweak.utils.is_likely_proper, "ENT")
proper2_detector.add_gap_tokens(LOWERCASED_TOKENS | NAME_PREFIXES)
#add  .ner.        
# Detection based on part-of-speech tags
nnp_detector = skweak.heuristics.TokenConstraintAnnotator("nnp_detector", lambda tok: tok.tag_=="NNP", "ENT")
        
# Detection based on dependency relations (compound phrases)
compound = lambda tok: skweak.utils.is_likely_proper(tok) and skweak.utils.in_compound(tok)
compound_detector = skweak.heuristics.TokenConstraintAnnotator("compound_detector", compound, "ENT")
 
combined = skweak.base.CombinedAnnotator()
exclusives = ["date_detector", "time_detector", "money_detector"]
for annotator in [proper_detector, proper2_detector, nnp_detector, compound_detector]:
    annotator.add_incompatible_sources(exclusives)
    annotator.add_gap_tokens(["'s", "-"])
    combined.add_annotator(annotator)

    # We add one variants for each NE detector, looking at infrequent tokens
    infrequent_name = "infrequent_%s"%annotator.name
    combined.add_annotator(skweak.heuristics.SpanConstraintAnnotator(infrequent_name, annotator.name, skweak.utils.is_infrequent))

doc = combined(doc)
skweak.utils.display_entities(doc, "proper_detector")
skweak.utils.display_entities(doc, "proper2_detector")
skweak.utils.display_entities(doc, "nnp_detector")
skweak.utils.display_entities(doc, "compound_detector")

Finally, I created three specific annotators:
- to recognise company names with a legal type
- full person names (with a first name along a list of common first names)
- slightly less common entities such as `NORP`, `FAC`, `LANGUAGE`, `EVENT` and `LAW`.

In [206]:

# Other types (legal references etc.)      
misc_detector = skweak.heuristics.FunctionAnnotator("misc_detector", misc_generator)
legal_detector = skweak.heuristics.FunctionAnnotator("legal_detector", legal_generator)
        
# Detection of companies with a legal type
ends_with_legal_suffix = lambda x: x[-1].lower_.rstrip(".") in LEGAL_SUFFIXES
company_type_detector = skweak.heuristics.SpanConstraintAnnotator("company_type_detector", "proper2_detector", 
                                                    ends_with_legal_suffix, "COMPANY")
# Detection of full person names
FIRSR_NAMES = "firstnames.json"
full_name_detector = skweak.heuristics.SpanConstraintAnnotator("full_name_detector", "proper2_detector", 
                                                     FullNameDetector(), "PERSON")


legal_detector(doc)
company_type_detector(doc)
full_name_detector(doc)
misc_detector(doc)
skweak.utils.display_entities(doc, "company_type_detector")
skweak.utils.display_entities(doc, "full_name_detector")
skweak.utils.display_entities(doc, "misc_detector")
skweak.utils.display_entities(doc, "legal_detector")

Finally, we also rely on an external probabilistic [parser of named entities](https://github.com/snipsco/snips-nlu-parsers) from [Snips](https://snips.ai/). The parser recognises `DATE`, `TIME`, `ORDINAL`, `CARDINAL`, `MONEY` and `PERCENT`. The parser is implemented in _Rust_, so it runs quite fast.

In [207]:
# Detection based on a probabilistic parser
# NB: requires to install "snips-nlu-parsers" (pip install snips-nlu-parsers)
snips = SnipsAnnotator("snips")
snips(doc)
skweak.utils.display_entities(doc, "snips")

### 4. Document-level annotators

All annotators presented so far rely on _local_ decisions on tokens or phrases.  However, news articles are not mere collections of words, but exhibit a high degree of internal coherence. This can be exploited to furhter improve the annotation. Two document-level annotators are implemented:

Before we can run the document-level annotators, we need to normalise some of the entities. The `ConLL2003Standardiser` is responsible for this normalisation:
- entities `PER` (from conll2003, BTC and SEC) are set to `PERSON`
- entities `LOC` from conll2003, BTC and SEC for spans that are also annotated by other layers as `GPE` are set to `GPE` 
- entities `ORG` that are annotated by other layers as `COMPANY` are set to `COMPANY`
    

In [208]:
annotator = ConLL2003Standardiser()
doc = annotator(doc)

In [209]:
mv = skweak.aggregation.MajorityVoter("mv", ["LOC", "MISC", "ORG", "PER"])
mv.add_underspecified_label("ENT", {"LOC", "MISC", "ORG", "PER"})
doc = mv(doc)
skweak.utils.display_entities(doc, "mv")

#### 4.1 Document history

When a journalist first mentions an entity such as a company or person in an article, they typically write it in a "long form", and then use shorter mentions once the entity is properly introduced. For instance, in the text above, "Scott Moore" is first mentioned with a full name, and then simply referred to as "Moore". Similarly, companies are often first introduced to with their legal type.  The `DocumentHistoryAnnotator` takes advantage of this property, by propagating the label from the first mention onto subsequent mentions:

In [210]:
annotator = skweak.doclevel.DocumentHistoryAnnotator("doc_history", "mv", ["PER", "ORG"])
annotator(doc)
skweak.utils.display_entities(doc, "doc_history")

#### 4.2 Label consistency

Another property of news documents is the fact that two (or more) named entities sharing the same string in a text typically refer to the same entity, and should therefore have the same label. "Komatsu" can be both a company name and a city in Japan, but within a given document, it will typically be one or the other for the whole document. We can capture this fact with an annotator that looks at the majority label for a given string, and annotate all occurrences with this label:

In [211]:
annotator = skweak.doclevel.DocumentMajorityAnnotator("doc_majority", "mv")
annotator(doc)
skweak.utils.display_entities(doc, "doc_majority")

In [212]:
mv = skweak.aggregation.MajorityVoter("mv", ["LOC", "MISC", "ORG", "PER"])
mv.add_underspecified_label("ENT", {"LOC", "MISC", "ORG", "PER"})
doc = mv(doc)
skweak.utils.display_entities(doc, "mv")

## __Step 2__: Estimation of label model

We can construct a full annotator with all annotators described above, and then run it on a dataset such as Reuters, Bloomberg, or Acquire:

In [213]:
from typing import Iterable, Tuple
import re, json, os
import snips_nlu_parsers
from skweak.base import CombinedAnnotator, SpanAnnotator
from skweak.spacy import ModelAnnotator, TruecaseAnnotator
from skweak.heuristics import FunctionAnnotator, TokenConstraintAnnotator, SpanConstraintAnnotator, SpanEditorAnnotator
from skweak.gazetteers import GazetteerAnnotator, extract_json_data
from skweak.doclevel import DocumentHistoryAnnotator, DocumentMajorityAnnotator
from skweak.aggregation import MajorityVoter
from skweak import utils
from spacy.tokens import Doc, Span  # type: ignore


class NERAnnotator(CombinedAnnotator):
    """Annotator of entities in documents, combining several sub-annotators (such as gazetteers,
    spacy models etc.). To add all annotators currently implemented, call add_all(). """

    def add_all(self):
        """Adds all implemented annotation functions, models and filters"""

        print("Loading shallow functions")
        self.add_shallow()
        print("Loading Spacy NER models")
        self.add_models()
        print("Loading gazetteer supervision modules")
        self.add_gazetteers()
        print("Loading document-level supervision sources")
        self.add_doc_level()

        return self

    def add_shallow(self):
        """Adds shallow annotation functions"""

        # Detection of dates, time, money, and numbers
        self.add_annotator(FunctionAnnotator("date_detector", date_generator))
        self.add_annotator(FunctionAnnotator("time_detector", time_generator))
        self.add_annotator(FunctionAnnotator("money_detector", money_generator))

        # Detection based on casing
        proper_detector = TokenConstraintAnnotator("proper_detector", utils.is_likely_proper, "ENT")

        # Detection based on casing, but allowing some lowercased tokens
        proper2_detector = TokenConstraintAnnotator("proper2_detector", utils.is_likely_proper, "ENT")
        proper2_detector.add_gap_tokens(LOWERCASED_TOKENS | NAME_PREFIXES)

        # Detection based on part-of-speech tags
        nnp_detector = TokenConstraintAnnotator("nnp_detector", lambda tok: tok.tag_ in {"NNP", "NNPS"}, "ENT")

        # Detection based on dependency relations (compound phrases)
        compound = lambda tok: utils.is_likely_proper(tok) and utils.in_compound(tok)
        compound_detector = TokenConstraintAnnotator("compound_detector", compound, "ENT")

        exclusives = ["date_detector", "time_detector", "money_detector"]
        for annotator in [proper_detector, proper2_detector, nnp_detector, compound_detector]:
            annotator.add_incompatible_sources(exclusives)
            annotator.add_gap_tokens(["'s", "-"])
            self.add_annotator(annotator)

            # We add one variants for each NE detector, looking at infrequent tokens
            infrequent_name = "infrequent_%s" % annotator.name
            self.add_annotator(SpanConstraintAnnotator(infrequent_name, annotator.name, utils.is_infrequent))

        # Other types (legal references etc.)
        misc_detector = FunctionAnnotator("misc_detector", misc_generator)
        legal_detector = FunctionAnnotator("legal_detector", legal_generator)

        # Detection of companies with a legal type
        ends_with_legal_suffix = lambda x: x[-1].lower_.rstrip(".") in LEGAL_SUFFIXES
        company_type_detector = SpanConstraintAnnotator("company_type_detector", "proper2_detector",
                                                        ends_with_legal_suffix, "COMPANY")

        # Detection of full person names
        full_name_detector = SpanConstraintAnnotator("full_name_detector", "proper2_detector",
                                                     FullNameDetector(), "PERSON")

        for annotator in [misc_detector, legal_detector, company_type_detector, full_name_detector]:
            annotator.add_incompatible_sources(exclusives)
            self.add_annotator(annotator)

        # General number detector
        number_detector = FunctionAnnotator("number_detector", number_generator)
        number_detector.add_incompatible_sources(exclusives + ["legal_detector", "company_type_detector"])
        self.add_annotator(number_detector)

        self.add_annotator(SnipsAnnotator("snips"))
        return self

    def add_models(self):
        """Adds Spacy NER models to the annotator"""

        self.add_annotator(ModelAnnotator("core_web_md", "en_core_web_md"))
        #self.add_annotator(TruecaseAnnotator("core_web_md_truecase", "en_core_web_md", "form_frequencies.json"))
        #self.add_annotator(ModelAnnotator("BTC", os.path.dirname(__file__) + "/../../data/btc"))
        #self.add_annotator( TruecaseAnnotator("BTC_truecase", os.path.dirname(__file__) + "/../../data/btc", FORM_FREQUENCIES))

        # Avoid spans that start with an article
        editor = lambda span: span[1:] if span[0].lemma_ in {"the", "a", "an"} else span
        self.add_annotator(SpanEditorAnnotator("edited_BTC", "BTC", editor))
        self.add_annotator(SpanEditorAnnotator("edited_BTC_truecase", "BTC_truecase", editor))
        self.add_annotator(SpanEditorAnnotator("edited_core_web_md", "core_web_md", editor))
        self.add_annotator(SpanEditorAnnotator("edited_core_web_md_truecase", "core_web_md_truecase", editor))

        return self

    def add_gazetteers(self, full_load=True):
        """Adds gazetteer supervision models (company names and wikidata)."""

        # Annotation of company names based on a large list of companies
        # company_tries = extract_json_data(COMPANY_NAMES) if full_load else {}

        # Annotation of company, person and location names based on wikidata
        wiki_tries = extract_json_data("wikidata_small_tokenised.json.gz") if full_load else {}

        # Annotation of company, person and location names based on wikidata (only entries with descriptions)
        wiki_small_tries = extract_json_data("wikidata_small_tokenised.json.gz")

        # Annotation of location names based on geonames
        geo_tries = extract_json_data("geonames.json")

        # Annotation of organisation and person names based on crunchbase open data
        crunchbase_tries = extract_json_data("crunchbase_companies.json.gz")

        # Annotation of product names
        products_tries = extract_json_data("products.json")

        exclusives = ["date_detector", "time_detector", "money_detector", "number_detector"]
        for name, tries in {"wiki":wiki_tries, "wiki_small":wiki_small_tries,
                            "geo":geo_tries, "crunchbase":crunchbase_tries, "products":products_tries}.items():
            
            # For each KB, we create two gazetters (case-sensitive or not)
            cased_gazetteer = GazetteerAnnotator("%s_cased"%name, tries, case_sensitive=True)
            uncased_gazetteer = GazetteerAnnotator("%s_uncased"%name, tries, case_sensitive=False)
            cased_gazetteer.add_incompatible_sources(exclusives)
            uncased_gazetteer.add_incompatible_sources(exclusives)
            self.add_annotators(cased_gazetteer, uncased_gazetteer)
                
            # We also add new sources for multitoken entities (which have higher confidence)
            multitoken_cased = SpanConstraintAnnotator("multitoken_%s"%(cased_gazetteer.name), 
                                                       cased_gazetteer.name, lambda s: len(s) > 1)
            multitoken_uncased = SpanConstraintAnnotator("multitoken_%s"%(uncased_gazetteer.name), 
                                                         uncased_gazetteer.name, lambda s: len(s) > 1)
            self.add_annotators(multitoken_cased, multitoken_uncased)
                
        return self

    def add_doc_level(self):
        """Adds document-level supervision sources"""

        self.add_annotator(ConLL2003Standardiser())

        maj_voter = MajorityVoter("doclevel_voter", ["LOC", "MISC", "ORG", "PER"], 
                                  initial_weights={"doc_history":0, "doc_majority":0})
        maj_voter.add_underspecified_label("ENT", {"LOC", "MISC", "ORG", "PER"})     
        self.add_annotator(maj_voter)   
           
        self.add_annotator(DocumentHistoryAnnotator("doc_history_cased", "doclevel_voter", ["PER", "ORG"]))
        self.add_annotator(DocumentHistoryAnnotator("doc_history_uncased", "doclevel_voter", ["PER", "ORG"]))
        
        maj_voter = MajorityVoter("doclevel_voter", ["LOC", "MISC", "ORG", "PER"],
                                  initial_weights={"doc_majority":0})
        maj_voter.add_underspecified_label("ENT", {"LOC", "MISC", "ORG", "PER"})
        self.add_annotator(maj_voter)

        self.add_annotator(DocumentMajorityAnnotator("doc_majority_cased", "doclevel_voter"))
        self.add_annotator(DocumentMajorityAnnotator("doc_majority_uncased", "doclevel_voter", 
                                                     case_sensitive=False))
        return self


In [214]:
full_annotator = NERAnnotator().add_all()
print("Total number of annotators:", len(full_annotator.annotators))

Loading shallow functions
Loading Spacy NER models
Loading gazetteer supervision modules
Extracting data from wikidata_small_tokenised.json.gz
Populating trie for class PERSON (number: 1863434)
Populating trie for class LOC (number: 14241)
Populating trie for class GPE (number: 273373)
Populating trie for class ORG (number: 91341)
Populating trie for class PRODUCT (number: 12457)
Extracting data from wikidata_small_tokenised.json.gz
Populating trie for class PERSON (number: 1863434)
Populating trie for class LOC (number: 14241)
Populating trie for class GPE (number: 273373)
Populating trie for class ORG (number: 91341)
Populating trie for class PRODUCT (number: 12457)
Extracting data from geonames.json
Populating trie for class GPE (number: 15205)
Extracting data from crunchbase_companies.json.gz
Populating trie for class COMPANY (number: 539174)
Extracting data from products.json
Populating trie for class PRODUCT (number: 45362)
Loading document-level supervision sources
Total number 

We can then take the raw data from [Reuters](https://github.com/NorskRegnesentral/skweak/raw/main/data/reuters_small.tar.gz), run Spacy on the textual content, and finally apply the annotator to get annotations from the each source:

In [215]:
!python -m spacy convert reuters_small.tar.gz "/content/"

Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/dist-packages/spacy/__main__.py", line 4, in <module>
    setup_cli()
  File "/usr/local/lib/python3.7/dist-packages/spacy/cli/_util.py", line 71, in setup_cli
    command(prog_name=COMMAND)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/dist-packages/

In [216]:
!ungz /content/retuers_small.tar.gz

/bin/bash: ungz: command not found


In [217]:
import tarfile
# open file
file = tarfile.open('reuters_small.tar.gz')
  
# extracting file
file.extractall('/content/reuters_small')
  
file.close()

In [218]:
pip install reuters

[0m

In [219]:

# We annotate 200 documents, and store them in a Spacy DocBin file
docs = list(skweak.utils.docbin_reader("reuters_small.spacy"))
docs = list(full_annotator.pipe(docs))

Once this is done, we can finally estimate a unified annotator model through weak supervision. The basic idea is to describe the named entity recognition problem as a _Hidden markov Model_ where the observations are the annotations from each source, and the states correspond to the "true" (hidden) labels for each token, as illustrated in Figure 2 in the paper.

Since we don't have access to the true labels for each token, we will rely on _Baum-Welch_ (a variant of EM) to estimate the HMM model through unsupervised training. More specifically, we will need to estimate 3 models:
- the initial probabilities $P(Y_0)$ of the labels for the first token of a document
- the transition matrix $P(Y_i | Y_{i-1})$ for the labels 
- the emission models $P(\lambda_{i,j} | Y_i)$ of observing a particular value $\lambda_{i,j}$ (say, `B-PER`) from the source $j$ given the true label $Y_i$. In the current model, we assume the emissions to be independent of one another given the true label, to reduce the complexity of the model.

Given an annotated dataset, the HMM model can be easily estimated:

In [220]:
unified_model = skweak.aggregation.HMM("hmm", ["LOC", "MISC", "ORG", "PER"])
unified_model.add_underspecified_label("ENT", ["LOC", "MISC", "ORG", "PER"])
# We then run Baum-Welch on the model (can take some time)
unified_model.fit(docs)

# Saving the model to a file
unified_model.save("/content/hmm_reuters_small.pkl")

Starting iteration 1
Finished E-step with 195 documents
Starting iteration 2


         1     -334337.3747             +nan


Finished E-step with 195 documents
Starting iteration 3


         2     -284632.9615      +49704.4131


Finished E-step with 195 documents
Starting iteration 4


         3     -276919.0075       +7713.9540


Finished E-step with 195 documents


         4     -276821.0967         +97.9108


Note that the HMM model relies on some informative priors to facilitate the parameter estimation:
- the prior for the initial probabilities is a Dirichlet based on counts for the most reliable model (chosen right now to be the Spacy NER model trained on Ontonotes)
- the prior for the transition matrix is a list of Dirichlet also based on counts from the standard Spacy NER model.
- finally, the initial emission models are calculated based on subjective estimates of the relative precision and recall of each source. For instance, we know that a source like `company_type_detector` (which looks at legal suffixes such as "Inc." at the end of the noun phrase) has a very high precision, but a low recall , since many mentions of companies do not include a suffix. In contrast, gazeteers will tend to have a better recall, but a lower precision (some company names also happen to be names of geopolitical entities or persons).  The initial precisions and recalls provided to the model are specified in `SOURCE_PRIORS` in the file `labelling.py`. When a precision and recall is not provided for a given source, they are assumed to be zeros (for instance, `company_type_detector` only detects `COMPANY` entities and nothing else).  

In [221]:
unified_model.pretty_print() 

HMM model with following parameters:
Output labels: ['O', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC', 'B-ORG', 'I-ORG', 'B-PER', 'I-PER']
Label groups: {'B-ENT': {'B-ORG', 'B-PER', 'B-LOC', 'B-MISC'}, 'I-ENT': {'I-MISC', 'I-PER', 'I-LOC', 'I-ORG'}}
--------
Start distribution:
O         0.19
B-LOC     0.11
I-LOC     0.00
B-MISC    0.03
I-MISC    0.00
B-ORG     0.64
I-ORG     0.00
B-PER     0.03
I-PER     0.00
dtype: float64
--------
Transition model:
           O  B-LOC  I-LOC  B-MISC  I-MISC  B-ORG  I-ORG  B-PER  I-PER
O       0.92   0.02   0.00    0.00    0.00   0.04   0.00   0.01   0.00
B-LOC   0.76   0.03   0.19    0.00    0.00   0.01   0.00   0.01   0.00
I-LOC   0.88   0.01   0.07    0.01    0.00   0.01   0.00   0.02   0.00
B-MISC  0.69   0.00   0.00    0.01    0.27   0.01   0.00   0.01   0.00
I-MISC  0.61   0.02   0.00    0.06    0.30   0.01   0.00   0.01   0.00
B-ORG   0.57   0.01   0.00    0.00    0.00   0.05   0.37   0.00   0.00
I-ORG   0.47   0.01   0.00    0.00    0.00   0.04   0.

Once the model is learned, we can apply it as any other "annotator" object:

In [222]:
docs = list(unified_model.pipe(docs))
skweak.utils.display_entities(docs[0], "hmm")
skweak.utils.display_entities(docs[1], "hmm")

<br>

## __Step 3__: Development of neural NER model


We can now learn a neural NER model based on these unified annotations. We have two options: a straighforward (but slightly underperforming) approach using Spacy, and a more sophisticated approach using our own NER model

### __Alternative 1__: Using Spacy

In [225]:
test_text = """Sunday marks the 40th anniversary of the signing of the Proclamation of the Constitution Act, 1982. Queen Elizabeth II, then-prime minister Pierre Trudeau, Jean Chrétien, the justice minister at the time, and André Ouellet, the registrar general, put their signatures on the document, as raindrops dripped on the page.
After that, the Constitution Act, 1982 became the law of the land, comprised of the charter, a section expounding upon the rights of Indigenous people, the somewhat vestigial Constitution Act, 1867, and others, not to mention the various common laws that help form this country’s constitutional foundation.
In the intervening decades, the charter has affected Canadians in myriad ways; it has helped guarantee rights to a fair trial, opened the door to medically assisted dying, allowed for prisoners to vote in elections, and limited the ways in which Canadians can express themselves.
Yet, Canadians cannot — as Americans can — go see their foundational documents. There can be no pilgrimage to Ottawa or Winnipeg to see the papers that protect our rights to worship as we please or associate with whom we please.
Not only is the charter, or Section 35 — concerning the rights of Indigenous people — not on display in Canada, but neither are other venerable documents, such as the British North America (BNA) Act.
There are a number of reasons for this. The first is that the Constitution Act, 1982 isn’t an original Canadian document per se. Rather, it — and therefore the charter — are parts of the Canada Act, 1982, a law passed in the United Kingdom. The originals, then, are a British possession."""
skweak.utils.display_entities(nlp(test_text))
skweak.utils.display_entities(nlp(news_text))

NB: The file `eval_utils.py` contains code to easily extract evaluation metrics by comparing the annotations from a particular annotation layer (for instance the HMM predictions, or the predictions from a single source) to the gold standard: