In [None]:
(NOT EVEN A FIRST VERSION YET)

## What does Named Entity Recognition do for you?

But what even do we mean with entities here? [What even are named entities](https://en.wikipedia.org/wiki/Named-entity_recognition)?




Classical examples of NER take a sentence like "Jim bought 300 shares of Acme in 2006" and point out that 
- Jim is a PERSON
- Acme is an ORGANISATION, and
- 2006 is a DATE
Clearly useful for some things - this particular one probably leading to information extraction.

NER is typically smarter than just matching known substrings or lemmas - we can often estimate from context
that, in "X bought 300 shares of Y in Z", X is some sort of actor, Y is something you can buy.
And in that context, it figured that if Z is a number, it is probably a DATE and not a MONEY or unknown CARDINAL number.

However, in that sentence Z could be a place (in greece) or time (in 2006) or other (good confidence, in a panic),
and getting that amount of detail right quickly becomes a much wider NLP question, 
so even with contextual awareness at work, NER tends to focus on categories that are relatively simple to learn well.


In fact, the simplest NER models may not do much beyond PERSON, LOCATION, and ORGANISATION,
perhaps PRODUCT if they are made for a corporation,
presumably because these are the classes that are used with enough consistency


You could define named entities as 
* anything that fits into pre-defined categories -- such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc. 
* _any concept for which we use fairly consistent names_





Such cherry-picked examples tend to have entity types that are something like [rigid designators](https://en.wikipedia.org/wiki/Rigid_designator),
or other fancy terms that have meanings like 
"this points to the same thing no matter what" or 
"there is just one of these" or 
"there may be multiple but we can tell from concext what _type_ this one is".

It also helps that, such categories, while not closed classes ()




At the same time, none of this really explains the types that are typically included or excluded,
or more importantly, _why_.
Further confused by the fact that NER doesn't even quite stick to its own rules.


There's immediate a bit of a philosophical question of what even _should_ to be included into this?

But also a practical one. 

Why not more? Why not, say, detect "strafbaar feit" and "vergreep" as LEGALACT?

You can. But how far do you go?


...and it's not even easy to answer which of those is more sensible. 


But also, what about any repeated concept that is useful?
In the legal world there are some phrases that are _more_ than just phrases.

Say, "strafbaar feit" is a very specific term, as is "vergrijp".


At the same time, how far do you go?
Would you want "vergrijp tegen de voorschriften betreffende de orde"?
Maybe that is

Part of what makes NER in its usual form useful is that 
on unseen text, a COMPANY is probably be 




What kind of terms do we want?

Say, 
* strafbaar feit
* onherroepelijke beslissing
* bestuurlijke autoriteit

Maybe
* feit dat wordt bestraft als vergrijp 
* vergrijp tegen de voorschriften betreffende de orde
voor zover tegen de beslissing beroep op een met name in strafzaken bevoegde rechter is opengesteld


Fuzzy looker
* NP that 

If we take "
Okay, so that's a start, but clearly not tuned to 

* strafbaar feit


In [None]:
from transformers import pipeline

pipe = pipeline("feature-extraction", model="Gerwin/legal-bert-dutch-english")

pipe("This restaurant is awesome")


In [3]:
from transformers import AutoTokenizer, AutoModel, TFAutoModel
tokenizer = AutoTokenizer.from_pretrained("Gerwin/legal-bert-dutch-english")
model = AutoModel.from_pretrained("Gerwin/legal-bert-dutch-english")  # PyTorch

Some weights of the model checkpoint at Gerwin/legal-bert-dutch-english were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
help(model)

In [None]:
from transformers import AutoTokenizer, AutoModel, TFAutoModel
tokenizer = AutoTokenizer.from_pretrained("Gerwin/legal-bert-dutch-english")

model = TFAutoModel.from_pretrained("Gerwin/legal-bert-dutch-english")  # TensorFlow

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")

outputs = model(**inputs)

outputs

In [6]:
from transformers import pipeline

pipe = pipeline("feature-extraction", model="unmasker = pipeline('fill-mask', model='bert-base-multilingual-uncased')

unmasker("Hello I'm a [MASK] model.")
")

pytorch_model.bin:   0%|          | 0.00/672M [00:00<?, ?B/s]

Some weights of the model checkpoint at Gerwin/legal-bert-dutch-english were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
text = "In deze wet wordt verstaan onder: rechterlijke uitspraak: een onherroepelijke beslissing van een rechter wegens een strafbaar feit; beschikking: een onherroepelijke beslissing van een bestuurlijke autoriteit wegens een strafbaar feit of een feit dat wordt bestraft als vergrijp tegen de voorschriften betreffende de orde, voor zover tegen de beslissing beroep op een met name in strafzaken bevoegde rechter is opengesteld"
pipe(text)

In [15]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/distilbert-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/distilbert-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Wolfgang and I live in Berlin"

ner_results = nlp(example)
import pprint
pprint.pprint(ner_results)

[{'end': 2,
  'entity': 'LABEL_0',
  'index': 1,
  'score': 0.9992238,
  'start': 0,
  'word': 'My'},
 {'end': 7,
  'entity': 'LABEL_0',
  'index': 2,
  'score': 0.9994271,
  'start': 3,
  'word': 'name'},
 {'end': 10,
  'entity': 'LABEL_0',
  'index': 3,
  'score': 0.99953353,
  'start': 8,
  'word': 'is'},
 {'end': 19,
  'entity': 'LABEL_1',
  'index': 4,
  'score': 0.99110633,
  'start': 11,
  'word': 'Wolfgang'},
 {'end': 23,
  'entity': 'LABEL_0',
  'index': 5,
  'score': 0.9994848,
  'start': 20,
  'word': 'and'},
 {'end': 25,
  'entity': 'LABEL_0',
  'index': 6,
  'score': 0.987379,
  'start': 24,
  'word': 'I'},
 {'end': 30,
  'entity': 'LABEL_0',
  'index': 7,
  'score': 0.9990946,
  'start': 26,
  'word': 'live'},
 {'end': 33,
  'entity': 'LABEL_0',
  'index': 8,
  'score': 0.9990859,
  'start': 31,
  'word': 'in'},
 {'end': 40,
  'entity': 'LABEL_5',
  'index': 9,
  'score': 0.9967968,
  'start': 34,
  'word': 'Berlin'}]


In [27]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline


tokenizer = AutoTokenizer.from_pretrained("romjansen/mbert-base-cased-NER-NL-legislation-refs")
model = AutoModelForTokenClassification.from_pretrained("romjansen/mbert-base-cased-NER-NL-legislation-refs")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
#example = "In deze wet wordt verstaan onder: rechterlijke uitspraak: een onherroepelijke beslissing van een rechter wegens een strafbaar feit"
example = "Het eerste lid is van toepassing in het geval de veroordeelde een natuurlijke persoon is, voor zover deze inkomsten of vermogen of zijn vaste woon- of verblijfplaats in Nederland heeft dan wel in het geval de veroordeelde een rechtspersoon is, voor zover deze inkomsten of vermogen of zijn statutaire zetel in Nederland heeft, dan wel indien het specifieke voorwerp waarop de beslissing tot confiscatie of het confiscatiebevel betrekking heeft zich op Nederlands grondgebied bevindt."


ner_results = nlp(example)
#import pprint
#pprint.pprint(ner_results)
for d in ner_results:
    print( d['entity'], d['word'])

OSError: romjansen/mbert-base-cased-NER-NL-legislation-refs is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.

In [24]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("Gerwin/legal-bert-dutch-english")
model = AutoModelForTokenClassification.from_pretrained("Gerwin/legal-bert-dutch-english")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
#example = "In deze wet wordt verstaan onder: rechterlijke uitspraak: een onherroepelijke beslissing van een rechter wegens een strafbaar feit"
example = "Het eerste lid is van toepassing in het geval de veroordeelde een natuurlijke persoon is, voor zover deze inkomsten of vermogen of zijn vaste woon- of verblijfplaats in Nederland heeft dan wel in het geval de veroordeelde een rechtspersoon is, voor zover deze inkomsten of vermogen of zijn statutaire zetel in Nederland heeft, dan wel indien het specifieke voorwerp waarop de beslissing tot confiscatie of het confiscatiebevel betrekking heeft zich op Nederlands grondgebied bevindt."


ner_results = nlp(example)
#import pprint
#pprint.pprint(ner_results)
for d in ner_results:
    print( d['entity'], d['word'])

Some weights of the model checkpoint at Gerwin/legal-bert-dutch-english were not used when initializing BertForTokenClassification: ['cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initia

LABEL_0 het
LABEL_1 eerste
LABEL_1 lid
LABEL_1 is
LABEL_0 van
LABEL_0 toe
LABEL_1 ##passing
LABEL_0 in
LABEL_1 het
LABEL_1 geval
LABEL_1 de
LABEL_0 vero
LABEL_0 ##orde
LABEL_0 ##elde
LABEL_1 een
LABEL_0 natuur
LABEL_0 ##lijke
LABEL_1 persoon
LABEL_1 is
LABEL_0 ,
LABEL_0 voor
LABEL_1 zo
LABEL_0 ##ver
LABEL_0 deze
LABEL_1 ink
LABEL_1 ##oms
LABEL_1 ##ten
LABEL_1 of
LABEL_0 vermogen
LABEL_1 of
LABEL_0 zijn
LABEL_0 vaste
LABEL_0 woo
LABEL_0 ##n
LABEL_1 -
LABEL_0 of
LABEL_0 verb
LABEL_1 ##lij
LABEL_0 ##f
LABEL_1 ##plaats
LABEL_0 in
LABEL_0 nederland
LABEL_0 heeft
LABEL_0 dan
LABEL_0 wel
LABEL_0 in
LABEL_1 het
LABEL_1 geval
LABEL_1 de
LABEL_0 vero
LABEL_0 ##orde
LABEL_0 ##elde
LABEL_0 een
LABEL_0 rechts
LABEL_0 ##pers
LABEL_1 ##oon
LABEL_0 is
LABEL_0 ,
LABEL_0 voor
LABEL_1 zo
LABEL_0 ##ver
LABEL_1 deze
LABEL_0 ink
LABEL_1 ##oms
LABEL_1 ##ten
LABEL_1 of
LABEL_0 vermogen
LABEL_1 of
LABEL_0 zijn
LABEL_0 statut
LABEL_1 ##aire
LABEL_0 ze
LABEL_1 ##tel
LABEL_0 in
LABEL_0 nederland
LABEL_0 heeft
LAB

In [16]:
from transformers import pipeline
#unmasker = pipeline('fill-mask', model='Gerwin/legal-bert-dutch-english')
#unmasker("Hello I'm a [MASK] model.")

tokenizer = AutoTokenizer.from_pretrained("Gerwin/legal-bert-dutch-english")





Some weights of the model checkpoint at Gerwin/legal-bert-dutch-english were not used when initializing BertForTokenClassification: ['cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initia

[{'entity': 'LABEL_1',
  'score': 0.56502813,
  'index': 1,
  'word': 'in',
  'start': 0,
  'end': 2},
 {'entity': 'LABEL_1',
  'score': 0.54192364,
  'index': 2,
  'word': 'deze',
  'start': 3,
  'end': 7},
 {'entity': 'LABEL_0',
  'score': 0.5572802,
  'index': 3,
  'word': 'wet',
  'start': 8,
  'end': 11},
 {'entity': 'LABEL_0',
  'score': 0.63282734,
  'index': 4,
  'word': 'wordt',
  'start': 12,
  'end': 17},
 {'entity': 'LABEL_1',
  'score': 0.5182938,
  'index': 5,
  'word': 'vers',
  'start': 18,
  'end': 22},
 {'entity': 'LABEL_0',
  'score': 0.69207937,
  'index': 6,
  'word': '##taan',
  'start': 22,
  'end': 26},
 {'entity': 'LABEL_0',
  'score': 0.63642013,
  'index': 7,
  'word': 'onder',
  'start': 27,
  'end': 32},
 {'entity': 'LABEL_0',
  'score': 0.64541346,
  'index': 8,
  'word': ':',
  'start': 32,
  'end': 33},
 {'entity': 'LABEL_0',
  'score': 0.5703261,
  'index': 9,
  'word': 'rechter',
  'start': 34,
  'end': 41},
 {'entity': 'LABEL_0',
  'score': 0.6196393,

## What named entities does spacy know?

In [None]:
import random, collections, re
import wetsuite.datasets
import wetsuite.helpers.spacy

import spacy
from spacy.tokens.span import Span
import spacy.displacy
import wetsuite.helpers.spacy

In [None]:
dutch = spacy.load('nl_core_news_lg')

bwb_text = wetsuite.datasets.load('bwb-mostrecent-text')

In [18]:
cherry_picked_xml = wetsuite.helpers.net.download( 'https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0022604/2023-04-19_0/xml/BWBR0022604_2023-04-19_0.xml' )

spacy.displacy.render( dutch(bwb_text.data.get('BWBR0022604')) , style='ent', jupyter=True)

In [None]:
# you could look at some more examples like:
for url, a_law_txt in bwb_text.data.random_sample( 1 ):
    display('-------------------------- %s -----------------------------'%url)
    doc = dutch( a_law_txt )
    
    # there is a nicer-looking entity visualizer...
    spacy.displacy.render(doc, style='ent', jupyter=True)    # Note: colab require explicit jupyter=True,  local notebooks do not.

    # ...but if we want to show multiple labelings at once, we need the 'span' visualizer, so we need to transplant entities to spans it picks up
    #myspans = []
    #for ent in doc.ents: 
    #    myspans.append( Span(doc, ent.start,ent.end, ent.label_) )
    #for nc in doc.noun_chunks:
    #    myspans.append( Span(doc,  nc.start,nc.end,  'nc') )
    #doc.spans["custom"] = myspans
    #spacy.displacy.render(doc, style="span", options={"spans_key": "custom"}, jupyter=True)

# which isn't bad, but also not as nice as we got in the cherry picked examples.



In [19]:

#spacy.require_gpu()
#english = spacy.load("en_core_web_trf")  

nlp = spacy.blank("en")  # "Build upon the spaCy Small Model" (seems wrong?)
ruler = nlp.add_pipe("entity_ruler") # adds EntityRuler to pipelines and returns reference

patterns = [
    {"label": "GPE", "pattern": "Treblinka"},
    {"label": "ORG", "pattern": "Wikipedia"},

    {"label": "TERM", "pattern": "depot"},
    {"label": "F", "pattern": "vehicle maintenance facilit.*"},
]

ruler.add_patterns(patterns)

# test on example sentence
text = "The term depot is not used in reference to vehicle maintenance facilities in the U.S"
doc = nlp(text)
for ent in doc.ents:
    print (ent.text, ent.label_)

depot TERM


In [9]:
from spacy.tokens import Doc
from spacy.training import Example

predicted = Doc(nlp.vocab, words=["Apply", "some", "sunscreen"])
token_ref = ["Apply", "some", "sun", "screen"]
tags_ref = ["VERB", "DET", "NOUN", "NOUN"]
example = Example.from_dict(predicted, {"words": token_ref, "tags": tags_ref})
print( 'predicted', example.predicted )
print( 'reference', example.reference )

predicted Apply some sunscreen 
reference Apply some sun screen 


In [20]:

count = collections.defaultdict(int)

for plaintext, in random.sample( bwb_rows, 10):

    for paragraph in re.split( '\n{2,}', plaintext ):
        doc = english( paragraph )
        #print( paragraph )
        for chunk in doc.noun_chunks:
            #print( 'NC[%r]'%chunk )
            count[chunk.text] += 1

it = list(count.items())
it.sort(key = lambda x:x[1])
for s, cnt in it:
    print('%s\t%s'%(cnt,s))

NameError: name 'bwb_rows' is not defined