In [None]:
(BARELY EVEN A FIRST VERSION YET)

## What does Named Entity Recognition do for you?

### What

[What even are these "entities"](https://en.wikipedia.org/wiki/Named-entity_recognition)?

What we usually call Named Entity Recognition (and sometimes terms like 'entity extraction') wants to recognize things that fall within some well-defined,
pre-defined categories, which ends up primarily focusing on fairly consistently named things - hence _named_ entities.



#### Basic example of NER

Classical examples of NER take a text-only sentence like "Jim bought 300 shares of Acme in 2006 in Paris for 300 dollars" (a _very_ cherry-picked esample),
and point out that:

In [3]:
import spacy, spacy.displacy
english = spacy.load('en_core_web_lg')
spacy.displacy.render( english("Jim bought 300 shares of Acme in 2006 in Paris for 300 dollars") , style='ent', jupyter=True)

(GPE is 'Geopolitical entity', other models might call this LOCATION)

The "fairly consistent naming" wording tried to avoid some philosophical questions. 
We like to think a name points at the same thing no matter what, 
and/or there is just one, and/or there may be multiple but we can tell from concext what _type_ this one is, 
yet this often isn't true. See concepts like [rigid designators](https://en.wikipedia.org/wiki/Rigid_designator). 

The numbers help point this out. While underlying methods doesn't really care whether our text happens to contain or be just numbers.
You can't say they are names, but they are still entities you can sort into categories - say, the 300 is first identified as just counting something (CARDINAL),
and then subsequently identified as a MONEY when it had 'dollars' after it.

It might also find them as time expressions, percentages, medical codes - if actually trained that way.

Which also helps point out that NER methods, in general, tend to mix a "does this look like a thing we want to report"
with a "what does it look like in context?", and that what it leans on more may vary per category.

### Why / expectation management

You can imagine that that cherry-picked example was not representative.

So before we get deeper into examples, let's not skip a question:

**This is a means. What is the goal?** What do we hope to extract?

**Also, how good is it at any particular goal?**
And if it has limitations, can we work with them, or do we not waste our time
and look to other methods instead?


Generally, NER is useful when 
- these categories are actually useful to you
- these categories are somewhat flexible and somewhat large (if small and fixed, you could do it with some basic pattern matching)
- we actually get consistent detection (TODO: for one of the varied reasons that NER works)

NER is not _necessary_ when
- basic pattern matching is enough and/or
- You have a complete and exhaustive list of things you want to extract

Also, if you can't think of why this is useful, it probably isn't.

#### Some Relevant projects

More in our area, consider "[Named Entity Recognition in Indian court judgments](https://github.com/Legal-NLP-EkStep/legal_NER)".

There is an [online version of that and other analysis](https://summarizer-fer6v2lowq-uc.a.run.app/searchdetails#ANALYSIS)
that you can inspect (should default to an example case. Note that the URLs suggests this is preliminary and might move).

If you do, you can note that not every person or organisation is marked.
And yet the things it does is still very useful, and should 
describing the topics of the case quite well.

Say, 
- You might expect it to try its best for every sentence, figuring out what clearly-named things it talks about.
  NLPers dream of more and more semantic detailing.

- Yet NER often ends up getting used for document-wide information extraction.
  because you usually do well even if you don't catch every use, and you may still be doing great.

That NER says that, beyond the basic set of entity types, it extracts e.g. 
LAWYER, PETITIONER, RESPONDENT, JUDGE, COURT, WITNESS, STATUTE, PROVISION, PRECEDENT.

That's quite a boast, as various of those are not very rigid designators.
They will likely be matched in a introduction that is regular to the point it is almost a template
(which might e.g. adds ` (appellant)` to a name).

Even if it never matches the same name again in the same document, 
it may still be great for information extraction _about the document as a whole_,
while being less useful for generic NLP of its parts.

Documents may even specifically repeating the name and literally  _say_ 'appellant' and 'judge' exactly for clarity, 
which is great for more than basic information extraction - but only if you know it's happening.

And, as research we will mention laters, another entity that pops up is _roles_.
Sure you have marked some named people, but it may be just as interesting, say, what they do within the organisation.
This may however a more complex task.

<!-- -->

Some points to consider:
- the difference in what it ends up doing well, or only somewhat, can be very real for your given purpose or project,
- or at the very least introduce a lot of work training it to do what you want.
- In the end, it does what you make it do, or train it for. And that may be easily incomplete, or a moving target.

Other examples that may be interesting, in telling us what have others done and how they have fared:

"[Evaluation of Named Entity Recognition in Dutch online criminal complaints](https://clinjournal.org/clinj/article/view/65)"
sticks with the clasics like Person, Location, Product, Organisation, Event,
and seems to conclude it's useful for basic information retrieval of topic.

"[Deep Learning for Legal Tech: exploring NER on Dutch court rulings]()" defines
- locations (LOC),
- courts (LEG),
- dates (DATE), 
- section headers (SECTION), 
- non-standardised number/letter combinations that are intended as identifier (ID), 
- ECLI (ECLI) and the Dutch Case Law identifier (NJ) that preceded it

"[Named Entity Recognition of Legislation References](http://arno.uvt.nl/show.cgi?fid=160982)"
mostly just for identifiers.

"[Named entity recognition and resolution in legal text](https://www.researchgate.net/publication/220745968_Named_Entity_Recognition_and_Resolution_in_Legal_Text)"
seems to go for Jurisdiction, Court, Title, Doctype, Judge, though seems more pattern matching than NER.

TODO: find more


<!--
[or](https://arxiv.org/pdf/2103.06268) [other](http://nlp.cs.aueb.gr/pubs/icail2017.pdf) papers rarely give


Unfortunately, in general there isn't a lot of data to train legal tasks like this, less so for NER, 
less so for Dutch. We will have to do some of this ourselves.
-->

## How well does it work?

NER is now typically smarter than just matching known substrings or lemmas - we can often estimate from context
that, in "X bought 300 shares of Y in Z", X is some sort of actor, Y is something you can buy.
And in that context, it figured that if Z is a number, it is probably a DATE and not a MONEY or unknown CARDINAL number.

However, in that sentence Z could be a place (in greece) or time (in 2006) or other (good confidence, in a panic),
and getting that amount of detail right quickly becomes a much wider NLP question, 
so even with contextual awareness at work, NER tends to focus on categories that are relatively simple to learn well.

### What entities does spacy's basic Dutch models already know?

In [12]:
import spacy, spacy.displacy
import wetsuite.datasets, wetsuite.helpers.net

# load model to use
dutch = spacy.load('nl_core_news_lg')
# text to apply it to
cherry_picked_text = wetsuite.datasets.load('bwb-mostrecent-text').data.get('BWBR0022604')
# parse
partial_doc = dutch(cherry_picked_text[:6000]) # cut off at roughly a screen's worth of text, we get the idea.
# visualize to show us the entities
spacy.displacy.render(partial_doc, style='ent', jupyter=True)

In [None]:
# similarly, in court cases:
dutch = spacy.load('nl_core_news_lg')

for case_id, casedict in wetsuite.datasets.load('rechtspraaknl-struc').data.random_sample(5):

    if uitspraak is not None:
        print( '='*80 )
        print( case_id )
        uitspraak = casedict.get('uitspraak')

        uitspraak = uitspraak[:4000]
        doc = dutch( uitspraak )
        #spacy.displacy.render(doc, style='ent')
        for ent in doc.ents:
            if ent.label_ in ('CARDINAL',):
                continue
            print( f'{repr(ent.text):20s} {ent.label_}' )
            #print( ent )
        #And, to see how complete it might be, the actual text:
        print( '-'*80 )
        print(uitspraak)

That... leaves something to be improved.

## Can we make it do more?

Yes, but that takes training.

Which takes a whole bunch of examples.

Which takes knowing what you want to detect.


In [43]:
wetnamen = set()
for id, (namen1, namen2) in wetsuite.datasets.load('wetnamen').data.items():
    for nm in namen1+namen2:
        nm = nm.strip(' -')#.lower()
        # turns out a 4MByte regexp isn't very fast, so we take out the very long stuff. TODO: do all this in the generation of that list.
        if len(nm)>2  and len(nm) < 50:
            wetnamen.add(nm)

#random.sample( wetnamen, 30 )

In [45]:
count = 0
frag_with_wetnaam = []

print('PRE')
nn = sorted( list( re.escape(naam)  for naam in wetnamen ), key=lambda s:len(s), reverse=True )

_re_str = r'\b(%s)\b'%( '|'.join( nn ) ) # this is a monster of a regexp
print( len(_re_str) )
_re_namen = re.compile( _re_str)#, flags=re.I )

PRE
780299


In [None]:
import wetsuite.helpers.spacy
# This is a rather inefficient ways to do it.
print('PROC')
#for id, cvdr_xmlbytes in wetsuite.helpers.notebook.ProgressBar( wetsuite.datasets.load('cvdr-mostrecent-xml').data.random_sample(1000) ):
    #for fragment in wetsuite.helpers.split.feeling_lucky( cvdr_xmlbytes ): # we use split to place the quote within a paragraph-or-so.
for id, cvdr_txt in wetsuite.helpers.notebook.ProgressBar( wetsuite.datasets.load('cvdr-mostrecent-text').data.random_sample(100) ):
    for fragment in wetsuite.helpers.spacy.sentence_split(cvdr_txt, as_plain_sents=True):
    #for fragment in re.split('\n{2,}', cvdr_txt):
        for m in re.finditer(_re_namen, fragment):
            print(id, m.groups()[0])
            frag_with_wetnaam.append( (fragment, m.start(), m.end()) )
            # TODO: move all matches in a fragment into one list; spacy want it that way

count
#frag_with_wetnaam

for fragment, start, end in frag_with_wetnaam:#random.sample( frag_with_wetnaam, 1 ):
    print( fragment[:start], '[' , fragment[start:end], ']', fragment[end:] )
    print('-----')

In [None]:
https://towardsdatascience.com/train-ner-with-custom-training-data-using-spacy-525ce748fab7

https://spacy.io/usage/training

It's easy to be self congratulatory, 
we can't strongly verify, or quantify, the performance
with the same things we fed in to train it. 
Of course it's going to find most of those.

Maybe it's better at all the things you didn't tell it about. 
Maybe it's worse. 
You wouldn't know.

This is a classical issue in machine learning of any type.

The classical solution is to split the data you fed in into a training part, and testing part.
You only use a good portion to try to , and you use the rest to prove that's true.



## Unsorted

What kind of terms do we want?

Say, 
* strafbaar feit
* onherroepelijke beslissing
* bestuurlijke autoriteit

Maybe
* feit dat wordt bestraft als vergrijp 
* vergrijp tegen de voorschriften betreffende de orde
voor zover tegen de beslissing beroep op een met name in strafzaken bevoegde rechter is opengesteld


Fuzzy looker
* NP that 

If we take "
Okay, so that's a start, but clearly not tuned to 

* strafbaar feit
