(BARELY EVEN A FIRST VERSION YET)

## What does Named Entity Recognition do for you?

What we call 'entity extraction' takes text and recognize things that fall within some well-defined categories.

This ends up primarily focusing on fairly consistently named things - hence _named_ entities and Named Entity Recognition (NER).

A classical examples of NER take a text-only sentence and pointing out that:

In [2]:
import spacy, spacy.displacy
english = spacy.load('en_core_web_lg')
spacy.displacy.render( english("Jim bought 300 shares of Acme in 2006 in Paris for 300 dollars") , style='ent', jupyter=True)

# (side note: GPE is 'Geopolitical entity', other models might call this LOCATION)

## Waxing philosophical

### What 

Okay, that's pretty neat, but [what even are these "entities"](https://en.wikipedia.org/wiki/Named-entity_recognition)?


The "fairly consistent naming" wording we used before was trying to sidestep some basic questions,
because they trip us into some philosophical messiness.

We like to think a name points at the same thing no matter what,
and/or there is just one, 
and/or there may be multiple but we can tell from context what _type_ this one is, 
yet this often isn't true.
See also discussions on topics such as [rigid designators](https://en.wikipedia.org/wiki/Rigid_designator).

And note that even in that example, we had some things that aren't names at all,
yet are still things identifiable within of a useful category. 

Say, the date.

The numbers more so.
Numbers rarely ever name things nor point to unique things,
but they are still a useful category of _thing_ to extract,
with even some distinction of different kinds/uses of numbers.
- The 300 is first identified as just counting _something_ (CARDINAL),
- the second is identified as a MONEY when it had 'dollars' after it,
- and if you train it enough, you might find numbers as/within time expressions, percentages, medical codes.

Modern NER is typically smarter than just matching known substrings or lemmas. Many will learn from the context, and look for similar context when matching.
The machine learning that backs various might softly learn that in "X bought 300 shares of Y in Z", X is some sort of actor, Y is something you can buy.

Even then, even with contextual awareness at work, NER tends to focus on categories that are relatively simple to learn well.

### Why / expectation management

You can imagine that that first buying-shares example was cherry-picked and not representative.

So before we get deeper into examples, let's not skip a more practical question:

**This is a means. What is the goal?** 

What do we hope to extract, and for what?

**Also, how good is it at any particular goal?**

And if it has limitations, can we work with them,
or do we not waste our time,
and instead look to better methods toward our goal?


Generally, NER is useful when 
- these categories are actually useful to you
- these categories are somewhat flexible and somewhat large (if small and fixed, you could do it with some basic pattern matching)
- we actually get consistent detection (TODO: for one of the varied reasons that NER works)

NER is not _necessary_ when
- basic pattern matching is enough and/or
- You have a complete and exhaustive list of things you want to extract


Some points to consider:
- If you can't think of why this is useful, then it probably isn't.
- the difference in what it ends up doing well, or doing _somewhat_, can be very real for any given purpose or project,
- or at the very least introduce a lot of work training it to do what you want well enough.
- In the end, it does what you make it do, or train it for. And that may be easily incomplete, or a moving target.

### What techniques does NER include?

Perhaps anything that finds useful tings in useful categories.

In the question of "is it NER", people also make a large point of how it's done.

NLP people seem to consider in NER when it picks up things beyond the original fixed list we've given it.
Which makes sense, because otherwise you would just match for that list and call it a day.

That usually points to machine learning, picking similar things in similar contexts.

Okay, so what if we do:

In [9]:
import wetsuite.helpers.patterns,  spacy, spacy.displacy
dutch = spacy.load('nl_core_news_lg')
doc = dutch( " text CVDR101405/1 text Stb. 2011, 35 text 33684R2020 text ECLI:NL:HR:2005:AT4537 text art. 2.12, eerste lid, aanhef en onder a text" )
wetsuite.helpers.patterns.mark_references_spacy( doc, wetsuite.helpers.patterns.find_references( doc.text, ljn=True)  )
spacy.displacy.render( doc, style='ent', jupyter=True, options={"colors":{'ECLI':'#ffaaaa', 'CVDR':'#7755ff', 'CELEX':'#ffaaff', 'ARTIKEL':'#ffaa77'}} )

(Putting aside for a moment that we are leading an answer by making it _look_ the same as the earlier entity...)

Was that not NER, because it's not done by machine learning but a bunch of specific pattern matching?
Or because these aren't names really?

Was that NER, because it picks useful entities in specific categories?

Does it really matter what you call it?

...well yes, for a common reason we use names around technical topics:
we put a name to a technique so that we can talk about how it works, and what it does for us.

As just mentioned, NER often means machine learning,
often on a word-sequence basis, and that is actually an ill fit to do this particular pattern matching,
as it may not understand what characters and what context to pay attention to,
because these aren't patterns of words, they are patterns of characters,
which is not how they are typically set up.

At the same time, a natural reference like "artikel 3, lid 5, van wet Bla" 
turns out to have so many variations that a rigid pattern matcher may miss many of them.

Iif you look at our code that does that, it's and experiment that specically avoids being a fixed pattern matcher,
but maybe NER is brilliant at that particular one.
That's the kind of thing you want to talk about.

## Getting more applied

### What does it do out of the box?

While a linguist might dream of more and more semantic demarking _all_ the things it talks about,
dreaming of good results in any sentence and context, it is often incomplete enough that it won't catch everything you want,
not because they couldn't, but because out of the box, they are not specialized to anything much.

To wit:

In [43]:
import spacy, spacy.displacy,  wetsuite.datasets, wetsuite.helpers.net
dutch = spacy.load('nl_core_news_lg')                                                      # load model to use (which includes pretrained NER)
cherry_picked_text = wetsuite.datasets.load('bwb-mostrecent-text').data.get('BWBR0022604') # load text to apply it to
partial_doc = dutch(cherry_picked_text)[500:600]                                         # parse a (specifically picked) fragment of text - we get the idea.
spacy.displacy.render(partial_doc, style='ent', jupyter=True)                              # visualize to show us the entities

In [48]:
# similarly, in court cases:

for casedict in wetsuite.datasets.load('rechtspraaknl-struc').data.random_values(1):
    doc = dutch( casedict.get('bodytext') )
    spacy.displacy.render(doc[50:150], style='ent', jupyter=True) # that token slice tries to cut off some introduction, and leave a smallish block of text
    print('----------')

----------


Please re-run that a bunch of times, to see what's there, and what's missing, and the amount of variation.
You'll probably notice that this leaves something to be desired.


There are also some things it picks out that we can be pretty sure it wasn't trained on, which is part of the point of NER.

There are also a bunch of things it gets wrong, like in some tests it would call a confiscatiebevel a location - that sort of thing.

Then again, if your goal is primarily information extraction _at document level_ - what it is about,
what organisations and people are involved, you are often still doing pretty well
even if half of the mentions aren't detected.

But in a few paragraphs, we snuck in in a new goal to justify our method,
quietly adjusting our use to the limitations. That's not necessarily how you want to work.

### Some relevant projects

Let's look at someone who has already done the work.
Consider "[Named Entity Recognition in Indian court judgments](https://arxiv.org/pdf/2211.03442)"
(and [its code](https://github.com/Legal-NLP-EkStep/legal_NER)).

There is an [online version of that and other analysis](https://summarizer-fer6v2lowq-uc.a.run.app/searchdetails#ANALYSIS)
that you can inspect (should default to an example case. Note that the URLs suggests this might be a preliminary project and might move).

If you do, you can note that both that it's a bunch better than our late example, 
but also that not every person or organisation is marked.

The things that it extracts still seem very useful, say, 
to describing the parties involved and probably the basic topics of the case.

Consider that this project, beyond the basic set of entity types, says it extracts e.g. 
LAWYER, PETITIONER, RESPONDENT, JUDGE, COURT, WITNESS, STATUTE, PROVISION, PRECEDENT.

That's quite a boast, as various of those are not very rigid designators.
Say, if a personal name appears, it will not be clear which role they play - unless context is very clear.

Now, legal text invests time into being unambiguous, 
and court cases everywhere tend to have an introduction that is regular 
to the point it is almost a template (which might e.g. adds ` (appellant)` to a name).
That name will likely be matched as an appelant there, and only there.

So this is NER used to do something more like pattern recognition. 
Which is fine (it overlaps with what it does anyway), and useful, 
yet you want to know why the method might tell you it's APPELANT once,
and perhaps PERSON everywhere else.


And if never matches the same name again in the same document,
it may be less useful for generic NLP but still absolutely great for information extraction _about the document as a whole_. 


Documents may even specifically avoid repeating the name and use the literal text `appellant`, `judge`, etc. 
for clarity, which is great for even more than basic information extraction -- but only if you know it's happening.

And, as research we will mention laters, another entity that pops up is _roles_.
Sure you have marked some named people, but it may be just as interesting, say, what they do within the organisation.
This may however a more complex task.

Other examples that may be interesting, in telling us what have others done and how they have fared:

"[Evaluation of Named Entity Recognition in Dutch online criminal complaints](https://clinjournal.org/clinj/article/view/65)"
sticks with the clasics like Person, Location, Product, Organisation, Event,
and seems to conclude it's useful for basic information retrieval of topic.

"[Deep Learning for Legal Tech: exploring NER on Dutch court rulings]()" has datesm, courts, section headers, and case identifiers.

"[Named Entity Recognition of Legislation References](http://arno.uvt.nl/show.cgi?fid=160982)"
mostly just for identifiers.

"[Named entity recognition and resolution in legal text](https://www.researchgate.net/publication/220745968_Named_Entity_Recognition_and_Resolution_in_Legal_Text)"
seems to go for Jurisdiction, Court, Title, Doctype, Judge, though seems more pattern matching than NER.

TODO: find more


<!--
[or](https://arxiv.org/pdf/2103.06268) [other](http://nlp.cs.aueb.gr/pubs/icail2017.pdf) papers rarely give


Unfortunately, in general there isn't a lot of data to train legal tasks like this, less so for NER, 
less so for Dutch. We will have to do some of this ourselves.
-->

## Can we make it do more?

Yes, but again, that takes training, and that takes annotation.

And some decisions before that, to know what would be interesting, what to spend time on.

So what _do_ we want?

Well, roles are interesting. Above, things like staatssecretaris aren't marked.

But if we start adding wishes, some of them might shift out of the scope of NER.
Say, phrases that indicate resulting actions might be interesting.

Sure, terms like "strafbaar feit", "ongegrond verklaard", "onherroepelijke beslissing", "bestuurlijke autoriteit",
"onherroepelijke beslissing van een bestuurlijke autoriteit", but unless you can think of a clear, exclusive category name for them,
maybe they're not NER in the classical sense. 
You might be able to abuse NER models for this, but it might take more wrangling.

And then there are a lot of variations - do you also want "feit dat wordt bestraft als vergrijp"?
If so, then you need to train for that too.

## TEST

<!--

wetnamen = set()
for id, (namen1, namen2) in wetsuite.datasets.load('wetnamen').data.items():
    for name in namen1+namen2:
        name = name.strip(' -')#.lower()
        # turns out a 4MByte regexp isn't very fast, so we take out the very long stuff. TODO: do all this in the generation of that list.
        if len(name)>2  and len(name) < 50:
            wetnamen.add(name)

#random.sample( wetnamen, 30 )


import re
count = 0
frag_with_wetnaam = []

print('PRE')
nn = sorted( list( re.escape(naam)  for naam in wetnamen ), key=lambda s:len(s), reverse=True )

_re_str = r'\b(%s)\b'%( '|'.join( nn ) ) # this is a monster of a regexp
print( len(_re_str) )
_re_namen = re.compile( _re_str)#, flags=re.I )


import wetsuite.helpers.spacy
# This is a rather inefficient ways to do it.
print('PROC')
#for id, cvdr_xmlbytes in wetsuite.helpers.notebook.ProgressBar( wetsuite.datasets.load('cvdr-mostrecent-xml').data.random_sample(1000) ):
    #for fragment in wetsuite.helpers.split.feeling_lucky( cvdr_xmlbytes ): # we use split to place the quote within a paragraph-or-so.
for id, cvdr_txt in wetsuite.helpers.notebook.ProgressBar( wetsuite.datasets.load('cvdr-mostrecent-text').data.random_sample(100) ):
    for fragment in wetsuite.helpers.spacy.sentence_split(cvdr_txt, as_plain_sents=True):
    #for fragment in re.split('\n{2,}', cvdr_txt):
        for m in re.finditer(_re_namen, fragment):
            #print(id, m.groups()[0])
            frag_with_wetnaam.append( (fragment, m.start(), m.end()) )
            # TODO: move all matches in a fragment into one list; spacy want it that way

count
#frag_with_wetnaam



for fragment, start, end in frag_with_wetnaam:#random.sample( frag_with_wetnaam, 1 ):
    print( fragment[:start], '[' , fragment[start:end], ']', fragment[end:] )
    print('-----')
    
-->

<!--

It's easy to be self congratulatory, 
we can't strongly verify, or quantify, the performance
with the same things we fed in to train it. 
Of course it's going to find most of those.

When we give it a list we don't care about that list - it better be good at that.

We care about all the things you didn't tell it about, but we figure are relevant.
Maybe it's better at those. Maybe it's worse. You wouldn't easily know.

This is a classical issue in machine learning of any type.

The classical solution is to split the data you fed in into a training part, and testing part.
You only use a good portion to try to , and you use the rest to prove that's true.

https://towardsdatascience.com/train-ner-with-custom-training-data-using-spacy-525ce748fab7

https://spacy.io/usage/training

-->