(BARELY A FIRST VERSION)

## What does Named Entity Recognition do for you?

What we call 'entity extraction' takes text and recognize things that fall within some well-defined categories.

We already have part of speech tagging and such,
so this ends up primarily focusing on fairly consistently named things - hence _named_ entities and Named Entity Recognition (NER).

A classical example of NER is to take a text-only sentence and pointing out that:

In [2]:
import spacy, spacy.displacy
english = spacy.load('en_core_web_lg')
spacy.displacy.render( english("Jim bought 300 shares of Acme in 2006 in Paris for 300 dollars") , style='ent', jupyter=True)

# (side note: GPE is 'Geopolitical entity', other models might call this LOC(ATION))

## Waxing philosophical

### What 

Okay, that's pretty neat, but [what even are these "entities"](https://en.wikipedia.org/wiki/Named-entity_recognition)?

The wording of "fairly consistent naming" tied to sidestep this question,
because we easily trip outselves into some philosophical messiness.

We like to think a name points at the same thing no matter what,
preferably there's just one, maybe there may be multiple but we can tell from context which one,
yet this often isn't true. Even telling what _kind_ of thing is sort of iffy.
See also discussions on topics such as [rigid designators](https://en.wikipedia.org/wiki/Rigid_designator).

---

And even in such a cherry-picked, textbook example, we had some things that aren't names at all,
yet are still things identifiable within of a useful category. 

You can argue that a date isn't a name per se - or maybe that it is, in that it is the best reference we have to point at a specific day, or year, or such.


Numbers, on the other hand, rarely name things, nor point to unique things,
but if we do NER with the goal of information extraction, 
they are often still considered useful _category of thing_ to extract,
to the point that you might even make a distinction between different kinds/uses of numbers.
- The 300 is first identified as just counting _something_ (CARDINAL),
- the second is identified as a MONEY when it had 'dollars' after it,
- and if you train it enough, you might find numbers as/within time expressions, percentages, medical codes, and whatnot.


Modern NER is typically smarter than just matching known substrings or lemmas.
Many will learn from the context, and look for similar context when matching.

The machine learning that backs various NER might also softly learn that in 
"X bought 300 shares of Y in Z", X is some sort of actor, Y is something you can buy.

Even then, even with contextual awareness at work, NER tends to focus on categories that are relatively simple to learn well.

### Why / expectation management

It is worth mentioning that a lot of NER uses are seen as 'find things that look like names', followed by 'now classify it'.

It turns out it is easier to identify that something is usually a name (e.g. `Paris`), 
or is used as a name (e.g. `the neighbour's garden`),
and much harder to find out what it is pointing at,
because that tends to be highly contextural (Is Paris a person or place? Do you mean the one in France, Canada, New Zealand, or one of the other fourty-odd ones?).

Say, if directly context suggests we are naming a place, and we see Paris which is a thing that can be a place,
that's a reasonable conclusion.

So finding that something is probably a name, and finding the likely _category_ of that thing, is somewhere in the middle in terms of difficulty.

Before we get deeper into examples, let's not skip a more practical question:

**This is a means. What is the goal?** - what do we hope to extract, and for what?

**How good is it at any particular goal?** - if it has limitations, are those prohibitive or can we make them work for us?
<!-- Sometimes they don't go far enough, because they're not specialized in what you want, but they absolutely could be trained that way. -->

<!--
People come at NER with different goals, and as a result may be happy or disappointed with the same output.
-->
A linguist might dream of semantic marking of all the things that all sentences talk about,
of good results in any sentence and context. ...and be disappointed when, half the time, it missed marking up a name as an entity,
or calls someone a LOC rather than a PERSON.

Someone else, given exactly the same output but wanting to do some document-level information extraction,
might be very happy to see repeated detections of the same name, and some organisations mentioned.
That might be an absolutely great way for them to gather who is involved.


Generally, NER is useful when 
- these categories are actually useful to you
- these categories are somewhat flexible and somewhat large (if small and fixed, you could do it with some basic pattern matching)
- we actually get consistent detection (TODO: for one of the varied reasons that NER works)

NER is not _necessary_ when
- basic pattern matching is enough and/or
- You have a complete and exhaustive list of things you want to extract


Some points to consider:
- If you can't think of why this is useful, then it probably isn't.
- the difference in what it ends up doing well, or doing _somewhat_, can be very real for any given purpose or project,
- or at the very least introduce a lot of work training it to do what you want well enough.
- In the end, it does what you make it do, or train it for. And that may be easily incomplete, or a moving target.

### What's in the NER name?

Is NER anything that finds useful things in useful categories?

It turns out that people also make a large point of how it's done.


In particular, NLP people seem to consider in NER when it picks up things we didn't tell it about.

Which makes sense, because if you just tried to make a complete list of something, 
it would still consistently miss anything not on that list,
which isn't great for that 'find things that look like names' step we mentioned. 
(it might also get things wrong whenever a name is on several lists, or otherwise could be several things)

For example, listing all companies will be quickly out of date, but if we use machine learning, 
"Jim bought 300 shares in [a_noun]" (and many sentences like it) 
would kick its  internal statistics to say that 'bought' and 'shares' are something you do with a _company_.


Fine, but if we do:

In [9]:
import wetsuite.helpers.patterns,  spacy, spacy.displacy
dutch = spacy.load('nl_core_news_lg')
doc = dutch( " text CVDR101405/1 text Stb. 2011, 35 text 33684R2020 text ECLI:NL:HR:2005:AT4537 text art. 2.12, eerste lid, aanhef en onder a text" )
wetsuite.helpers.patterns.mark_references_spacy( doc, wetsuite.helpers.patterns.find_references( doc.text, ljn=True)  )
spacy.displacy.render( doc, style='ent', jupyter=True, options={"colors":{'ECLI':'#ffaaaa', 'CVDR':'#7755ff', 'CELEX':'#ffaaff', 'ARTIKEL':'#ffaa77'}} )

Was that NER, because it picks useful entities, and in specific categories?

Or was that _not_ NER, because 
it's not really using at context or using machine learning,
these are arguably not names,
and is just a some specific pattern matching that doesn't care about context,
and will never learn a variant that we didn't code specifically?
(e.g. `81 R.O.` might in general be way too fuzzy as a pattern to look for, but in specific context _very_ clearly be a named reference, and _even_ the actual designation may very clearly be `Artikel 81 Wet op de rechterlijke organisatie`)

Yet does it really matter what you call it? 
If it works and is useful, not really.
...though if we want to talk about the limitations of this method towards this goal, probably.

<!--
As just mentioned, NER often means machine learning,
often on a word-sequence basis, and that is actually an ill fit to do this particular pattern matching,
as it may not understand what characters and what context to pay attention to,
because these aren't patterns of words, they are patterns of characters,
which is not how they are typically set up.

At the same time, a natural reference like "artikel 3, lid 5, van wet Bla" 
turns out to have so many variations that a rigid pattern matcher may miss many of them.

Iif you look at our code that does that, it's and experiment that specically avoids being a fixed pattern matcher,
but maybe NER is brilliant at that particular one.
That's the kind of thing you want to talk about.
-->

Much more on topic, and applied, consider "[Named Entity Recognition in Indian court judgments](https://arxiv.org/pdf/2211.03442)"
(and [its code](https://github.com/Legal-NLP-EkStep/legal_NER)).

There is an [online version of that and other analysis](https://summarizer-fer6v2lowq-uc.a.run.app/searchdetails#ANALYSIS)
that you can inspect (should default to an example case. Note that the URLs suggests this might be a preliminary project and might move).

If you poke around those results, you can note that both that it's a bunch better than our late example, 
but also that not every person or organisation is marked.

The things that it extracts still seem very useful, say, 
to describing the parties involved and probably the basic topics of the case.

This project says that (beyond the basic set of entity types) it extracts e.g. 
LAWYER, PETITIONER, RESPONDENT, JUDGE, COURT, WITNESS, STATUTE, PROVISION, PRECEDENT.

That's quite a boast, as various of those are not very rigid designators.
Say, if a personal name appears, it will not be clear which role they play - except in contexts that make that clear.

In fact, it seems that the reason that that works is that court cases start with something template like that 
will contain `Petitioner: A. Name`. In which case, if `A. Name` appears repeatedly, all other mentions would probably be PERSON.

We might call this **excellent abuse** of NER.

Abuse, because it breaks with some classical take on NER, and breaks some expectations you may have.
The fact that it introduces a model hierarchy is somewhat awkward, 
and it uses it for information extraction, something NER isn't necessarily any good at. 

Excellent, because it recognizes what the underlying model is, and does,
that it uses word context to decide the category of a thing is,
and then uses that well _for its specific goal_.

<!-- -->

Sure, the same name of the judge or apellant might be detected as PERSON in all other cases,
so if you wanted to do document _content_ markup you would still have a bunch of work to do.
Maybe _functionally_ it would be clearer and  more tweakable if a first pass detected lots of PERSONs
and have a second pass say 'this PERSON is specifically a JUDGE, that PERSON is a PETITIONER'.

But theory aside, this is _excellent_ at document-level information extraction
when you want to answer 'who was the judge, petitioner, and respondent?'.

<!-- -->

And note that, in a wider context, documents may even specifically avoid repeating the personal name,
and use the literal text `appellant`, `judge`, etc. to avoid confusion,
which is great for even more than basic information extraction -- but only if you know it's happening.

And, as research we will mention laters, another entity that pops up is _roles_.
Sure you have marked some named people, but it may be just as interesting, say, what they do within the organisation.
This may however a more complex task.

<!--
Now, legal text invests time into being unambiguous, 
and court cases everywhere tend to have an introduction that is regular 
to the point it is almost a template (which might e.g. adds ` (appellant)` to a name).
That name will likely be matched as an appelant there, and only there.

So this is NER used to do something more like pattern recognition. 
Which is fine (it overlaps with what it does anyway), and useful, 
yet you want to know why the method might tell you it's APPELANT once,
and perhaps PERSON everywhere else.


And if never matches the same name again in the same document,
it may be less useful for generic NLP but still absolutely great for information extraction _about the document as a whole_. 
-->

Other examples that may be interesting, in telling us what have others done and how they have fared:

"[Evaluation of Named Entity Recognition in Dutch online criminal complaints](https://clinjournal.org/clinj/article/view/65)"
sticks with the clasics like Person, Location, Product, Organisation, Event,
and seems to conclude it's useful for basic information retrieval of topic.

"[Deep Learning for Legal Tech: exploring NER on Dutch court rulings]()" has datesm, courts, section headers, and case identifiers.

"[Named Entity Recognition of Legislation References](http://arno.uvt.nl/show.cgi?fid=160982)"
mostly just for identifiers.

"[Named entity recognition and resolution in legal text](https://www.researchgate.net/publication/220745968_Named_Entity_Recognition_and_Resolution_in_Legal_Text)"
seems to go for Jurisdiction, Court, Title, Doctype, Judge, though seems more pattern matching than NER.

TODO: find more


<!--
[or](https://arxiv.org/pdf/2103.06268) [other](http://nlp.cs.aueb.gr/pubs/icail2017.pdf) papers rarely give


Unfortunately, in general there isn't a lot of data to train legal tasks like this, less so for NER, 
less so for Dutch. We will have to do some of this ourselves.
-->

## Getting more applied

### What does generic NER do out of the box?


In [15]:
import spacy, spacy.displacy,  wetsuite.datasets, wetsuite.helpers.net

In [16]:
# load two models: a generic dutch model (which includes some pretrained NER), and a more 
dutch    = spacy.load('nl_core_news_lg') 
some_ner = spacy.load('xx_ent_wiki_sm')

In [34]:
# court case:
casetext = wetsuite.datasets.load('rechtspraaknl-struc').data.random_values(1)[0].get('bodytext')
casetext = casetext[1000:2000] # cut off some introduction, and leave a smallish block of text
spacy.displacy.render(dutch(casetext), style='ent', jupyter=True) 
display('=#'*80)
spacy.displacy.render(some_ner(casetext), style='ent', jupyter=True)

'=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#'

Please re-run that a bunch of times, to note
- what it detects and what it's missing (both seem to be missing some legal terms),
- the difference between these two models (the first likes dates, which can be useful),
- difference in focus, cleanliness of what it calls entities (the second seems to be a little better at getting boundaries right)
- difference of the caterories it assigns (no, a confiscatiebevel is not a location - that sort of thing).

You'll probably notice that both leave something to be desired. 

<!--
Then again, if your goal is primarily information extraction _at document level_ - what it is about,
what organisations and people are involved, you are often still doing pretty well
even if half of the mentions aren't detected.

But in a few paragraphs, we snuck in in a new goal to justify our method,
quietly adjusting our use to the limitations. That's not necessarily how you want to work.
-->

In [18]:
# court case:
casetext = wetsuite.datasets.load('rechtspraaknl-struc').data.random_values(1)[0].get('bodytext')
spacy.displacy.render(dutch(casetext), style='ent', jupyter=True) 
display('=#'*80)
spacy.displacy.render(some_ner(casetext), style='ent', jupyter=True)

'=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#'

## Can we make it do more?

Yes, the Indian work above proved that. 

It does take training, 
and that takes annotation,
and that that takes decisions of what you want to spent time on.

Which probably depends on a specific purpose.

So what _do_ we want?
- What might be useful in general?
- What might be useful in specific contexts (court cases are a decent context to think withing)
- What might be an easy improvement, and example to have here?

Well
* It knows some law names, but we may wish to verify it knows as many as we can find ourselves.
  - it should not be hard to find a bunch of mentions of laws - there are lists of sorts

* ensure it knows more institutions that commonly come up
  - this one is harder, in the sense that... which ones do we care about?

* Roles -- Sure you have marked some named people, but it may be just as interesting, say, what they do within the organisation.
  - This may however a more complex task -- to even define.
  - say, we may want `staatssecretaris` to be marked consistently, but what else?


But if we start adding wishes, some of them might shift out of the scope of NER.
Say, phrases that indicate resulting actions might be interesting.

Sure, terms like "strafbaar feit", "ongegrond verklaard", "onherroepelijke beslissing", 
"bestuurlijke autoriteit", 
"onherroepelijke beslissing van een bestuurlijke autoriteit", may all be interesting in some way or other, 
but remember also that each thing should be in a clear category. 

Unless we can think of a clear, exclusive category name for them,
then we make life difficult for ourselves.

You might be able to abuse NER models for this, but it might take more wrangling.



## TEST

In [2]:
dutch  = spacy.load('nl_core_news_lg')

2024-09-21 22:04:06.708018: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-09-21 22:04:06.789158: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-09-21 22:04:08.722565: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-09-21 22:04:08.724698: I tensorflow/comp

In [12]:
import spacy
from spacy.matcher import Matcher
import wetsuite.datasets

def LOWERs(strlist):
    return list( {'LOWER':s}   for s in strlist )

def ADJ_N():
    return list( [{'POS':'ADJ', 'OP':'*'}, {"POS": {"REGEX": "^N"}}] )

def DETPRON_N():
    return list( [{'POS':'DET', 'OP':'*'}, {'POS':'PRON', 'OP':'*'}, {"POS": {"REGEX": "^N"}}] )


matcher = Matcher(dutch.vocab)

for phraselower in (
    'aan de hand van het',
    'aan de hand van de'
    ):
    matcher.add( phraselower+'1', [   LOWERs(phraselower.split()) + ADJ_N()      ])
    matcher.add( phraselower+'2', [   LOWERs(phraselower.split()) + DETPRON_N()     ])

for phraselower in (
    'heeft geconcludeerd dat',
    'heeft besloten'
    ):
    matcher.add( phraselower+'1', [   ADJ_N()     + LOWERs(phraselower.split())     ])
    matcher.add( phraselower+'2', [   DETPRON_N() + LOWERs(phraselower.split())     ])



for case in wetsuite.datasets.load('cvdr-mostrecent-text').data.random_values(100):
#for case in wetsuite.datasets.load('rechtspraaknl-struc').data.random_values(100):
    doc = dutch( case )
#    doc = dutch( case.get('bodytext') )

    for match_id, start, end in matcher( doc ):
        match_str = dutch.vocab.strings[match_id]  # Get string representation, seems to point out the pattern name is added to the vocab too, presumably to have an integer-only representation?
        span = doc[start:end]  # The matched span
        print(f"Pattern {repr(match_str):15s} matches token {start:3d}..{end:3d} matches text {repr(span.text)}")    
        
    





('131640',
 '(Verordening Kwijtschelding 2012)\n\nVoor de kwijtschelding van gemeentelijke belastingen vinden op grond van artikel 255 van de Gemeentewet, de bepalingen in de Invorderingswet 1990 en de Uitvoeringsregeling 1990, met uitzondering van het volgende, overeenkomstig van toepassing;\n\nKwijtschelding van gemeentelijke belastingen wordt op verzoek verleend voor:\nafvalstoffenheffing;\nrioolheffing.\nKwijtschelding afvalstoffenheffing \nEen verzoek om kwijtschelding wordt verleend indien men voldoet aan de voorwaarden die zijn opgenomen in de uitvoeringsregeling. Indien men hiervoor in aanmerking komt geldt dit voor een éénpersoonshuishouden voor maximaal één kleine gft-container en één kleine restcontainer. Voor een meerpersoonshuishouden voor maximaal één kleine gft-container en één grote restcontainer.\nKwijtschelding rioolheffing \nEen verzoek om kwijtschelding wordt verleend indien men voldoet aan de voorwaarden die zijn opgenomen in de uitvoeringsregeling. Indien men hier

<!--

wetnamen = set()
for id, (namen1, namen2) in wetsuite.datasets.load('wetnamen').data.items():
    for name in namen1+namen2:
        name = name.strip(' -')#.lower()
        # turns out a 4MByte regexp isn't very fast, so we take out the very long stuff. TODO: do all this in the generation of that list.
        if len(name)>2  and len(name) < 50:
            wetnamen.add(name)

#random.sample( wetnamen, 30 )


import re
count = 0
frag_with_wetnaam = []

print('PRE')
nn = sorted( list( re.escape(naam)  for naam in wetnamen ), key=lambda s:len(s), reverse=True )

_re_str = r'\b(%s)\b'%( '|'.join( nn ) ) # this is a monster of a regexp
print( len(_re_str) )
_re_namen = re.compile( _re_str)#, flags=re.I )


import wetsuite.helpers.spacy
# This is a rather inefficient ways to do it.
print('PROC')
#for id, cvdr_xmlbytes in wetsuite.helpers.notebook.ProgressBar( wetsuite.datasets.load('cvdr-mostrecent-xml').data.random_sample(1000) ):
    #for fragment in wetsuite.helpers.split.feeling_lucky( cvdr_xmlbytes ): # we use split to place the quote within a paragraph-or-so.
for id, cvdr_txt in wetsuite.helpers.notebook.ProgressBar( wetsuite.datasets.load('cvdr-mostrecent-text').data.random_sample(100) ):
    for fragment in wetsuite.helpers.spacy.sentence_split(cvdr_txt, as_plain_sents=True):
    #for fragment in re.split('\n{2,}', cvdr_txt):
        for m in re.finditer(_re_namen, fragment):
            #print(id, m.groups()[0])
            frag_with_wetnaam.append( (fragment, m.start(), m.end()) )
            # TODO: move all matches in a fragment into one list; spacy want it that way

count
#frag_with_wetnaam



for fragment, start, end in frag_with_wetnaam:#random.sample( frag_with_wetnaam, 1 ):
    print( fragment[:start], '[' , fragment[start:end], ']', fragment[end:] )
    print('-----')
    
-->

<!--

It's easy to be self congratulatory, 
we can't strongly verify, or quantify, the performance
with the same things we fed in to train it. 
Of course it's going to find most of those.

When we give it a list we don't care about that list - it better be good at that.

We care about all the things you didn't tell it about, but we figure are relevant.
Maybe it's better at those. Maybe it's worse. You wouldn't easily know.

This is a classical issue in machine learning of any type.

The classical solution is to split the data you fed in into a training part, and testing part.
You only use a good portion to try to , and you use the rest to prove that's true.

https://towardsdatascience.com/train-ner-with-custom-training-data-using-spacy-525ce748fab7

https://spacy.io/usage/training

-->