## Purpose of this notebook

Finding the varied names used to refer to laws can help,
among other things, resolve the varied references to them, and e.g. see how unambiguous those are.

Even when specific laws and regulations have official names,
there are some practical names for additional name variants,
plus people actually writing documents will naturally use shortened names, acronyms, and other variations.

We would like to know all of these variants - also to support finding the many references that use these.
So we try to not just the idealized references, but also the names people use in a lot of practice.

## Some more notes (feel free to skip)

We assume that we do not need to care to catch every case.
In fact, we can afford to ignore odd cases and focus on things that are there more consistently - this will probably give cleaner results, 
and the more text that you have mentioning a thing, the more that this simpler appoach works decently.
  
That said, by volume a lot of code is specifically there to do some cleanup to to use more of the cases,
which is also messy code because it's playing whack-a-mole with many alternate ways of formatting names and/or identifiers.

<!-- -->

If you are happy with the results (which we made a small dataset of),
then this notebook may not be _directly_ useful to you, 
yet is might contain some code you might consider taking (and some which we might put into our library).

<!-- -->

The below spends time stripping things like "article x lid y" from the text because we care mostly about the main names,
but actually the non-stripped text can be specifically interesting for similar tasks, e.g. train a classifier to find these references by text.

### Names of laws

There are often two names:
* the __intutule__, which is often a longer, more descriptive name, and can be quite long.
* the __citeertitel__, which ought to be succinct (one of the guidelines in [Aanwijzing 4.25 Aanwijzingen voor de regelgeving](https://wetten.overheid.nl/BWBR0005730/2022-04-01#Hoofdstuk4_Paragraaf4.5_Artikel4.25)), and it is typically settled in the last article of a law, though note not everything _has_ such a citeertitel.

<!-- -->

Documents referring to laws are free to use whatever reference they wish. 
There is often a preference for brevity, particularly around repeated reference, and there is also just human variation, e.g. due to
- [onjuist spatiegebruik](https://spatiegebruik.nl)
- spelling variants, variations in function words, e.g. _Wet op het voorgezet onderwijs_  versus  _Wet voorgezet onderwijs_
- dates in the name -- which is useful to disambiguate both significant updates and outdated-and-replaced variants. Yet this is (usually) not technically part of it actual title
...and more.

There are also entirely __unofficial names__.

Say, while "Mammoetwet van 1968" seems a specific reference, 
there is nothing officially called that, nor do documents refer to it like that.
It refers to ''Wet op het voortgezet onderwijs'' and is a [_nickname_ indicating sheer size](https://historiek.net/mammoetwet-1968-betekenis-definitie/84129/).

...but _actually_, [BWBR0002399](https://wetten.overheid.nl/BWBR0002399) has been called ['Wet voortgezet onderwijs' 1968 through 2022](https://wetten.overheid.nl/BWBR0002399/informatie#tab-wijzigingenoverzicht), and [since 2022 there is a different law, a Wet voortgezet onderwijs 2020](https://wetten.overheid.nl/BWBR0044212).

We won't catch details like how one thing continues another, but we have a good chance to at least catch all these names.

### Abbreviations of laws


Settling a citeertitel does not include an abbreviation - abbreviations of laws do _not_ seem to be in any way official.

People still use abbreviations for practical reasons, mostly for more-often-cited things. 
Even regelingen occasionally use abbreviations in their citeertitels.


Because of the unofficial status, there are are a lot of details that seem down to convention. 

For example capitalisation:
- it seems that before the 90s (e.g. Ar in 1992?{{verify}}), abbreviations tended to be all capitalized, 
- since the 90s it's often just the first letter, and you can expect either, not based on the age of the specific law.
- separately, longer ones and non-[[initialism]] abbreviations also tend to have only the first letter as a capital (e.g. [https://wetten.overheid.nl/BWBR0008657 Wajong ("Wet arbeidsongeschiktheidsvoorziening jonggehandicapten")], which seems to follow the logic that the full citeertitel should be written that way[https://wetten.overheid.nl/BWBR0005730/2022-04-01#Hoofdstuk4_Paragraaf4.5_Artikel4.25]. That capital is frequently W for wet, but not necessarily.
- Lawyers and public officials who refer to laws a lot seem to also do this initial-capital-only, seemingly with a little less regard to origin/age -- 
yet certain well-established abbreviations may still be all-capitalized.



Note that, being unofficial, and defined at time of reference, not be the thing that is being referred to,
**abbreviations need not be unambiguous** in a global sense.

Say, WVO has been used in official documents to refer to
- [Wet op het voortgezet onderwijs](https://wetten.overheid.nl/BWBR0002399), 
- [Wet verontreiniging oppervlaktewateren](https://wetten.overheid.nl/BWBR0002682), and
- [Wet veiligheidsonderzoeken](https://wetten.overheid.nl/BWBR0008277)

...which makes e.g. "[Formatiebesluit WVO](https://wetten.overheid.nl/BWBR0005446)" unclear at a glance.
While it becomes clear enough in [its first article, begripsbepalingen](https://wetten.overheid.nl/BWBR0005446/2019-01-01#HoofdstukI_Artikel1),
it still seems to technically violate [Aanwijzing 4.25 Aanwijzingen voor de regelgeving Aanwijzing 4.25 lid 1 Formulering citeertitel](https://wetten.overheid.nl/BWBR0005730#Hoofdstuk4_Paragraaf4.5_Artikel4.25) in that it is usually perfectly avoidable to use an acronym.

## Doing something about it (the code)

In [1]:
import re, collections, pprint, random, urllib.parse

import bs4, requests

import wetsuite.datasets
import wetsuite.helpers.etree
import wetsuite.helpers.localdata
import wetsuite.helpers.notebook
import wetsuite.helpers.meta
import wetsuite.helpers.strings
import wetsuite.helpers.patterns
import wetsuite.helpers.koop_parse
import wetsuite.helpers.spacy

## Some helper functions:

In [109]:
# These are two rather crude string cleanup functions.
# Their distinction has also become vague;  CONSIDER: merge
def cleanup_basics(name: str):
    """ Takes a string like "artikel 3 van de Woo"
        and turns it into a less-detailed reference like "Woo"  
        mostly by taking off the "artikel X", "lid Y", and "van de/het"
        and nothing more creative than that.
    """
    for remove_before_s in ('van de ', 'van die ', 'van het '):
        rbi = name.find(remove_before_s)
        if rbi != -1:
            name = name[rbi+len(remove_before_s):].strip()
    
    if ', art' in name: # "Woo, artikel" -> "Woo"
        name = name[:name.index(', art')].strip()

    for re_remove in (r'^[Aa]rt(?:[.]|ikel|\b)? [0-9:.]+[a-z]*',   # at start
                        r'[Aa]rt(?:[.]|ikel|\b)? [0-9:.]+[a-z]*$',  # at end
                        ):
        if re.search( re_remove, name ) is not None:
            name = re.sub(re_remove, ' ', name).strip(', ')

    for re_remove in (r'^lid [0-9:.]+[\u00baa-z]*',   # at start
                        r'lid [0-9:.]+[\u00baa-z]*$',  # at end
                        ):
        if re.search( re_remove, name ) is not None:
            name = re.sub(re_remove, ' ', name).strip(', ')

    name = name.rstrip('(')
    return name

assert cleanup_basics('artikel 1 van de Woo') == 'Woo'
assert cleanup_basics('Woo, artikel 1') == 'Woo'
assert cleanup_basics('art. 1 Woo') == 'Woo'
assert cleanup_basics('art. 1 lid 2 Woo') == 'Woo'
assert cleanup_basics('lid 2 Woo') == 'Woo'


def cleanup_wet_title(name: str):
    ''' Could be seen as an extension of cleanup_basics, 
        but is designed to take the entire aanhaal alinea and more aggressively removes anything not the title
        ...such as quotes, spaces, the rest of a sentence.
    '''
    name = name.replace('\u00A0',' ') # replace non-breaking space with a regular space
    name = name.replace('\n',' ')

    # take out matching quotations -- and assume they are being used to demark the entire name  
    #  the rest and the below still executes but is unlikely to match much
    for qleft, qright in (
        ('\u2018', '\u2019'),
        ('\u00ab', '\u00bb'),
        ('\u201c', '\u201d'),
        ('\u201e', '\u201d'),
        ('\u201d', '\u201d'),
    ):
        left_index  = name.find(qleft)
        right_index = name.rfind(qright)
        if left_index != -1 and right_index != -1   and left_index < 10  and  right_index > left_index+5:
            name = name[ left_index+1 : right_index ]

    if name.startswith('"') and name.count('"')>=2: #assume these are outside quotations
        try:
            name = name[ 1: name.index('"', 4) ].strip()
        except:
            print( repr(name) )
            raise

    # quite hackish
    name = re.sub( r'(?:artikel [0-9.]+,?\s*)?(?:[a-z]+e lid|lid [0-9]+,?)?( van het| van di?e| van dat| der)\s*',' ', name).strip()

    # aimed at the aanheffing sentence specifically
    for re_cutafter in(
        r'en zal',                              # probaly more often than not the note where it will be publicised
        r'en treedt',                           # probaly more often than not the note when it will be publicised
        r'[;,.] ([zZ]ij |Het |Dit besluit )?(treedt|treden) in werking',
        r'[.] Deze ',
        r'zal worden gepubliceerd', 
        r'en wordt gepubliceerd',
        r'wordt .{0,20}in de Staatscourant',
        r' (en|zij|zei|zal) (wordt|worden) bekend\s?gemaakt',
        r'. De beleidsregels', 
        r' en werkt terug ',
        r'Zij werkt terug ',
        r'De regeling zal met ',
        r' met vermelding van ',
        r' met bijvoeging van ',

        # TODO: extract abbreviations too
        r'(, )?(of )? afgekort (als|tot)',
        r'[\(]afgekort\b',
    ):
        match = re.search( re_cutafter.replace(' ',r'[\s\n]+'), name, re.M )
        if match is not None:
            name = name[:match.start()]

    if name.startswith('de '):  # wet
        name = name[3:].strip()
    if name.startswith('het '): # besluit
        name = name[4:].strip()
    # note: taking off 'nieuwe' at the start is dangerous because that is also frequently part of the actual name

    name = name.strip('\u2018\u2019') #quote
    name = re.sub('\s+',' ', name).strip()
    name = name.strip('"\'.;:,\u00ab\u00bb \u2018\u2019\u201c\u201d\u201e')
    return name

In [None]:
def title_is_too_generic_or_specific( title ):
    """ Various documents will refer to shortened names,
        from 'de minister' to 'deze brief' or just 'artikel 2', without further context.
        
        Until we build something that can resolve such contextual information, 
        we easily want to ignore rather than use these.
        So this returns whether we think it's such a not-easily-reconcilable reference

        You may wish to use it after abovementioned cleanup, because that's when you can tell.
    
        Keep in mind that some are very short
    """
    title_lower = title.strip('- ').lower()
    title_lower = title_lower.replace('genoemde ','').strip()

    if title_lower=='':
        #print('T empty')
        return True
    
    if title_lower in (
        'overheid.nl'
        'artikel',
        'beschikking',
        'verordening',
        'rijkswet',
        'raad',
        'circulaire',
        'onbekend',
        'regeling',
        'nieuwe wet',
        'wetboek', 'uitvoeringswet', 'subsidieregeling',  'fonds', 'het', 'model', 'mens', 
        'wet',

        'beleidsregels', 'deze regeling', 'Deze regeling', 'die regeling',
        'besluit','Besluit', 'dat besluit', 'Dit besluit',

        'geen', 
        'gemeente',
        'de wet', 'eerste', 'mandaatbesluit', 'dit besluit', 
        'regelement', 'reglement',
        'bijlage',
        'was', 'gelet op', 
        'bodem','politie','minister','inrichting',
        'stichting',
        'onderwijs',
        'wetgeving',
        'jaarverslag',
        'statuten',
        'controles', 'eed',
        'wet op', 'wet op de',
        'wetboek van',
        'openbaar vervoer',
        'werkgever',
        'burgerlijk',
        'rijk',
        'rijksoverheid',               
        'onder',
        'eerste lid', 'tweede lid', 'derde lid', 'vierde lid', 'vijfde lid', 
        ):
        #print('T fixed')
        return True

    if re.match(r'((in|van)\s+)?(de|het|die|dat)\s+(wet|regeling|besluit)$', title_lower) is not None:
        #print('T wetbesluitregeling')
        return True

    if re.match('lid [a-z]+$',title_lower) is not None:
        #print('T lid1')
        return True
    if re.match('[0-9]+, [a-z]+ lid$',title_lower) is not None:
        #print('T lid2')
        return True
    if re.match('[a-z]+ lid onder [a-z]+$',title_lower) is not None:
        #print('T lid3')
        return True
    if re.match('[a-z]+ volzin$',title_lower) is not None: # 'tweede volzin',
        #print('T volzin')
        return True
    if re.match('(onder|sub) [a-z]+$',title_lower) is not None: # 'onder a', 'sub a', 'sub b', 'sub c',
        #print('T ondersub')
        return True
    if re.match('en [0-9]+$',title_lower) is not None:
        #print('T en')
        return True
    if re.match('[0-9]+$',title_lower) is not None: # '140','147','172','213','242','231','416', '424', '426',
        #print('T num')
        return True

    if re.match('[0-9]+[:.][0-9]+$',title_lower) is not None:
        #print('T numnum')
        return True
    if re.match('Hoofdstuk [0-9]+$',title_lower) is not None:
        #print('T hoofdstuk')
        return True

    # some of these are singular edge cases, shouldn't really be here

    # too specific, e.g. still contains things like "lid", "sub", "aanhef en onderdeel"
    title_lower = title.strip().lower()
    for sub in (', lid ',
                ', sub ',
                ', onder ',
                'aanhef en onderdeel ',
                ' op grond van ',
                'artikelen',
                'hoofdstuk ',
                'bijlage bij',
                'behorende bij',
                'van genoemd',
                'bedoelde',
                'laatstgenoemd',
                ):
        if sub in title_lower:
            return True

    if re.match(r'^([0-9]+[a-z]?[\s+,]+|[a-z]+ lid[\s, ]+|wet)+$', title) is not None:
        # try to reject the garbled cases consisting _entirely) of parts, like '20, wet', '18, de lid, wet', '40b wet', '20, tweede lid, wet'
        return True

    if re.match(r'^(eerste|tweede|derde|vierde|vijfde|zesde)(,\s+| lid,| lid| tot en met\s+)+(besluit)$', title) is not None:
        # try to reject cases like 'eerste tot en met besluit', 'eerste lid, besluit', 'eerste tot en met besluit', 'tweede lid, besluit',
        return True

    return False

In [235]:
for name in ('artikel 17 der Consulaire wet',
             'artikel 17 van de Consulaire wet',
             'artikel 17 lid 1 der Consulaire wet',
             'lid 1 der Consulaire wet',
             'artikel 17 lid 1, der Consulaire wet',
             'artikel 17, lid 1, der Consulaire wet',
             'artikel 17 van de Woo',
             'van die wet',
             'in die wet',
             'die wet',
             'de wet',
             'regeling',
             'van die besluit',
             'artikel 3 van de wet',
             'artikel 2 van die wet',
             'artikel 3 van dat besluit',
             'vijfde lid, van genoemd besluit',
             'artikel 2.38 wet',
             'eerste tot en met besluit',
             'eerste lid, besluit',
             'eerste tot en met besluit',
             'tweede lid, besluit',
             'bijlage bij dat besluit',
             'eerste lid, onderdeel b, ze regeling',
             'eerste lid, onder c, besluit',
             'eerste lid, onder a of b, besluit',
             'hoofdstuk 4 besluit',
             'in de eerste lid bedoelde wet',
             'eerste lid van laatstgenoemd besluit',
             # stripped too much? Or just weird?
             '20, wet',
             '18, de lid, wet',
             '40b wet',
             '20, tweede lid, wet',
             ):
    # we _really_ should merge those, or at least clarify the difference
    cleaned = cleanup_wet_title(name )
    cleaned = cleanup_basics(cleaned)
    checked = cleaned
    if title_is_too_generic_or_specific(cleaned):
        checked = None

    print( f'{repr(name):47s} --> cleaned:{repr(cleaned):47s} --> checked:{repr(checked)} ' )


'artikel 17 der Consulaire wet'                 --> cleaned:'Consulaire wet'                                --> checked:'Consulaire wet' 
'artikel 17 van de Consulaire wet'              --> cleaned:'Consulaire wet'                                --> checked:'Consulaire wet' 
'artikel 17 lid 1 der Consulaire wet'           --> cleaned:'Consulaire wet'                                --> checked:'Consulaire wet' 
'lid 1 der Consulaire wet'                      --> cleaned:'Consulaire wet'                                --> checked:'Consulaire wet' 
'artikel 17 lid 1, der Consulaire wet'          --> cleaned:'Consulaire wet'                                --> checked:'Consulaire wet' 
'artikel 17, lid 1, der Consulaire wet'         --> cleaned:'Consulaire wet'                                --> checked:'Consulaire wet' 
'artikel 17 van de Woo'                         --> cleaned:'Woo'                                           --> checked:'Woo' 
'van die wet'                                

In [None]:
def name_from_extref_tag(extref_tag, allow_fuzzier=False):
    ''' Thats an etree object, specifically an extref node from a document,
        @param allow_fuzzier: if True, we're pretty strict about only matching some known patterns, if False we just hope for the best
        @return: the part of the extref's text that is probably the name of a thing being referenced, stripped of some obvious stuff
        ...or returns None, if we decide it's probably not a useful name, or we're not really sure what's in there.
    '''
    match1 = re.search(' (?:der|van de|van het) (.*)', extref_tag.text)   # I've seen a case where 'van de' is in the reference twice, like 'van de wet, in artikel 7 van de Elektriciteitswet 1998', but meh.
    match2 = re.search('lid, (Wet .*)', extref_tag.text)

    if match1 is not None:
        name = match1.groups()[0].strip()
        if name == 'wet': #  'van de wet' without a name
            return None
    elif match2 is not None:
        name = match2.groups()[0].strip()
    elif extref_tag.text.startswith('Wet ') or extref_tag.text.startswith('Successiewet ') or extref_tag.text.startswith('Besluit ') or extref_tag.text.startswith('Verordening ') or extref_tag.text.startswith('Warenwetbesluit '):
        name = extref_tag.text.strip()
    elif extref_tag.text.rstrip('0123456789 ').endswith('wet'): # assume it's a short name
        name = extref_tag.text.strip()
    elif extref_tag.text.rstrip('0123456789 ').endswith('besluit'): # assume it's a short name
        name = extref_tag.text.strip()
    elif extref_tag.text.rstrip('0123456789 ').endswith('regeling'): # assume it's a short name
        name = extref_tag.text.strip()
    elif re.match( 'W[A-Za-z]+$', extref_tag.text)   and len(extref_tag.text) in (2,3,4)  and  extref_tag.text.lower()!='wet': # assume it's a short initalism, for a law
        name = extref_tag.text.strip('')
    else: # maybe look at what these are
        if allow_fuzzier:
            name = extref_tag.text.strip('') # will still be cleaned below
            name = cleanup_basics( name ) # this is a little awkward of a combination, really
        else:
            #print( 'DUNNO %r'%(wetsuite.helpers.etree.tostring(extref_tag).decode('u8')) )
            return None

    # right now we're interested in the main name, not reference to specific part
    #  ...but we could later try to deal with those as well - there are ~15K cases that the below skipped
    name = cleanup_wet_title(name)
    name = cleanup_basics(name)

    # ignore less-formal references
    if title_is_too_generic_or_specific(name):
        #print( "SKIP non-useful name (generic/specific): %r"%name)
        return None
    
    #if wetsuite.helpers.strings.contains_any_of( name.lower(), (
    #    'artikel ','artikelen ', 'art.', 'lid,', 'paragraaf ', 'van die ', ' die wet', 'van de wet', 'van deze wet', 'met vermelding ',
    #    'eerste lid', 'tweede lid', 'boek van ', 'weede lid', 'met de bijbehorende', 'voornoemde wet', 'Bijlage ', 'bijlage bij', 'bijlage ') ):
    #    #print( "SKIP FOR NOW (specific reference): %r"%name) # there
    #    return None
    return name


# Collect references from BWB

## Names from BWB's intitule, citeertitel, and "aangehaald als" text

__Citeertitel__ - is often a fairly succinct name
: this basically comes from the last paragraph when it says something like "Deze wet wordt aangehaald als: Wet op de rechterlijke organisatie.", though the code belows does not assume those are identical, in part for ease of code because mostly laws have that paragraph, but _everything_ has a citeertitel.

__Intitule__ - is often a more detailed description, and will rarely be the text people use to cite (though sometimes it is the _same_ as citeertitel)

In [160]:
# should take a minute or two to go through all.

bwb_latestonly_xml = wetsuite.datasets.load('bwb-mostrecent-xml').data
# side note: there are a few laws that changed name, but due to that choice of data, we focus on the recent version

                                                                     # lists mostly for code below to be simpler, they'll contain 0 or 1 items
bwb_names_citeertitel = collections.defaultdict(list)   # BWB-id -> list of name strings     from something's own metadata, what it calls itself
bwb_names_aanhaling   = collections.defaultdict(list)   # BWB-id -> list of name strings     from something's own data, what it calls itself
bwb_soort             = {}                              # BWB-id -> soort    

# TODO: progress bar
for _, xmlbytes in wetsuite.helpers.notebook.ProgressBar( bwb_latestonly_xml.items() ): # note: going through ~36K documents will take a minute or two
    tree = wetsuite.helpers.etree.fromstring( xmlbytes )
    tree = wetsuite.helpers.etree.strip_namespace( tree )

    meta = wetsuite.helpers.koop_parse.bwb_toestand_usefuls(tree)
    bwbid = meta['bwb-id']
    bwb_soort[bwbid] = meta['soort']

    ## record the citeertitel 
    # which is just in the metadata
    bwb_names_citeertitel[ bwbid ].append( cleanup_wet_title(meta['citeertitel']) )  


    ## find the self-reference
    # find all alineas that mention 'aangehaald' (if any), find the thing it then mentions
    aanhaling = []
    for al_tag in tree.getiterator(tag='al'):
        tagtext = wetsuite.helpers.etree.all_text_fragments( al_tag, strip='\n', join='' )
        #tagtext = all_text_fragments( tag, strip='\n', join='' )
        #if 'aangehaald' in tagtext:
            #print( tagtext )
        match = re.search(r' (?:aangehaald als|aangehaald onder de titel van):?\s*(.*)', tagtext) # this probably filters out too much
        if match is not None:
            aanhaling.append( match.groups()[0].rstrip('.') )

    if len(aanhaling)==0: 
        #TODO: consider subelements, e.g. the CO<sub>2</sub> example

        # While debugging, it is useful to check that indeed they're without such a reference
        if wetsuite.helpers.strings.contains_any_of( meta['intitule'].lower(), ['regeling ', 'toepassing ', 'aanwijzing ', 'besluit ', 'beschikking ']): # maybe look at soort instead?
            pass
            #print( "SELFAANHAALFAIL (OKAY; seeming non-law) in %r"%xml_url)
        else:
            pass
            #print( "SELFAANHAALFAIL in %r"%xml_url)
        # there was previously also an 'is this french' check
    else:
        text = aanhaling[-1] # if we matched more than once (should be rare), then it's likely to be the last

        cleaner_title = cleanup_wet_title( text )
        if cleaner_title not in bwb_names_aanhaling[ bwbid ]:
            bwb_names_aanhaling[ bwbid ].append( cleaner_title )
        if '(' in cleaner_title: # some titles have parentheses at the end. Probably add them both with and without, because that may or may not be part of the name (TODO: clean more)
            cleaner_title = cleaner_title[:cleaner_title.index('(')].strip()
            if cleaner_title not in bwb_names_aanhaling[ bwbid ]:
                bwb_names_aanhaling[ bwbid ].append( cleaner_title )

  0%|          | 0/37806 [00:00<?, ?it/s]

### Quick summary

In [161]:
# how much did we get?
print("From  %d BWB documents  we got  %d citeertitels  and  %d aanhalingen."%(
    len(bwb_latestonly_xml),
    len(bwb_names_citeertitel),
    len(bwb_names_aanhaling),
))

From  37806 BWB documents  we got  37806 citeertitels  and  21871 aanhalingen.


In [162]:
# Quick inspection of
#    what kind of things are we apparently not cleaning from the aanhaling? 
# and/or 
#   what kind of differences are there between aanhaling and citeertitel?

for bwbid in list(bwb_names_aanhaling)[::100]: # every hundedth, to get just a small portion (avoiding a bulk of output)
#for bwbid in bwb_names_aanhaling: # all
    aanhaling   = bwb_names_aanhaling[bwbid]
    citeertitel = bwb_names_citeertitel[bwbid]
    if len( set( aanhaling ).symmetric_difference( citeertitel ) ) > 0: # when there is a difference 
        print()
        print( bwbid )
        print( '  ',aanhaling )
        print( '  ',citeertitel )


BWBR0005279
   ['Uniform Aanbestedingsreglement EG 1991’, bij afkorting ‘UAR-EG 1991']
   ['Uniform Aanbestedingsreglement EG 1991']

BWBR0010562
   ['Regeling stofomschrijving Nederland en Indonesië (examen geschiedenis)', 'Regeling stofomschrijving Nederland en Indonesië']
   ['Regeling stofomschrijving Nederland en Indonesië (examen geschiedenis)']

BWBR0011793
   ['Regeling beschikbare middelen ver-strekkingen en vergoedingen Zfw 2001']
   ['Regeling beschikbare middelen verstrekkingen en vergoedingen Zfw 2001']

BWBR0014363
   ['Openstellingsbesluit Kaderregeling kennis en advies (jonge agrariërs 2002)', 'Openstellingsbesluit Kaderregeling kennis en advies']
   ['Openstellingsbesluit Kaderregeling kennis en advies (jonge agrariërs 2002)']

BWBR0020202
   ['Warenwetregeling invoer bepaalde levensmiddelen uit Brazilië, China, Egypte, Iran en Turkije (beschikking 2006/504/EG)', 'Warenwetregeling invoer bepaalde levensmiddelen uit Brazilië, China, Egypte, Iran en Turkije']
   ['Waren

### Extrefs in the BWB

In the BWB in XML form, `<extref>` tags are referces to other laws, regulations, and more.

In a wider context, extref tags are fairly free-form in what link _text_ they contain,
yet within the BWB documents they are used fairly consistently, so provide fairly clean data.

We are currently interested in those that point to laws, and those will contain a BRB-ids and look something like:
`<extref verwijzing-id="2189982" doc="jci1.3:c:BWBR0015163&amp;artikel=5" bwb-id="BWBR0015163" label-id="4716344">artikel 5 van het Instellingsbesluit Productschap Vis</extref>`
We should perhaps filter by it pointing to BWB documents with soort=wet, but can do that later.

As we specifically look for repeated, clearer cases, we are not bothered to extract every reference - we are just looking for consistentas long as a majority is considered useful.

In [196]:
# This may take a minute or two

bwb_names_extref = collections.defaultdict(list)   # BWB-id -> list of name strings    (those names being what references from elsewhere call this BWB)

for _, xmlbytes in wetsuite.helpers.notebook.ProgressBar( bwb_latestonly_xml.items() ):
    tree = wetsuite.helpers.etree.fromstring( xmlbytes )
    tree = wetsuite.helpers.etree.strip_namespace( tree )

    for extref_tag in tree.iter( tag='extref' ): # find all external references, regardless of context in the document

        # is the thing we point to marked with a BWB identifier   (this may miss some things, but the else: below should tell us)
        ref_bwb = extref_tag.get('bwb-id') # CONSIDER: normalize?
        if ref_bwb is not None:
            # We might not care about things the above didn't already have names for
            #if ref_bwb not in bwb_name_self:
            #    continue

            # (parsing an extref tag might become a function eventually, but right now it's specific content in a specific dataset)

            if extref_tag.text is not None:
                name = name_from_extref_tag( extref_tag )
                if name is None:
                    #print( "SKIP, apparently boring name: %r"%wetsuite.helpers.etree.tostring(extref_tag).decode('utf8') )
                    continue

                if 'voornoemde wet' in name:
                    continue

                if name.endswith(')') and '(' in name: 
                    i = name.rfind('(') # find last open bracket, if there are multiple
                    
                    one = cleanup_wet_title( name[:i].strip() )
                    bwb_names_extref[ref_bwb].append( one )
                    
                    # IF this seems to be in the form 'bla bla (wet bla)' or 'bla bla (blawet)', then add the bracket contents
                    two = cleanup_wet_title( name[i+1:-1].strip() ) # -1 is valid only as long as that endswith up there stays
                    if ' ' in two and 'wet' in two.lower():
                        bwb_names_extref[ref_bwb].append( two )
                else: # enter as-is
                    bwb_names_extref[ref_bwb].append( name )

                # Note that we're specifically recording every reference (there will be many duplicates), 
                #  so that we can count them later and 

        else: # not references to bwb entries 
            # ignore what we know, to see what else there is
            # also, we might actually use CELEX references
            if extref_tag.get('reeks') == 'Celex': # EU identifiers
                pass
            elif extref_tag.get('doc') == 'onbekend': # ?
                pass
            else: # currently a few dozen cases left, that's fine to ignore
                pass # print out everything else to figure out what we don't handle yet?:
                #print( 'UNKNOWN extref %r'%(wetsuite.helpers.etree.tostring(extref_tag).decode('u8')) )

  0%|          | 0/37806 [00:00<?, ?it/s]

In [200]:
# review what we just collected

amt_names = 0
for id, names in bwb_names_extref.items():
    amt_names += len(names)
print( "We gathered %d name references for %d laws\n\nSome examples:"%(amt_names, len(bwb_names_extref)) )

# Printing _everything_ ( e.g. pprint.pprint( bwb_names_extref ) would be around 300K lines of output, so we print a small sample instead:
for example in random.sample( list( bwb_names_extref.items()), 5 ):
    pprint.pprint( example )

We gathered 286995 name references for 9624 laws

Some examples:
('BWBR0042732',
 ['Wet kwaliteitsborging voor het bouwen',
  'Wet kwaliteitsborging voor het bouwen',
  'Wet kwaliteitsborging voor het bouwen',
  'Wet kwaliteitsborging voor het bouwen',
  'Wet kwaliteitsborging voor het bouwen',
  'Wet kwaliteitsborging voor het bouwen',
  'Wet kwaliteitsborging voor het bouwen',
  'Wet Kwaliteitsborging voor het bouwen',
  'Wkb',
  'Wkb',
  'Wet kwaliteitsborging voor het bouwen',
  'Wet kwaliteitsborging voor het bouwen',
  'Wet kwaliteitsborging voor het bouwen',
  'Wet kwaliteitsborging voor het bouwen',
  'Wet kwaliteitsborging voor het bouwen',
  'Wet kwaliteitsborging voor het bouwen'])
('BWBR0031387',
 ['Regeling regionale aanpak voortijdig schoolverlaters en prestatiesubsidie '
  'vo',
  'Regeling regionale aanpak voortijdig schoolverlaten en prestatiesubsidie '
  'voor het voortgezet onderwijs',
  'Regeling regionale aanpak voortijdig schoolverlaten en prestatiesubsidie '
  'v

### Check: does that look reasonable?

The above extracted
* `bwb_names_citeertitel` from metadata
* `bwb_names_aanhaling`   from text
* `bwb_names_extref`      from links
   * _every_ occurence was added to a list, to be able to count which one is common. We can use `wetsuite.helpers.strings.count_normalized()` to do that for us - it lets us unify capitalization-only variation and reports the most common capitalisation

We could do do various things with this, e.g.
* count extref names (or _all_) -- to look at unusual cases
* look at ambiguity, e.g. duplicates
* names used for self, names used by others -- to search in

In [201]:
# Look at some hand picked examples

bwbrs = sorted(  set(bwb_names_citeertitel) | set( bwb_names_aanhaling) | set(bwb_names_extref)  )  # their keys,  so joins all bwb-ids that any part found
print( 'Distinct BWB-ids we found some names for: ',len(bwbrs) )

for example_bwbid in ('BWBR0031986', # here to point out it's only really laws that have citeertitels and aanhaaling paragraphs - references to this may be more descriptive
                      'BWBR0005537', # here to point out we can find acronyms too  (though we may need some cleverer non-linear behaviour in count_normalized)
                      'BWBR0015703', # here to point out this one changed names
):
    print()
    print( ' --> ', example_bwbid )
    print( 'Citeer:              ', bwb_names_citeertitel.get(example_bwbid) )
    print( 'Aanhaal:             ', bwb_names_aanhaling.get(example_bwbid)   )
    print( 'Extref refs to this: ', len(bwb_names_extref.get(example_bwbid))   )
    print( '  distinct & count:  ', wetsuite.helpers.strings.count_normalized( bwb_names_extref.get(example_bwbid), min_count=1,   min_word_length=1, normalize_func=lambda x:x.lower().strip() ) )
    print( '  filtered:          ', wetsuite.helpers.strings.count_normalized( bwb_names_extref.get(example_bwbid), min_count=0.05, min_word_length=1, normalize_func=lambda x:x.lower().strip() ) )

Distinct BWB-ids we found some names for:  38231

 -->  BWBR0031986
Citeer:               None
Aanhaal:              None
Extref refs to this:  3
  distinct & count:   {'Verordening fonds sociale aangelegenheden vleeswarenindustrie (PVV) 2012': 2, 'Verordening fonds voor onderzoek en ontwikkeling vleeswarenindustrie (PVV) 2012': 1}
  filtered:           {'Verordening fonds sociale aangelegenheden vleeswarenindustrie (PVV) 2012': 2, 'Verordening fonds voor onderzoek en ontwikkeling vleeswarenindustrie (PVV) 2012': 1}

 -->  BWBR0005537
Citeer:               ['Algemene wet bestuursrecht']
Aanhaal:              ['Algemene wet bestuursrecht']
Extref refs to this:  8726
  distinct & count:   {'Algemene wet bestuursrecht': 8038, 'zevende lid, onderdelen a en b, wet': 1, '3:12 wet': 4, 'Algemeen wet bestuursrecht': 4, 'Algemene wet': 29, 'Algemene': 4, 'Algemene wet bestuursrecht van toepassing': 1, 'Algemene wet bestuursrech': 7, 'eerste en tweede lid, wet': 1, 'Awb': 570, '4:8 wet': 1, 'eer

## Collect references from CVDR

The CVDR XML has `<dcterms:source>` elements in its header which are references made in the text

There are such entries which signify something like "we know this is a reference but aren't certain what to", we ignore those.
To see an example that contains both, see e.g. [CVDR100088/1](https://repository.officiele-overheidspublicaties.nl/CVDR/100088/1/xml/100088_1.xml).


<!-- -->

The XML documents in the CVDR repository have extref tags,
but they are a little more varied than those in BWB due to what these documents practically do in a legal sense:
these extrefs are more often references to _specific parts_ of laws.

This require more cleanup of more human wording,
but the amount of data should give cleaner preferences for the main name,
and the type of its use should also give more idea of what laws are more commonly referenced.

In [202]:
# NOTE: going through 160K documents will probably take ten minutes
#       (...also why it's split into a few cells)
cvdr_sref_data   = {} # xml_url it came from   ->   [(type, orig, specref, None, source_text),  ...]    (see cvdr_sourcerefs's documentation)
cvdr_extref_data = {} # xml_url it came from   ->   [extrefnode,  ...]    (see cvdr_sourcerefs()'s documentation)

## Go through and parse CVDR XMLs, and collect references into the parsed XML (don't process it yet)... 
for xml_url, xmlbytes in wetsuite.helpers.notebook.ProgressBar( wetsuite.datasets.load('cvdr-mostrecent-xml').data.items() ):
    tree = wetsuite.helpers.etree.fromstring( xmlbytes )
    tree = wetsuite.helpers.etree.strip_namespace( tree )

    try:
        meta = wetsuite.helpers.koop_parse.cvdr_meta(tree, flatten=True)

        ## ...collect the <dcterms:source> references,
        # use an existing function. Which returns a list of tuples, we'll deal with their content a little later
        srefs = wetsuite.helpers.koop_parse.cvdr_sourcerefs(tree, ignore_without_id=True)   
        if len(srefs) > 0:
            cvdr_sref_data[ xml_url ] = srefs

        ## ...collect extref tags from the document overall, since these are often referces to mostly laws, and to other CVDR articles.
        extrefs = list( tree.iter(tag='extref') )
        if len(extrefs)>0:
            cvdr_extref_data[ xml_url ] = extrefs
    except:
        pass

  0%|          | 0/236735 [00:00<?, ?it/s]

In [203]:
## process the sourceref part of what we just collected
cvdr_sourceref_names = collections.defaultdict( list )  # BWB-id -> list of names used
count_ignoring       = collections.defaultdict( int  )  #   type -> amount we ignored

for xml_url, sourceref_list in cvdr_sref_data.items():
    for type, orig, bwbr, parts, spectext in sourceref_list:
        if type=='BWB':
            # The text may often be something like "artikel 5 van de wet Bla" or "Artikel 5, wet Bla" and we want just "Bla
            #   try to find the name by taking off specific bits of reference
            #   the references themselves are fairly clean to start with, so this works moderately well
            name = cleanup_basics( spectext )

            cvdr_sourceref_names[bwbr].append( name )

            if 0:
                # this is unrelated -- was trying to see which keys are used in the jci idetails,  because jci doesn't actually seem to define that
                # this next line is most of them
                for known in ('boek', 'hoofdstuk', 'titeldeel', 'afdeling','afd', 'artikel', 'lid', 'paragraaf', 'bijlage'):
                    if known in parts:
                        parts.pop( known )
                if 'g' in parts:
                    g = parts.pop('g')
                if len(parts)>0:
                    print('PARTSLEFT in %s: %s'%(xml_url, parts), [spectext, bwbr, parts])
        else:
            count_ignoring[type] += 1

print( f'ignored: {dict(count_ignoring)}; names:{len(cvdr_sourceref_names)}' )

ignored: {'CVDR': 6577}; names:913


In [227]:
# You could output all of that, but WARNING: ~50k lines
#cvdr_sourceref_names
# so, instead, a sample:
random.sample( list(cvdr_sourceref_names.items()), 10 )

[('BWBR0033723',
  ['Wet verplichte meldcode huiselijk geweld en kindermishandeling',
   'Besluit verplichte meldcode huiselijk geweld en kindermishandeling',
   'Besluit verplichte meldcode huiselijk geweld en kindermishandeling',
   'Besluit verplichte meldcode huiselijk geweld en kindermishandeling',
   'Besluit verplichte meldcode huiselijk geweld en kindermishandeling']),
 ('BWBR0044970',
  ['Wijzigingswet Wet natuurbescherming en Omgevingswet (stikstofreductie en natuurverbetering)',
   'Wijzigingswet Wet natuurbescherming en Omgevingswet (stikstofreductie en natuurverbetering)',
   'Wijzigingswet Wet natuurbescherming en Omgevingswet (stikstofreductie en natuurverbetering)']),
 ('BWBR0012002', ['Vreemdelingenwet 2000']),
 ('BWBR00005181', ['Woningwet']),
 ('BWBR0008074',
  ['Reglement rijbewijzen',
   'RR',
   'Reglement Rijbewijzen',
   'Reglement rijbewijzen',
   'Reglement rijbewijzen',
   'Reglement rijbewijzen',
   'amvb Reglement rijbewijzen',
   'Reglement rijbewijzen',
 

In [206]:
## and process the extref part
# The contents are more varied than in the above BWB case.
#  Out of interest, we try to know about most of them even though we ignore them for this use

_jcilike1 = re.compile('1.[0-9]+:[cv]:(BW[BRWVB0-9]+)') # something like '1.0:v:BWBR0005537' which seems like an abuse of jci
_jcilike2 = re.compile('1.[0-9]+:(CVDR[0-9_]+)')        # something like '1.1:CVDR215805_1' which seems like a non-standard imitaiton of jci

cvdr_extrefs = []

# try to keep track of how many cases we're using, or ignoring for cleanliness's sake
count_ignored, count_matched, count_division  = 0, 0, collections.defaultdict(int)

for xml_url, extrefs in cvdr_extref_data.items():
#for xml_url, extrefs in list(cvdr_extref_data.items())[:5000]:
    for extref in extrefs:
        if extref.text is None:
            continue

        matching = False
        ignoring = 0  # meaning not, 1 meaning tentatively, 2 meaning thoroughly

        value = extref.attrib.get('doc', None).strip()
        # that value is often an ID or URL
        #   in theory the 'struct' attribute tell us how to interpret the value, but it's not there much of the time, so we try not to rely on it
        
        #print( 'VALUE   %r '%value)
        if re.match(r'[\[\]01243456789\s]', extref.text): # numbered but nameless references like '[1]'
            ignoring = 2
            count_division['justnumber'] += 1

        if value is None:
            ignoring = 2
            count_division['none'] += 1

        else: # value is not None:
            name         = name_from_extref_tag( extref )                        # should give a cleaner answer, but answers None more easily
            name_fuzzier = name_from_extref_tag( extref, allow_fuzzier=True )    # used as a fallback if the previous is None
                                                                                 #   the code for that is in specific sections, because it can vary per extref type
                                                                                 #   though note that means each section can forget, so... don't.
            _jcilike1m = _jcilike1.match( value )
            _jcilike2m = _jcilike2.match( value )

            if value == '':
                ignoring = 2
                count_division['empty'] += 1
            elif value.startswith('about:blank'):   # riiiight.
                ignoring = 2
                count_division['empty'] += 1
            elif value.startswith('bookmark://'):   
                ignoring = 2  
                count_division['pointless'] += 1
            elif value.startswith('mk:@MSITStore'): # that's a local file   (IE, CHM stuff)
                ignoring = 2
                count_division['local'] += 1
            elif value.startswith('file://'):       # more local files
                ignoring = 2
                count_division['local'] += 1
            elif value.startswith('Xopus.asp'):     # that's an internal reference, apparently https://www.koopoverheid.nl/documenten/instructies/2017/10/24/gebruikershandleiding-xopus-xml-editor
                ignoring = 2
                count_division['local'] += 1
            elif value.startswith('mailto:'):
                ignoring = 2  
                count_division['pointless'] += 1

            elif value.startswith('/cvdr'): # probably to a served image or PDF or such
                ignoring = 2
                count_division['relative'] += 1

            elif value.startswith('#'): # probably a page anchor - maybe interesting, actually?
                ignoring = 2  
                count_division['samepage'] += 1

            ## cases we can probably use:
            elif value.startswith('http://wetten.overheid.nl/cgi-bin/deeplink/'):
                # will look like
                #   http://wetten.overheid.nl/cgi-bin/deeplink/law1/title=Burgerlijk%20Wetboek%20Boek%201
                #   http://wetten.overheid.nl/cgi-bin/deeplink/law1/bwbid=BWBR0005537/article=1:2
                # Those are not query parameters, and I've not found the standard this might be following, so some estimation is involved here
                rest = value[value.index('law1/')+5:]

                if rest.startswith('title%3D'): # probably inconsistent escaping?
                    ignoring = 1
                    count_division['badesc'] += 1
                elif rest.startswith('bwbid%3D'):
                    ignoring = 1
                    count_division['badesc'] += 1

                elif rest.startswith('title='): # quite a few of these
                    from wetsuite.extras.lawref import resolve_deeplink_bwbid
                    bwbid = resolve_deeplink_bwbid( value ) # does fetches, and caches them. First run will be slower
                    if bwbid is None:
                        ignoring = 1 # 2?
                        count_division['could not resolve deeplink'] += 1
                    else:

                        if name is None and name_fuzzier is None:  # this is copied deeper into the logic (a few times) because it's a less common reason to give up / fall back
                            name = name_fuzzier

                            count_division['uninteresting extref text?'] += 1
                            ignoring = 2 # maybe 1 to report, unless we're printing:
                            #print('SKIP uninteresting extref text (None or %r from %r)'%( name_fuzzier, extref.text) )
                        else:
                            name = name or name_fuzzier
                            
                            tt = cleanup_basics( name )

                            if '/' in rest:
                                count_division['detailed title deeplink'] += 1
                                matching = True
                                cvdr_extrefs.append( ['BWB-deeplink-title-detailed', bwbid, tt])
                                #print(['TEST1.1', bwbid, tt, name])
                                #raise ValueError( 'SKIP FOR NOW, specific reference   in  %r'%rest )
                            else:
                                count_division['basic title deeplink'] += 1
                                matching = True
                                #title = urllib.parse.unquote(rest[6:])
                                #print(['TEST1.2', bwbid, tt, name])
                                cvdr_extrefs.append( ['BWB-deeplink-title', bwbid, tt]) # fall back to the below style
                                # TODO: actually fetch these to see what identifier they end up on to

                elif rest.startswith('bwbid='):
                    if name is None and name_fuzzier is None:  # this is copied deeper into the logic (a few times) because it's a less common reason to give up / fall back
                        name = name_fuzzier

                        count_division['uninteresting extref text?'] += 1
                        ignoring = 2 # maybe 1 to report, unless we're printing:
                        #print('SKIP uninteresting extref text (None or %r from %r)'%( name_fuzzier, extref.text) )
                    else:
                        name = name or name_fuzzier

                        if '/' in rest:
                            tt = cleanup_basics( name )
                            bwbid = urllib.parse.unquote( rest[rest.index('=')+1:rest.index('/')] )
                            #print(['TEST2.1', bwbid, name])
                            
                            ignoring = 1 # TEMPORARILY 2
                            count_division['deeplink too detailed; TODO'] += 1
                            #raise ValueError( 'SKIP FOR NOW, specific reference   in  %r'%rest )
                        else:
                            matching = True
                            bwbid = urllib.parse.unquote(rest[6:])
                            #print(['TEST2.2', bwbid, name])
                            cvdr_extrefs.append( ['BWB-deeplink-bwbid', bwbid, name]) # fall back to the below style

                else:
                    raise ValueError( 'TODO: deal with  %r'%rest )


            elif value.startswith('http://wetten.overheid.nl/') or value.startswith('https://wetten.overheid.nl/'):
                # non-deeplinks; assume these will look something like
                #   http://wetten.overheid.nl/BWBR0012059/geldigheidsdatum_24-09-2008#Hoofdstuk4_Artikel37
                # or
                #   http://wetten.overheid.nl/jci1.3:c:BWBR0001941&amp;artikel=2&amp;z=2017-05-25&amp;g=2017-05-25
                count_division['wetten.overheid.nl'] += 1

                won_match1 = re.search('wetten.overheid.nl/(BW[^/]+)(?:[/]|$)', value)
                won_match2 = re.search('wetten.overheid.nl/jci[0-9.]+:[cv]:(BW[0-9A-Z]+)(?:[&]|$)', value) # seems we could use
                
                if won_match1 is not None:
                    if name is None and name_fuzzier is None:
                        ignoring = 1
                        name = name_fuzzier
                    else:
                        matching = True 
                        name = name or name_fuzzier
                        #print(['WON1', won_match1.group(1), name, value] )
                        cvdr_extrefs.append( ['won1', won_match1.group(1), name])

                elif won_match2 is not None:
                    if name is None and name_fuzzier is None:
                        ignoring = 1
                        name = name_fuzzier
                    else:
                        matching = True 
                        name = name or name_fuzzier
                        #print(['WON2', won2_match1.group(1), name, value] )
                        cvdr_extrefs.append( ['won2', won_match2.group(1), name])

                else:
                    ignoring = 1
                    #print(['NOWON', value ])


            # other http[s]:// is probably less meaningful.   A few may still be usef for other reasons, but we are currently not looking for them
            elif value.lower().startswith('http://') or value.lower().startswith('https://'):
                ignoring = 1  # was TEMPORARILY 2, while figuring out the deeplink stuff
                count_division['otherlink; CONSIDER'] += 1
                # eg. http://decentrale.regelgeving.overheid.nl/ might still be interesting, but most of these are less interesting
                #print( value)

            elif value.lower().startswith('www.'):
                ignoring = 1
                count_division['otherlink; MAYBE'] += 1
                # these all seem unineresting
                #print( value)

            else: # This maybe should just be part of the same if-elif chain above, because there is no longer a clean split of cases

                name = name_from_extref_tag( extref )
                if name is None:
                    ignoring = 2  # (don't print identifier part if we skip it for text reference reasons)
                    count_division['boringname?'] += 1
                    #print('SKIP, apparently boring name: %r'%extref.text)

                elif value.startswith('CVDR://'): # 'CVDR://97153_2'
                    matching = True
                    ref = wetsuite.helpers.koop_parse.cvdr_parse_identifier( value[7:] )
                    cvdr_extrefs.append( ['CVDRurl', ref, name])

                elif value.startswith('BWB://'): # e.g. 'BWB://1.0:v:BWBR0011468&artikel=76'
                    try:
                        ref = wetsuite.helpers.meta.parse_jci( value[6:] )
                        cvdr_extrefs.append( ['BWBurl', ref, name])
                        matching = True 
                    except Exception as e:
                        print([value, e])
                        ignoring = 1 

                elif _jcilike1m is not None:
                    matching = True 
                    cvdr_extrefs.append( ['jcilike-bwb', _jcilike1m.group(1), name])

                elif _jcilike2m is not None:
                    matching = True 
                    cvdr_extrefs.append( ['jcilike-cvdr', _jcilike2m.group(1), name])

        if matching:
            count_matched += 1
        elif ignoring:
            count_ignored += 1

        if 0:  # report things we didn't handle,  and/or  couldn't decide on
            say = ''
            if not matching  and  ignoring==0:
                say = 'DID NOT RECOGNIZE:  '
            elif ignoring == 1:
                say = 'IGNORING, MAYBE LATER: '

            if say:
                extref.tail = None            
                print( say, wetsuite.helpers.etree.tostring( extref ).decode('utf8') )
                #print( 'ATTRIB ', extref.attrib )
                #print( 'DOC    ', doc)

print("Ignored %d and took information from %d items"%( count_ignored, count_matched) )
pprint.pprint( count_division )

['BWB://1.0:v:271829_1', ValueError("'1.0:v:271829_1' does not look like a valid jci")]
Ignored 127622 and took information from 84126 items
defaultdict(<class 'int'>,
            {'badesc': 50,
             'basic title deeplink': 14154,
             'boringname?': 16183,
             'could not resolve deeplink': 14147,
             'deeplink too detailed; TODO': 262,
             'detailed title deeplink': 21484,
             'empty': 717,
             'justnumber': 29386,
             'local': 3222,
             'otherlink; CONSIDER': 35782,
             'otherlink; MAYBE': 775,
             'pointless': 1107,
             'relative': 29529,
             'samepage': 7939,
             'uninteresting extref text?': 10209,
             'wetten.overheid.nl': 29659})


### Summarize what we got from CVDR

In [207]:
# source refs
print( '%d source references, to %d BWB-ids'%( 
    sum(  list( len(names)   for _, names in cvdr_sourceref_names.items() )  ),  
    len(  cvdr_sourceref_names  ) 
) )

print('\nThe most common:')
#   we take from a dict and put into a list so we can sort by count
sortable = []
for bwbid, names in cvdr_sourceref_names.items():
    #print()
    name_count = wetsuite.helpers.strings.count_normalized( names, min_word_length=1, stopwords=[], min_count=0.3 )
    sortable.append( (sum(name_count.values()), bwbid, name_count) )
sortable.sort(key=lambda x:x[0], reverse=True) # first column (count) descending
pprint.pprint(sortable[:30])

246302 source references, to 913 BWB-ids

The most common:
[(100300, 'BWBR0005416', {'Gemeentewet': 100300}),
 (26124, 'BWBR0005537', {'Algemene wet bestuursrecht': 26124}),
 (9795, 'BWBR0015703', {'Participatiewet': 9795}),
 (8299, 'BWBR0035362', {'Wet maatschappelijke ondersteuning 2015': 8299}),
 (4696, 'BWBR0003245', {'Wet milieubeheer': 4696}),
 (4375, 'BWBR0005212', {'Paspoortwet': 4375}),
 (3794, 'BWBR0034925', {'Jeugdwet': 3794}),
 (3568, 'BWBR0033715', {'Wet basisregistratie personen': 3568}),
 (2707, 'BWBR0005108', {'Waterschapswet': 2707}),
 (2689, 'BWBR0024779', {'Wet algemene bepalingen omgevingsrecht': 2689}),
 (2505, 'BWBR0035303', {'Huisvestingswet 2014': 2505}),
 (2348,
  'BWBR0041522',
  {'Rechtspositiebesluit decentrale politieke ambtsdragers': 2348}),
 (2057, 'BWBR0007376', {'Archiefwet 1995': 2057}),
 (2053, 'BWBR0005645', {'Provinciewet': 2053}),
 (2010, 'BWBR0002458', {'Alcoholwet': 595, 'Drank- en Horecawet': 1415}),
 (1891,
  'BWBR0004044',
  {'Wet inkomensvoor

In [208]:
print(  'Extracted %d interesting extrefs  (~%d different names)'%(
    len(cvdr_extrefs),
    len( set(nm   for _,_,nm  in cvdr_extrefs))
)  )

Extracted 84126 interesting extrefs  (~7792 different names)


In [209]:
# Go through what the above collected, put it into a dict like the previous sections did
cvdr_extrefs_filtered = collections.defaultdict(list)   # BWB-id -> list of name strings


print( '%d extrefs'%( len( cvdr_extrefs ) ) )

for typ, ref, text in cvdr_extrefs[:2000]:
    if typ in ('CVDRurl','jcilike-cvdr'): # not currently interested in these references
        continue
    
    if text is None:
        print( 'SKIP ', typ, ref, text )
        continue

    elif typ == 'BWB-deeplink-title':           # ref is a bwbid
        cvdr_extrefs_filtered[ref].append( text )
#        print( ref, text )
        #print( typ, ref, text)
        #if bwbid is not None:
        #    #print( ref)
        #    #print( typ, ref, text )
        #    cvdr_extrefs_filtered[bwbid].append( text )
        #else:
        #    print("DID NOT RESOLVE %r"%ref)

    elif typ == 'BWB-deeplink-title-detailed':  # ref is a bwbid (hopefully)
        # this may need a little more inspection
        cvdr_extrefs_filtered[ref].append( text )

    elif typ == 'BWBurl':               # ref is a parsed dict
        bwbid = ref.get('bwb')
        cvdr_extrefs_filtered[bwbid].append( text )

    elif typ == 'jcilike-bwb':               # ref is a bwb id
        cvdr_extrefs_filtered[ref].append( text )

    elif typ in ('won1','won2'):
        cvdr_extrefs_filtered[ref].append( text )

    else:
        print( 'UNHANDLED TYPE %r (%r, %r)' %( typ, ref, text ) )

84126 extrefs


In [None]:
# ~2000 lines, only see this when you want to?
# cvdr_extrefs_filtered

defaultdict(list,
            {'BWBR0006622': ['Wegenverkeerswet 1994',
              'Wegenverkeerswet 1994',
              'Wegenverkeerswet 1994',
              'Wegenverkeerswet 1994',
              'Wegenverkeerswet 1994',
              'Wegenverkeerswet',
              'Wegenverkeerswet 1994',
              'Wegenverkeerswet 1994',
              'Wegenverkeerswet 1994',
              'Wegenverkeerswet 1994',
              'Wegenverkeerswet 1994',
              'Wegenverkeerswet 1994',
              'Wegenverkeerswet 1994',
              'Wegenverkeerswet 1994',
              'Wegenverkeerswet 1994',
              'Wegenverkeerswet 1994',
              'Wegenverkeerswet 1994',
              'Wegenverkeerswet 1994',
              'Wegenverkeerswet 1994',
              'Wegenverkeerswet 1994',
              'Wegenverkeerswet 1994',
              'Wegenverkeerswet 1994',
              'Wegenverkeerswet 1994',
              'Wegenverkeerswet 1994',
              'Wegenverkeerswet',
  

In [211]:
sortable = []
for bwbid, names in cvdr_extrefs_filtered.items():
    #print(bwbid, names)
    name_count = wetsuite.helpers.strings.count_normalized( names, min_word_length=1, stopwords=[], min_count=0.3 )
    sortable.append( (sum(name_count.values()), bwbid, name_count) )
sortable.sort(key=lambda x:x[0], reverse=True)
pprint.pprint(sortable)

[(247, 'BWBR0005416', {'Gemeentewet': 247}),
 (177, 'BWBR0005537', {'Algemene wet bestuursrecht': 118, 'Awb': 59}),
 (139, 'BWBR0015703', {'WWB': 139}),
 (75, 'BWBR0003245', {'Wet milieubeheer': 75}),
 (64, 'BWBR0004044', {'IOAW': 23, 'Ioaw': 41}),
 (53, 'BWBR0004163', {'IOAZ': 18, 'Ioaz': 35}),
 (52, 'BWBR0005181', {'Woningwet': 52}),
 (48, 'BWBR0001854', {'Wetboek van Strafrecht': 48}),
 (37, 'BWBR0013798', {'Wet BIBOB': 14, 'Wet Bibob': 23}),
 (36, 'BWBR0006622', {'Wegenverkeerswet 1994': 36}),
 (32, 'BWBR0002458', {'Drank- en Horecawet': 32}),
 (30,
  'BWBR0017017',
  {'Wet kinderopvang': 10,
   'Wet kinderopvang en kwaliteitseisen peuterspeelzalen': 20}),
 (30, 'BWBR0001941', {'Opiumwet': 30}),
 (25,
  'BWBR0004825',
  {'RVV 1990': 18, 'Reglement Verkeersregels en Verkeerstekens 1990': 7}),
 (24, 'BWBR0002469', {'Wet op de Kansspelen': 6, 'Wet op de kansspelen': 18}),
 (23, 'BWBR0003549', {'Wet op de expertisecentra': 23}),
 (20, 'BWBR0003420', {'Wet op het primair onderwijs': 20}

## Finally do something with all that data

to review, we made
* `bwb_names_citeertitel`,
* `bwb_names_aanhaling`, 
* `bwb_names_extref`, 
* `cvdr_sourceref_names`,
* `cvdr_extrefs_filtered`

...each a dict from BWB-id to a list of names used to refer to it

### Report ambiguous names

In [None]:
conflict_data = collections.defaultdict(set) # name -> list of bwbids

for bwbid in bwb_names_citeertitel:
    for name in set( bwb_names_citeertitel[bwbid] ):
        #if len(name) > 5: # quick way to focus on abbreviations
        #    continue
        conflict_data[name].add(bwbid)

for bwbid in bwb_names_aanhaling:
    for name in set( bwb_names_aanhaling[bwbid] ):
        #if len(name) > 5:
        #    continue
        conflict_data[name].add(bwbid)

for bwbid in bwb_names_extref:
    for name in set( bwb_names_extref[bwbid] ):
        #if len(name) > 5:
        #    continue
        conflict_data[name].add(bwbid)

for bwbid in cvdr_extrefs_filtered:
    for name in set( cvdr_extrefs_filtered[bwbid] ):
        #if len(name) > 5:
        #    continue
        conflict_data[name].add(bwbid)


# ~2000 cases (TODO: look into that) so let's see just a handful of them
cases = 0
for name, bwbids in conflict_data.items():
    if len(bwbids)>1:
        print(" %r can refer to any of %r"%(name, sorted(bwbids)))
        cases += 1
        if cases > 20:
            break

# A whole bunch of those seems to be mistakes - either this code's, or the actual citation's.
#   TODO: figure out where they are from  (it's probably a good idea to have the data collection also carry the place they were taken from)

 'Wetboek van Burgerlijke Rechtsvordering' can refer to any of ['BWBR0001827', 'BWBR0039872']
 'Wet op de rechterlijke organisatie' can refer to any of ['BWBR0001830', 'BWBR0002170']
 'Wet algemene bepalingen' can refer to any of ['BWBR0001833', 'BWBR0024779']
 'Grondwet' can refer to any of ['BWBR0001840', 'BWBR0002656']
 'Wet op de Parlementaire Enquête' can refer to any of ['BWBR0001841', 'BWBR0023825']
 'Wetboek van Strafrecht' can refer to any of ['BWBR0001854', 'BWBR0001903']
 'Waterstaatswet 1900' can refer to any of ['BWBR0001867', 'BWBR0002505']
 'Rijksoctrooiwet' can refer to any of ['BWBR0001879', 'BWBR0007118']
 'Ziektewet' can refer to any of ['BWBR0001888', 'BWBR0001987', 'BWBR0002460', 'BWBR0002524']
 'Veewet' can refer to any of ['BWBR0001900', 'BWBR0006727']
 'Wetboek van Strafvordering' can refer to any of ['BWBR0001854', 'BWBR0001903', 'BWBR0001926', 'BWBR0019359', 'BWBR0028681']
 'Opiumwet' can refer to any of ['BWBR0001941', 'BWBR0003060', 'BWBR0003063']
 'Warenwet

### Merge all useful bits

In [213]:
merged_data = [] # data for a name-searching webpage: 
# list of   [BWBR, [self names], [other names] ]

merged_ids = set()
merged_ids.update( bwb_names_citeertitel )
merged_ids.update( bwb_names_aanhaling )
merged_ids.update( bwb_names_extref )

for bwbid in sorted(merged_ids, reverse=True):
    self_names  = []
    other_names = []
    
    if bwbid in bwb_names_citeertitel:
        for name in bwb_names_citeertitel[bwbid]:
            if name not in self_names:
                self_names.append( name )

    if bwbid in bwb_names_aanhaling:
        for name in bwb_names_aanhaling[bwbid]:
            if name not in self_names:
                self_names.append( name )

    if bwbid in cvdr_sourceref_names:
        for name in cvdr_sourceref_names[bwbid]:
            if name not in other_names:
                other_names.append( name )


    # the next two are messier sources, so we try to be stricter about what we take from it

    if bwbid in bwb_names_extref:
        name_count = wetsuite.helpers.strings.count_normalized( bwb_names_extref[bwbid], normalize_func=lambda s:s.lower(), min_word_length=1, stopwords=[], min_count=0.001 )
        for name, count in name_count.items():
            #if count >= 3:
            #print('ADD   %-12s  %4s   %r' %( bwbid, count, name ))
            if name not in self_names and name not in other_names:
                other_names.append(name)

    if bwbid in cvdr_extrefs_filtered:
        name_count = wetsuite.helpers.strings.count_normalized( cvdr_extrefs_filtered[bwbid], normalize_func=lambda s:s.lower(), min_word_length=1, stopwords=[], min_count=0.001 )
        for name, count in name_count.items():
            #if count >= 3:
            #print('ADD   %-12s  %4s   %r' %( bwbid, count, name ))
            if name not in self_names and name not in other_names:
                other_names.append(name)


    merged_data.append([ bwbid, self_names, other_names ] )

In [226]:
# some examples as sanity check
pprint.pprint( random.sample( merged_data, 30 ) )
#pprint.pprint( merged_data )

[['BWBR0017991', ['Vaststellingsbesluit Nationaal Frequentieplan 2005'], []],
 ['BWBR0030787',
  ['Verordening Bestemmingsheffing Fonds voor Wetenschappelijk Onderzoek en '
   'Voorlichting 2009'],
  []],
 ['BWBR0016561',
  ['Vaststellingsbesluit selectielijst neerslag handelingen op het '
   'beleidsterrein Waterstaat over de periode (1911–) 1945–2001 (Minister van '
   'Sociale Zaken en Werkgelegenheid)'],
  ['minister van Onderwijs, Cultuur en Wetenschap en de minister van Sociale '
   'Zaken en Werkgelegenheid']],
 ['BWBR0018632',
  ['Vaststellingsbesluit selectielijst neerslag handelingen Koninklijk '
   'Onderwijsfonds voor de Scheepvaart, beleidsterrein Volwasseneneducatie en '
   'beroepsonderwijs over de periode vanaf 1996'],
  []],
 ['BWBR0004799', ['Legesbesluit visa'], []],
 ['BWBR0003516',
  ['Landbouwkwaliteitsbeschikking vrijstelling bepaalde keuringsvoorschriften '
   'groenten en fruit'],
  []],
 ['BWBR0003321',
  ['Wet verlening van vorderingsbevoegdheid in verband me

In [None]:
wet_store = wetsuite.helpers.localdata.MsgpackKV('wetnamen.db', key_type=str, value_type=None)
wet_store.truncate() # remove current contents
wet_store._put_meta('description_short','''A name for each BWB-ID''')

wet_store._put_meta('description','''
This dataset tries to give the varied names for each law.
                    
This is a map from BWB-ID to two lists:
- the name the law has for itself
- the varied names other things use for it

The collection can usually use some refining - there is some unnecessareily pollution in the strings.
''' + wetsuite.datasets.generated_today_text())

for bwbid, self_names, other_names in merged_data:
     wet_store.put( key=bwbid, value=[self_names, other_names], commit=False )
wet_store.commit()   

In [217]:
# This is targeted at a HTML page that searches them live
import json

jd = json.dumps( merged_data ).encode('ascii')

js_file = open('wetnamen.js','wb')
js_file.write( b'namedata = ' )
json_file = open('wetnamen.json','wb')


js_file.write( jd )
json_file.write( jd )

js_file.write(b';')
js_file.close()
json_file.close()