Add new patterns field #38

jcushman · 2021-03-15T20:27:55Z

Following up on the discussion over here, this adds a proposed new field under each edition called "patterns" that specifies regexes mapping to that edition. I'm filing this now for comment before getting too far into implementation.

To simplify the regexes, this follows the example of courts-db by expanding placeholders from variables.json, but with the addition of recursive resolution and some additional special-case expansion for "$edition". Here's the intended expansion of the first entry, for example:

"Bankr. L. Rep.": [
        {
            "editions": {
                "Bankr. L. Rep.": {
                    "patterns": [
                        "$paragraph_cite"
                    ],
                }
            },
            "variations": {
                "Bankr. L. Rep. (CCH)": "Bankr. L. Rep."
            }
        }

Expansion steps:

Recursive expansion from variables.json:
- "$paragraph_cite" ->
- "$reporter $paragraph_marker_optional$page_with_commas" ->
- "(?P<reporter>$edition) (?:[P¶] ?)?(?P<page>\d(?:[\d,]*\d)?)" ->
Then fanout by replacing $edition (if any) with the edition key and variations:
- "(?P<reporter>Bankr\. L\. Rep\.) (?:[P¶] ?)?(?P<page>\d(?:[\d,]*\d)?)"
- "(?P<reporter>Bankr\. L\. Rep\. $CCH$) (?:[P¶] ?)?(?P<page>\d(?:[\d,]*\d)?)"

This seems a little complicated when it's all spelled out, but I'm hoping that it makes things both DRY and easy to express -- the underlying reality is that Bankr. L. Rep. is cited as a reporter string or variation, sometimes a paragraph marker, and then a paragraph number, as are a number of other reporters, and this hopefully expresses that pretty clearly and with minimal redundancy.

Other comments:

Naming the things in variables.json is hard and I'm open to improvements!
I kept both "cite_format" and "regexes" keys in for now in case those are in use elsewhere, though this could potentially replace both.
I called this "patterns" instead of "regexes" to make clear to anyone transitioning that it's a different thing from those other fields. If that's unhelpful I could switch it back to "regexes" though.
I didn't yet include any code to do the expansion. Probably some helpers want to live here in reporters-db, but I thought maybe it would be best to add those while integrating with eyecite. There'll need to be flexibility to do stuff like patching in a more complicated regex for "$page" so I'm not quite sure what the API wants to be yet.
It would probably be a good idea for the tests to require that each edition with "patterns" also have a string under "examples" that is matched by each pattern, which would go a long way to detecting typos. I didn't do that yet though.

mlissner · 2021-03-16T22:42:52Z

This checks all the boxes for me, @jcushman. It's a bit tricky to understand, but your example is probably a worst case, and it makes sense, so I think we're there.

A couple replies:

Naming the things in variables.json is hard and I'm open to improvements!

I stared at this for a bit. The only idea I had was to try to do namespacing. I guess that could take a couple forms, but I could imagine something like:

variables = {
    "page": {"$with_commas": xyz, "$with_paragraph_optional": abc},
    "reporter": {"$single_volume": foo, ...},
    "cite": {"$with_paragraph_and_suffix": bar, ...},
}

That'd just be for organizational purposes, and somewhere we'd have a little function that flattened it into:

page_with_commas
page_with_paragraph_optional
reporter_single_volume
cite_with_paragraph_and_suffix

The other thing that might help is just putting comments into the variables, maybe? But of course it's JSON, so we've been there before. Maybe it could be a python file though. Why not? Is anybody going to actually import the JSON version for something? Probably not.

I kept both "cite_format" and "regexes" keys in for now in case those are in use elsewhere, though this could potentially replace both.

I vote to nuke both and do a major version bump. I don't think they're in use anywhere (they're relatively new and unfinished).

I called this "patterns" instead of "regexes" to make clear to anyone transitioning that it's a different thing from those other fields. If that's unhelpful I could switch it back to "regexes" though.

If we decide to nuke the old ones, regexes seems better to me, but either way.

I didn't yet include any code to do the expansion. Probably some helpers want to live here in reporters-db, but I thought maybe it would be best to add those while integrating with eyecite. There'll need to be flexibility to do stuff like patching in a more complicated regex for "$page" so I'm not quite sure what the API wants to be yet.

You mean you want to do a data-only PR for this and then do a second one later that does the python work for this? That seems totally fine by me. It's what we did for the current regexes ones.

It would probably be a good idea for the tests to require that each edition with "patterns" also have a string under "examples" that is matched by each pattern, which would go a long way to detecting typos. I didn't do that yet though.

I'd love that idea, and this was something I was hoping for and planning to comment on before I got to the end of your message. I think this would go a long ways and it shouldn't be hard particularly.

Thanks Jack, this is awesome.

jcushman · 2021-03-17T17:57:24Z

Updated variables.json to use nested dictionaries, which flatten back down as you suggested, and also to ignore fields ending in '#' as comments. The file does seem more readable. I like keeping it as a neutral file instead of python in case someone wants to adapt this to JS or something later. I could see it wanting to be yaml, but not sure the extra dependency is worth it.
Renamed "patterns" to "regexes", and dropped the existing regexes and cite_format fields.
Added a test that ensures that all custom regexes match at least one example, and if there are custom regexes, all examples are matched by at least one custom regex.
In adding examples for each custom regex I noticed that real life citations to "Unemployment Ins. Rep." are pretty thorny, so I expanded the regex to deal with them. I think it's a nice demonstration -- that would be a challenging one to handle before, but it comes out pretty manageable.

mlissner

Fantastic. A couple little suggestions, if you don't mind.

mlissner · 2021-03-17T21:12:38Z

tests.py

+                                    'Possible example: "%s"'
+                                    % exrex.getone(regexes[0])


Would it be wise to have it generate 10 or 20 examples instead?

reporters_db/data/reporters.json

reporters_db/__init__.py

jcushman · 2021-03-18T13:41:03Z

I renamed VARIABLES and also variables.json, changed the error message to provide 10 examples, and updated the readme.

mlissner · 2021-03-18T19:58:31Z

Love it. Thank you @jcushman !

mlissner · 2021-03-18T19:58:39Z

I'll do a release in a sec.

mlissner mentioned this pull request Mar 16, 2021

Update reporters by comparing to CAP reporter list #39

Merged

jcushman added 2 commits March 17, 2021 13:43

Add test helpers to iterate reporters

8283ff6

Refactor regexes field using variables

85a5b75

jcushman force-pushed the patterns branch from 15e81bc to 85a5b75 Compare March 17, 2021 17:47

Separately expose raw variables

13a24d1

mlissner requested changes Mar 17, 2021

View reviewed changes

Regexes: update readme, rename file and variables, show more examples

1c70d2e

mlissner merged commit 89696a5 into freelawproject:master Mar 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new patterns field #38

Add new patterns field #38

jcushman commented Mar 15, 2021

mlissner commented Mar 16, 2021

jcushman commented Mar 17, 2021 •

edited

mlissner left a comment

mlissner Mar 17, 2021

jcushman commented Mar 18, 2021

mlissner commented Mar 18, 2021

mlissner commented Mar 18, 2021

Add new patterns field #38

Add new patterns field #38

Conversation

jcushman commented Mar 15, 2021

mlissner commented Mar 16, 2021

jcushman commented Mar 17, 2021 • edited

mlissner left a comment

Choose a reason for hiding this comment

mlissner Mar 17, 2021

Choose a reason for hiding this comment

jcushman commented Mar 18, 2021

mlissner commented Mar 18, 2021

mlissner commented Mar 18, 2021

jcushman commented Mar 17, 2021 •

edited