Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new patterns field #38

Merged
merged 4 commits into from Mar 18, 2021
Merged

Conversation

jcushman
Copy link
Contributor

Following up on the discussion over here, this adds a proposed new field under each edition called "patterns" that specifies regexes mapping to that edition. I'm filing this now for comment before getting too far into implementation.

To simplify the regexes, this follows the example of courts-db by expanding placeholders from variables.json, but with the addition of recursive resolution and some additional special-case expansion for "$edition". Here's the intended expansion of the first entry, for example:

"Bankr. L. Rep.": [
        {
            "editions": {
                "Bankr. L. Rep.": {
                    "patterns": [
                        "$paragraph_cite"
                    ],
                }
            },
            "variations": {
                "Bankr. L. Rep. (CCH)": "Bankr. L. Rep."
            }
        }

Expansion steps:

  • Recursive expansion from variables.json:
    • "$paragraph_cite" ->
    • "$reporter $paragraph_marker_optional$page_with_commas" ->
    • "(?P<reporter>$edition) (?:[P¶] ?)?(?P<page>\d(?:[\d,]*\d)?)" ->
  • Then fanout by replacing $edition (if any) with the edition key and variations:
    • "(?P<reporter>Bankr\. L\. Rep\.) (?:[P¶] ?)?(?P<page>\d(?:[\d,]*\d)?)"
    • "(?P<reporter>Bankr\. L\. Rep\. \(CCH\)) (?:[P¶] ?)?(?P<page>\d(?:[\d,]*\d)?)"

This seems a little complicated when it's all spelled out, but I'm hoping that it makes things both DRY and easy to express -- the underlying reality is that Bankr. L. Rep. is cited as a reporter string or variation, sometimes a paragraph marker, and then a paragraph number, as are a number of other reporters, and this hopefully expresses that pretty clearly and with minimal redundancy.

Other comments:

  • Naming the things in variables.json is hard and I'm open to improvements!
  • I kept both "cite_format" and "regexes" keys in for now in case those are in use elsewhere, though this could potentially replace both.
  • I called this "patterns" instead of "regexes" to make clear to anyone transitioning that it's a different thing from those other fields. If that's unhelpful I could switch it back to "regexes" though.
  • I didn't yet include any code to do the expansion. Probably some helpers want to live here in reporters-db, but I thought maybe it would be best to add those while integrating with eyecite. There'll need to be flexibility to do stuff like patching in a more complicated regex for "$page" so I'm not quite sure what the API wants to be yet.
  • It would probably be a good idea for the tests to require that each edition with "patterns" also have a string under "examples" that is matched by each pattern, which would go a long way to detecting typos. I didn't do that yet though.

@mlissner
Copy link
Member

This checks all the boxes for me, @jcushman. It's a bit tricky to understand, but your example is probably a worst case, and it makes sense, so I think we're there.

A couple replies:

  • Naming the things in variables.json is hard and I'm open to improvements!

I stared at this for a bit. The only idea I had was to try to do namespacing. I guess that could take a couple forms, but I could imagine something like:

variables = {
    "page": {"$with_commas": xyz, "$with_paragraph_optional": abc},
    "reporter": {"$single_volume": foo, ...},
    "cite": {"$with_paragraph_and_suffix": bar, ...},
}

That'd just be for organizational purposes, and somewhere we'd have a little function that flattened it into:

page_with_commas
page_with_paragraph_optional
reporter_single_volume
cite_with_paragraph_and_suffix

The other thing that might help is just putting comments into the variables, maybe? But of course it's JSON, so we've been there before. Maybe it could be a python file though. Why not? Is anybody going to actually import the JSON version for something? Probably not.

  • I kept both "cite_format" and "regexes" keys in for now in case those are in use elsewhere, though this could potentially replace both.

I vote to nuke both and do a major version bump. I don't think they're in use anywhere (they're relatively new and unfinished).

  • I called this "patterns" instead of "regexes" to make clear to anyone transitioning that it's a different thing from those other fields. If that's unhelpful I could switch it back to "regexes" though.

If we decide to nuke the old ones, regexes seems better to me, but either way.

  • I didn't yet include any code to do the expansion. Probably some helpers want to live here in reporters-db, but I thought maybe it would be best to add those while integrating with eyecite. There'll need to be flexibility to do stuff like patching in a more complicated regex for "$page" so I'm not quite sure what the API wants to be yet.

You mean you want to do a data-only PR for this and then do a second one later that does the python work for this? That seems totally fine by me. It's what we did for the current regexes ones.

  • It would probably be a good idea for the tests to require that each edition with "patterns" also have a string under "examples" that is matched by each pattern, which would go a long way to detecting typos. I didn't do that yet though.

I'd love that idea, and this was something I was hoping for and planning to comment on before I got to the end of your message. I think this would go a long ways and it shouldn't be hard particularly.

Thanks Jack, this is awesome.

@jcushman
Copy link
Contributor Author

jcushman commented Mar 17, 2021

  • Updated variables.json to use nested dictionaries, which flatten back down as you suggested, and also to ignore fields ending in '#' as comments. The file does seem more readable. I like keeping it as a neutral file instead of python in case someone wants to adapt this to JS or something later. I could see it wanting to be yaml, but not sure the extra dependency is worth it.
  • Renamed "patterns" to "regexes", and dropped the existing regexes and cite_format fields.
  • Added a test that ensures that all custom regexes match at least one example, and if there are custom regexes, all examples are matched by at least one custom regex.
  • In adding examples for each custom regex I noticed that real life citations to "Unemployment Ins. Rep." are pretty thorny, so I expanded the regex to deal with them. I think it's a nice demonstration -- that would be a challenging one to handle before, but it comes out pretty manageable.

Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic. A couple little suggestions, if you don't mind.

tests.py Outdated
Comment on lines 311 to 312
'Possible example: "%s"'
% exrex.getone(regexes[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be wise to have it generate 10 or 20 examples instead?

reporters_db/data/reporters.json Show resolved Hide resolved
reporters_db/__init__.py Outdated Show resolved Hide resolved
@jcushman
Copy link
Contributor Author

I renamed VARIABLES and also variables.json, changed the error message to provide 10 examples, and updated the readme.

@mlissner mlissner merged commit 89696a5 into freelawproject:master Mar 18, 2021
@mlissner
Copy link
Member

Love it. Thank you @jcushman !

@mlissner
Copy link
Member

I'll do a release in a sec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants