New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new patterns field #38
Conversation
This checks all the boxes for me, @jcushman. It's a bit tricky to understand, but your example is probably a worst case, and it makes sense, so I think we're there. A couple replies:
I stared at this for a bit. The only idea I had was to try to do namespacing. I guess that could take a couple forms, but I could imagine something like:
That'd just be for organizational purposes, and somewhere we'd have a little function that flattened it into:
The other thing that might help is just putting comments into the variables, maybe? But of course it's JSON, so we've been there before. Maybe it could be a python file though. Why not? Is anybody going to actually import the JSON version for something? Probably not.
I vote to nuke both and do a major version bump. I don't think they're in use anywhere (they're relatively new and unfinished).
If we decide to nuke the old ones, regexes seems better to me, but either way.
You mean you want to do a data-only PR for this and then do a second one later that does the python work for this? That seems totally fine by me. It's what we did for the current regexes ones.
I'd love that idea, and this was something I was hoping for and planning to comment on before I got to the end of your message. I think this would go a long ways and it shouldn't be hard particularly. Thanks Jack, this is awesome. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic. A couple little suggestions, if you don't mind.
tests.py
Outdated
'Possible example: "%s"' | ||
% exrex.getone(regexes[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be wise to have it generate 10 or 20 examples instead?
I renamed |
Love it. Thank you @jcushman ! |
I'll do a release in a sec. |
Following up on the discussion over here, this adds a proposed new field under each edition called
"patterns"
that specifies regexes mapping to that edition. I'm filing this now for comment before getting too far into implementation.To simplify the regexes, this follows the example of courts-db by expanding placeholders from variables.json, but with the addition of recursive resolution and some additional special-case expansion for "$edition". Here's the intended expansion of the first entry, for example:
Expansion steps:
"$paragraph_cite"
->"$reporter $paragraph_marker_optional$page_with_commas"
->"(?P<reporter>$edition) (?:[P¶] ?)?(?P<page>\d(?:[\d,]*\d)?)"
->$edition
(if any) with the edition key and variations:"(?P<reporter>Bankr\. L\. Rep\.) (?:[P¶] ?)?(?P<page>\d(?:[\d,]*\d)?)"
"(?P<reporter>Bankr\. L\. Rep\. \(CCH\)) (?:[P¶] ?)?(?P<page>\d(?:[\d,]*\d)?)"
This seems a little complicated when it's all spelled out, but I'm hoping that it makes things both DRY and easy to express -- the underlying reality is that
Bankr. L. Rep.
is cited as a reporter string or variation, sometimes a paragraph marker, and then a paragraph number, as are a number of other reporters, and this hopefully expresses that pretty clearly and with minimal redundancy.Other comments:
eyecite
. There'll need to be flexibility to do stuff like patching in a more complicated regex for "$page" so I'm not quite sure what the API wants to be yet."patterns"
also have a string under"examples"
that is matched by each pattern, which would go a long way to detecting typos. I didn't do that yet though.