Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write scraper for PACER NEF emails that can pull out the links and anything else #381

Closed
mlissner opened this issue Apr 19, 2021 · 2 comments
Assignees

Comments

@mlissner
Copy link
Member

One of the first things we're going to need to do is start parsing the NEF emails for links and other metadata that we can grab. I'll attach a few example emails in a sec.

@mlissner
Copy link
Member Author

OK, getting these emails mostly anonymized took awhile, sorry for the delay. Here's two HTML emails (in a single file). I'm told there's also an option for plaintext emails, but I suspect few people actually want that and I guess we can put it off until the future if nobody actually turns that on (we'll get failing examples of those eventually if people are using it).
nef-examples.mbox.txt

@mlissner
Copy link
Member Author

As far as fields for these go, I'd start by looking at the field names in the test assets directories, where we have dockets as HTML parsed to JSON. For example, these test fixtures probably have most if not all of the field names you need:

https://github.com/freelawproject/juriscraper/tree/master/tests/examples/pacer/dockets/district

And you can see how those are usually used, here:

https://github.com/freelawproject/juriscraper/blob/master/tests/local/test_DocketParseTest.py#L115

The parser itself might live in juriscraper.pacer.nef_email.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants