New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All bulk data should be available by court #285

Closed
brianwc opened this Issue Aug 29, 2014 · 10 comments

Comments

Projects
None yet
2 participants
@brianwc

brianwc commented Aug 29, 2014

We've had another request for this today. While one can currently write a script to retrieve all the bulk data for a given court by feeding it a series of years, 2013/ca1 2012/ca1 2011/ca1 ..., a power user recently pointed out that not only is this a hassle, but also one does not know what range of years to feed the script without checking the coverage page and squinting at the graphs (one by one). Bah!

This issue could also be resolved by simply providing a script that does this year-by-year retrieval, whichever is easier/faster, but it seems slightly more elegant to allow someone to just request the entire Supreme Court corpus with a single URI.

Edit: And why don't we just do the year by year collection and then zip all the years up together and provide people links to that? We'd only have to update it once per year (or when the back catalog is filled in by a data donation/backscraper). This would not include the current year, but if people want to add on what we have of the current year, that's already easy enough.

@mlissner

This comment has been minimized.

Show comment
Hide comment
@mlissner

mlissner Aug 29, 2014

Member

Yeah, the entire bulk system needs to be re-designed when it comes down to it. The problems are these:

  • Each dimension that we use to cut the data multiplies the number of files we need. Right now it's day, month, year and court. With 300+ courts and thousands of dates, that means millions of potential bulk files, plus the one file with EVERYTHING.
  • Some bulk files are huge and need to be created in advance for fear of nuking the server via client actions.
  • It's possible to use data in more than one bulk file at a time (e.g., when creating the all.tar.gz file, add each bit of data to all-ca1.tar.gz file or whatever, but that's harder and more complex to code.
  • Our current design involves making a single massive XML file, which no computer can read without streaming, something over most people's heads (and even beyond the capacity of 32bit computers!)

That last point isn't really related, but I can't help but think it's the biggest problem and yet no consumer has considered solving it. A part of me wants to just say: "you want it, give us a pull request."

All that said, probably the best solution for this is three fold:

  1. Drop support for daily dumps. If you want that, you can get it via the REST API, or you can wait until the end of the month.
  2. This leaves only monthly and complete bulk files. Monthly files can be generated as requested, but at the end of each month a cron job triggers the generation of the all.tar.gz file and also generates ca1.tar.gz and whatever else. This could be a mess of code, but it shouldn't actually be too terribly hard or compute intensive.
  3. Refactor the bulk files so they are zips containing thousands of XML files.

I think that trio would fix most of the problems.

Member

mlissner commented Aug 29, 2014

Yeah, the entire bulk system needs to be re-designed when it comes down to it. The problems are these:

  • Each dimension that we use to cut the data multiplies the number of files we need. Right now it's day, month, year and court. With 300+ courts and thousands of dates, that means millions of potential bulk files, plus the one file with EVERYTHING.
  • Some bulk files are huge and need to be created in advance for fear of nuking the server via client actions.
  • It's possible to use data in more than one bulk file at a time (e.g., when creating the all.tar.gz file, add each bit of data to all-ca1.tar.gz file or whatever, but that's harder and more complex to code.
  • Our current design involves making a single massive XML file, which no computer can read without streaming, something over most people's heads (and even beyond the capacity of 32bit computers!)

That last point isn't really related, but I can't help but think it's the biggest problem and yet no consumer has considered solving it. A part of me wants to just say: "you want it, give us a pull request."

All that said, probably the best solution for this is three fold:

  1. Drop support for daily dumps. If you want that, you can get it via the REST API, or you can wait until the end of the month.
  2. This leaves only monthly and complete bulk files. Monthly files can be generated as requested, but at the end of each month a cron job triggers the generation of the all.tar.gz file and also generates ca1.tar.gz and whatever else. This could be a mess of code, but it shouldn't actually be too terribly hard or compute intensive.
  3. Refactor the bulk files so they are zips containing thousands of XML files.

I think that trio would fix most of the problems.

@mlissner

This comment has been minimized.

Show comment
Hide comment
@mlissner

mlissner Aug 29, 2014

Member

Thinking about this more, maybe we just drop support for daily, monthly, and annual dumps, and going forward just have complete bulk files for each court that we regenerate at the beginning of each month.

If that solved everbody's use cases, it'd be easy enough to code and maintain.

Member

mlissner commented Aug 29, 2014

Thinking about this more, maybe we just drop support for daily, monthly, and annual dumps, and going forward just have complete bulk files for each court that we regenerate at the beginning of each month.

If that solved everbody's use cases, it'd be easy enough to code and maintain.

@brianwc

This comment has been minimized.

Show comment
Hide comment
@brianwc

brianwc Aug 29, 2014

I somewhat haphazardly looked through the log files, so I'm not in a position to make this decision based on the data, but I saw a little of both 1) people downloading a given day, month, or year, and nothing else and 2) people systematically downloading every year/month of a given court (who would presumably be better served by a single zip download). At a minimum we would want to preserve a small download of some sort for those who just want to quickly take a look at the files to learn what format to expect, but perhaps we can just learn what the smallest zip file is and suggest people check out the bulk download for Court X for such exploratory purposes.

brianwc commented Aug 29, 2014

I somewhat haphazardly looked through the log files, so I'm not in a position to make this decision based on the data, but I saw a little of both 1) people downloading a given day, month, or year, and nothing else and 2) people systematically downloading every year/month of a given court (who would presumably be better served by a single zip download). At a minimum we would want to preserve a small download of some sort for those who just want to quickly take a look at the files to learn what format to expect, but perhaps we can just learn what the smallest zip file is and suggest people check out the bulk download for Court X for such exploratory purposes.

@mlissner

This comment has been minimized.

Show comment
Hide comment
@mlissner

mlissner Aug 29, 2014

Member

Sounds good. I'll do some outreach and see how much people care about a change like this.

Member

mlissner commented Aug 29, 2014

Sounds good. I'll do some outreach and see how much people care about a change like this.

@mlissner

This comment has been minimized.

Show comment
Hide comment
@mlissner

mlissner Sep 2, 2014

Member

Outreach complete. Looks like no objections.

Member

mlissner commented Sep 2, 2014

Outreach complete. Looks like no objections.

@mlissner

This comment has been minimized.

Show comment
Hide comment
@mlissner

mlissner Sep 4, 2014

Member

Additional quandary to think about is how to handle deletions or modifications. In the past, we have had a routine that deleted the bulk files when something changed. I'm inclined to remove this feature and request that people using the bulk files check the REST API for this kind of data if they want.

Downside of doing this is that the bulk files won't have our latest cleanups and users will have to wait to get those fixes until the next month.

Upside is that we don't have to invalidate and recreate the bulk files whenever we change something. I think it's reasonable for the bulk files to be simple snapshots of the system rather than fully accurate at all times.

Member

mlissner commented Sep 4, 2014

Additional quandary to think about is how to handle deletions or modifications. In the past, we have had a routine that deleted the bulk files when something changed. I'm inclined to remove this feature and request that people using the bulk files check the REST API for this kind of data if they want.

Downside of doing this is that the bulk files won't have our latest cleanups and users will have to wait to get those fixes until the next month.

Upside is that we don't have to invalidate and recreate the bulk files whenever we change something. I think it's reasonable for the bulk files to be simple snapshots of the system rather than fully accurate at all times.

@mlissner mlissner self-assigned this Sep 19, 2014

@mlissner

This comment has been minimized.

Show comment
Hide comment
@mlissner

mlissner Sep 19, 2014

Member

OK, I'm proceeding with these changes today:

  1. Bulk files for each jurisdiction and for all jurisdictions (combined) will be generated at the end of the month.
  2. Date-based bulk files are no more.
  3. Each bulk file will be a compressed archive (algo TBD) that contains XML files generated by our API (if it's performant enough, this will mean that the files are the same in both places). Each file will be named after the SHA1 of the item to ensure it's unique in the zip.
  4. I will create a separate bulk files containing similar data for oral arguments. These will be much smaller.
Member

mlissner commented Sep 19, 2014

OK, I'm proceeding with these changes today:

  1. Bulk files for each jurisdiction and for all jurisdictions (combined) will be generated at the end of the month.
  2. Date-based bulk files are no more.
  3. Each bulk file will be a compressed archive (algo TBD) that contains XML files generated by our API (if it's performant enough, this will mean that the files are the same in both places). Each file will be named after the SHA1 of the item to ensure it's unique in the zip.
  4. I will create a separate bulk files containing similar data for oral arguments. These will be much smaller.
@mlissner

This comment has been minimized.

Show comment
Hide comment
@mlissner

mlissner Sep 22, 2014

Member

OK, making progress here, but some small changes:

  • Because we put IDs in our URLs, the SHA1s are less useful than our IDs. Thus, instead of using the sha1 to identify items, I'm now using the item ID.
  • The bulk files are being created as tar files compressed using gunzip algo (same as before, and I did some tests for this last time).

About to create the snapshots for oral args, but first I need to create the API. This is how it's supposed to work -- dogfood!

Only downside at the moment of the new system is that serialization is going to take way more time. Our old archives could generate in about 4 hours IIRC. These will take 20, so we'll just make them on the last day of each month, and put a note in the API reference to that effect.

Finally, the other thing to note is that because of this simplification, I can take django out of the serving role for these files: Django still creates them, but they'll get served by apache directly.

Member

mlissner commented Sep 22, 2014

OK, making progress here, but some small changes:

  • Because we put IDs in our URLs, the SHA1s are less useful than our IDs. Thus, instead of using the sha1 to identify items, I'm now using the item ID.
  • The bulk files are being created as tar files compressed using gunzip algo (same as before, and I did some tests for this last time).

About to create the snapshots for oral args, but first I need to create the API. This is how it's supposed to work -- dogfood!

Only downside at the moment of the new system is that serialization is going to take way more time. Our old archives could generate in about 4 hours IIRC. These will take 20, so we'll just make them on the last day of each month, and put a note in the API reference to that effect.

Finally, the other thing to note is that because of this simplification, I can take django out of the serving role for these files: Django still creates them, but they'll get served by apache directly.

@brianwc

This comment has been minimized.

Show comment
Hide comment
@brianwc

brianwc Sep 22, 2014

Sounds ok, but does this mean that the SHA wouldn't even make it into
the bulk file? I don't think we want that, because as soon as we start
collecting FDSys docs those are going to (finally!) come with
court-generated SHAs and so there will finally be a means of
authenticating against a publisher. We wouldn't want to drop this piece
of metadata just as it is about to become useful. (Note: they don't use
SHA-1, so we'll actually need to generate new SHAs with the same
algorithm they use, which I think is SHA-256. But the general idea is
that there should be a place for this in the bulk file, whether it's by
retaining the current SHAs or by adding them back in once we collect
from FDSys. Let's not make doing that any harder than it has to be.)

Brian

On 09/22/2014 03:33 PM, Mike Lissner wrote:

OK, making progress here, but some small changes:

  • Because we put IDs in our URLs, the SHA1s are less useful than our
    IDs. Thus, instead of using the sha1 to identify items, I'm now
    using the item ID.
  • The bulk files are being created as tar files compressed using
    gunzip algo (same as before, and I did some tests for this last time).

About to create the snapshots for oral args, but first I need to create
the API. This is how it's supposed to work -- dogfood!

Only downside at the moment of the new system is that serialization is
going to take way more time. Our old archives could generate in about 4
hours IIRC. These will take 20, so we'll just make them on the last day
of each month, and put a note in the API reference to that effect.

Finally, the other thing to note is that because of this simplification,
I can take django out of the serving role for these files: Django still
creates them, but they'll get served by apache directly.


Reply to this email directly or view it on GitHub
#285 (comment).

brianwc commented Sep 22, 2014

Sounds ok, but does this mean that the SHA wouldn't even make it into
the bulk file? I don't think we want that, because as soon as we start
collecting FDSys docs those are going to (finally!) come with
court-generated SHAs and so there will finally be a means of
authenticating against a publisher. We wouldn't want to drop this piece
of metadata just as it is about to become useful. (Note: they don't use
SHA-1, so we'll actually need to generate new SHAs with the same
algorithm they use, which I think is SHA-256. But the general idea is
that there should be a place for this in the bulk file, whether it's by
retaining the current SHAs or by adding them back in once we collect
from FDSys. Let's not make doing that any harder than it has to be.)

Brian

On 09/22/2014 03:33 PM, Mike Lissner wrote:

OK, making progress here, but some small changes:

  • Because we put IDs in our URLs, the SHA1s are less useful than our
    IDs. Thus, instead of using the sha1 to identify items, I'm now
    using the item ID.
  • The bulk files are being created as tar files compressed using
    gunzip algo (same as before, and I did some tests for this last time).

About to create the snapshots for oral args, but first I need to create
the API. This is how it's supposed to work -- dogfood!

Only downside at the moment of the new system is that serialization is
going to take way more time. Our old archives could generate in about 4
hours IIRC. These will take 20, so we'll just make them on the last day
of each month, and put a note in the API reference to that effect.

Finally, the other thing to note is that because of this simplification,
I can take django out of the serving role for these files: Django still
creates them, but they'll get served by apache directly.


Reply to this email directly or view it on GitHub
#285 (comment).

@mlissner

This comment has been minimized.

Show comment
Hide comment
@mlissner

mlissner Sep 22, 2014

Member

SHA1's are preserved and available in the bulk files, just not used for filenames in bulk files.

Member

mlissner commented Sep 22, 2014

SHA1's are preserved and available in the bulk files, just not used for filenames in bulk files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment