Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Capping data downloads #1378

Closed
jmcarp opened this issue Nov 30, 2015 · 9 comments
Closed

Discussion: Capping data downloads #1378

jmcarp opened this issue Nov 30, 2015 · 9 comments

Comments

@jmcarp
Copy link
Contributor

jmcarp commented Nov 30, 2015

For performance reasons, we probably don't want to allow users to download CSVs of arbitrary size--some collections include 80+ million records, and grow by 10 million records per year. Assuming that we want to impose some kind of cap on downloads, how should caps behave?

  1. If a user requests a collection with >1m records, throw a warning when they request the download
  2. If a user requests a collection with >1m records, return the first 1m records only.

Note: if we go with the first option, we'll have to use approximate counts for user requests, so we might wind up sometimes rejecting queries that we should accept, and accepting queries that we should reject. This would only happen in cases where the size of the query is close to the cap that we set.

Interested in opinions from @noahmanger @LindsayYoung @jenniferthibault

@LindsayYoung
Copy link
Contributor

I like option 2, but I think we should still throw a warning to be explicit that we will only return a million records.

@jenniferthibault
Copy link

This bulk download seems to be the surest way to access all the data, so I'd be hesitant to limit it. If we go with option 2, is there any other way that someone could get the other records? From this, sounds like no.

Is there an option 3 that's something like:
3. If a user requests a collection with >1m records, break up the download in to parts of 1m. (so a 5m record download with have part 1 of 5, 2 of 5, 3 of 5, 4 of 5, and 5 of 5)
?

@LindsayYoung
Copy link
Contributor

@jenniferthibault FEC already has bulk downloads and we are not taking those away.

This feature is for the custom downloads, for when they just want some subset of the information. Like when they want to dive into a particular candidate or committee.

@jenniferthibault
Copy link

GOTCHA. It was never clear to me that bulk download and custom download were separate things, and I was probably using them interchangeably.

@noahmanger
Copy link

Lindsay raised the good point on the other issue that Excel maxes out at 65000 rows. Why not cap it at that?

@jmcarp
Copy link
Contributor Author

jmcarp commented Nov 30, 2015

Not opposed to that number, but it looks like the current row limit in Excel is more like 1m rows: https://support.office.com/en-us/article/Excel-specifications-and-limits-ca36e2dc-1f09-4620-b726-67c00b05040f

@LindsayYoung
Copy link
Contributor

@jenniferthibault I love how you are thinking about this and I would like this to be a better resource for reporters too. I do think the main sticking point for reporters is going to be the timeliness of the data rather than the number of rows in a custom download.

Realistically, this is not a good resource for reporting on time-sensitive stuff. I would love it to be, but the FEC would need to push data to us and we would need to update the API about every hour. (At least for transaction data and high level totals, I think the maps and other breakdowns are still fine to be calculated once a day, though that does introduce inconsistency).

Moreover, Josh mentioned the limits of excel documents. That means that if we are trying to improve the experience of most reporters, having more than a million downloads requires database skills. My assumption is that most people who have database skills could be able to use the API and don't need this. That assumption won't fit everyone, and for that subset of people, they can break their query down in to pieces, like using date ranges to subdivide the queries.

I don't mean to harp on our current shortcomings. This is really good information for deep dives, that are less time sensitive. For that kind of work, you are probably (though not always) looking for particular donors or committees etc. These stories are harder to find, take more time and can continue between the onslaught during reporting deadlines.

The API is good for people that want to replace infrastructure from the weekly to daily updates, they can do that with the API and just request the new information coming in. This is not something you would want to do manually.

Currently, the reporting that takes place closest to the deadline will come from the e-filings feed which updates on the hour or 1/2 hour- I don't precisely remember. We currently don't have access to that.

As for what people are looking for first, most often it is how much money is raised, which you don't need a million records for, you need to look at the summary numbers. That is usually followed by interesting transactions, donors which you do need to browse the transactions for.

Happy to reach out to some reporting people and verify these assumptions if that would be helpful.

@LindsayYoung
Copy link
Contributor

Thanks for finding that. I must have been looking at an older excel version.

I think a million is a generous cap.

@noahmanger
Copy link

Closing this as the cap has been implemented for 100k records.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants