Discussion: Capping data downloads #1378

jmcarp · 2015-11-30T14:58:19Z

For performance reasons, we probably don't want to allow users to download CSVs of arbitrary size--some collections include 80+ million records, and grow by 10 million records per year. Assuming that we want to impose some kind of cap on downloads, how should caps behave?

If a user requests a collection with >1m records, throw a warning when they request the download
If a user requests a collection with >1m records, return the first 1m records only.

Note: if we go with the first option, we'll have to use approximate counts for user requests, so we might wind up sometimes rejecting queries that we should accept, and accepting queries that we should reject. This would only happen in cases where the size of the query is close to the cap that we set.

Interested in opinions from @noahmanger @LindsayYoung @jenniferthibault

LindsayYoung · 2015-11-30T15:09:56Z

I like option 2, but I think we should still throw a warning to be explicit that we will only return a million records.

jenniferthibault · 2015-11-30T16:00:01Z

This bulk download seems to be the surest way to access all the data, so I'd be hesitant to limit it. If we go with option 2, is there any other way that someone could get the other records? From this, sounds like no.

Is there an option 3 that's something like:
3. If a user requests a collection with >1m records, break up the download in to parts of 1m. (so a 5m record download with have part 1 of 5, 2 of 5, 3 of 5, 4 of 5, and 5 of 5)
?

LindsayYoung · 2015-11-30T16:05:56Z

@jenniferthibault FEC already has bulk downloads and we are not taking those away.

This feature is for the custom downloads, for when they just want some subset of the information. Like when they want to dive into a particular candidate or committee.

jenniferthibault · 2015-11-30T16:57:46Z

GOTCHA. It was never clear to me that bulk download and custom download were separate things, and I was probably using them interchangeably.

noahmanger · 2015-11-30T19:11:33Z

Lindsay raised the good point on the other issue that Excel maxes out at 65000 rows. Why not cap it at that?

jmcarp · 2015-11-30T19:27:25Z

Not opposed to that number, but it looks like the current row limit in Excel is more like 1m rows: https://support.office.com/en-us/article/Excel-specifications-and-limits-ca36e2dc-1f09-4620-b726-67c00b05040f

LindsayYoung · 2015-11-30T19:49:14Z

@jenniferthibault I love how you are thinking about this and I would like this to be a better resource for reporters too. I do think the main sticking point for reporters is going to be the timeliness of the data rather than the number of rows in a custom download.

Realistically, this is not a good resource for reporting on time-sensitive stuff. I would love it to be, but the FEC would need to push data to us and we would need to update the API about every hour. (At least for transaction data and high level totals, I think the maps and other breakdowns are still fine to be calculated once a day, though that does introduce inconsistency).

Moreover, Josh mentioned the limits of excel documents. That means that if we are trying to improve the experience of most reporters, having more than a million downloads requires database skills. My assumption is that most people who have database skills could be able to use the API and don't need this. That assumption won't fit everyone, and for that subset of people, they can break their query down in to pieces, like using date ranges to subdivide the queries.

I don't mean to harp on our current shortcomings. This is really good information for deep dives, that are less time sensitive. For that kind of work, you are probably (though not always) looking for particular donors or committees etc. These stories are harder to find, take more time and can continue between the onslaught during reporting deadlines.

The API is good for people that want to replace infrastructure from the weekly to daily updates, they can do that with the API and just request the new information coming in. This is not something you would want to do manually.

Currently, the reporting that takes place closest to the deadline will come from the e-filings feed which updates on the hour or 1/2 hour- I don't precisely remember. We currently don't have access to that.

As for what people are looking for first, most often it is how much money is raised, which you don't need a million records for, you need to look at the summary numbers. That is usually followed by interesting transactions, donors which you do need to browse the transactions for.

Happy to reach out to some reporting people and verify these assumptions if that would be helpful.

LindsayYoung · 2015-11-30T19:59:26Z

Thanks for finding that. I must have been looking at an older excel version.

I think a million is a generous cap.

noahmanger · 2015-12-09T22:30:26Z

Closing this as the cap has been implemented for 100k records.

jmcarp mentioned this issue Dec 2, 2015

Estimate times for data downloads #1380

Closed

noahmanger added data-into-workflows labels Dec 4, 2015

noahmanger closed this as completed Dec 9, 2015

noahmanger removed the current-sprint label Dec 9, 2015

cnlucas mentioned this issue Jul 23, 2024

Feature request: Remove or increase download cap, restrict pagination on large datasets #5884

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: Capping data downloads #1378

Discussion: Capping data downloads #1378

jmcarp commented Nov 30, 2015

LindsayYoung commented Nov 30, 2015

jenniferthibault commented Nov 30, 2015

LindsayYoung commented Nov 30, 2015

jenniferthibault commented Nov 30, 2015

noahmanger commented Nov 30, 2015

jmcarp commented Nov 30, 2015

LindsayYoung commented Nov 30, 2015

LindsayYoung commented Nov 30, 2015

noahmanger commented Dec 9, 2015

Discussion: Capping data downloads #1378

Discussion: Capping data downloads #1378

Comments

jmcarp commented Nov 30, 2015

LindsayYoung commented Nov 30, 2015

jenniferthibault commented Nov 30, 2015

LindsayYoung commented Nov 30, 2015

jenniferthibault commented Nov 30, 2015

noahmanger commented Nov 30, 2015

jmcarp commented Nov 30, 2015

LindsayYoung commented Nov 30, 2015

LindsayYoung commented Nov 30, 2015

noahmanger commented Dec 9, 2015