Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
Fetch CSV data for previews from public url #3826
When we switch to storing all attachments exclusively in Asset Manager, the CsvPreview class will no longer be able to read CSV data from the local NFS mount to generate the previews presented to the
Instead we can generate the CSV previews by fetching the CSV file served at the public host. Currently that will in effect by reading the file from the NFS mount, but when we do change over to serving from Asset Manager it will fetch the CSV file from there.
Some CSV file attachments are almost 200Mb so it is not practical to fetch the entire file as part of the preview request. Fortunately the preview functionality is configured to present at most 1,000 rows
I've configured the Range request to request the first 30,000 bytes of the file. This is somewhat arbitrary at the moment, and I could do some work to set this to a limit that ensures 1000 lines of every currently uploaded csv attachment is fetched.
 For example: https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/675936/Q4_2017_csv.csv/preview
chrisroos left a comment
This approach seems pretty good to me.
Is the CSV preview only available in whitehall-admin? And if so, do you think it's OK to make a request to Asset Manager each time someone previews a CSV file?
@chrisroos the CSV preview isn't available in admin, only to the public. For example, https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/675936/Q4_2017_csv.csv/preview. I'm hopeful that caching will mean we don't make too many requests.
Ah, OK. So to make sure I understand: a request for a CSV preview page will hit Whitehall, which will make a request to Asset Manager for a subset of the CSV file before sending it back to the client. The page sent back to the client will be cached by Fastly which will reduce the number of requests we're making to Asset Manager. Is that correct?
2 times, most recently
Mar 2, 2018
I've verified that this works on integration (I had to add a1ac03a to use basic auth in that environment). I've also set a value for MAXIMUM_RANGE_BYTES based on CSV files historically added to Whitehall. There's a small edge case remaining explained in the final commit, but I hope we can live with it to allow us to move on with migrating the assets over and come back to it shortly if we think it's necessary.
@chrisroos - could you review when you get chance?