Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch CSV data for previews from public url #3826

Merged
merged 5 commits into from Mar 6, 2018

Conversation

Projects
None yet
2 participants
@chrislo
Copy link
Contributor

commented Mar 1, 2018

When we switch to storing all attachments exclusively in Asset Manager, the CsvPreview class will no longer be able to read CSV data from the local NFS mount to generate the previews presented to the
user[1].

Instead we can generate the CSV previews by fetching the CSV file served at the public host. Currently that will in effect by reading the file from the NFS mount, but when we do change over to serving from Asset Manager it will fetch the CSV file from there.

Some CSV file attachments are almost 200Mb[2] so it is not practical to fetch the entire file as part of the preview request. Fortunately the preview functionality is configured to present at most 1,000 rows
and 50 columns to the user.

I've configured the Range request to request the first 30,000 bytes of the file. This is somewhat arbitrary at the moment, and I could do some work to set this to a limit that ensures 1000 lines of every currently uploaded csv attachment is fetched.

[1] For example: https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/675936/Q4_2017_csv.csv/preview
[2]
https://www.gov.uk//government/uploads/system/uploads/attachment_data/file/399850/mappings-2015-01-27T10_26_13_00_00.csv/preview

@chrislo chrislo force-pushed the fetch-csv-previews-from-public-url branch from 6850faa to 2718b93 Mar 1, 2018

@chrisroos
Copy link
Contributor

left a comment

This approach seems pretty good to me.

Is the CSV preview only available in whitehall-admin? And if so, do you think it's OK to make a request to Asset Manager each time someone previews a CSV file?

end
end

test '#new raises an exception if the request status is anything other than 206' do

This comment has been minimized.

Copy link
@chrisroos

chrisroos Mar 1, 2018

Contributor

Out of interest, do you still get a 206 if the file being requested is smaller than 30,000 bytes?

This comment has been minimized.

Copy link
@chrislo

chrislo Mar 1, 2018

Author Contributor

Yes, it appears so. For example

curl https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/675936/Q4_2017_csv.csv -H "Range: bytes=0-1000000" -v > /dev/null 

This file is 51798 bytes, but requesting the first 1000000 still returns a 206.

@chrislo

This comment has been minimized.

Copy link
Contributor Author

commented Mar 1, 2018

@chrisroos the CSV preview isn't available in admin, only to the public. For example, https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/675936/Q4_2017_csv.csv/preview. I'm hopeful that caching will mean we don't make too many requests.

@chrislo chrislo force-pushed the fetch-csv-previews-from-public-url branch from 2718b93 to c2c7225 Mar 2, 2018

@chrisroos

This comment has been minimized.

Copy link
Contributor

commented Mar 2, 2018

@chrisroos the CSV preview isn't available in admin, only to the public. For example, https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/675936/Q4_2017_csv.csv/preview. I'm hopeful that caching will mean we don't make too many requests.

Ah, OK. So to make sure I understand: a request for a CSV preview page will hit Whitehall, which will make a request to Asset Manager for a subset of the CSV file before sending it back to the client. The page sent back to the client will be cached by Fastly which will reduce the number of requests we're making to Asset Manager. Is that correct?

@chrislo

This comment has been minimized.

Copy link
Contributor Author

commented Mar 2, 2018

@chrisroos - it makes a request to the public URL for the asset which at the moment will hit Whitehall and fetch the file from NFS, but when we change over will hit Asset Manager. Does that make sense?

@chrisroos

This comment has been minimized.

Copy link
Contributor

commented Mar 2, 2018

@chrisroos - it makes a request to the public URL for the asset which at the moment will hit Whitehall and fetch the file from NFS, but when we change over will hit Asset Manager. Does that make sense?

Yup, completely.

@chrislo chrislo force-pushed the fetch-csv-previews-from-public-url branch 2 times, most recently from 8e0552d to af980cf Mar 2, 2018

@chrislo chrislo changed the title [WIP] Fetch CSV data for previews from public url Fetch CSV data for previews from public url Mar 2, 2018

@chrislo

This comment has been minimized.

Copy link
Contributor Author

commented Mar 2, 2018

I've verified that this works on integration (I had to add a1ac03a to use basic auth in that environment). I've also set a value for MAXIMUM_RANGE_BYTES based on CSV files historically added to Whitehall. There's a small edge case remaining explained in the final commit, but I hope we can live with it to allow us to move on with migrating the assets over and come back to it shortly if we think it's necessary.

@chrisroos - could you review when you get chance?

@chrisroos
Copy link
Contributor

left a comment

This looks good to me, @chrislo.

I've added a couple of comments but they're about style and so are entirely subjective. I'll let you decided what, if anything, to do about them.

@@ -16,7 +16,9 @@ def initialize(path)
private

def connection
Faraday.new(url: Whitehall.public_root)
conn = Faraday.new(url: Whitehall.public_root)
conn.basic_auth(basic_auth_user, basic_auth_password) if ENV.has_key?("BASIC_AUTH_CREDENTIALS")

This comment has been minimized.

Copy link
@chrisroos

chrisroos Mar 5, 2018

Contributor

This is obviously subjective but I think I'd prefer to inline the methods you've introduced, given that they're only used in the #connection method. Something like:

if ENV.has_key?('BASIC_AUTH_CREDENTIALS')
  basic_auth_credentials = ENV["BASIC_AUTH_CREDENTIALS"].split(":")
  basic_auth_user = basic_auth_credentials[0]
  basic_auth_password = basic_auth_credentials[1]
  # OR
  basic_auth_user, basic_auth_password = ENV["BASIC_AUTH_CREDENTIALS"].split(":")

  conn.basic_auth(basic_auth_user, basic_auth_password)
end
@@ -7,6 +7,12 @@
end

When(/^I preview the contents of the attachment$/) do
fn = File.join(Whitehall.clean_uploads_root, @attachment.file.store_path)

This comment has been minimized.

Copy link
@chrisroos

chrisroos Mar 5, 2018

Contributor

This is subjective but I'm not a fan of these short variable names. Can this be filename or attachment_path?

@path = path

Tempfile.create(temp_fn, temp_dir) do |tmp_file|
tmp_file.write(csv_file)

This comment has been minimized.

Copy link
@chrisroos

chrisroos Mar 5, 2018

Contributor

It wasn't immediately obvious to me what this method was doing. It took me a while to realise that the #csv_file method results in an HTTP connection to retrieve the content of the file. Is there any advantage of extracting the various methods in this class? I think I might find it easier to follow if it was all inline within the initialize method, or at least that it was a bit more explicit, e.g.:

def initialize(path)
  csv_file = read_csv_file(path)

  Tempfile.create(temp_fn, temp_dir) do |tmp_file|
    tmp_file.write(csv_file)
    ...
@chrislo

This comment has been minimized.

Copy link
Contributor Author

commented Mar 6, 2018

Thanks @chrisroos - as discussed, I'm going to rebase this on master and merge.

chrislo added some commits Mar 1, 2018

Introduce CsvFileFromPublicHost
When we switch to storing all attachments exclusively in Asset
Manager, the CsvPreview class will no longer be able to read CSV data
from the local NFS mount to generate the previews presented to the
user[1].

Instead we can generate the CSV previews by fetching the CSV file
served at the public host. Currently that will in effect by reading
the file from the NFS mount, but when we do change over to serving
from Asset Manager it will fetch the CSV file from there.

Some CSV file attachments are almost 200Mb[2] so it is not practical
to fetch the entire file as part of the preview request. Fortunately
the preview functionality is configured to present at most 1,000 rows
and 50 columns to the user.

The CsvFileFromPublicHost class introduced in this commit fetches the
file from the public host by making a Range request for the first
MAXIMUM_RANGE_BYTES bytes. This is currently arbitrarily set at 30,000
bytes. It saves the file to a temporary file and yields the file to
the caller.

A range request always returns 206 even if the range requested is
larger than the file size. So we use this status code to check that
the data has been successfully fetched and raise an exception if not.

In a subsequent commit we will use this class in the
CsvPreviewController to build the preview.

[1] For example: https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/675936/Q4_2017_csv.csv/preview
[2]
https://www.gov.uk//government/uploads/system/uploads/attachment_data/file/399850/mappings-2015-01-27T10_26_13_00_00.csv/preview
CsvPreviewController uses csv data fetched from public host
This commit modifies the CsvPreviewController to use data fetched from
the public host by calling the CsvFileFromPublicHost class introduced
in the previous commit.

I've decided to stub the values returned from CsvFileFromPublicHost in
the CsvPreviewControllerTest but stub at the request level in the
feature spec to increase the coverage.

I'm not particularly happy with how much indirection there is in the
argument passed to CsvFileFromPublicHost. When we've switched to
AssetManagerStorage we will be able to use the `url` method instead
which should be clearer.
Cope with CsvFileFromPublicHost::ConnectionError
If CsvFileFromPublicHost cannot fetch the CSV file it raises this
exception. We display a `This file could not be previewed` error in
the same way as if we'd encountered a CsvPreview:FileEncodingError or
CSV::MalformedCSVError.
Use basic auth to request CSV if on integration
Assets served from
`https://www-origin.integration.publishing.service.gov.uk` are
protected behind basic authentication. For the CSV preview
functionality to work in integration Faraday needs to pass the
authentication header along with the request. The
BASIC_AUTH_CREDENTIALS environment variable is only set in integration
so we can use the presence of this variable to conditionally add the
request header.
Set a better value for MAXIMUM_RANGE_BYTES
When building the CsvPreview from the temporary file created by
CsvFileFromPublicHost we don't want to download the entire file (as
some files are almost 200Mb). Instead we make a range request for the
first MAXIMUM_RANGE_BYTES.

Setting MAXIMUM_RANGE_BYTES isn't obvious. CsvPreview will truncate
the file to the first 1000 rows and 50 columns by default, but as the
file is arranged in column then row order, we don't know how many
bytes to request to ensure that we have around 1000 rows.

I selected around 1,500 CSV files at random from those uploaded to
GOV.UK, downloaded them, truncated them to the first 1,000 rows and
counted how many bytes each file was.

Except for some very wide, very wordy CSV files[1] 99.5% of the files
sampled contained 1,000 rows within the first 300,000 bytes.

A consequence of the way that CsvPreview works is that it has no way
of knowing for the remaining 0.5% of files whether the file has been
truncated. So the truncated advice in
`app/views/csv_preview/_truncated_message` will not be shown in these
cases.

I think we can live with this in order to move us towards switching
these files to be served from Asset Manager, but we may want to
revisit this later.

[1] An example being
https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/530705/Cyber_Security_Breaches_Survey_2016_raw_data_value_labels.csv/preview. The
first 1,000 lines of this file total 8094845 bytes.

@chrislo chrislo force-pushed the fetch-csv-previews-from-public-url branch from af980cf to 67464c2 Mar 6, 2018

@chrislo chrislo merged commit c7e1547 into master Mar 6, 2018

2 checks passed

continuous-integration/jenkins/branch This commit looks good
Details
continuous-integration/jenkins/publishing-e2e-tests Publishing end-to-end tests succeeded on Jenkins
Details

@chrislo chrislo deleted the fetch-csv-previews-from-public-url branch Mar 6, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.