Skip to content
This repository has been archived by the owner on Jul 11, 2023. It is now read-only.

Make use of content-type to detect format #252

Closed
wants to merge 11 commits into from
Closed

Make use of content-type to detect format #252

wants to merge 11 commits into from

Conversation

pierredittgen
Copy link
Contributor

Currently, when using tabulator with URL sources, format is infered from:

One step beyond: when an URL has no format suffix, neither format url parameter, tabulator can request HEAD information and infer format from Content-type information. E.g content type 'text/csv' means csv format.

Content-type to format is defined as a dict (CONTENT_TYPE_FORMAT) that can be extended as needed to support more mime-types.

@@ -17,6 +18,19 @@

# Module API

# Maps mime-type to format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

req = requests.head(source, allow_redirects=True)
if req.status_code == requests.codes.ok:
content_type = req.headers.get('Content-type')
if not content_type is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if content_type is not None
is more readable I think

if req.status_code == requests.codes.ok:
content_type = req.headers.get('Content-type')
if not content_type is None:
format = CONTENT_TYPE_FORMAT.get(content_type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here you can use mimetypes.guess_extension(...)

@@ -58,6 +72,14 @@ def detect_scheme_and_format(source):
if query_string_format is not None and len(query_string_format) == 1:
format = query_string_format[0]

# Test if format info can be extracted from Content-type header
elif source.startswith('http'):
req = requests.head(source, allow_redirects=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some exception handling is needed here.

@akariv
Copy link
Member

akariv commented Oct 15, 2018

@pierredittgen looks good, I added some comments.
Can you please make sure that tests pass after this change?

@roll
Copy link
Member

roll commented Oct 15, 2018

@akariv
@pierredittgen
I have some input on this one.

I think we had this implemented already at some point in time (should be possible to find the implementation in the history). But I disabled it because it was really inaccurate (the main problem was with Github Mime-Types as I can remember).

Probably we need a flag to make this feature optional (disabled by default). Or something like this. If my remembering is correct.

@akariv
Copy link
Member

akariv commented Oct 15, 2018 via email

@akariv
Copy link
Member

akariv commented Oct 15, 2018 via email

@cbenz
Copy link
Contributor

cbenz commented Nov 15, 2018

Has this pull-request been reviewed and accepted? What could we do @pierredittgen and I to move on?

The parameter use_http_content_type=False was added in latest commits.

@roll
Copy link
Member

roll commented Nov 26, 2018

@cbenz
@pierredittgen
Sorry I've got out of the loop a little bit.

Happy to review but not sure what's status of the PR (the flag, tests).

@roll roll added this to Software in Frictionless General Mar 19, 2019
@roll roll moved this from Software to In progress in Frictionless General Mar 19, 2019
@johanricher
Copy link
Member

What is missing to merge this PR?

@roll roll moved this from Pull Requests to Software in Frictionless General Mar 25, 2019
@roll
Copy link
Member

roll commented Mar 25, 2019

Sorry for the really late reply. I'm probably slightly out of sync and can be wrong.

But I have a few questions:

  • so we add an additional http call to the helper function which TBH wasn't intended to be not pure. I was wondering we need to check the http header inside the Remote Loader during the same (as data collection) call. But I guess it requires some design changes (I have to check)
  • not sure how as a user I can set the use_http_content_type flag. Will not it be always False?
  • we need tests, readme's API update

@pierredittgen
Copy link
Contributor Author

not sure how as a user I can set the use_http_content_type flag. Will not it be always False?

In fact, in our code, we directly call the detect_scheme_and_format function to try to get the format of the given resource. Then, we can pass this info to the "inspect" method of goodtables.inspector.

@roll roll added the wip label Apr 15, 2019
@stale
Copy link

stale bot commented Jul 15, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jul 15, 2019
@stale stale bot removed the wontfix label Jul 29, 2019
@roll roll added review and removed wip labels Sep 3, 2019
@stale
Copy link

stale bot commented Oct 27, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Oct 27, 2019
@johanricher
Copy link
Member

@roll It seems a lot of work & progress was made here but not totally finished. What's left to do in order to merge this PR? On our side, @pierredittgen can still be available to help close this. :)

@stale stale bot removed the wontfix label Oct 28, 2019
@roll
Copy link
Member

roll commented Oct 28, 2019

@johanricher

The main problem was that it'd been intended to update the helpers.detect_scheme_and_format function which is:

  • not a part of public API
  • wasn't created to contain any IO calls

We need to add a publically available option to the remote loader:

And probably copy this code to something like helpers.get_format_by_content_type and use it inside the Stream.open (we can't use it inside the remote loader because a loader can't update a file format...). TBH not sure exactly but something like this.

@stale
Copy link

stale bot commented Jan 26, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jan 26, 2020
@stale stale bot closed this Feb 25, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Frictionless General
  
Software (core)
Development

Successfully merging this pull request may close these issues.

None yet

5 participants