Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

untar_data does not untar .tgz, .gz, or .tar.tgz files #1130

Closed
mlsmall opened this issue Nov 11, 2018 · 5 comments
Closed

untar_data does not untar .tgz, .gz, or .tar.tgz files #1130

mlsmall opened this issue Nov 11, 2018 · 5 comments

Comments

@mlsmall
Copy link

mlsmall commented Nov 11, 2018

Untar_data function gives an error "not a gzip file" when trying to download and untar a file.
Describe the bug

To Reproduce

untar_data('https://s3.amazonaws.com/fast-ai-imageclas/mnist_png.tgz')
untar_data('http://data.vision.ee.ethz.ch/cvl/food-101.tar.gz')
untar_data('http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz')
Expected behavior

I expect the file to be downloaded and untarred
Screenshots

Additional context

Getting an error that says "not a gzip file"

@odysseus0
Copy link

odysseus0 commented Nov 13, 2018

What is going on here

Here I will explain the current behavior of untar_data. As for if such behavior is desirable, people who have an opinion can express it here.

Here is an example of the use of untar_data in the lesson 1 notebook:

path = untar_data(URLs.PETS)

From the naming of the variable, it is common to assume that URLs.PETS is an actual url. Looking at the documentation of untar_data, it shows that

untar_data(url:str, fname:PathOrStr=None, dest:PathOrStr=None, data=True)

"Download url if it doesn't exist to fname and un-tgz to folder dest"

So it also suggests that the first argument of untar_data is a url.

If you look into the exact content of URLs.PETS, which is simply a constant field of the class URLs, you will find that it is defined as

'https://s3.amazonaws.com/fast-ai-imageclas/oxford-iiit-pet'

However, if you try to visit that url or download the file using command like wget, you will find that this url is invalid. Indeed, the actual address is

'https://s3.amazonaws.com/fast-ai-imageclas/oxford-iiit-pet.tgz'

with .tgz added at the end.

What I do find confusing here is that, though both the naming URLs and the docs of untar_data suggest that its first argument is a url, URLs.PETS is actually not a valid url. Digging a bit deeper, we found in the source code of download_data function called by untar_url function the following lines

if not fname.exists():
    print(f'Downloading {url}')
    download_url(f'{url}.tgz', fname)

So it is clear that the url variable here is actually not a true url, but one with .tgz missing at the end. We can also infer that the url parameter untar_data takes and every field of URLs class is actually the true url with .tgz removed from the end.

Now going back to your post, to download any dataset ending with .tgz, simply remove the .tgz and throw it into untar_data and it should work. For datasets with other formats, currently it is not supported with untar_data, because regardless of your dataset url and its file format, it always appends .tgz when downloading the data.

@sgugger
Copy link
Contributor

sgugger commented Nov 16, 2018

Closing this. @odysseus0 you should put all that paragraph in a PR for the docs of untar_data!

@sgugger sgugger closed this as completed Nov 16, 2018
@odysseus0
Copy link

@sgugger Do you think that there is any need to make changes to the fact that URLs class actually does not contain full urls, and the fact that untar_data only supports tgz file url in a non-intuitive way?

Should I create a PR for such feature recommendation?

Sorry. I am still not very familiar with the general guidelines of contributing to open source.

@sgugger
Copy link
Contributor

sgugger commented Nov 16, 2018

untar_data is supposed to be an internal function with the fastai datasets.
Maybe you could create a new function that will be more general? download_url will work for everything already, I guess it's mostly having a uncompress function that would deal with any format that's missing, but that's not an easy feature.

@odysseus0
Copy link

odysseus0 commented Nov 17, 2018

That makes sense to me. I will start experimenting with it.

However, another issue here is that the setup of untar_data the URLs class gives people the misconception that it can take any url that represents compressed file as an argument. For the very minimum, we can add some more notes to the untar_data, but more ideally, we should also name URLs class instead as FastaiDatasets, so that it is clear it is only an internal function for now and should only be used with Fastai Datasets.

I know this is really minuscule and you probably have way more important things to look after. In such case, should I simply start a PR for it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants