New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate format detection and extension checks for loading data #587

Closed
laserson opened this Issue Feb 25, 2015 · 9 comments

Comments

Projects
None yet
4 participants
@laserson
Contributor

laserson commented Feb 25, 2015

I find this annoying, paternalistic, and awkward with Hadoop's view of directories as data sets. What is the purpose of this? Is there actually a use case where the user won't know whether they're trying to load VCF or ADAM data? For example, in the vcf2adam CLI tool, it calls loadGenotypes which attempts to load VCF and then, upon failure, ADAM. But why would it ever need to attempt to load ADAM data if I'm calling "VCF-to-ADAM"?

I'd like to eliminate the paternalistic file extension stuff and also the "magical" detection of format, because I find it more difficult. cc @ryan-williams because I think he proposed the .adam check to begin with.

I'd also like to expose the actual functions that load specific data types. So there should be a separate load function for each file format we support.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Feb 25, 2015

Member

I am all for:

  1. Making callers specify what kind of data they're intending to read
  2. Exposing functions that read a file and assume a specific input format, e.g. will read a VCF file no matter what its extension is
  3. Exposing functions that, for a given "kind" of data (e.g. "variants" xor "reads"), try to infer from file extension which of the functions from 2. to use
  4. Debatable: encouraging use of an .adam extension for files containing ADAM records.

4 came about when I inadvertently was trying to load a directory as an ADAM file and felt like an automated check could have saved me some trouble. Maybe just having these functions check that the argument is a file and not a directory would be a good compromise?

1, 2, and 3, seem like clearer wins to me and I can't tell if you disagree. I've made changes towards 1.; reading without declaring the type of data is Bad Magic. OTOH, the convenience of e.g. transform doing the right thing when passed a variety of file extensions seems worthwhile too, per 3.

Sry, typing from phone

Member

ryan-williams commented Feb 25, 2015

I am all for:

  1. Making callers specify what kind of data they're intending to read
  2. Exposing functions that read a file and assume a specific input format, e.g. will read a VCF file no matter what its extension is
  3. Exposing functions that, for a given "kind" of data (e.g. "variants" xor "reads"), try to infer from file extension which of the functions from 2. to use
  4. Debatable: encouraging use of an .adam extension for files containing ADAM records.

4 came about when I inadvertently was trying to load a directory as an ADAM file and felt like an automated check could have saved me some trouble. Maybe just having these functions check that the argument is a file and not a directory would be a good compromise?

1, 2, and 3, seem like clearer wins to me and I can't tell if you disagree. I've made changes towards 1.; reading without declaring the type of data is Bad Magic. OTOH, the convenience of e.g. transform doing the right thing when passed a variety of file extensions seems worthwhile too, per 3.

Sry, typing from phone

@laserson

This comment has been minimized.

Show comment
Hide comment
@laserson

laserson Feb 25, 2015

Contributor

Yes, I think exposing both 2 and 3 would be a great way to go.

I'm not sure I understand 4.

Contributor

laserson commented Feb 25, 2015

Yes, I think exposing both 2 and 3 would be a great way to go.

I'm not sure I understand 4.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Feb 25, 2015

Member

Basically there is "variants vs. reads" and there is "crappy text formats vs. ADAM" for each of those.

We should never let users be ambiguous about the former, but for the latter there are cases where it makes sense to be specific, and others where having slightly broad/magical APIs seems beneficial.

Member

ryan-williams commented Feb 25, 2015

Basically there is "variants vs. reads" and there is "crappy text formats vs. ADAM" for each of those.

We should never let users be ambiguous about the former, but for the latter there are cases where it makes sense to be specific, and others where having slightly broad/magical APIs seems beneficial.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Feb 25, 2015

Member

4 is what I thoughy you were objecting to: we warn or throw or something, i forget, if one tries to open a path that doesn't end in .adam as an ADAM file

Member

ryan-williams commented Feb 25, 2015

4 is what I thoughy you were objecting to: we warn or throw or something, i forget, if one tries to open a path that doesn't end in .adam as an ADAM file

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Feb 25, 2015

Member

I'm in general agreement with @ryan-williams here. We do throw right now if you open a path that doesn't end in .adam. The one annoyance I've found there is that we throw if the path ends in .adam/, which is silly.

I suppose the question is, @laserson what do you find unpleasant about the automatic format detection code? Is it the automatic format detection code that you don't like, or is it just that you'd like to be able to specify what exact file format you're loading from?

Member

fnothaft commented Feb 25, 2015

I'm in general agreement with @ryan-williams here. We do throw right now if you open a path that doesn't end in .adam. The one annoyance I've found there is that we throw if the path ends in .adam/, which is silly.

I suppose the question is, @laserson what do you find unpleasant about the automatic format detection code? Is it the automatic format detection code that you don't like, or is it just that you'd like to be able to specify what exact file format you're loading from?

@laserson

This comment has been minimized.

Show comment
Hide comment
@laserson

laserson Feb 25, 2015

Contributor

I am a strong advocate for 1 and 2, and am happy with 3, which is probably implemented by using file extensions. My preference is that file extensions be hints and good manners, rather than strong requirements. Especially since individual data sets in Hadoop are often directories rather than files. I find it confusing/strange to name directories with extensions.

Another issue is that if I write out parquet data, I'd have to name the directory .adam, as the actual files will be something like .gz.parquet, no?

@fnothaft My initial response is, what is the advantage of requiring file format extensions? Using them can be helpful, and should be encouraged. But in almost every use case I can imagine, the user will know the format of their data. I agree that it'd be nice to have some functions that "just work" without having to specify the file formats, but there should be a clear API to fall back on that is explicit.

Contributor

laserson commented Feb 25, 2015

I am a strong advocate for 1 and 2, and am happy with 3, which is probably implemented by using file extensions. My preference is that file extensions be hints and good manners, rather than strong requirements. Especially since individual data sets in Hadoop are often directories rather than files. I find it confusing/strange to name directories with extensions.

Another issue is that if I write out parquet data, I'd have to name the directory .adam, as the actual files will be something like .gz.parquet, no?

@fnothaft My initial response is, what is the advantage of requiring file format extensions? Using them can be helpful, and should be encouraged. But in almost every use case I can imagine, the user will know the format of their data. I agree that it'd be nice to have some functions that "just work" without having to specify the file formats, but there should be a clear API to fall back on that is explicit.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Feb 25, 2015

Member

@laserson

Another issue is that if I write out parquet data, I'd have to name the directory .adam, as the actual files will be something like .gz.parquet, no?

Ja!

ls reads21.adam/
_SUCCESS        _metadata       part-r-00000.gz.parquet

@fnothaft My initial response is, what is the advantage of requiring file format extensions? Using them can be helpful, and should be encouraged. But in almost every use case I can imagine, the user will know the format of their data. I agree that it'd be nice to have some functions that "just work" without having to specify the file formats, but there should be a clear API to fall back on that is explicit.

Yeah, that sounds reasonable to me. We've got all the functions for loading using explicit extensions in ADAMContext, but as package private. I think it makes a lot of sense for cases like vcf2adam.

When I read your comment at first, I thought you were suggesting totally getting rid of the "detect by extension" functionality. But, if you just want to have an explicit way to specify that you will only be getting data from a single format, I think that's definitely reasonable.

Member

fnothaft commented Feb 25, 2015

@laserson

Another issue is that if I write out parquet data, I'd have to name the directory .adam, as the actual files will be something like .gz.parquet, no?

Ja!

ls reads21.adam/
_SUCCESS        _metadata       part-r-00000.gz.parquet

@fnothaft My initial response is, what is the advantage of requiring file format extensions? Using them can be helpful, and should be encouraged. But in almost every use case I can imagine, the user will know the format of their data. I agree that it'd be nice to have some functions that "just work" without having to specify the file formats, but there should be a clear API to fall back on that is explicit.

Yeah, that sounds reasonable to me. We've got all the functions for loading using explicit extensions in ADAMContext, but as package private. I think it makes a lot of sense for cases like vcf2adam.

When I read your comment at first, I thought you were suggesting totally getting rid of the "detect by extension" functionality. But, if you just want to have an explicit way to specify that you will only be getting data from a single format, I think that's definitely reasonable.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jul 6, 2016

Member

I would like to close this. Does anyone have strong opposition?

Member

fnothaft commented Jul 6, 2016

I would like to close this. Does anyone have strong opposition?

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jul 6, 2016

Member

Closing as we have met consensus and I don't see anything further to do.

Member

heuermh commented Jul 6, 2016

Closing as we have met consensus and I don't see anything further to do.

@heuermh heuermh closed this Jul 6, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment