-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.option('columns', 'col1,col2,col3,col4') does not preserve order #76
Comments
Hi Jacob, Thanks again for your precious feedback :-) I think there are three separate things here:
1), this is clearly a bug, that I found (the spark-fits/src/main/scala/com/astrolabsoftware/sparkfits/FitsHduBintable.scala Lines 49 to 50 in 7aa656b
I will open a PR and push the fix. 2), since #55, it is not obvious that not providing the schema will add overhead. As long as you know all the FITS schemas are the same, the code will perform minimal checks. 3), this is tricky. In practice, the option Having said that, the proper way to get speed-up would be to upgrade the structure of spark-fits to Apache Spark Data Source V2 (we are currently using V1), recently released. But that's a lot of work, and probably needs to be done in the context of an internship as a whole project. |
Also on 3), FITS is row-based, so when we read data it is natural to do:
while for an efficient reading of columns, you would rather prefer to have a column-based format:
I suspect this will probably make the column filter pushdown implementation no so straightforward :-( |
Issue 76: preserve column order from the columns option
The ordering has been fixed in #77 |
Hi Julien,
Great work on the latest release! here is some more minor feedback.
Since I am now loading a very large number of small files I thought it would be best for me to only load the relevant columns that I need and specify the schema explicitly in order to minimise the overhead (is this what you would recommend?).
I now use the .option('column', 'comma_seperated_column_names') method with spark-fits but noticed that the order of the specified columns is not preserved (see inline variable values and resulting df.show() after loading in the screenshot below).
The columns seem to be grouped by type rather than alphabetically or by the order of the list given. This would usually not be a problem, but since I also use the column option with .schema(UserSchema...) this can cause unexpected behaviour because I set the order of the fields in the UserSchema to be identical to that of the order in the 'comma_seperated_column_names' in the column option.
The main reason why I want to use these options is to optmise the speed of reading in many files. Please let me know if this methodology is beneficial or if I am heading in the wrong direction. For a single file the header file must be read to filter the columns anyway and I assume the schema is simultaneously inferred so there might not be any speed benefit to specifying it? although I am not sure if this is still the case if you read in many files and specify the schema manually?
Cheers,
Jacob
The text was updated successfully, but these errors were encountered: