.option('columns', 'col1,col2,col3,col4') does not preserve order #76

jacobic · 2019-05-13T22:17:23Z

Hi Julien,

Great work on the latest release! here is some more minor feedback.

Since I am now loading a very large number of small files I thought it would be best for me to only load the relevant columns that I need and specify the schema explicitly in order to minimise the overhead (is this what you would recommend?).

I now use the .option('column', 'comma_seperated_column_names') method with spark-fits but noticed that the order of the specified columns is not preserved (see inline variable values and resulting df.show() after loading in the screenshot below).

The columns seem to be grouped by type rather than alphabetically or by the order of the list given. This would usually not be a problem, but since I also use the column option with .schema(UserSchema...) this can cause unexpected behaviour because I set the order of the fields in the UserSchema to be identical to that of the order in the 'comma_seperated_column_names' in the column option.

The main reason why I want to use these options is to optmise the speed of reading in many files. Please let me know if this methodology is beneficial or if I am heading in the wrong direction. For a single file the header file must be read to filter the columns anyway and I assume the schema is simultaneously inferred so there might not be any speed benefit to specifying it? although I am not sure if this is still the case if you read in many files and specify the schema manually?

Cheers,
Jacob

JulienPeloton · 2019-05-14T08:46:51Z

Hi Jacob,

Thanks again for your precious feedback :-)

I think there are three separate things here:

Order is not preserved when using the columns option (clearly a bug)
Is there a gain in specifing manually the schema in the multifile case?
Speeding up the I/O by selecting only some columns.

1), this is clearly a bug, that I found (the sorted should not be called):

spark-fits/src/main/scala/com/astrolabsoftware/sparkfits/FitsHduBintable.scala

Lines 49 to 50 in 7aa656b

    
           val colPositions = selectedColNames.map( 
        
             x => getColumnPos(keyValues, x)).toList.sorted

I will open a PR and push the fix.

2), since #55, it is not obvious that not providing the schema will add overhead. As long as you know all the FITS schemas are the same, the code will perform minimal checks.

3), this is tricky. In practice, the option columns allows to skip the decoding part, but not the loading part, hence there is only a little speed-up. To do it properly, we would need to implement filters at the level of the record reader directly. This work is planned, I am just lacking of time :-(

Having said that, the proper way to get speed-up would be to upgrade the structure of spark-fits to Apache Spark Data Source V2 (we are currently using V1), recently released. But that's a lot of work, and probably needs to be done in the context of an internship as a whole project.

JulienPeloton · 2019-05-14T12:36:08Z

Also on 3), FITS is row-based, so when we read data it is natural to do:

(row, col)
element (0, 0) / element (0, 1) / element (0, 2) / ... 
element (1, 0) / element (1, 1) / element (1, 2) / ...
...

while for an efficient reading of columns, you would rather prefer to have a column-based format:

(row, col)
element (0, 0) / element (1, 0) / element (2, 0) / ...
element (0, 1) / element (1, 1) / element (2, 1) / ...
...

I suspect this will probably make the column filter pushdown implementation no so straightforward :-(

Issue 76: preserve column order from the columns option

JulienPeloton · 2019-05-15T07:16:01Z

The ordering has been fixed in #77

JulienPeloton added bug Something isn't working header labels May 14, 2019

JulienPeloton added a commit that referenced this issue May 15, 2019

Merge pull request #77 from /issues/76/colorder

24dd75d

Issue 76: preserve column order from the columns option

JulienPeloton closed this as completed May 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.option('columns', 'col1,col2,col3,col4') does not preserve order #76

.option('columns', 'col1,col2,col3,col4') does not preserve order #76

jacobic commented May 13, 2019

JulienPeloton commented May 14, 2019 •

edited

Loading

JulienPeloton commented May 14, 2019

JulienPeloton commented May 15, 2019

.option('columns', 'col1,col2,col3,col4') does not preserve order #76

.option('columns', 'col1,col2,col3,col4') does not preserve order #76

Comments

jacobic commented May 13, 2019

JulienPeloton commented May 14, 2019 • edited Loading

JulienPeloton commented May 14, 2019

JulienPeloton commented May 15, 2019

JulienPeloton commented May 14, 2019 •

edited

Loading