Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.option('columns', 'col1,col2,col3,col4') does not preserve order #76

Closed
jacobic opened this issue May 13, 2019 · 3 comments
Closed

.option('columns', 'col1,col2,col3,col4') does not preserve order #76

jacobic opened this issue May 13, 2019 · 3 comments
Labels
bug Something isn't working header

Comments

@jacobic
Copy link

jacobic commented May 13, 2019

Hi Julien,

Great work on the latest release! here is some more minor feedback.

Since I am now loading a very large number of small files I thought it would be best for me to only load the relevant columns that I need and specify the schema explicitly in order to minimise the overhead (is this what you would recommend?).

I now use the .option('column', 'comma_seperated_column_names') method with spark-fits but noticed that the order of the specified columns is not preserved (see inline variable values and resulting df.show() after loading in the screenshot below).

Screenshot 2019-05-13 at 23 59 19

The columns seem to be grouped by type rather than alphabetically or by the order of the list given. This would usually not be a problem, but since I also use the column option with .schema(UserSchema...) this can cause unexpected behaviour because I set the order of the fields in the UserSchema to be identical to that of the order in the 'comma_seperated_column_names' in the column option.

The main reason why I want to use these options is to optmise the speed of reading in many files. Please let me know if this methodology is beneficial or if I am heading in the wrong direction. For a single file the header file must be read to filter the columns anyway and I assume the schema is simultaneously inferred so there might not be any speed benefit to specifying it? although I am not sure if this is still the case if you read in many files and specify the schema manually?

Cheers,
Jacob

@JulienPeloton
Copy link
Member

JulienPeloton commented May 14, 2019

Hi Jacob,

Thanks again for your precious feedback :-)

I think there are three separate things here:

  1. Order is not preserved when using the columns option (clearly a bug)
  2. Is there a gain in specifing manually the schema in the multifile case?
  3. Speeding up the I/O by selecting only some columns.

1), this is clearly a bug, that I found (the sorted should not be called):

val colPositions = selectedColNames.map(
x => getColumnPos(keyValues, x)).toList.sorted

I will open a PR and push the fix.

2), since #55, it is not obvious that not providing the schema will add overhead. As long as you know all the FITS schemas are the same, the code will perform minimal checks.

3), this is tricky. In practice, the option columns allows to skip the decoding part, but not the loading part, hence there is only a little speed-up. To do it properly, we would need to implement filters at the level of the record reader directly. This work is planned, I am just lacking of time :-(

Having said that, the proper way to get speed-up would be to upgrade the structure of spark-fits to Apache Spark Data Source V2 (we are currently using V1), recently released. But that's a lot of work, and probably needs to be done in the context of an internship as a whole project.

@JulienPeloton JulienPeloton added bug Something isn't working header labels May 14, 2019
@JulienPeloton
Copy link
Member

Also on 3), FITS is row-based, so when we read data it is natural to do:

(row, col)
element (0, 0) / element (0, 1) / element (0, 2) / ... 
element (1, 0) / element (1, 1) / element (1, 2) / ...
...

while for an efficient reading of columns, you would rather prefer to have a column-based format:

(row, col)
element (0, 0) / element (1, 0) / element (2, 0) / ...
element (0, 1) / element (1, 1) / element (2, 1) / ...
...

I suspect this will probably make the column filter pushdown implementation no so straightforward :-(

JulienPeloton added a commit that referenced this issue May 15, 2019
Issue 76: preserve column order from the columns option
@JulienPeloton
Copy link
Member

The ordering has been fixed in #77

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working header
Projects
None yet
Development

No branches or pull requests

2 participants