Skip to content

[SPARK-42359][SQL] Support row skipping when reading CSV files#39907

Closed
ted-jenks wants to merge 36 commits intoapache:masterfrom
ted-jenks:SPARK-42359
Closed

[SPARK-42359][SQL] Support row skipping when reading CSV files#39907
ted-jenks wants to merge 36 commits intoapache:masterfrom
ted-jenks:SPARK-42359

Conversation

@ted-jenks
Copy link
Contributor

@ted-jenks ted-jenks commented Feb 6, 2023

What changes were proposed in this pull request?

Added support row skipping when reading CSV files with a skipLines option.

Why are the changes needed?

In SPARK-42359 we highlight the need for row skipping functionality in CSV reads. In summary, there is no way of users reading CSV files that do not have the header/data in the first row with the DataFrame API. Advanced users can use RDDs and zipWithIndex to remove the first n rows, though this is not compatible with RDDs being an unordered concept.

Does this PR introduce any user-facing change?

Now the user's have access to a new a option when reading CSVs. This option defaults to 0 so legacy code will be unaffected. The skipLines option that has been added will cause the parser to skip a specified number of lines before parsing the data. It will respect multline values. If the header option is set to true, the first line after the skipLines will be taken as the header.

The option is used: spark.read.option("skipLines", 2).csv("/path/to/file.csv").

This change has been reflected in the user docs.

How was this patch tested?

Tests added in CSVSuite for:

  • Reads from datasets of Strings with multiLine enabled and disabled.
  • Reads from CSV files with multiLine enabled and disabled (these take different code-paths).
  • Reads with header set as true and as as false (ensure schema is correctly inferred).
  • Invalid skipLines option throws exception.

@ted-jenks ted-jenks changed the title [WIP][SPARK-42359][SQL] Support row skipping when reading CSV files [SPARK-42359][SQL] Support row skipping when reading CSV files Feb 14, 2023
@ted-jenks
Copy link
Contributor Author

@HyukjinKwon I can see you have touched a lot of this code. What do you think about these changes?

nonEmptyLines.as[String]
}
}
commentFilteredLines.rdd.zipWithIndex().toDF("value", "order")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zipWithIndex is actually expensive .. it requires another job to execute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon I reworked it to try to avoid this, how does it look now?

@ted-jenks ted-jenks marked this pull request as draft February 16, 2023 16:44
@ted-jenks ted-jenks marked this pull request as ready for review February 17, 2023 13:50
@ted-jenks
Copy link
Contributor Author

@HyukjinKwon This is ready for a re-review 🤞

@ted-jenks
Copy link
Contributor Author

@HyukjinKwon would be great to get an update on your thoughts on this issue?

@ted-jenks ted-jenks requested review from HyukjinKwon and jaceklaskowski and removed request for HyukjinKwon and jaceklaskowski March 29, 2023 09:53
@ted-jenks ted-jenks requested a review from HyukjinKwon April 12, 2023 12:55
@ted-jenks
Copy link
Contributor Author

@HyukjinKwon I have done more work on this, please let me know what you think!

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions (mostly just things I think we should document better for the new behaviour).

Comment on lines +105 to +110
<tr>
<td><code>skipLines</code></td>
<td>0</td>
<td>Sets the number of non-empty, uncommented lines to skip before parsing CSV files. If the <code>header</code> option is set to <code>true</code>, the first line after the number of <code>skipLines</code> will be taken as the header.</td>
<td>read</td>
</tr>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does skipLines apply before or after the filtering? (e.g. if we have 10 empty lines at the top of partition 1, what is the behaviour)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does a CSV file have multiple header rows?

Comment on lines +384 to +385
val handleSkipLines: () => Unit =
() => 1.to(skipLines).foreach(_ => tokenizer.parseNext())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the behaviour when skipLines is greater than the length of the input file?

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Oct 18, 2023
Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ted-jenks The feature is useful from my point of view. May I ask you to rebase on the recent master.

@github-actions github-actions bot closed this Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants