[SPARK-42359][SQL] Support row skipping when reading CSV files by ted-jenks · Pull Request #39907 · apache/spark

ted-jenks · 2023-02-06T14:18:52Z

What changes were proposed in this pull request?

Added support row skipping when reading CSV files with a skipLines option.

Why are the changes needed?

In SPARK-42359 we highlight the need for row skipping functionality in CSV reads. In summary, there is no way of users reading CSV files that do not have the header/data in the first row with the DataFrame API. Advanced users can use RDDs and zipWithIndex to remove the first n rows, though this is not compatible with RDDs being an unordered concept.

Does this PR introduce any user-facing change?

Now the user's have access to a new a option when reading CSVs. This option defaults to 0 so legacy code will be unaffected. The skipLines option that has been added will cause the parser to skip a specified number of lines before parsing the data. It will respect multline values. If the header option is set to true, the first line after the skipLines will be taken as the header.

The option is used: spark.read.option("skipLines", 2).csv("/path/to/file.csv").

This change has been reflected in the user docs.

How was this patch tested?

Tests added in CSVSuite for:

Reads from datasets of Strings with multiLine enabled and disabled.
Reads from CSV files with multiLine enabled and disabled (these take different code-paths).
Reads with header set as true and as as false (ensure schema is correctly inferred).
Invalid skipLines option throws exception.

This reverts commit 9df96aa.

This reverts commit 2fa479e.

This reverts commit bbb7376.

This reverts commit e3ffa83.

This reverts commit be466a9.

This reverts commit a4da66d.

This reverts commit 53f52ef.

ted-jenks · 2023-02-16T09:02:06Z

@HyukjinKwon I can see you have touched a lot of this code. What do you think about these changes?

HyukjinKwon · 2023-02-16T10:45:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala

+        nonEmptyLines.as[String]
+      }
    }
+    commentFilteredLines.rdd.zipWithIndex().toDF("value", "order")


zipWithIndex is actually expensive .. it requires another job to execute.

@HyukjinKwon I reworked it to try to avoid this, how does it look now?

ted-jenks · 2023-02-27T15:11:19Z

@HyukjinKwon This is ready for a re-review 🤞

ted-jenks · 2023-03-27T09:48:31Z

@HyukjinKwon would be great to get an update on your thoughts on this issue?

docs/sql-data-sources-csv.md

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala

ted-jenks · 2023-04-21T13:03:14Z

@HyukjinKwon I have done more work on this, please let me know what you think!

holdenk

Some questions (mostly just things I think we should document better for the new behaviour).

holdenk · 2023-07-08T23:39:18Z

docs/sql-data-sources-csv.md

+  <tr>
+    <td><code>skipLines</code></td>
+    <td>0</td>
+    <td>Sets the number of non-empty, uncommented lines to skip before parsing CSV files. If the <code>header</code> option is set to <code>true</code>, the first line after the number of <code>skipLines</code> will be taken as the header.</td>
+    <td>read</td>
+  </tr>


Does skipLines apply before or after the filtering? (e.g. if we have 10 empty lines at the top of partition 1, what is the behaviour)?

When does a CSV file have multiple header rows?

holdenk · 2023-07-08T23:40:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala

+    val handleSkipLines: () => Unit =
+      () => 1.to(skipLines).foreach(_ => tokenizer.parseNext())


Whats the behaviour when skipLines is greater than the length of the input file?

github-actions · 2023-10-18T00:17:56Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

MaxGekk

@ted-jenks The feature is useful from my point of view. May I ask you to rebase on the recent master.

github-actions bot added DOCS SQL labels Feb 6, 2023

ted-jenks force-pushed the SPARK-42359 branch from cab25be to eda35c2 Compare February 6, 2023 14:32

add skiplines

284dde6

ted-jenks force-pushed the SPARK-42359 branch from eda35c2 to 284dde6 Compare February 6, 2023 14:34

ted jenks and others added 25 commits February 6, 2023 15:39

reformat and parameterize tests

8c24bcf

move asserts

a3895cc

remove accidental words

8cc21d7

add tests with comments

b56d548

explicitely order blank line removal and comments

bbb7376

renme func

2fa479e

revert respect skipline in schema inference

9df96aa

Revert "revert respect skipline in schema inference"

a85b916

This reverts commit 9df96aa.

Revert "renme func"

8ba4d1c

This reverts commit 2fa479e.

Revert "explicitely order blank line removal and comments"

bc216a7

This reverts commit bbb7376.

return to comments and blanks first

47a03ab

add test for comments and blanks in dataset

220cea9

make expectation for new test right

a0b621c

change test name

e3ffa83

fix header assert

3f87508

rework CSVExprUtils

53f52ef

update CSV Utils

a4da66d

Revert "change test name"

be466a9

This reverts commit e3ffa83.

Revert "Revert "change test name""

0256fb2

This reverts commit be466a9.

Revert "update CSV Utils"

f55bb9a

This reverts commit a4da66d.

Revert "rework CSVExprUtils"

1429471

This reverts commit 53f52ef.

set order to blank, skip, comment

eb51aa2

revert order to original

b5d11c2

update docs

fc7c495

remove filter comment and empty

1bde0fd

ted-jenks mentioned this pull request Feb 14, 2023

[SPARK-42373][SQL] Remove unused blank line removal from CSVExprUtils #39927

Closed

ted-jenks changed the title ~~[WIP][SPARK-42359][SQL] Support row skipping when reading CSV files~~ [SPARK-42359][SQL] Support row skipping when reading CSV files Feb 14, 2023

HyukjinKwon reviewed Feb 16, 2023

View reviewed changes

remove sip with index

2b11eeb

ted-jenks requested a review from HyukjinKwon February 16, 2023 14:16

ted-jenks marked this pull request as draft February 16, 2023 16:44

use an iter

3b97b9c

ted-jenks marked this pull request as ready for review February 17, 2023 13:50

jaceklaskowski reviewed Mar 27, 2023

View reviewed changes

comments

b636a75

ted-jenks requested review from HyukjinKwon and jaceklaskowski and removed request for HyukjinKwon and jaceklaskowski March 29, 2023 09:53

ted-jenks requested a review from HyukjinKwon April 12, 2023 12:55

holdenk reviewed Jul 8, 2023

View reviewed changes

github-actions bot added the Stale label Oct 18, 2023

MaxGekk reviewed Oct 18, 2023

View reviewed changes

github-actions bot closed this Oct 19, 2023

		val handleSkipLines: () => Unit =
		() => 1.to(skipLines).foreach(_ => tokenizer.parseNext())

Conversation

ted-jenks commented Feb 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

ted-jenks commented Feb 16, 2023

Uh oh!

HyukjinKwon Feb 16, 2023

Choose a reason for hiding this comment

Uh oh!

ted-jenks Feb 17, 2023

Choose a reason for hiding this comment

Uh oh!

ted-jenks commented Feb 27, 2023

Uh oh!

ted-jenks commented Mar 27, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ted-jenks commented Apr 21, 2023

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

holdenk Jul 8, 2023

Choose a reason for hiding this comment

Uh oh!

srowen Jul 9, 2023

Choose a reason for hiding this comment

Uh oh!

holdenk Jul 8, 2023

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 18, 2023

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ted-jenks commented Feb 6, 2023 •

edited

Loading