Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26745][SQL] Revert count optimization in JSON datasource by SPARK-24959 #23667

Closed
wants to merge 1 commit into from

Conversation

Projects
None yet
4 participants
@HyukjinKwon
Copy link
Member

commented Jan 28, 2019

What changes were proposed in this pull request?

This PR reverts JSON count optimization part of #21909.

We cannot distinguish the cases below without parsing:

[{...}, {...}]
[]
{...}
# empty string

when we count(). One line (input: IN) can be, 0 record, 1 record and multiple records and this is dependent on each input.

See also #23665 (comment).

How was this patch tested?

Manually tested.

@SparkQA

This comment has been minimized.

Copy link

commented Jan 28, 2019

Test build #101745 has finished for PR 23667 at commit dd5b177.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.
@HyukjinKwon

This comment has been minimized.

Copy link
Member Author

commented Jan 28, 2019

retest this please

@SparkQA

This comment has been minimized.

Copy link

commented Jan 28, 2019

Test build #101753 has finished for PR 23667 at commit dd5b177.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@srowen

srowen approved these changes Jan 29, 2019

Copy link
Member

left a comment

It's a straight revert right? Looks OK. I agree we need to undo it for now because of the correctness issue. And then merge #23602

@HyukjinKwon

This comment has been minimized.

Copy link
Member Author

commented Jan 30, 2019

Will get this in in few days if there's no objection.

@HyukjinKwon HyukjinKwon force-pushed the HyukjinKwon:revert-SPARK-24959 branch to 466e3dd Jan 31, 2019

@SparkQA

This comment has been minimized.

Copy link

commented Jan 31, 2019

Test build #101926 has finished for PR 23667 at commit 466e3dd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@HyukjinKwon

This comment has been minimized.

Copy link
Member Author

commented Jan 31, 2019

Merged to master.

I am going to open a backport soon.

@asfgit asfgit closed this in d4d6df2 Jan 31, 2019

} else {
// If `columnPruning` enabled and partition attributes scanned only,
// `schema` gets empty.
(_: String) => InternalRow.empty

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Jan 31, 2019

Contributor

It's too long ago and I can't remember the details. Does it mean we still have this count optimization for CSV? does it work in multiline mode?

This comment has been minimized.

Copy link
@HyukjinKwon

HyukjinKwon Jan 31, 2019

Author Member

Yes, it does for CSV when multiline is off and, for miltiline mode it executes a different code path.

UnivocityParser.parseStream -> UnivocityParser.convert

asfgit pushed a commit that referenced this pull request Feb 6, 2019

[SPARK-26745][SQL][TESTS] JsonSuite test case: empty line -> 0 record…
… count

## What changes were proposed in this pull request?

This PR consists of the `test` components of #23665 only, minus the associated patch from that PR.

It adds a new unit test to `JsonSuite` which verifies that the `count()` returned from a `DataFrame` loaded from JSON containing empty lines does not include those empty lines in the record count. The test runs `count` prior to otherwise reading data from the `DataFrame`, so as to catch future cases where a pre-parsing optimization might result in `count` results inconsistent with existing behavior.

This PR is intended to be deployed alongside #23667; `master` currently causes the test to fail, as described in [SPARK-26745](https://issues.apache.org/jira/browse/SPARK-26745).

## How was this patch tested?

Manual testing, existing `JsonSuite` unit tests.

Closes #23674 from sumitsu/json_emptyline_count_test.

Authored-by: Branden Smith <branden.smith@publicismedia.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

stczwd added a commit to stczwd/spark that referenced this pull request Feb 18, 2019

[SPARK-26745][SQL] Revert count optimization in JSON datasource by SP…
…ARK-24959

## What changes were proposed in this pull request?

This PR reverts JSON count optimization part of apache#21909.

We cannot distinguish the cases below without parsing:

```
[{...}, {...}]
```

```
[]
```

```
{...}
```

```bash
# empty string
```

when we `count()`. One line (input: IN) can be, 0 record, 1 record and multiple records and this is dependent on each input.

See also apache#23665 (comment).

## How was this patch tested?

Manually tested.

Closes apache#23667 from HyukjinKwon/revert-SPARK-24959.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

stczwd added a commit to stczwd/spark that referenced this pull request Feb 18, 2019

[SPARK-26745][SQL][TESTS] JsonSuite test case: empty line -> 0 record…
… count

## What changes were proposed in this pull request?

This PR consists of the `test` components of apache#23665 only, minus the associated patch from that PR.

It adds a new unit test to `JsonSuite` which verifies that the `count()` returned from a `DataFrame` loaded from JSON containing empty lines does not include those empty lines in the record count. The test runs `count` prior to otherwise reading data from the `DataFrame`, so as to catch future cases where a pre-parsing optimization might result in `count` results inconsistent with existing behavior.

This PR is intended to be deployed alongside apache#23667; `master` currently causes the test to fail, as described in [SPARK-26745](https://issues.apache.org/jira/browse/SPARK-26745).

## How was this patch tested?

Manual testing, existing `JsonSuite` unit tests.

Closes apache#23674 from sumitsu/json_emptyline_count_test.

Authored-by: Branden Smith <branden.smith@publicismedia.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

wangyum added a commit to wangyum/spark that referenced this pull request Mar 14, 2019

[SPARK-26745][SQL] Revert count optimization in JSON datasource by SP…
…ARK-24959

## What changes were proposed in this pull request?

This PR reverts JSON count optimization part of apache#21909.

We cannot distinguish the cases below without parsing:

```
[{...}, {...}]
```

```
[]
```

```
{...}
```

```bash
# empty string
```

when we `count()`. One line (input: IN) can be, 0 record, 1 record and multiple records and this is dependent on each input.

See also apache#23665 (comment).

## How was this patch tested?

Manually tested.

Closes apache#23667 from HyukjinKwon/revert-SPARK-24959.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

wangyum added a commit to wangyum/spark that referenced this pull request Mar 14, 2019

[SPARK-26745][SQL][TESTS] JsonSuite test case: empty line -> 0 record…
… count

## What changes were proposed in this pull request?

This PR consists of the `test` components of apache#23665 only, minus the associated patch from that PR.

It adds a new unit test to `JsonSuite` which verifies that the `count()` returned from a `DataFrame` loaded from JSON containing empty lines does not include those empty lines in the record count. The test runs `count` prior to otherwise reading data from the `DataFrame`, so as to catch future cases where a pre-parsing optimization might result in `count` results inconsistent with existing behavior.

This PR is intended to be deployed alongside apache#23667; `master` currently causes the test to fail, as described in [SPARK-26745](https://issues.apache.org/jira/browse/SPARK-26745).

## How was this patch tested?

Manual testing, existing `JsonSuite` unit tests.

Closes apache#23674 from sumitsu/json_emptyline_count_test.

Authored-by: Branden Smith <branden.smith@publicismedia.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

dongjoon-hyun added a commit that referenced this pull request Apr 27, 2019

[SPARK-26745][SQL][TESTS] JsonSuite test case: empty line -> 0 record…
… count

This PR consists of the `test` components of #23665 only, minus the associated patch from that PR.

It adds a new unit test to `JsonSuite` which verifies that the `count()` returned from a `DataFrame` loaded from JSON containing empty lines does not include those empty lines in the record count. The test runs `count` prior to otherwise reading data from the `DataFrame`, so as to catch future cases where a pre-parsing optimization might result in `count` results inconsistent with existing behavior.

This PR is intended to be deployed alongside #23667; `master` currently causes the test to fail, as described in [SPARK-26745](https://issues.apache.org/jira/browse/SPARK-26745).

Manual testing, existing `JsonSuite` unit tests.

Closes #23674 from sumitsu/json_emptyline_count_test.

Authored-by: Branden Smith <branden.smith@publicismedia.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 63bced9)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.