[SPARK-32025][SQL] Csv schema inference problems with different types in the same column #28896

planga82 · 2020-06-22T20:33:36Z

What changes were proposed in this pull request?

This pull request fixes a bug present in the csv type inference.
We have problems when we have different types in the same column.

Previously:

$ cat /example/f1.csv
col1
43200000
true

spark.read.csv(path="file:///example/*.csv", header=True, inferSchema=True).show()
+----+
|col1|
+----+
|null|
|true|
+----+

root
 |-- col1: boolean (nullable = true)

Now

spark.read.csv(path="file:///example/*.csv", header=True, inferSchema=True).show()
+-------------+
|col1          |
+-------------+
|43200000 |
|true           |
+-------------+

root                                                                            
 |-- col1: string (nullable = true)

Previously the hierarchy of type inference is the following:

IntegerType

LongType

DecimalType

DoubleType

TimestampType

BooleanType

StringType

So, when, for example, we have integers in one column, and the last element is a boolean, all the column is inferred as a boolean column incorrectly and all the number are shown as null when you see the data

We need the following hierarchy. When we have different numeric types in the column it will be resolved correctly. And when we have other different types it will be resolved as a String type column

IntegerType

LongType

DecimalType

DoubleType

StringType

TimestampType

StringType

BooleanType

StringType

StringType

Why are the changes needed?

Fix the bug explained

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test and manual tests

planga82 · 2020-06-22T20:43:22Z

CC @HyukjinKwon @cloud-fan @MaxGekk

MaxGekk

If StringType is not the last type in any type inference chains, this can break existing users apps, I guess. Can't it?

planga82 · 2020-06-22T21:21:36Z

@MaxGekk Yes, the last type in inference is StringType in all inference chains. I have changed the description to show it better.
The principal difference is that now we have different chains instead of one

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala

cloud-fan · 2020-06-23T04:48:51Z

Does JSON source have the same problem?

planga82 · 2020-06-23T15:56:50Z

@HyukjinKwon It's exactly what you say, it only happens when the incompatibility is inside one partition. I will change the PR to use compatibleType, and I will make some performance test. Thanks for your help!

@cloud-fan I tested the same situation with json and it works fine, we don't have problems there

planga82 · 2020-06-23T20:46:09Z

I have done some performance tests in my local machine

File: 999999999 lines integers / 10 GB
Resources: 4 Cores

Result without changes:

21m10,114s
20m18,626s
19m46,158s

Results after changes

19m46,388s
22m10,784s
21m15,937s

It seems that we don't have a very significant impact but the tests in local are not the best way to be sure

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala

sql/core/src/test/resources/test-data/mixed-types1.csv

HyukjinKwon · 2020-06-25T04:47:54Z

ok to test

SparkQA · 2020-06-25T07:05:01Z

Test build #124510 has finished for PR 28896 at commit 4629bb5.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

planga82 · 2020-06-25T07:23:19Z

retest this please

HyukjinKwon · 2020-06-25T10:03:31Z

retest this please

SparkQA · 2020-06-25T14:40:03Z

Test build #124512 has finished for PR 28896 at commit 4629bb5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-06-26T01:41:07Z

Merged to master.

gatorsmile · 2020-07-08T06:15:10Z

Is that possible we can document the type inference rule? something like the traditional database like https://www.ibm.com/support/knowledgecenter/SSEPGG_10.1.0/com.ibm.db2.luw.sql.ref.doc/doc/r0008477.html

cc @huaxingao @planga82

huaxingao · 2020-07-08T15:11:35Z

@gatorsmile I will take a look.

planga82 · 2020-07-08T15:36:18Z

Thanks @huaxingao If you can't do it for any reason tell me.

Fix csv type inference

55c013e

probot-autolabeler bot added the SQL label Jun 22, 2020

MaxGekk reviewed Jun 22, 2020

View reviewed changes

HyukjinKwon reviewed Jun 23, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala Outdated Show resolved Hide resolved

planga82 added 3 commits June 23, 2020 18:00

Discard infer schema changes

7fbd442

Fix text

ae64bb9

Check compatibility types

c6bbc27

cloud-fan reviewed Jun 24, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jun 24, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala Outdated Show resolved Hide resolved

Move compatible type inside inferSchema

a3a4bfd

cloud-fan approved these changes Jun 24, 2020

View reviewed changes

HyukjinKwon reviewed Jun 24, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala Show resolved Hide resolved

HyukjinKwon reviewed Jun 24, 2020

View reviewed changes

sql/core/src/test/resources/test-data/mixed-types1.csv Outdated Show resolved Hide resolved

planga82 added 2 commits June 24, 2020 16:52

Pull request comments

d167493

Delete not used variable

4629bb5

HyukjinKwon approved these changes Jun 25, 2020

View reviewed changes

HyukjinKwon closed this in bbb2cba Jun 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32025][SQL] Csv schema inference problems with different types in the same column #28896

[SPARK-32025][SQL] Csv schema inference problems with different types in the same column #28896

planga82 commented Jun 22, 2020 •

edited

planga82 commented Jun 22, 2020

MaxGekk left a comment

planga82 commented Jun 22, 2020 •

edited

cloud-fan commented Jun 23, 2020

planga82 commented Jun 23, 2020

planga82 commented Jun 23, 2020

HyukjinKwon commented Jun 25, 2020

SparkQA commented Jun 25, 2020

planga82 commented Jun 25, 2020

HyukjinKwon commented Jun 25, 2020

SparkQA commented Jun 25, 2020

HyukjinKwon commented Jun 26, 2020

gatorsmile commented Jul 8, 2020

huaxingao commented Jul 8, 2020

planga82 commented Jul 8, 2020

[SPARK-32025][SQL] Csv schema inference problems with different types in the same column #28896

[SPARK-32025][SQL] Csv schema inference problems with different types in the same column #28896

Conversation

planga82 commented Jun 22, 2020 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

planga82 commented Jun 22, 2020

MaxGekk left a comment

Choose a reason for hiding this comment

planga82 commented Jun 22, 2020 • edited

cloud-fan commented Jun 23, 2020

planga82 commented Jun 23, 2020

planga82 commented Jun 23, 2020

HyukjinKwon commented Jun 25, 2020

SparkQA commented Jun 25, 2020

planga82 commented Jun 25, 2020

HyukjinKwon commented Jun 25, 2020

SparkQA commented Jun 25, 2020

HyukjinKwon commented Jun 26, 2020

gatorsmile commented Jul 8, 2020

huaxingao commented Jul 8, 2020

planga82 commented Jul 8, 2020

planga82 commented Jun 22, 2020 •

edited

planga82 commented Jun 22, 2020 •

edited