Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32025][SQL] Csv schema inference problems with different types in the same column #28896

Closed

Conversation

planga82
Copy link
Contributor

@planga82 planga82 commented Jun 22, 2020

What changes were proposed in this pull request?

This pull request fixes a bug present in the csv type inference.
We have problems when we have different types in the same column.

Previously:

$ cat /example/f1.csv
col1
43200000
true

spark.read.csv(path="file:///example/*.csv", header=True, inferSchema=True).show()
+----+
|col1|
+----+
|null|
|true|
+----+

root
 |-- col1: boolean (nullable = true)

Now

spark.read.csv(path="file:///example/*.csv", header=True, inferSchema=True).show()
+-------------+
|col1          |
+-------------+
|43200000 |
|true           |
+-------------+

root                                                                            
 |-- col1: string (nullable = true)

Previously the hierarchy of type inference is the following:

IntegerType

LongType

DecimalType

DoubleType

TimestampType

BooleanType

StringType

So, when, for example, we have integers in one column, and the last element is a boolean, all the column is inferred as a boolean column incorrectly and all the number are shown as null when you see the data

We need the following hierarchy. When we have different numeric types in the column it will be resolved correctly. And when we have other different types it will be resolved as a String type column

IntegerType

LongType

DecimalType

DoubleType

StringType

TimestampType

StringType

BooleanType

StringType

StringType

Why are the changes needed?

Fix the bug explained

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test and manual tests

@planga82
Copy link
Contributor Author

CC @HyukjinKwon @cloud-fan @MaxGekk

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If StringType is not the last type in any type inference chains, this can break existing users apps, I guess. Can't it?

@planga82
Copy link
Contributor Author

planga82 commented Jun 22, 2020

@MaxGekk Yes, the last type in inference is StringType in all inference chains. I have changed the description to show it better.
The principal difference is that now we have different chains instead of one

@cloud-fan
Copy link
Contributor

Does JSON source have the same problem?

@planga82
Copy link
Contributor Author

@HyukjinKwon It's exactly what you say, it only happens when the incompatibility is inside one partition. I will change the PR to use compatibleType, and I will make some performance test. Thanks for your help!

@cloud-fan I tested the same situation with json and it works fine, we don't have problems there

@planga82
Copy link
Contributor Author

I have done some performance tests in my local machine

  • File: 999999999 lines integers / 10 GB
  • Resources: 4 Cores

Result without changes:

  • 21m10,114s
  • 20m18,626s
  • 19m46,158s

Results after changes

  • 19m46,388s
  • 22m10,784s
  • 21m15,937s

It seems that we don't have a very significant impact but the tests in local are not the best way to be sure

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jun 25, 2020

Test build #124510 has finished for PR 28896 at commit 4629bb5.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@planga82
Copy link
Contributor Author

retest this please

1 similar comment
@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jun 25, 2020

Test build #124512 has finished for PR 28896 at commit 4629bb5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@gatorsmile
Copy link
Member

Is that possible we can document the type inference rule? something like the traditional database like https://www.ibm.com/support/knowledgecenter/SSEPGG_10.1.0/com.ibm.db2.luw.sql.ref.doc/doc/r0008477.html

cc @huaxingao @planga82

@huaxingao
Copy link
Contributor

@gatorsmile I will take a look.

@planga82
Copy link
Contributor Author

planga82 commented Jul 8, 2020

Thanks @huaxingao If you can't do it for any reason tell me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
7 participants