-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25517][SQL] Detect/Infer date type in CSV file #22539
Conversation
This fix is with reference to the below JIRA Issue which I've created just hours before: https://issues.apache.org/jira/browse/SPARK-25517 This is about spark.read.format("csv").option("inferSchema", "true").option("dateFormat", "MM/dd/yyyy").load(/path/to/csvfile). Assume /path/to/csvfile has date type column such as employee joining date, for example:- 02/22/2018 which is 22nd of feb 2018 is a date but the spark always read this joining_date column as string, whereas this works perfectly fine with timestampFormat.
Hi, Please review the changes for the bug which is described and documented here at the below JIRA location in detail: https://issues.apache.org/jira/browse/SPARK-25517 Thanks, |
Can one of the admins verify this patch? |
Could you edit your title to include the jira number and component? e.g. [SPARK-25517][CORE] Detect ... Helps with bookkeeping, plus it'll add a link to the jira so people can see your PR from there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @softmanu .
- Since this PR changes
tryParseTimestamp
, it seems to be not a good place for this logic. Do you want to add a new function or to rename this function? - Could you add your test case (from SPARK-25517) at
CSVInferSchemaSuite.scala
?
BTW, #11550 originally aimed to DateType
, too. So, cc @HyukjinKwon .
For TimestampType, this uses the given format to infer schema and also to convert the values
For DateType, this uses the given format to convert the values.
I think this is a duplicate of #21363 |
Thank you for review, @HyukjinKwon . @softmanu . Could you take a look at SPARK-19228 and close this PR and Apache Spark JIRA? |
Ping, @softmanu . |
@dongjoon-hyun @HyukjinKwon And sure, I will add a test case, and work upon it. P.S. I've found other different issues in spark same around date/timestamp which is not working at all because the implementation itself is missing totally. On this I will get back later, first let me resolve this current issue. Thanks, |
Looks #21363 getting inactive. Can you take this over instead? You can pick up the commits there and open another PR. |
This fix is with reference to the below JIRA Issue which I've created just hours before:
SPARK-25517
This is about spark.read.format("csv").option("inferSchema", "true").option("dateFormat", "MM/dd/yyyy").load(/path/to/csvfile). Assume /path/to/csvfile has a column which contains just date information such as employee joining date, for example:- 02/22/2018 which is 22nd of feb 2018, is a date but the spark always incorrectly reads this joining_date column as string, whereas the same analogy works perfectly fine with timestampFormat or the timestamp column values in csv.
What changes were proposed in this pull request?
to support for detecting date type from the csv files,
How was this patch tested?
manual test
Please review http://spark.apache.org/contributing.html before opening a pull request.