Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25517][SQL] Detect/Infer date type in CSV file #22539

Closed
wants to merge 1 commit into from

Conversation

TheCodeCache
Copy link

This fix is with reference to the below JIRA Issue which I've created just hours before:

SPARK-25517

This is about spark.read.format("csv").option("inferSchema", "true").option("dateFormat", "MM/dd/yyyy").load(/path/to/csvfile). Assume /path/to/csvfile has a column which contains just date information such as employee joining date, for example:- 02/22/2018 which is 22nd of feb 2018, is a date but the spark always incorrectly reads this joining_date column as string, whereas the same analogy works perfectly fine with timestampFormat or the timestamp column values in csv.

What changes were proposed in this pull request?

to support for detecting date type from the csv files,

How was this patch tested?

manual test

Please review http://spark.apache.org/contributing.html before opening a pull request.

This fix is with reference to the below JIRA Issue which I've created just hours before:

https://issues.apache.org/jira/browse/SPARK-25517

This is about spark.read.format("csv").option("inferSchema", "true").option("dateFormat", "MM/dd/yyyy").load(/path/to/csvfile). Assume /path/to/csvfile has date type column such as employee joining date, for example:- 02/22/2018 which is 22nd of feb 2018 is a date but the spark always read this joining_date column as string, whereas this works perfectly fine with timestampFormat.
@TheCodeCache
Copy link
Author

Hi,

Please review the changes for the bug which is described and documented here at the below JIRA location in detail:

https://issues.apache.org/jira/browse/SPARK-25517

Thanks,
Manoranjan

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@NiharS
Copy link
Contributor

NiharS commented Sep 24, 2018

Could you edit your title to include the jira number and component?

e.g. [SPARK-25517][CORE] Detect ...

Helps with bookkeeping, plus it'll add a link to the jira so people can see your PR from there.

@TheCodeCache TheCodeCache changed the title detect date type in csv file [SPARK-25517][SQL] Detect/Infer date type in CSV file Sep 25, 2018
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @softmanu .

  • Since this PR changes tryParseTimestamp, it seems to be not a good place for this logic. Do you want to add a new function or to rename this function?
  • Could you add your test case (from SPARK-25517) at CSVInferSchemaSuite.scala?

BTW, #11550 originally aimed to DateType, too. So, cc @HyukjinKwon .

For TimestampType, this uses the given format to infer schema and also to convert the values
For DateType, this uses the given format to convert the values.

@HyukjinKwon
Copy link
Member

I think this is a duplicate of #21363

@dongjoon-hyun
Copy link
Member

Thank you for review, @HyukjinKwon .

@softmanu . Could you take a look at SPARK-19228 and close this PR and Apache Spark JIRA?

@dongjoon-hyun
Copy link
Member

Ping, @softmanu .

@TheCodeCache
Copy link
Author

@dongjoon-hyun @HyukjinKwon
Hi,
i was not well whole last week, now I am back, so yes, thanks for reviewing and all the comments. whether my PR is a duplicate or not we can see out later, all I worry here is the fact that it's not working as expected, and the whole steps of execution I have explained/captured at granular level in a very well structured and detailed manner so that it could be easy to understand, under this JIRA SPARK-25517

And sure, I will add a test case, and work upon it.

P.S. I've found other different issues in spark same around date/timestamp which is not working at all because the implementation itself is missing totally. On this I will get back later, first let me resolve this current issue.

Thanks,
Manoranjan : )

@HyukjinKwon
Copy link
Member

Looks #21363 getting inactive. Can you take this over instead? You can pick up the commits there and open another PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants