[SPARK-15493][SQL] default QuoteEscapingEnabled flag to true when writing CSV#13267
[SPARK-15493][SQL] default QuoteEscapingEnabled flag to true when writing CSV#13267jurriaan wants to merge 7 commits intoapache:masterfrom
Conversation
…riting CSV See https://github.com/uniVocity/univocity-parsers/blob/f3eb2af26374940e60d91d1703bde54619f50c51/src/main/java/com/univocity/parsers/csv/CsvWriterSettings.java#L231-L247 This kind of functionality is needed to be able to write Amazon Redshift compatible CSV files (https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html#copy-csv) https://issues.apache.org/jira/browse/SPARK-15493
|
|
||
| val df = spark.createDataFrame(Seq(("test \"quote\"", 123, | ||
| "it \"works\"!", "\"very\" well"))) | ||
| .toDF("a", "b", "c", "d") |
There was a problem hiding this comment.
Indentation here.. maybe. I see other codes follow the indentations such as:
spark.createDataFrame(Seq(
("test \"quote\"", 123, "it \"works\"!", "\"very\" well")
)).toDF("a", "b", "c", "d")val data = Seq(("test \"quote\"", 123, "it \"works\"!", "\"very\" well"))
spark.createDataFrame(data).toDF("a", "b", "c", "d")and etc..
|
@jurriaan Just to double check.. It dose not escape |
|
@HyukjinKwon If you don't supply those options they are set to the defaults. For the workings of the setQuoteEscapingEnabled see uniVocity/univocity-parsers#38. In the test I supplied them to show a possible usecase. |
|
@HyukjinKwon Addressed your comments and improved the documentation a bit. |
|
Test build #3011 has finished for PR 13267 at commit
|
|
Can we explain using an example what this does when it is off? |
|
@rxin The default CSV behaviour will save the data like this when you specify With quoteEscapingEnabled set to true the output looks like this: As you can see the default does wrap a value in quotes only if it starts with quotes. When quoteEscapingEnabled is turned on it wraps all values containing quotation characters in quotes. This is needed in some CSV dialects. |
|
Thanks - a follow up question: should this flag ever be false? i.e. should we not have this flag and just have it on always? Alternatively, have it on by default? |
|
@rxin Good question, I'm not sure what's the best approach here. It looks like setting the flag to true by default could be a good choice. The comment at src/main/java/com/univocity/parsers/csv/CsvWriterSettings.java mentions the default behaviour (quoteEscapingEnabled set to false) is not valid CSV according to RFC 4180. To quote RFC 4180:
I'm not sure why they (the univocity csv parser developers) chose to turn it off by default. Maybe performance reasons? |
|
@jbax can we get a 2nd opinion here about quoteEscapingEnabled? |
|
It's disabled by default because earlier versions were slower when writing With version 2.1.0 the new algorithm made the writing performance improve a Versions 2.2.x and up will have this enabled by default.
|
|
Thanks, @jbax. Given this I think we should just have it on by default. Some follow-up questions:
|
|
@rxin In your case think it's better to have this turned on by default. Regarding your other questions: 1 - There's no timeline. 2.2.x will come out when new features are requested by our users and implemented. Currently there's nothing in the pipeline so we'll be on 2.1.x adding fixes and minor internal improvements over time. We have no open bugs either. 2 - Yes. It fixes a couple of bugs you guys probably won't come across, but it also improves the performance of the parser with whitespace trimming enabled (it's enabled by default, by the way). 3 - It's OK and I don't see why it would be a problem, other than having some client with a very uncommon use case (they are out there, that's why the library has a lot of configuration options). |
|
Yea I agree with escapeQuotes. |
|
@jurriaan want to do the change? |
|
@rxin Done :) |
python/pyspark/sql/readwriter.py
Outdated
| quoted value. If None is set, it uses the default value, ``\`` | ||
| :param escapeQuotes: A flag indicating whether values containing quotes should always | ||
| be enclosed in quotes. Default is to escape all values containing | ||
| a quote character. ``true`` |
There was a problem hiding this comment.
what's the true at the end here?
|
BTW don'r forget to update the title too. |
|
Test build #3018 has finished for PR 13267 at commit
|
|
Thanks - merging in master/2.0. |
…ting CSV ## What changes were proposed in this pull request? Default QuoteEscapingEnabled flag to true when writing CSV and add an escapeQuotes option to be able to change this. See https://github.com/uniVocity/univocity-parsers/blob/f3eb2af26374940e60d91d1703bde54619f50c51/src/main/java/com/univocity/parsers/csv/CsvWriterSettings.java#L231-L247 This change is needed to be able to write RFC 4180 compatible CSV files (https://tools.ietf.org/html/rfc4180#section-2) https://issues.apache.org/jira/browse/SPARK-15493 ## How was this patch tested? Added a test that verifies the output is quoted correctly. Author: Jurriaan Pruis <email@jurriaanpruis.nl> Closes #13267 from jurriaan/quote-escaping. (cherry picked from commit c875d81) Signed-off-by: Reynold Xin <rxin@databricks.com>
What changes were proposed in this pull request?
Default QuoteEscapingEnabled flag to true when writing CSV and add an escapeQuotes option to be able to change this.
See https://github.com/uniVocity/univocity-parsers/blob/f3eb2af26374940e60d91d1703bde54619f50c51/src/main/java/com/univocity/parsers/csv/CsvWriterSettings.java#L231-L247
This change is needed to be able to write RFC 4180 compatible CSV files (https://tools.ietf.org/html/rfc4180#section-2)
https://issues.apache.org/jira/browse/SPARK-15493
How was this patch tested?
Added a test that verifies the output is quoted correctly.