Skip to content

[SPARK-15493][SQL] default QuoteEscapingEnabled flag to true when writing CSV#13267

Closed
jurriaan wants to merge 7 commits intoapache:masterfrom
jurriaan:quote-escaping
Closed

[SPARK-15493][SQL] default QuoteEscapingEnabled flag to true when writing CSV#13267
jurriaan wants to merge 7 commits intoapache:masterfrom
jurriaan:quote-escaping

Conversation

@jurriaan
Copy link
Contributor

@jurriaan jurriaan commented May 23, 2016

What changes were proposed in this pull request?

Default QuoteEscapingEnabled flag to true when writing CSV and add an escapeQuotes option to be able to change this.

See https://github.com/uniVocity/univocity-parsers/blob/f3eb2af26374940e60d91d1703bde54619f50c51/src/main/java/com/univocity/parsers/csv/CsvWriterSettings.java#L231-L247

This change is needed to be able to write RFC 4180 compatible CSV files (https://tools.ietf.org/html/rfc4180#section-2)

https://issues.apache.org/jira/browse/SPARK-15493

How was this patch tested?

Added a test that verifies the output is quoted correctly.

@jurriaan
Copy link
Contributor Author

cc @rxin @HyukjinKwon


val df = spark.createDataFrame(Seq(("test \"quote\"", 123,
"it \"works\"!", "\"very\" well")))
.toDF("a", "b", "c", "d")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation here.. maybe. I see other codes follow the indentations such as:

spark.createDataFrame(Seq(
  ("test \"quote\"", 123, "it \"works\"!", "\"very\" well")
)).toDF("a", "b", "c", "d")
val data = Seq(("test \"quote\"", 123, "it \"works\"!", "\"very\" well"))
spark.createDataFrame(data).toDF("a", "b", "c", "d")

and etc..

@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 24, 2016

@jurriaan Just to double check.. It dose not escape quotes if quote and/or escape are/is not set?
I think they might better be documented..

@jurriaan
Copy link
Contributor Author

jurriaan commented May 24, 2016

@HyukjinKwon If you don't supply those options they are set to the defaults. For the workings of the setQuoteEscapingEnabled see uniVocity/univocity-parsers#38. In the test I supplied them to show a possible usecase.

@jurriaan
Copy link
Contributor Author

@HyukjinKwon Addressed your comments and improved the documentation a bit.

@SparkQA
Copy link

SparkQA commented May 24, 2016

Test build #3011 has finished for PR 13267 at commit caf8808.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented May 24, 2016

Can we explain using an example what this does when it is off?

@jurriaan
Copy link
Contributor Author

@rxin
An example using the following dataframe:

spark.createDataFrame([['test "quote"', 123, 'it "works"!', '"very" well']])

The default CSV behaviour will save the data like this when you specify " as quote and as escape char:

test "quote",123,it "works"!,"""very"" well"

With quoteEscapingEnabled set to true the output looks like this:

"test ""quote""",123,"it ""works""!","""very"" well"

As you can see the default does wrap a value in quotes only if it starts with quotes. When quoteEscapingEnabled is turned on it wraps all values containing quotation characters in quotes. This is needed in some CSV dialects.

@rxin
Copy link
Contributor

rxin commented May 24, 2016

Thanks - a follow up question: should this flag ever be false? i.e. should we not have this flag and just have it on always? Alternatively, have it on by default?

@jurriaan
Copy link
Contributor Author

jurriaan commented May 24, 2016

@rxin Good question, I'm not sure what's the best approach here. It looks like setting the flag to true by default could be a good choice.

The comment at src/main/java/com/univocity/parsers/csv/CsvWriterSettings.java mentions the default behaviour (quoteEscapingEnabled set to false) is not valid CSV according to RFC 4180.

To quote RFC 4180:

If fields are not enclosed with double quotes, then double quotes may not appear inside the fields.

I'm not sure why they (the univocity csv parser developers) chose to turn it off by default. Maybe performance reasons?

@rxin
Copy link
Contributor

rxin commented May 24, 2016

@jbax can we get a 2nd opinion here about quoteEscapingEnabled?

@jbax
Copy link

jbax commented May 24, 2016

It's disabled by default because earlier versions were slower when writing
CSV and it helped a little bit. Also because parsing unqoted values is
faster.

With version 2.1.0 the new algorithm made the writing performance improve a
lot, and having quoteEscaping enabled now makes writing faster. I found
this out after testing version 2.1.1 (a maintenance release) so I didn't
change the default behavior.

Versions 2.2.x and up will have this enabled by default.
On 25 May 2016 6:33 AM, "Reynold Xin" notifications@github.com wrote:

@jbax https://github.com/jbax can we get a 2nd opinion here about
quoteEscapingEnabled?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#13267 (comment)

@rxin
Copy link
Contributor

rxin commented May 25, 2016

Thanks, @jbax. Given this I think we should just have it on by default.

Some follow-up questions:

  1. When will 2.2.x come out?
  2. We should probably upgrade to 2.1.1 right?
  3. Would it be OK if we always have this on (i.e. not having this option exposed to users at all)?

@jbax
Copy link

jbax commented May 25, 2016

@rxin In your case think it's better to have this turned on by default. Regarding your other questions:

1 - There's no timeline. 2.2.x will come out when new features are requested by our users and implemented. Currently there's nothing in the pipeline so we'll be on 2.1.x adding fixes and minor internal improvements over time. We have no open bugs either.

2 - Yes. It fixes a couple of bugs you guys probably won't come across, but it also improves the performance of the parser with whitespace trimming enabled (it's enabled by default, by the way).

3 - It's OK and I don't see why it would be a problem, other than having some client with a very uncommon use case (they are out there, that's why the library has a lot of configuration options).

@falaki
Copy link
Contributor

falaki commented May 25, 2016

@rxin and @jurriaan I agree to keep it set by default. However, I think it is better to leave it configurable. In two cases before, I assumed a reasonable default value is good enough, but ended up exposing them in options.

Also, I suggest a simpler name like escapeQuotes or enableQuoteEscaping.

@rxin
Copy link
Contributor

rxin commented May 25, 2016

Yea I agree with escapeQuotes.

@rxin
Copy link
Contributor

rxin commented May 25, 2016

@jurriaan want to do the change?

@jurriaan
Copy link
Contributor Author

@rxin Done :)

quoted value. If None is set, it uses the default value, ``\``
:param escapeQuotes: A flag indicating whether values containing quotes should always
be enclosed in quotes. Default is to escape all values containing
a quote character. ``true``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the true at the end here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops :)

@rxin
Copy link
Contributor

rxin commented May 25, 2016

BTW don'r forget to update the title too.

@SparkQA
Copy link

SparkQA commented May 25, 2016

Test build #3018 has finished for PR 13267 at commit 8c4bef1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jurriaan jurriaan changed the title [SPARK-15493][SQL] Allow setting the quoteEscapingEnabled flag when writing CSV [SPARK-15493][SQL] default QuoteEscapingEnabled flag to true when writing CSV May 25, 2016
@rxin
Copy link
Contributor

rxin commented May 25, 2016

Thanks - merging in master/2.0.

@asfgit asfgit closed this in c875d81 May 25, 2016
asfgit pushed a commit that referenced this pull request May 25, 2016
…ting CSV

## What changes were proposed in this pull request?

Default QuoteEscapingEnabled flag to true when writing CSV and add an escapeQuotes option to be able to change this.

See https://github.com/uniVocity/univocity-parsers/blob/f3eb2af26374940e60d91d1703bde54619f50c51/src/main/java/com/univocity/parsers/csv/CsvWriterSettings.java#L231-L247

This change is needed to be able to write RFC 4180 compatible CSV files (https://tools.ietf.org/html/rfc4180#section-2)

https://issues.apache.org/jira/browse/SPARK-15493

## How was this patch tested?

Added a test that verifies the output is quoted correctly.

Author: Jurriaan Pruis <email@jurriaanpruis.nl>

Closes #13267 from jurriaan/quote-escaping.

(cherry picked from commit c875d81)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants