[SPARK-25513][SQL] Read zipped CSV and JSON #22528

MaxGekk · 2018-09-22T20:18:30Z

What changes were proposed in this pull request?

In the PR, I propose to support reading of zip archives containing one CSV or JSON file in the multi-line mode.

How was this patch tested?

Added tests for CSV and JSON where zip archives are created by Java library.

MaxGekk · 2018-09-22T20:25:59Z

jenkins, retest this, please

SparkQA · 2018-09-23T00:17:41Z

Test build #96477 has finished for PR 22528 at commit ec8ba0d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-09-23T01:59:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala

+          val zip = new ZipArchiveInputStream(inputStream)
+          if (zip.getNextEntry != null) Some(zip) else None
+        } else None
+      }.getOrElse(inputStream)


@MaxGekk, I got that we can support zipped one but isn't this difficult to extend this support to non multiline modes (and multiple files) as well? Basically deflate is the same codec and I wonder if we really should allow this zip one specifically in multiline mode for CSV / JSON specifically with a clear restriction (single file). Please correct me if I misunderstood.

... but isn't this difficult to extend this support to non multiline modes ...

In per-line mode, need to teach Hadoop's Line Reader how to read zip archives. It seems the feasible way is via Decompressor interface. The problem is the interface is designed to read compression data in stream fashion. That contradicts to structure of ZIP format where the directory structure is located at the end of file.

and multiple files) as well

In general, it is possible to support per-line mode and multiple files, if LineRecordReader will be extended in the same way as I do in this PR (not via Decompressor API) but LineRecordReader, SplitLineReader and maybe something else should to be move out of Hadoop to Spark. From my point of view it could be done but not in this simple PR.

Basically deflate is the same codec and I wonder if we really should allow this zip one specifically in multiline mode for CSV / JSON specifically with a clear restriction (single file).

Significant different between zip and deflate is the first one is some kind of container of independently compressed entries where directory info is located at the end of files. In Spark's datasources, there is a restriction for now - one file maps to one InputStream:

in per-line mode: in = new SplitLineReader(codec.createInputStream(fileIn, decompressor), job, this.recordDelimiterBytes); filePosition = fileIn; . If you return one InputStream for all files in a zip archive, it can be difficult to differentiate last line in one file and the first line in another file.

in multi-line mode, the readFile methods should be modified to handle multiple InputStream per file: PartitionedFile (to eliminate restriction of one file per zip archive).

For sure, all of those issues can be solved but

Multiple files per zip archive is significantly rare case that one zipped JSON/CSV in the wild. Maybe you have another statistics.

Support per-line mode demands principal changes - extension and movement to Spark Hadoop's per-line reader.

It might be feasible but still difficult. This is a simple PR but I don't think we should just allow this single case alone.

Multiple files per zip archive is significantly rare case that one zipped JSON/CSV in the wild.

Yea, I doubt if that's rare because it, at least, wasn't in my (humble and maybe biased) experience in the previous company.

How about we just propose the changes you are thinking currently? looks you have a clear idea on this.

MaxGekk · 2018-09-25T18:54:54Z

@gatorsmile Please, take a look at the PR.

gatorsmile · 2018-09-26T23:26:57Z

Generally, this is useful for reading the zip archives containing a single CSV. The extra support JSON file is not very useful.

The major limitation is we have to read the zipped CSV files using the multi line mode. I am not sure whether we can automatically switch to multi line mode. If so, is the result expected? Is the results different if we read it using the regular mode? If different, can we issue an exception with a better error message and let the users manually enable the multi line mode?

HyukjinKwon · 2018-09-27T02:58:01Z

The limitation is quite clear - multi line mode only and single file. We don't have to rush about this - looks @MaxGekk has a clear idea on this. Another concern here is, we have another place to control the compression codec (where we usually delegate to HDFS libraries).

I don't think we can extend the current way to implement #22528 (comment) idea. It just sounds like a bandaid fix to allow one zipped file case in multi line mode.

MaxGekk · 2018-09-27T13:05:54Z

Another concern here is, we have another place to control the compression codec (where we usually delegate to HDFS libraries).

I was considering using Compressor API but its streaming nature controverses to structure of zip archive where meta-info is located at the end of files, and you cannot read/uncompress it sequentially block-by-block.

It just sounds like a bandaid fix to allow one zipped file case in multi line mode.

I believe it is better to return correct result in a case when wrong result is returned for now (try to read zipped CSV), or to force users to use this workaround only to read zip archives via RDD API: https://docs.databricks.com/spark/latest/data-sources/zip-files.html#zip-files . Especially in the case of compressed not splittable CSV, there is not big difference how to read it in multiLine enabled or disabled.

MaxGekk · 2018-09-28T10:56:45Z

Since there is no consensus, I am going to close the PR.

HyukjinKwon · 2018-09-29T09:46:17Z

@MaxGekk, please feel free to proceed based upon your idea. From a cursory look, the idea makes sense to me.

gatorsmile · 2018-09-30T01:07:29Z

We should stop returning a wrong result. Please fix it. Thanks!

MaxGekk added 7 commits September 22, 2018 21:49

Support zip archives

a926d27

Add test for zipped CSV files

2971624

Fix imports

149e452

Added a test for zipped JSON

1dff39e

Refactoring of the CSV test

09dff81

Make extension checking case agnostic

5fda51a

Providing the UTF-8 charset

ec8ba0d

HyukjinKwon reviewed Sep 23, 2018

View reviewed changes

MaxGekk closed this Sep 29, 2018

MaxGekk deleted the read-zipped-csv-json branch August 17, 2019 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25513][SQL] Read zipped CSV and JSON #22528

[SPARK-25513][SQL] Read zipped CSV and JSON #22528

MaxGekk commented Sep 22, 2018

MaxGekk commented Sep 22, 2018

SparkQA commented Sep 23, 2018

HyukjinKwon Sep 23, 2018 •

edited

MaxGekk Sep 23, 2018 •

edited

HyukjinKwon Sep 26, 2018

MaxGekk commented Sep 25, 2018

gatorsmile commented Sep 26, 2018

HyukjinKwon commented Sep 27, 2018

MaxGekk commented Sep 27, 2018

MaxGekk commented Sep 28, 2018

HyukjinKwon commented Sep 29, 2018

gatorsmile commented Sep 30, 2018

[SPARK-25513][SQL] Read zipped CSV and JSON #22528

[SPARK-25513][SQL] Read zipped CSV and JSON #22528

Conversation

MaxGekk commented Sep 22, 2018

What changes were proposed in this pull request?

How was this patch tested?

MaxGekk commented Sep 22, 2018

SparkQA commented Sep 23, 2018

HyukjinKwon Sep 23, 2018 • edited

Choose a reason for hiding this comment

MaxGekk Sep 23, 2018 • edited

Choose a reason for hiding this comment

HyukjinKwon Sep 26, 2018

Choose a reason for hiding this comment

MaxGekk commented Sep 25, 2018

gatorsmile commented Sep 26, 2018

HyukjinKwon commented Sep 27, 2018

MaxGekk commented Sep 27, 2018

MaxGekk commented Sep 28, 2018

HyukjinKwon commented Sep 29, 2018

gatorsmile commented Sep 30, 2018

HyukjinKwon Sep 23, 2018 •

edited

MaxGekk Sep 23, 2018 •

edited