Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25513][SQL] Read zipped CSV and JSON #22528

Closed
wants to merge 7 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Sep 22, 2018

What changes were proposed in this pull request?

In the PR, I propose to support reading of zip archives containing one CSV or JSON file in the multi-line mode.

How was this patch tested?

Added tests for CSV and JSON where zip archives are created by Java library.

@MaxGekk
Copy link
Member Author

MaxGekk commented Sep 22, 2018

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Sep 23, 2018

Test build #96477 has finished for PR 22528 at commit ec8ba0d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val zip = new ZipArchiveInputStream(inputStream)
if (zip.getNextEntry != null) Some(zip) else None
} else None
}.getOrElse(inputStream)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk, I got that we can support zipped one but isn't this difficult to extend this support to non multiline modes (and multiple files) as well? Basically deflate is the same codec and I wonder if we really should allow this zip one specifically in multiline mode for CSV / JSON specifically with a clear restriction (single file). Please correct me if I misunderstood.

Copy link
Member Author

@MaxGekk MaxGekk Sep 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... but isn't this difficult to extend this support to non multiline modes ...

In per-line mode, need to teach Hadoop's Line Reader how to read zip archives. It seems the feasible way is via Decompressor interface. The problem is the interface is designed to read compression data in stream fashion. That contradicts to structure of ZIP format where the directory structure is located at the end of file.

and multiple files) as well

In general, it is possible to support per-line mode and multiple files, if LineRecordReader will be extended in the same way as I do in this PR (not via Decompressor API) but LineRecordReader, SplitLineReader and maybe something else should to be move out of Hadoop to Spark. From my point of view it could be done but not in this simple PR.

Basically deflate is the same codec and I wonder if we really should allow this zip one specifically in multiline mode for CSV / JSON specifically with a clear restriction (single file).

Significant different between zip and deflate is the first one is some kind of container of independently compressed entries where directory info is located at the end of files. In Spark's datasources, there is a restriction for now - one file maps to one InputStream:

  • in per-line mode: in = new SplitLineReader(codec.createInputStream(fileIn, decompressor), job, this.recordDelimiterBytes); filePosition = fileIn; . If you return one InputStream for all files in a zip archive, it can be difficult to differentiate last line in one file and the first line in another file.
  • in multi-line mode, the readFile methods should be modified to handle multiple InputStream per file: PartitionedFile (to eliminate restriction of one file per zip archive).

For sure, all of those issues can be solved but

  • Multiple files per zip archive is significantly rare case that one zipped JSON/CSV in the wild. Maybe you have another statistics.
  • Support per-line mode demands principal changes - extension and movement to Spark Hadoop's per-line reader.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be feasible but still difficult. This is a simple PR but I don't think we should just allow this single case alone.

Multiple files per zip archive is significantly rare case that one zipped JSON/CSV in the wild.

Yea, I doubt if that's rare because it, at least, wasn't in my (humble and maybe biased) experience in the previous company.

How about we just propose the changes you are thinking currently? looks you have a clear idea on this.

@MaxGekk
Copy link
Member Author

MaxGekk commented Sep 25, 2018

@gatorsmile Please, take a look at the PR.

@gatorsmile
Copy link
Member

Generally, this is useful for reading the zip archives containing a single CSV. The extra support JSON file is not very useful.

The major limitation is we have to read the zipped CSV files using the multi line mode. I am not sure whether we can automatically switch to multi line mode. If so, is the result expected? Is the results different if we read it using the regular mode? If different, can we issue an exception with a better error message and let the users manually enable the multi line mode?

@HyukjinKwon
Copy link
Member

The limitation is quite clear - multi line mode only and single file. We don't have to rush about this - looks @MaxGekk has a clear idea on this. Another concern here is, we have another place to control the compression codec (where we usually delegate to HDFS libraries).

I don't think we can extend the current way to implement #22528 (comment) idea. It just sounds like a bandaid fix to allow one zipped file case in multi line mode.

@MaxGekk
Copy link
Member Author

MaxGekk commented Sep 27, 2018

Another concern here is, we have another place to control the compression codec (where we usually delegate to HDFS libraries).

I was considering using Compressor API but its streaming nature controverses to structure of zip archive where meta-info is located at the end of files, and you cannot read/uncompress it sequentially block-by-block.

It just sounds like a bandaid fix to allow one zipped file case in multi line mode.

I believe it is better to return correct result in a case when wrong result is returned for now (try to read zipped CSV), or to force users to use this workaround only to read zip archives via RDD API: https://docs.databricks.com/spark/latest/data-sources/zip-files.html#zip-files . Especially in the case of compressed not splittable CSV, there is not big difference how to read it in multiLine enabled or disabled.

@MaxGekk
Copy link
Member Author

MaxGekk commented Sep 28, 2018

Since there is no consensus, I am going to close the PR.

@MaxGekk MaxGekk closed this Sep 29, 2018
@HyukjinKwon
Copy link
Member

@MaxGekk, please feel free to proceed based upon your idea. From a cursory look, the idea makes sense to me.

@gatorsmile
Copy link
Member

We should stop returning a wrong result. Please fix it. Thanks!

@MaxGekk MaxGekk deleted the read-zipped-csv-json branch August 17, 2019 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants