New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25513][SQL] Read zipped CSV and JSON #22528
Conversation
jenkins, retest this, please |
Test build #96477 has finished for PR 22528 at commit
|
val zip = new ZipArchiveInputStream(inputStream) | ||
if (zip.getNextEntry != null) Some(zip) else None | ||
} else None | ||
}.getOrElse(inputStream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGekk, I got that we can support zipped one but isn't this difficult to extend this support to non multiline modes (and multiple files) as well? Basically deflate is the same codec and I wonder if we really should allow this zip one specifically in multiline mode for CSV / JSON specifically with a clear restriction (single file). Please correct me if I misunderstood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... but isn't this difficult to extend this support to non multiline modes ...
In per-line mode, need to teach Hadoop's Line Reader how to read zip archives. It seems the feasible way is via Decompressor
interface. The problem is the interface is designed to read compression data in stream fashion. That contradicts to structure of ZIP format where the directory structure is located at the end of file.
and multiple files) as well
In general, it is possible to support per-line mode and multiple files, if LineRecordReader
will be extended in the same way as I do in this PR (not via Decompressor
API) but LineRecordReader
, SplitLineReader
and maybe something else should to be move out of Hadoop to Spark. From my point of view it could be done but not in this simple PR.
Basically deflate is the same codec and I wonder if we really should allow this zip one specifically in multiline mode for CSV / JSON specifically with a clear restriction (single file).
Significant different between zip
and deflate
is the first one is some kind of container of independently compressed entries where directory info is located at the end of files. In Spark's datasources, there is a restriction for now - one file maps to one InputStream
:
- in per-line mode:
in = new SplitLineReader(codec.createInputStream(fileIn, decompressor), job, this.recordDelimiterBytes); filePosition = fileIn;
. If you return oneInputStream
for all files in a zip archive, it can be difficult to differentiate last line in one file and the first line in another file. - in multi-line mode, the
readFile
methods should be modified to handle multipleInputStream
perfile: PartitionedFile
(to eliminate restriction of one file per zip archive).
For sure, all of those issues can be solved but
- Multiple files per zip archive is significantly rare case that one zipped JSON/CSV in the wild. Maybe you have another statistics.
- Support per-line mode demands principal changes - extension and movement to Spark Hadoop's per-line reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be feasible but still difficult. This is a simple PR but I don't think we should just allow this single case alone.
Multiple files per zip archive is significantly rare case that one zipped JSON/CSV in the wild.
Yea, I doubt if that's rare because it, at least, wasn't in my (humble and maybe biased) experience in the previous company.
How about we just propose the changes you are thinking currently? looks you have a clear idea on this.
@gatorsmile Please, take a look at the PR. |
Generally, this is useful for reading the zip archives containing a single CSV. The extra support JSON file is not very useful. The major limitation is we have to read the zipped CSV files using the multi line mode. I am not sure whether we can automatically switch to multi line mode. If so, is the result expected? Is the results different if we read it using the regular mode? If different, can we issue an exception with a better error message and let the users manually enable the multi line mode? |
The limitation is quite clear - multi line mode only and single file. We don't have to rush about this - looks @MaxGekk has a clear idea on this. Another concern here is, we have another place to control the compression codec (where we usually delegate to HDFS libraries). I don't think we can extend the current way to implement #22528 (comment) idea. It just sounds like a bandaid fix to allow one zipped file case in multi line mode. |
I was considering using Compressor API but its streaming nature controverses to structure of zip archive where meta-info is located at the end of files, and you cannot read/uncompress it sequentially block-by-block.
I believe it is better to return correct result in a case when wrong result is returned for now (try to read zipped CSV), or to force users to use this workaround only to read zip archives via RDD API: https://docs.databricks.com/spark/latest/data-sources/zip-files.html#zip-files . Especially in the case of compressed not splittable CSV, there is not big difference how to read it in multiLine enabled or disabled. |
Since there is no consensus, I am going to close the PR. |
@MaxGekk, please feel free to proceed based upon your idea. From a cursory look, the idea makes sense to me. |
We should stop returning a wrong result. Please fix it. Thanks! |
What changes were proposed in this pull request?
In the PR, I propose to support reading of zip archives containing one CSV or JSON file in the multi-line mode.
How was this patch tested?
Added tests for CSV and JSON where zip archives are created by Java library.