-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13766][SQL] Consistent file extensions for files written by internal data sources #11604
Conversation
…SV, TEXT and JSON data sources
Test build #52744 has finished for PR 11604 at commit
|
+1 on adding tests. |
Sorry, it does have tests :) |
This seems reasonable to me. |
BTW can you also explain in code why for parquet we are putting compression codec before .parquet? I think the main reason is that tools such as gunzip wouldn't be able to decompress ".parquet.gz", because gz is not applied on the file, but rather parquet blocks. |
@rxin Sure. Thanks! |
Test build #52793 has finished for PR 11604 at commit
|
Thanks - merging this in master. |
…ternal data sources ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13766 This PR makes the file extensions (written by internal datasource) consistent. **Before** - TEXT, CSV and JSON ``` [.COMPRESSION_CODEC_NAME] ``` - Parquet ``` [.COMPRESSION_CODEC_NAME].parquet ``` - ORC ``` .orc ``` **After** - TEXT, CSV and JSON ``` .txt[.COMPRESSION_CODEC_NAME] .csv[.COMPRESSION_CODEC_NAME] .json[.COMPRESSION_CODEC_NAME] ``` - Parquet ``` [.COMPRESSION_CODEC_NAME].parquet ``` - ORC ``` [.COMPRESSION_CODEC_NAME].orc ``` When the compression codec is set, - For Parquet and ORC, each still stays in Parquet and ORC format but just have compressed data internally. So, I think it is okay to name `.parquet` and `.orc` at the end. - For Text, CSV and JSON, each does not stays in each format but it has different data format according to compression codec. So, each has the names `.json`, `.csv` and `.txt` before the compression extension. ## How was this patch tested? Unit tests are used and `./dev/run_tests` for coding style tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#11604 from HyukjinKwon/SPARK-13766.
What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13766
This PR makes the file extensions (written by internal datasource) consistent.
Before
After
When the compression codec is set,
.parquet
and.orc
at the end..json
,.csv
and.txt
before the compression extension.How was this patch tested?
Unit tests are used and
./dev/run_tests
for coding style tests.