[SPARK-13766][SQL] Consistent file extensions for files written by internal data sources #11604

HyukjinKwon · 2016-03-09T09:41:28Z

What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-13766
This PR makes the file extensions (written by internal datasource) consistent.

Before

TEXT, CSV and JSON

[.COMPRESSION_CODEC_NAME]

Parquet

[.COMPRESSION_CODEC_NAME].parquet

ORC

.orc

After

TEXT, CSV and JSON

.txt[.COMPRESSION_CODEC_NAME]
.csv[.COMPRESSION_CODEC_NAME]
.json[.COMPRESSION_CODEC_NAME]

Parquet

[.COMPRESSION_CODEC_NAME].parquet

ORC

[.COMPRESSION_CODEC_NAME].orc

When the compression codec is set,

For Parquet and ORC, each still stays in Parquet and ORC format but just have compressed data internally. So, I think it is okay to name .parquet and .orc at the end.
For Text, CSV and JSON, each does not stays in each format but it has different data format according to compression codec. So, each has the names .json, .csv and .txt before the compression extension.

How was this patch tested?

Unit tests are used and ./dev/run_tests for coding style tests.

…SV, TEXT and JSON data sources

HyukjinKwon · 2016-03-09T09:42:30Z

cc @rxin @srowen

SparkQA · 2016-03-09T11:30:48Z

Test build #52744 has finished for PR 11604 at commit 0e9b003.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-03-09T19:08:26Z

+1 on adding tests.

marmbrus · 2016-03-09T19:08:59Z

Sorry, it does have tests :)

marmbrus · 2016-03-09T19:09:22Z

This seems reasonable to me.

rxin · 2016-03-09T19:09:49Z

BTW can you also explain in code why for parquet we are putting compression codec before .parquet? I think the main reason is that tools such as gunzip wouldn't be able to decompress ".parquet.gz", because gz is not applied on the file, but rather parquet blocks.

HyukjinKwon · 2016-03-09T22:26:39Z

@rxin Sure. Thanks!

SparkQA · 2016-03-10T02:42:14Z

Test build #52793 has finished for PR 11604 at commit 7568744.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-03-10T03:12:36Z

Thanks - merging this in master.

…ternal data sources ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13766 This PR makes the file extensions (written by internal datasource) consistent. **Before** - TEXT, CSV and JSON ``` [.COMPRESSION_CODEC_NAME] ``` - Parquet ``` [.COMPRESSION_CODEC_NAME].parquet ``` - ORC ``` .orc ``` **After** - TEXT, CSV and JSON ``` .txt[.COMPRESSION_CODEC_NAME] .csv[.COMPRESSION_CODEC_NAME] .json[.COMPRESSION_CODEC_NAME] ``` - Parquet ``` [.COMPRESSION_CODEC_NAME].parquet ``` - ORC ``` [.COMPRESSION_CODEC_NAME].orc ``` When the compression codec is set, - For Parquet and ORC, each still stays in Parquet and ORC format but just have compressed data internally. So, I think it is okay to name `.parquet` and `.orc` at the end. - For Text, CSV and JSON, each does not stays in each format but it has different data format according to compression codec. So, each has the names `.json`, `.csv` and `.txt` before the compression extension. ## How was this patch tested? Unit tests are used and `./dev/run_tests` for coding style tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#11604 from HyukjinKwon/SPARK-13766.

Inconsistent file extensions and omitted file extensions written by C…

0e9b003

…SV, TEXT and JSON data sources

Update comments for ORC and Parquet extensions

7568744

asfgit closed this in aa0eba2 Mar 10, 2016

HyukjinKwon deleted the SPARK-13766 branch October 1, 2016 06:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13766][SQL] Consistent file extensions for files written by internal data sources #11604

[SPARK-13766][SQL] Consistent file extensions for files written by internal data sources #11604

HyukjinKwon commented Mar 9, 2016

HyukjinKwon commented Mar 9, 2016

SparkQA commented Mar 9, 2016

rxin commented Mar 9, 2016

marmbrus commented Mar 9, 2016

marmbrus commented Mar 9, 2016

rxin commented Mar 9, 2016

HyukjinKwon commented Mar 9, 2016

SparkQA commented Mar 10, 2016

rxin commented Mar 10, 2016

[SPARK-13766][SQL] Consistent file extensions for files written by internal data sources #11604

[SPARK-13766][SQL] Consistent file extensions for files written by internal data sources #11604

Conversation

HyukjinKwon commented Mar 9, 2016

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon commented Mar 9, 2016

SparkQA commented Mar 9, 2016

rxin commented Mar 9, 2016

marmbrus commented Mar 9, 2016

marmbrus commented Mar 9, 2016

rxin commented Mar 9, 2016

HyukjinKwon commented Mar 9, 2016

SparkQA commented Mar 10, 2016

rxin commented Mar 10, 2016