Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13766][SQL] Consistent file extensions for files written by internal data sources #11604

Closed
wants to merge 2 commits into from

Conversation

HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-13766
This PR makes the file extensions (written by internal datasource) consistent.

Before

  • TEXT, CSV and JSON
[.COMPRESSION_CODEC_NAME]
  • Parquet
[.COMPRESSION_CODEC_NAME].parquet
  • ORC
.orc

After

  • TEXT, CSV and JSON
.txt[.COMPRESSION_CODEC_NAME]
.csv[.COMPRESSION_CODEC_NAME]
.json[.COMPRESSION_CODEC_NAME]
  • Parquet
[.COMPRESSION_CODEC_NAME].parquet
  • ORC
[.COMPRESSION_CODEC_NAME].orc

When the compression codec is set,

  • For Parquet and ORC, each still stays in Parquet and ORC format but just have compressed data internally. So, I think it is okay to name .parquet and .orc at the end.
  • For Text, CSV and JSON, each does not stays in each format but it has different data format according to compression codec. So, each has the names .json, .csv and .txt before the compression extension.

How was this patch tested?

Unit tests are used and ./dev/run_tests for coding style tests.

@HyukjinKwon
Copy link
Member Author

cc @rxin @srowen

@SparkQA
Copy link

SparkQA commented Mar 9, 2016

Test build #52744 has finished for PR 11604 at commit 0e9b003.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Mar 9, 2016

+1 on adding tests.

@marmbrus
Copy link
Contributor

marmbrus commented Mar 9, 2016

Sorry, it does have tests :)

@marmbrus
Copy link
Contributor

marmbrus commented Mar 9, 2016

This seems reasonable to me.

@rxin
Copy link
Contributor

rxin commented Mar 9, 2016

BTW can you also explain in code why for parquet we are putting compression codec before .parquet? I think the main reason is that tools such as gunzip wouldn't be able to decompress ".parquet.gz", because gz is not applied on the file, but rather parquet blocks.

@HyukjinKwon
Copy link
Member Author

@rxin Sure. Thanks!

@SparkQA
Copy link

SparkQA commented Mar 10, 2016

Test build #52793 has finished for PR 11604 at commit 7568744.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Mar 10, 2016

Thanks - merging this in master.

@asfgit asfgit closed this in aa0eba2 Mar 10, 2016
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
…ternal data sources

## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-13766
This PR makes the file extensions (written by internal datasource) consistent.

**Before**

- TEXT, CSV and JSON
```
[.COMPRESSION_CODEC_NAME]
```

- Parquet
```
[.COMPRESSION_CODEC_NAME].parquet
```

- ORC
```
.orc
```

**After**

- TEXT, CSV and JSON
```
.txt[.COMPRESSION_CODEC_NAME]
.csv[.COMPRESSION_CODEC_NAME]
.json[.COMPRESSION_CODEC_NAME]
```

- Parquet
```
[.COMPRESSION_CODEC_NAME].parquet
```

- ORC
```
[.COMPRESSION_CODEC_NAME].orc
```

When the compression codec is set,
- For Parquet and ORC, each still stays in Parquet and ORC format but just have compressed data internally. So, I think it is okay to name `.parquet` and `.orc` at the end.

- For Text, CSV and JSON, each does not stays in each format but it has different data format according to compression codec. So, each has the names `.json`, `.csv` and `.txt` before the compression extension.

## How was this patch tested?

Unit tests are used and `./dev/run_tests` for coding style tests.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#11604 from HyukjinKwon/SPARK-13766.
@HyukjinKwon HyukjinKwon deleted the SPARK-13766 branch October 1, 2016 06:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants