Skip to content

ARROW-2575: [Python] Exclude hidden files when reading Parquet dataset#2027

Closed
ukaratay wants to merge 1 commit intoapache:masterfrom
ukaratay:patch-1
Closed

ARROW-2575: [Python] Exclude hidden files when reading Parquet dataset#2027
ukaratay wants to merge 1 commit intoapache:masterfrom
ukaratay:patch-1

Conversation

@ukaratay
Copy link
Copy Markdown

On Unix systems hidden files are listed because os.walk does not care about hidden files. This especially creates a problem in macOS where .DS_Store files are created automatically.

On Unix systems hidden files are listed because os.walk does not care about hidden files. This especially creates a problem in macOS where .DS_Store files are created automatically.
@ukaratay ukaratay changed the title Silently exclude hidden files [Python] Silently exclude hidden files May 10, 2018
@codecov-io
Copy link
Copy Markdown

Codecov Report

Merging #2027 into master will increase coverage by 0.03%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2027      +/-   ##
==========================================
+ Coverage   87.42%   87.45%   +0.03%     
==========================================
  Files         189      178      -11     
  Lines       29289    28516     -773     
==========================================
- Hits        25607    24940     -667     
+ Misses       3682     3576     -106
Impacted Files Coverage Δ
rust/src/array.rs
rust/src/memory_pool.rs
rust/src/list_builder.rs
rust/src/bitmap.rs
rust/src/buffer.rs
rust/src/datatypes.rs
rust/src/memory.rs
rust/src/builder.rs
rust/src/list.rs
rust/src/lib.rs
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bb47c36...866c688. Read the comment docs.

@pitrou
Copy link
Copy Markdown
Member

pitrou commented May 11, 2018

Thanks @ukaratay. Can you:

  • create a JIRA ticker for this and change the PR title according?
  • add a test for this?

@ukaratay ukaratay changed the title [Python] Silently exclude hidden files ARROW-2575: [Python] Silently exclude hidden files May 11, 2018
@ukaratay
Copy link
Copy Markdown
Author

@pitrou I have created a JIRA ticket for it. However, there seems be no test for ParquetDataset class in Python code. So, adding a test is gonna take some time unless I wasn't able to see them.

@xhochy
Copy link
Copy Markdown
Member

xhochy commented May 13, 2018

@ukaratay A simple unit test for this could be:

  1. Write a Parquet file to a directory, read the directory as a Parquet dataset.
  2. Add a hidden, non-Parquet file to this directory.
  3. Read the directory again as a dataset. Without this PR, this should lead to an error; with the PR, it should produce the same output as 1.

@wesm
Copy link
Copy Markdown
Member

wesm commented Jul 19, 2018

Added this to 0.10.0 as it's a nuisance and not too difficult to test. @ukaratay can you write a test? Otherwise someone else may be able to get to it before 0.10 goes out

@pitrou pitrou changed the title ARROW-2575: [Python] Silently exclude hidden files ARROW-2575: [Python] Exclude hidden files when reading Parquet dataset Jul 23, 2018
@kr-hansen
Copy link
Copy Markdown

I don't have the source downloaded as I've been working from Conda, but I figured I'd do what I could to help this along. Here's what @xhochy mentioned put into code using some of the examples in the pyarrow docs. I tested it and it seems to work from what I tested. If someone (perhaps @ukaratay) added this as a test function it should work.

#Imports
import os
import pyarrow as pa
import pyarrow.parquet as pq

#Make table
data = [pa.array([1,2,3,4,]), pa.array(['foo', 'bar', 'baz', None]), pa.array([True, None, False, True])]
batch = pa.RecordBatch.from_arrays(data, ['f0', 'f1', 'f2'])
batches = [batch]*5
table = pa.Table.from_batches(batches)

#Write Table
pq.write_to_dataset(table, root_path='test.parquet')

#Read Directory
read1 = pq.read_table('test.parquet')

#Add Hidden File
open('test.parquet/.test','a').close()

#Try Reading Again
read2 = pq.read_table('test.parquet')

#Test
success = read1.equals(read2)

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Jul 24, 2018

Superseded by PR #2312.

@pitrou pitrou closed this Jul 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants