-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[BEAM-4011] Unify Python IO glob implementation. #5024
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Do not merge before #4979 |
bd9b398 to
a39b2b8
Compare
|
run python postcommit |
|
This code is working. I've opened BEAM-4027 for the precommit failure with cython. |
|
R: @chamikaramj |
|
retest this please |
chamikaramj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
| """Find all matching paths to the pattern provided.""" | ||
| if pattern.endswith('/'): | ||
| pattern += '*' | ||
| prefix_or_dir = re.match('^[^[*?]*', pattern).group(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment explaining this regex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| if self.exists(pattern): | ||
| file_metadatas = [FileMetadata(pattern, self.size(pattern))] | ||
| else: | ||
| if self.has_dirs(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't sure from code, but do we try to list "<path>/*" for "<path>/<prefix>*" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I explained it in the new comment a little above this line.
| raise NotImplementedError | ||
|
|
||
| @abc.abstractmethod | ||
| def size(self, path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a separate method for size() ? I think we can already stat files using the match() method (it returns FileMetada objects).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_list() returns sizes, but it only works on prefixes (may return more than one result) or directories (fails on files).
I added this method for the case where the pattern given doesn't end in / and has no globbing characters. In this case we return the size of the file or directory pointed to by the pattern (if it exists).
| raise NotImplementedError | ||
|
|
||
| @abc.abstractmethod | ||
| def list(self, dir_or_prefix): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having both list() and match() as public can be confusing to users. Let's keep list() as private (and move it to filesystem implementations) if there's no compelling use-case to keep it in the interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| def match(self, patterns, limits=None): | ||
| """Find all matching paths to the patterns provided. | ||
| Pattern matching is done using fnmatch.fnmatch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should clarify if match for <path>/* is recursive or not. I think most users will use a pattern in the form <path>/<prefix>* which to avoid matching all sub-directories anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified.
- Introduces FileSystem._list() abstract method. Lists a directory or prefix. - Implement FileSystem.match() - no longer abstract, unifies glob behavior using fnmatch.fnmatch.
chamikaramj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
|
@udim @chamikaramj first of all, thanks for your work in Apache Beam! I have a problem: I just updated from for queries against GCS. Now this previously public interface does not work anymore in |
|
Hi @knub ! Calling |
|
Hi udim, thanks for your quick response! For future reference, here's the full replacement with the same interface as before: from apache_beam.io.filesystems import FileSystems
def glob(path):
match_result = FileSystems.match([path])[0]
file_metadata_objects = match_result.metadata_list
return [fm.path for fm in file_metadata_objects]Side-note: I think there's a bug in the In [1]: from apache_beam.io.filesystems import FileSystems
In [2]: FileSystems.match(["gs://<bucket name>/*"])
/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/gcsio.py:176: DeprecationWarning: object() takes no parameters
super(GcsIO, cls).__new__(cls, storage_client))
WARNING:root:Retry with exponential backoff: waiting for 4.3505648468 seconds before retrying list_prefix because we caught exception: ValueError: GCS path must be in the form gs://<bucket>/<object>.
Traceback for above exception (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/apache_beam/utils/retry.py", line 180, in wrapper
return fun(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/gcsio.py", line 423, in list_prefix
bucket, prefix = parse_gcs_path(path)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/gcsio.py", line 147, in parse_gcs_path
raise ValueError('GCS path must be in the form gs://<bucket>/<object>.') |
|
@knub , I filed a bug for your side-note. Thanks for reporting! |
Adds glob support for HDFS.
Follow this checklist to help us incorporate your contribution quickly and easily:
[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue.mvn clean verifyto make sure basic checks pass. A more thorough check will be performed on your pull request automatically.