New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-6768] Sketching FileIO reading transforms for Python SDK #7791
Conversation
Run Python PostCommit |
Ah yes, the file objects are not pickleable. Gotta work around that. |
@pabloem Could you explain the end goal here? Is this replacing PTransforms such as |
I'm working on a design doc for it, but I was trying to get away with a PR
before the design doc:)- I'll share the design doc soon
…On Thu, Feb 14, 2019, 6:21 PM Udi Meiri ***@***.*** wrote:
@pabloem <https://github.com/pabloem> Could you explain the end goal
here? Is this replacing PTransforms such as ReadFromText?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7791 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABPc7P8OWcU05OI7wX1OJxUDqdP0hRZ2ks5vNhm4gaJpZM4axTBW>
.
|
Soon meaning tomorrow*
…On Thu, Feb 14, 2019, 7:13 PM Pablo Estrada ***@***.*** wrote:
I'm working on a design doc for it, but I was trying to get away with a PR
before the design doc:)- I'll share the design doc soon
On Thu, Feb 14, 2019, 6:21 PM Udi Meiri ***@***.*** wrote:
> @pabloem <https://github.com/pabloem> Could you explain the end goal
> here? Is this replacing PTransforms such as ReadFromText?
>
> —
> You are receiving this because you were mentioned.
>
>
> Reply to this email directly, view it on GitHub
> <#7791 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/ABPc7P8OWcU05OI7wX1OJxUDqdP0hRZ2ks5vNhm4gaJpZM4axTBW>
> .
>
|
@udim A couple examples are for consuming CSV files, where we need to use the first line of every file as header (this is not well supported by textio), and also returning PCollections with (file_name, line) pairs, which has also been asked about before |
Run Python PostCommit |
PTAL? : ) |
elif setting == EmptyMatchTreatment.ALLOW_IF_WILDCARD and '*' in pattern: | ||
return True | ||
else: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also want to validate that setting
has a valid value? (e.g.:
elif setting == EmptyMatchTreatment.DISALLOW:
return False
else:
raise ValueError(setting)
). I'm thinking about future changes adding/removing enum values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. Thanks Udi.
and its metadata; and ``ReadAll``, which takes in a ``PCollection`` of file | ||
metadata records, and produces a ``PCollection`` of (file metadata, file handle) | ||
tuples. | ||
These transforms currently do not support splitting by themselves. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want mark this whole module as experimental for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks Udi!
sdks/python/apache_beam/io/fileio.py
Outdated
This ``PTransform`` returns a ``PCollection`` of matching files in the form | ||
of ``FileMetadata`` objects.""" | ||
|
||
def __init__(self, file_pattern, empty_match_treatment=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document the default value and simplify the code setting self._empty_match_treatment:
def __init__(self, file_pattern, empty_match_treatment=None): | |
def __init__(self, file_pattern, empty_match_treatment=EmptyMatchTreatment.ALLOW_IF_WILDCARD): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
sdks/python/apache_beam/io/fileio.py
Outdated
This ``PTransform`` returns a ``PCollection`` of matching files in the form | ||
of ``FileMetadata`` objects.""" | ||
|
||
def __init__(self, empty_match_treatment=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def __init__(self, empty_match_treatment=None): | |
def __init__(self, empty_match_treatment=EmptyMatchTreatment.ALLOW): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
sdks/python/apache_beam/io/fileio.py
Outdated
Provides thre reading ``PTransform``\\s, ``MatchFiles``, | ||
``MatchAll``, that produces a ``PCollection`` of records representing a file | ||
and its metadata; and ``ReadAll``, which takes in a ``PCollection`` of file | ||
metadata records, and produces a ``PCollection`` of (file metadata, file handle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
produces a PCollection of ReadableFiles?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes. Done. Thanks!
sdks/python/apache_beam/io/fileio.py
Outdated
|
||
Provides thre reading ``PTransform``\\s, ``MatchFiles``, | ||
``MatchAll``, that produces a ``PCollection`` of records representing a file | ||
and its metadata; and ``ReadAll``, which takes in a ``PCollection`` of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/ReadAll/ReadMatches/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @udim for taking a good look - I've addressed your comments.
sdks/python/apache_beam/io/fileio.py
Outdated
|
||
Provides thre reading ``PTransform``\\s, ``MatchFiles``, | ||
``MatchAll``, that produces a ``PCollection`` of records representing a file | ||
and its metadata; and ``ReadAll``, which takes in a ``PCollection`` of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks.
sdks/python/apache_beam/io/fileio.py
Outdated
Provides thre reading ``PTransform``\\s, ``MatchFiles``, | ||
``MatchAll``, that produces a ``PCollection`` of records representing a file | ||
and its metadata; and ``ReadAll``, which takes in a ``PCollection`` of file | ||
metadata records, and produces a ``PCollection`` of (file metadata, file handle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes. Done. Thanks!
elif setting == EmptyMatchTreatment.ALLOW_IF_WILDCARD and '*' in pattern: | ||
return True | ||
else: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. Thanks Udi.
sdks/python/apache_beam/io/fileio.py
Outdated
This ``PTransform`` returns a ``PCollection`` of matching files in the form | ||
of ``FileMetadata`` objects.""" | ||
|
||
def __init__(self, file_pattern, empty_match_treatment=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
sdks/python/apache_beam/io/fileio.py
Outdated
This ``PTransform`` returns a ``PCollection`` of matching files in the form | ||
of ``FileMetadata`` objects.""" | ||
|
||
def __init__(self, empty_match_treatment=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
and its metadata; and ``ReadAll``, which takes in a ``PCollection`` of file | ||
metadata records, and produces a ``PCollection`` of (file metadata, file handle) | ||
tuples. | ||
These transforms currently do not support splitting by themselves. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks Udi!
Run Python PostCommit |
@udim PTAL : ) |
LGTM, thanks |
Run Portable_Python PreCommit |
r: @chamikaramj
These transforms sketch the reading transforms from
FileIO
. I'd love for you tot ake a look Cham : ) - They need better documentation, which I'm planning to add, but I wanted to see what you think of what I have currently.Design doc: https://s.apache.org/fileio-beam-python