Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-6768] Sketching FileIO reading transforms for Python SDK #7791

Merged
merged 1 commit into from Mar 9, 2019

Conversation

pabloem
Copy link
Member

@pabloem pabloem commented Feb 8, 2019

r: @chamikaramj

These transforms sketch the reading transforms from FileIO. I'd love for you tot ake a look Cham : ) - They need better documentation, which I'm planning to add, but I wanted to see what you think of what I have currently.

Design doc: https://s.apache.org/fileio-beam-python

@pabloem
Copy link
Member Author

pabloem commented Feb 8, 2019

Run Python PostCommit

@pabloem
Copy link
Member Author

pabloem commented Feb 9, 2019

Ah yes, the file objects are not pickleable. Gotta work around that.

@chamikaramj
Copy link
Contributor

R: @udim @iemejia

Udi, Ismaël, will you be interested in reviewing this ?

@udim
Copy link
Member

udim commented Feb 15, 2019

@pabloem Could you explain the end goal here? Is this replacing PTransforms such as ReadFromText?

@pabloem
Copy link
Member Author

pabloem commented Feb 15, 2019 via email

@pabloem
Copy link
Member Author

pabloem commented Feb 15, 2019 via email

@pabloem
Copy link
Member Author

pabloem commented Feb 15, 2019

@udim
Design doc: https://s.apache.org/fileio-beam-python
The main argument to add these transforms is that these transforms give more flexibility, and more fine-grained control into how files are consumed.

A couple examples are for consuming CSV files, where we need to use the first line of every file as header (this is not well supported by textio), and also returning PCollections with (file_name, line) pairs, which has also been asked about before

@pabloem
Copy link
Member Author

pabloem commented Feb 22, 2019

Run Python PostCommit

@pabloem
Copy link
Member Author

pabloem commented Feb 27, 2019

PTAL? : )

sdks/python/apache_beam/io/fileio.py Outdated Show resolved Hide resolved
elif setting == EmptyMatchTreatment.ALLOW_IF_WILDCARD and '*' in pattern:
return True
else:
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also want to validate that setting has a valid value? (e.g.:

elif setting == EmptyMatchTreatment.DISALLOW:
  return False
else:
  raise ValueError(setting)

). I'm thinking about future changes adding/removing enum values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Thanks Udi.

and its metadata; and ``ReadAll``, which takes in a ``PCollection`` of file
metadata records, and produces a ``PCollection`` of (file metadata, file handle)
tuples.
These transforms currently do not support splitting by themselves.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want mark this whole module as experimental for now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks Udi!

This ``PTransform`` returns a ``PCollection`` of matching files in the form
of ``FileMetadata`` objects."""

def __init__(self, file_pattern, empty_match_treatment=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document the default value and simplify the code setting self._empty_match_treatment:

Suggested change
def __init__(self, file_pattern, empty_match_treatment=None):
def __init__(self, file_pattern, empty_match_treatment=EmptyMatchTreatment.ALLOW_IF_WILDCARD):

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

This ``PTransform`` returns a ``PCollection`` of matching files in the form
of ``FileMetadata`` objects."""

def __init__(self, empty_match_treatment=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __init__(self, empty_match_treatment=None):
def __init__(self, empty_match_treatment=EmptyMatchTreatment.ALLOW):

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Provides thre reading ``PTransform``\\s, ``MatchFiles``,
``MatchAll``, that produces a ``PCollection`` of records representing a file
and its metadata; and ``ReadAll``, which takes in a ``PCollection`` of file
metadata records, and produces a ``PCollection`` of (file metadata, file handle)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

produces a PCollection of ReadableFiles?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes. Done. Thanks!


Provides thre reading ``PTransform``\\s, ``MatchFiles``,
``MatchAll``, that produces a ``PCollection`` of records representing a file
and its metadata; and ``ReadAll``, which takes in a ``PCollection`` of file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/ReadAll/ReadMatches/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks.

Copy link
Member Author

@pabloem pabloem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @udim for taking a good look - I've addressed your comments.

sdks/python/apache_beam/io/fileio.py Outdated Show resolved Hide resolved

Provides thre reading ``PTransform``\\s, ``MatchFiles``,
``MatchAll``, that produces a ``PCollection`` of records representing a file
and its metadata; and ``ReadAll``, which takes in a ``PCollection`` of file
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks.

Provides thre reading ``PTransform``\\s, ``MatchFiles``,
``MatchAll``, that produces a ``PCollection`` of records representing a file
and its metadata; and ``ReadAll``, which takes in a ``PCollection`` of file
metadata records, and produces a ``PCollection`` of (file metadata, file handle)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes. Done. Thanks!

elif setting == EmptyMatchTreatment.ALLOW_IF_WILDCARD and '*' in pattern:
return True
else:
return False
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Thanks Udi.

This ``PTransform`` returns a ``PCollection`` of matching files in the form
of ``FileMetadata`` objects."""

def __init__(self, file_pattern, empty_match_treatment=None):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

This ``PTransform`` returns a ``PCollection`` of matching files in the form
of ``FileMetadata`` objects."""

def __init__(self, empty_match_treatment=None):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

and its metadata; and ``ReadAll``, which takes in a ``PCollection`` of file
metadata records, and produces a ``PCollection`` of (file metadata, file handle)
tuples.
These transforms currently do not support splitting by themselves.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks Udi!

@pabloem pabloem changed the title Sketching FileIO reading transforms for Python SDK [BEAM-6768] Sketching FileIO reading transforms for Python SDK Mar 5, 2019
@pabloem
Copy link
Member Author

pabloem commented Mar 5, 2019

Run Python PostCommit

@pabloem
Copy link
Member Author

pabloem commented Mar 7, 2019

@udim PTAL : )

@udim
Copy link
Member

udim commented Mar 8, 2019

LGTM, thanks

@pabloem
Copy link
Member Author

pabloem commented Mar 8, 2019

Run Portable_Python PreCommit

@pabloem pabloem merged commit dd83432 into apache:master Mar 9, 2019
@pabloem pabloem deleted the fileio-read branch March 9, 2019 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants