Skip to content
This repository has been archived by the owner on Feb 10, 2021. It is now read-only.

Add __next__ (and next) to HDFile #121

Merged
merged 1 commit into from
Jun 2, 2017
Merged

Conversation

gglanzani
Copy link
Contributor

This allows for libraries as pandas to read a file as a buffer.

@martindurant
Copy link
Member

Whilst I don't see any reason not to implement these, I am surprised that you find __iter__ is not enough. HDFiles are routinely used in conjunction with pandas - what kind of problem were you having?
Furthermore, it seems to me like next could more simply call readline(); could you add some tests, please, to ensure that the functionality works as required?

@gglanzani
Copy link
Contributor Author

pandas check for next explicitly, so it fails when you do

with hdfs.open(path_to_csv) as f:
    df = pd.read_csv(f)

I will look into readline and tests!

hdfs3/core.py Outdated
""" Enables reading a file as a buffer in pandas """
return next(self._genline())

def next(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

next = __next__ is simpler.

hdfs3/core.py Outdated
@@ -672,6 +672,14 @@ def __iter__(self):
""" Enables `for line in file:` usage """
return self._genline()

def __next__(self):
""" Enables reading a file as a buffer in pandas """
return next(self._genline())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you care about performance, you could call readline() directly:

def __next__(self):
    out = self.readline()
    if out:
        return out
    else:
        raise StopIteration

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that readline() is known to perform poorly (and seems somewhat odd for binary files), so I hope pandas does not really call this repeatedly. If it does, wrapping the file in io.TextIOWrapper is probably far better.

@@ -672,6 +672,14 @@ def __iter__(self):
""" Enables `for line in file:` usage """
return self._genline()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this can now be replaced with return self?

@gglanzani
Copy link
Contributor Author

Thanks for the comment.

I've updated next and __next__ for now (@pitrou: thanks for the tip, I was sure there was an easier way).

@martindurant As for wrapping in TextIOWrapper: do you have some pointers on the path to take?

On the side: can this be merged in the meantime? It would be very helpful to have this feature.

@martindurant
Copy link
Member

The wrapper should work like

import io
with hdfs.open(path_to_csv) as f:
    df = pd.read_csv(io.TextIOWrapper(f))

where Pandas would now see a text-mode file with buffering and correct line-end handling.

I would merge, but there ought to be some test of the new method(s). I notice, also, that _genline is now essentially repeated, so could refactor - but this it not important.

This allows for libraries as pandas to read a file as a buffer.
@gglanzani
Copy link
Contributor Author

@martindurant I've wrote an additional test.

BTW, pandas is using readline() to read f :)

@martindurant
Copy link
Member

Cool, thank you.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants