Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace list with generator when iterating headers #4388

Merged
merged 3 commits into from Aug 8, 2017

Conversation

Projects
None yet
4 participants
@mvdbeek
Copy link
Member

commented Aug 8, 2017

This should be more memory-friendly. @blankenberg suggested doing this
in #4319.

Replace list with generator when iterating headers
This should be more memory-friendly. @blankenberg suggested doing this
in #4319.

@mvdbeek mvdbeek force-pushed the mvdbeek:iter_headers branch from 96d89a8 to fe2d607 Aug 8, 2017

@@ -313,7 +316,7 @@ def sniff( self, filename ):
>>> Interval().sniff( fname )
True
"""
headers = get_headers( filename, '\t', comment_designator='#' )
headers = iter_headers( filename, '\t', comment_designator='#' )

This comment has been minimized.

Copy link
@nsoranzo

nsoranzo Aug 8, 2017

Member

I'd move the iter_headers() call inside the try (i.e. for hdr in iter_headers( filename, '\t', comment_designator='#' ) ) and remove this line. Similarly below.

if len(headers[0]) > 2:
otulabel_names = headers[0][2:]
if len(first_line) > 2:
otulabel_names = first_line[2:]
# set label names and number of lines
for line in headers:

This comment has been minimized.

Copy link
@nsoranzo

nsoranzo Aug 8, 2017

Member

Here you are not processing the first line any more because of the next() call.

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Aug 8, 2017

Author Member

I switched to fetching the first line separately in 43702cc

@jmchilton

This comment has been minimized.

Copy link
Member

commented Aug 8, 2017

Beautiful - thanks!

For what it is worth, I believe the amount of memory can still grow unbounded because of this line:

for line in in_file:

It should be replaced by something that reads in lines not exceeding 10 Kb or something like that.

In fact - I'd take caching that first 10 Kb and not using the iterator over re-reading from the disk over and over. Caching the headers and using an iterator though is the best combination I suspect so this is still a very solid step in the right direction.

@jmchilton jmchilton merged commit 0da4768 into galaxyproject:dev Aug 8, 2017

5 checks passed

api test Build finished. 280 tests run, 0 skipped, 0 failed.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
framework test Build finished. 150 tests run, 0 skipped, 0 failed.
Details
integration test Build finished. 37 tests run, 0 skipped, 0 failed.
Details
toolshed test Build finished. 579 tests run, 0 skipped, 0 failed.
Details
@blankenberg

This comment has been minimized.

Copy link
Member

commented Aug 10, 2017

+1 Thanks @mvdbeek

As @jmchilton mentioned, very long lines can still consume unbounded memory, but that is a more difficult issue to deal with in a backwards compatible manner, as it could impact dataset metadata for certain datatypes and end up changing tool behavior.

@jmchilton

This comment has been minimized.

Copy link
Member

commented Aug 10, 2017

but that is a more difficult issue to deal with in a backwards compatible manner, as it could impact dataset metadata for certain datatypes and end up changing tool behavior

Isn't that a separate problem from constraining sniffing's memory usage?

@blankenberg

This comment has been minimized.

Copy link
Member

commented Aug 10, 2017

Isn't that a separate problem from constraining sniffing's memory usage?

In general, yes. But I was referring to any nonparameterized changes to the x_headers() methods.

@jmchilton

This comment has been minimized.

Copy link
Member

commented Aug 10, 2017

In general, yes. But I was referring to any nonparameterized changes to the x_headers() methods.

Gotcha - thanks!

@mvdbeek mvdbeek deleted the mvdbeek:iter_headers branch Sep 3, 2017

mvdbeek added a commit to mvdbeek/galaxy that referenced this pull request Nov 24, 2017

Fix metadata setting for Otu datatypes
I broke this in galaxyproject#4388.
I went through the other changes in that PR, and I believe this should be the
only error.
Figured out through the great detective work of @abernard.

mvdbeek added a commit to mvdbeek/galaxy that referenced this pull request Nov 24, 2017

Fix metadata setting for Otu datatypes
I broke this in galaxyproject#4388.
I went through the other changes in that PR, and I believe this should be the
only error.
Figured out through the great detective work of @abernard.

mvdbeek added a commit to mvdbeek/galaxy that referenced this pull request Nov 24, 2017

Fix metadata setting for Otu datatypes
I broke this in galaxyproject#4388.
I went through the other changes in that PR, and I believe this should be the
only error.
Figured out through the great detective work of @abernard.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.