Skip to content

[BEAM-778] Make filesystem._CompressedFile seekable.#2392

Closed
tibkiss wants to merge 4 commits into
apache:masterfrom
tibkiss:BEAM-778
Closed

[BEAM-778] Make filesystem._CompressedFile seekable.#2392
tibkiss wants to merge 4 commits into
apache:masterfrom
tibkiss:BEAM-778

Conversation

@tibkiss
Copy link
Copy Markdown
Contributor

@tibkiss tibkiss commented Mar 31, 2017

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

  • Make sure the PR title is formatted like:
    [BEAM-<Jira issue #>] Description of pull request
  • Make sure tests pass via mvn clean verify. (Even better, enable
    Travis-CI on your fork and ensure the whole test matrix passes).
  • Replace <Jira issue #> in the title with the actual Jira issue
    number, if there is one.
  • If this contribution is large, please file an Apache
    Individual Contributor License Agreement.

@tibkiss
Copy link
Copy Markdown
Contributor Author

tibkiss commented Mar 31, 2017

R: @chamikaramj

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage decreased (-0.003%) to 70.333% when pulling a11859d on tibkiss:BEAM-778 into 68522aa on apache:master.

@asfbot
Copy link
Copy Markdown

asfbot commented Mar 31, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/8999/
--none--

@chamikaramj
Copy link
Copy Markdown
Contributor

Thanks Tibor. I'm looking into this.

Copy link
Copy Markdown
Contributor

@chamikaramj chamikaramj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

elif whence == os.SEEK_CUR:
absolute_offset = self._uncompressed_position + offset
elif whence == os.SEEK_END:
# Determine and cache the uncompressed size of the file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be very expensive for large files. How about producing a warning along with the time it took to re-read the file ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a warning message using logger. Note that I've used the per module logger what i've suggested in 22th of March in the Dev list and was accepted by Ahmet (BEAM-1825).

We might consider reducing the log level to debug if the users find that it floods the logs.

return False
return self._file.mode == 'r'

def _rewind_read_buffer(self):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add comments describing what each of the _rewind* methods does ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I found a better name for _rewind_read_buffer: _clear_read_buffer. I think it better represents the purpose.

* if the new offset is out of bound, it is adjusted to either 0 or EOF.

Args:
offset: seek offset as number.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the offset of the compressed file or is it the offset after file is uncompressed ? Please clarify in the comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


Raises:
IOError: When this buffer is closed.
ValueError: When whence is invalid or the file is not seekable
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention what happens if offset is out of bound.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of bound seek is explained in 'seeking behavior' section in this docstring. I'd prefer to keep it there as the behavior does not yield to exceptions.


# Determine how many bytes needs to be read before we reach
# the requested offset. Rewind if we already passed the position.
if offset < self._uncompressed_position:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be "absolute_offset < self._uncompressed_position" ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good catch! Fixed.

# the requested offset. Rewind if we already passed the position.
if offset < self._uncompressed_position:
self._rewind()
bytes_to_skip = absolute_offset - self._uncompressed_position
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we skip up to absolute_offset if we ran self._rewind()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do skip up to absolute_offset as _rewind sets _uncompressed_position to 0 through _rewind_file.

read_size=self.read_block_size)
reference_fd = StringIO(self.content)

random_position = randint(0, len(self.content) - 1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's run this for several fixed positions within the range (0, size - 1) instead of a random position to make the test consistent.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: added several positional seeks around the boundaries and inside.

self.assertEqual(uncompressed_position, expected_position)
self.assertEqual(reference_position, expected_position)

seek_position = randint(-1 * mid_point, mid_point)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. Let's run this for several fixed positions instead of a random position.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. This exercise revealed an interesting behavior of cStringIO:
If we seek out of bound of a buffer through cStringIO then call tell() it will report
the out of bound position. If we do a read() or readline() prior to tell() then tell()
will report the correct position which is the size of the buffer.

CompressedFile's tell() never returns out of bound numbers. I've documented this difference in the testcase.

@tibkiss
Copy link
Copy Markdown
Contributor Author

tibkiss commented Apr 1, 2017

@chamikaramj : PTAL

@asfbot
Copy link
Copy Markdown

asfbot commented Apr 1, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/9048/

Build result: FAILURE

[...truncated 2.18 MB...] at java.lang.Thread.run(Thread.java:745)Caused by: org.apache.maven.plugin.MojoExecutionException: Command execution failed. at org.codehaus.mojo.exec.ExecMojo.execute(ExecMojo.java:302) at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208) ... 31 moreCaused by: org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1) at org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:404) at org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:166) at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:764) at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:711) at org.codehaus.mojo.exec.ExecMojo.execute(ExecMojo.java:289) ... 33 more2017-04-01T15:14:49.415 [ERROR] 2017-04-01T15:14:49.415 [ERROR] Re-run Maven using the -X switch to enable full debug logging.2017-04-01T15:14:49.415 [ERROR] 2017-04-01T15:14:49.415 [ERROR] For more information about the errors and possible solutions, please read the following articles:2017-04-01T15:14:49.415 [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException2017-04-01T15:14:49.415 [ERROR] 2017-04-01T15:14:49.415 [ERROR] After correcting the problems, you can resume the build with the command2017-04-01T15:14:49.415 [ERROR] mvn -rf :beam-sdks-pythonchannel stoppedSetting status of 9870ab0 to FAILURE with url https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/9048/ and message: 'Build finished. 'Using context: Jenkins: Maven clean install
--none--

@chamikaramj
Copy link
Copy Markdown
Contributor

Could you fix the Jenkins (lint) failure ?

LGTM other than that.

@tibkiss
Copy link
Copy Markdown
Contributor Author

tibkiss commented Apr 3, 2017

@chamikaramj : Fixed the lint warnings. Thanks for the review!

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage decreased (-0.2%) to 70.139% when pulling 2eea9ee on tibkiss:BEAM-778 into 68522aa on apache:master.

@asfbot
Copy link
Copy Markdown

asfbot commented Apr 3, 2017

@chamikaramj
Copy link
Copy Markdown
Contributor

retest this please

(rerunning Jenkins)

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage decreased (-0.2%) to 70.138% when pulling 2eea9ee on tibkiss:BEAM-778 into 68522aa on apache:master.

@asfbot
Copy link
Copy Markdown

asfbot commented Apr 3, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/9085/
--none--

@chamikaramj
Copy link
Copy Markdown
Contributor

LGTM

@chamikaramj
Copy link
Copy Markdown
Contributor

R: @aaltay regarding extra dependency

@aaltay
Copy link
Copy Markdown
Member

aaltay commented Apr 3, 2017

I would argue against taking the additional dependency in this case. Historically we had issues with not-stable dependencies. And in this case:

  • They recently renamed the package and it is <1.0
  • We could easily rewrite the tests to run for a list of compression types without using this additional dependency.
  • It is easier for all developers when full test names exist in the code.

For all of that, @tibkiss could you remove the extra dependency please.

@tibkiss
Copy link
Copy Markdown
Contributor Author

tibkiss commented Apr 5, 2017

@aaltay : Thanks for the review. I've removed 'parameterized' dependency as you suggested. Please take a look. Thanks!

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage decreased (-0.1%) to 70.021% when pulling ebab092 on tibkiss:BEAM-778 into de36e83 on apache:master.

@asfbot
Copy link
Copy Markdown

asfbot commented Apr 5, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/9172/
--none--

@chamikaramj
Copy link
Copy Markdown
Contributor

LGTM.

Thanks for the updates. I'll merge.

@asfgit asfgit closed this in e2a2836 Apr 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants