Skip to content

[BEAM-625] A few memory and IO optimizations in Avro and FileIO#1694

Closed
katsiapis wants to merge 3 commits intoapache:python-sdkfrom
katsiapis:memory
Closed

[BEAM-625] A few memory and IO optimizations in Avro and FileIO#1694
katsiapis wants to merge 3 commits intoapache:python-sdkfrom
katsiapis:memory

Conversation

@katsiapis
Copy link
Copy Markdown

@katsiapis katsiapis commented Dec 26, 2016

  • Using a buffer when decompressing with Snappy in Avro in order to avoid unnecessary copies during slicing of large strings.
  • Removing somewhat confusing cStringIO aliasing.
  • Aligning _CompressedFile.read_size with that of GCSIO to avoid reading deficiencies for compressed files. This effectively increases the read_size for _CompressedFile from 16KB to 16MB.

unnecessary copies during slicing of large strings.
@katsiapis
Copy link
Copy Markdown
Author

R: @chamikaramj, can you please take a look?

@asfbot
Copy link
Copy Markdown

asfbot commented Dec 26, 2016

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/6251/

Build result: ABORTED

[...truncated 10096 lines...]test_query_iterator_with_exact_batch_multiple (apache_beam.io.datastore.v1.helper_test.HelperTest) ... oktest_query_iterator_with_large_query_limit (apache_beam.io.datastore.v1.helper_test.HelperTest) ... oktest_query_iterator_with_multiple_batches (apache_beam.io.datastore.v1.helper_test.HelperTest) ... oktest_query_iterator_with_query_limit (apache_beam.io.datastore.v1.helper_test.HelperTest) ... oktest_query_iterator_with_single_batch (apache_beam.io.datastore.v1.helper_test.HelperTest) ... oktest_query_iterator_with_transient_failures (apache_beam.io.datastore.v1.helper_test.HelperTest) ... oktest_create_scatter_query (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_query_with_limit (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_query_with_multiple_kinds (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_query_with_offset (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_query_with_order (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_query_with_unsupported_filter (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... okTest get_splits when num scatter keys is a multiple of batch size. ... okTest get_splits when all scatter keys are retured in a single req. ... oktest_get_splits_with_large_num_splits (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_with_multiple_splits (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_with_small_num_entities (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_with_two_splits (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_corrupted_file (apache_beam.io.avroio_test.TestAvro) ... okBuild timed out (after 100 minutes). Marking the build as aborted.Build was abortedtest_dynamic_work_rebalancing_exhaustive (apache_beam.io.avroio_test.TestAvro) ... channel stoppedSetting status of e91662b to FAILURE with url https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/6251/ and message: 'Build finished. 'Using context: Jenkins: Maven clean install
--none--

@katsiapis katsiapis changed the title [BEAM-625] A few memory optimizations in Avro. [BEAM-625] A few memory and IO optimizations in Avro Dec 26, 2016
@asfbot
Copy link
Copy Markdown

asfbot commented Dec 26, 2016

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/6253/

Build result: FAILURE

[...truncated 11267 lines...] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)Caused by: org.apache.maven.plugin.MojoExecutionException: Command execution failed. at org.codehaus.mojo.exec.ExecMojo.execute(ExecMojo.java:276) at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208) ... 31 moreCaused by: org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1) at org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:404) at org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:166) at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:660) at org.codehaus.mojo.exec.ExecMojo.execute(ExecMojo.java:265) ... 33 more2016-12-26T22:42:47.004 [ERROR] 2016-12-26T22:42:47.004 [ERROR] Re-run Maven using the -X switch to enable full debug logging.2016-12-26T22:42:47.004 [ERROR] 2016-12-26T22:42:47.004 [ERROR] For more information about the errors and possible solutions, please read the following articles:2016-12-26T22:42:47.004 [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException2016-12-26T22:42:47.004 [ERROR] 2016-12-26T22:42:47.005 [ERROR] After correcting the problems, you can resume the build with the command2016-12-26T22:42:47.005 [ERROR] mvn -rf :beam-sdks-pythonchannel stoppedSetting status of c41130c to FAILURE with url https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/6253/ and message: 'Build finished. 'Using context: Jenkins: Maven clean install
--none--

@asfbot
Copy link
Copy Markdown

asfbot commented Dec 26, 2016

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/6252/

Build result: ABORTED

[...truncated 10097 lines...]test_query_iterator_with_exact_batch_multiple (apache_beam.io.datastore.v1.helper_test.HelperTest) ... oktest_query_iterator_with_large_query_limit (apache_beam.io.datastore.v1.helper_test.HelperTest) ... oktest_query_iterator_with_multiple_batches (apache_beam.io.datastore.v1.helper_test.HelperTest) ... oktest_query_iterator_with_query_limit (apache_beam.io.datastore.v1.helper_test.HelperTest) ... oktest_query_iterator_with_single_batch (apache_beam.io.datastore.v1.helper_test.HelperTest) ... oktest_query_iterator_with_transient_failures (apache_beam.io.datastore.v1.helper_test.HelperTest) ... oktest_create_scatter_query (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_query_with_limit (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_query_with_multiple_kinds (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_query_with_offset (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_query_with_order (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_query_with_unsupported_filter (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... okTest get_splits when num scatter keys is a multiple of batch size. ... okTest get_splits when all scatter keys are retured in a single req. ... oktest_get_splits_with_large_num_splits (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_with_multiple_splits (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_with_small_num_entities (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_with_two_splits (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_corrupted_file (apache_beam.io.avroio_test.TestAvro) ... okBuild timed out (after 100 minutes). Marking the build as aborted.Build was abortedtest_dynamic_work_rebalancing_exhaustive (apache_beam.io.avroio_test.TestAvro) ... channel stoppedSetting status of a5f5cf8 to FAILURE with url https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/6252/ and message: 'Build finished. 'Using context: Jenkins: Maven clean install
--none--

@asfbot
Copy link
Copy Markdown

asfbot commented Dec 26, 2016

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/6254/
--none--

@katsiapis katsiapis changed the title [BEAM-625] A few memory and IO optimizations in Avro [BEAM-625] A few memory and IO optimizations in Avro and FileIO Dec 27, 2016
@asfbot
Copy link
Copy Markdown

asfbot commented Dec 27, 2016

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/6263/
--none--

fileobj,
compression_type=CompressionTypes.GZIP,
read_size=16384):
read_size=gcsio.DEFAULT_READ_BUFFER_SIZE):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this is 16 KB while gcsio.DEFAULT_READ_BUFFER_SIZE is 16 MB. Could you mention that change in the PR description.

Also I don't think we should refer to a constant defined in one file system (GCS) from fileio. Redefine a constant here ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is already mentioned in the PR description (and the individual commit that introduces it). Would you like me to rephrase it in some way?

As for the constant, there's already several parts of fileio that depend on GcsIO (including constants like gcsio.MAX_BATCH_OPERATION_SIZE) etc. And this might make sense from a performance perspective (indeed this PR is moving towards that), so if we think we should refactor constants it should probably be done as a separate PR that does the refactoring?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clearly mention the increase of buffer size in PR description (16 KB to 16 MB).

Most other references to gcsio seems to be in ChannelFactory which is fine. Only other problematic reference I see is gcsio.MAX_BATCH_OPERATION_SIZE which probably should be removed at some point as well. I would like not to add more such references if possible :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created https://issues.apache.org/jira/browse/BEAM-1222 to remove GCS specific constants from fileio.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll add more info there. Leaving DEFAULT_READ_BUFFER_SIZE here for now since its comments are GCS specific (and probably not worth going into fileio), and otherwise we would need 2 constants to be kept in sync. Ie it makes sense to stay as is until the JIRA mentioned above is resolved.

@chamikaramj
Copy link
Copy Markdown
Contributor

LGTM

@katsiapis
Copy link
Copy Markdown
Author

Hey @dhalperi, could you merge in this PR?

@asfbot
Copy link
Copy Markdown

asfbot commented Dec 28, 2016

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/6277/
--none--

@dhalperi
Copy link
Copy Markdown
Contributor

LGTM, merged, thanks.

asfgit pushed a commit that referenced this pull request Dec 28, 2016
@katsiapis
Copy link
Copy Markdown
Author

Thanks Dan, closing.

@katsiapis katsiapis closed this Dec 28, 2016
@katsiapis katsiapis deleted the memory branch January 14, 2017 02:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants