[BEAM-625] A few memory and IO optimizations in Avro and FileIO#1694
[BEAM-625] A few memory and IO optimizations in Avro and FileIO#1694katsiapis wants to merge 3 commits intoapache:python-sdkfrom
Conversation
unnecessary copies during slicing of large strings.
|
R: @chamikaramj, can you please take a look? |
|
Refer to this link for build results (access rights to CI server needed): Build result: ABORTED[...truncated 10096 lines...]test_query_iterator_with_exact_batch_multiple (apache_beam.io.datastore.v1.helper_test.HelperTest) ... oktest_query_iterator_with_large_query_limit (apache_beam.io.datastore.v1.helper_test.HelperTest) ... oktest_query_iterator_with_multiple_batches (apache_beam.io.datastore.v1.helper_test.HelperTest) ... oktest_query_iterator_with_query_limit (apache_beam.io.datastore.v1.helper_test.HelperTest) ... oktest_query_iterator_with_single_batch (apache_beam.io.datastore.v1.helper_test.HelperTest) ... oktest_query_iterator_with_transient_failures (apache_beam.io.datastore.v1.helper_test.HelperTest) ... oktest_create_scatter_query (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_query_with_limit (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_query_with_multiple_kinds (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_query_with_offset (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_query_with_order (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_query_with_unsupported_filter (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... okTest get_splits when num scatter keys is a multiple of batch size. ... okTest get_splits when all scatter keys are retured in a single req. ... oktest_get_splits_with_large_num_splits (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_with_multiple_splits (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_with_small_num_entities (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_with_two_splits (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_corrupted_file (apache_beam.io.avroio_test.TestAvro) ... okBuild timed out (after 100 minutes). Marking the build as aborted.Build was abortedtest_dynamic_work_rebalancing_exhaustive (apache_beam.io.avroio_test.TestAvro) ... channel stoppedSetting status of e91662b to FAILURE with url https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/6251/ and message: 'Build finished. 'Using context: Jenkins: Maven clean install--none-- |
|
Refer to this link for build results (access rights to CI server needed): Build result: FAILURE[...truncated 11267 lines...] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)Caused by: org.apache.maven.plugin.MojoExecutionException: Command execution failed. at org.codehaus.mojo.exec.ExecMojo.execute(ExecMojo.java:276) at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208) ... 31 moreCaused by: org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1) at org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:404) at org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:166) at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:660) at org.codehaus.mojo.exec.ExecMojo.execute(ExecMojo.java:265) ... 33 more2016-12-26T22:42:47.004 [ERROR] 2016-12-26T22:42:47.004 [ERROR] Re-run Maven using the -X switch to enable full debug logging.2016-12-26T22:42:47.004 [ERROR] 2016-12-26T22:42:47.004 [ERROR] For more information about the errors and possible solutions, please read the following articles:2016-12-26T22:42:47.004 [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException2016-12-26T22:42:47.004 [ERROR] 2016-12-26T22:42:47.005 [ERROR] After correcting the problems, you can resume the build with the command2016-12-26T22:42:47.005 [ERROR] mvn -rf :beam-sdks-pythonchannel stoppedSetting status of c41130c to FAILURE with url https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/6253/ and message: 'Build finished. 'Using context: Jenkins: Maven clean install--none-- |
|
Refer to this link for build results (access rights to CI server needed): Build result: ABORTED[...truncated 10097 lines...]test_query_iterator_with_exact_batch_multiple (apache_beam.io.datastore.v1.helper_test.HelperTest) ... oktest_query_iterator_with_large_query_limit (apache_beam.io.datastore.v1.helper_test.HelperTest) ... oktest_query_iterator_with_multiple_batches (apache_beam.io.datastore.v1.helper_test.HelperTest) ... oktest_query_iterator_with_query_limit (apache_beam.io.datastore.v1.helper_test.HelperTest) ... oktest_query_iterator_with_single_batch (apache_beam.io.datastore.v1.helper_test.HelperTest) ... oktest_query_iterator_with_transient_failures (apache_beam.io.datastore.v1.helper_test.HelperTest) ... oktest_create_scatter_query (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_query_with_limit (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_query_with_multiple_kinds (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_query_with_offset (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_query_with_order (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_query_with_unsupported_filter (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... okTest get_splits when num scatter keys is a multiple of batch size. ... okTest get_splits when all scatter keys are retured in a single req. ... oktest_get_splits_with_large_num_splits (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_with_multiple_splits (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_with_small_num_entities (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_get_splits_with_two_splits (apache_beam.io.datastore.v1.query_splitter_test.QuerySplitterTest) ... oktest_corrupted_file (apache_beam.io.avroio_test.TestAvro) ... okBuild timed out (after 100 minutes). Marking the build as aborted.Build was abortedtest_dynamic_work_rebalancing_exhaustive (apache_beam.io.avroio_test.TestAvro) ... channel stoppedSetting status of a5f5cf8 to FAILURE with url https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/6252/ and message: 'Build finished. 'Using context: Jenkins: Maven clean install--none-- |
|
Refer to this link for build results (access rights to CI server needed): |
deficiencies for compressed files.
|
Refer to this link for build results (access rights to CI server needed): |
| fileobj, | ||
| compression_type=CompressionTypes.GZIP, | ||
| read_size=16384): | ||
| read_size=gcsio.DEFAULT_READ_BUFFER_SIZE): |
There was a problem hiding this comment.
Currently this is 16 KB while gcsio.DEFAULT_READ_BUFFER_SIZE is 16 MB. Could you mention that change in the PR description.
Also I don't think we should refer to a constant defined in one file system (GCS) from fileio. Redefine a constant here ?
There was a problem hiding this comment.
The change is already mentioned in the PR description (and the individual commit that introduces it). Would you like me to rephrase it in some way?
As for the constant, there's already several parts of fileio that depend on GcsIO (including constants like gcsio.MAX_BATCH_OPERATION_SIZE) etc. And this might make sense from a performance perspective (indeed this PR is moving towards that), so if we think we should refactor constants it should probably be done as a separate PR that does the refactoring?
There was a problem hiding this comment.
Please clearly mention the increase of buffer size in PR description (16 KB to 16 MB).
Most other references to gcsio seems to be in ChannelFactory which is fine. Only other problematic reference I see is gcsio.MAX_BATCH_OPERATION_SIZE which probably should be removed at some point as well. I would like not to add more such references if possible :)
There was a problem hiding this comment.
Created https://issues.apache.org/jira/browse/BEAM-1222 to remove GCS specific constants from fileio.
There was a problem hiding this comment.
Thanks, I'll add more info there. Leaving DEFAULT_READ_BUFFER_SIZE here for now since its comments are GCS specific (and probably not worth going into fileio), and otherwise we would need 2 constants to be kept in sync. Ie it makes sense to stay as is until the JIRA mentioned above is resolved.
|
LGTM |
|
Hey @dhalperi, could you merge in this PR? |
|
Refer to this link for build results (access rights to CI server needed): |
|
LGTM, merged, thanks. |
|
Thanks Dan, closing. |
Uh oh!
There was an error while loading. Please reload this page.