Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAPREDUCE-7435. Manifest Committer OOM on abfs #5519

Conversation

steveloughran
Copy link
Contributor

@steveloughran steveloughran commented Mar 29, 2023

  • Add heap information to gauges in _SUCCESS
  • which includes pulling up part of impl.IOStatisticsStore into a public IOStatisticsSetters interface with the put{Counter, Gauge, etc} methods only.
  • TestLoadManifests scaled up with #of tasks and #of files in each task to generate more load.

Summary: abfs uses a lot more heap during the load phase than file; possibly due to buffering.

How was this patch tested?

  • modified existing tests
  • azure tests in progress

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@steveloughran steveloughran force-pushed the mapreduce/MAPREDUCE-7435-committer-oom branch from c0fc290 to 720f120 Compare March 29, 2023 16:23
Copy link
Contributor

@cnauroth cnauroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @steveloughran . This mostly looks good. I entered a few comments.


@Override
public void setMeanStatistic(final String key, final MeanStatistic value) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meanStatistics().put(key, value);?

public static void addHeapInformation(IOStatisticsSetters ioStatisticsSetters,
String stage) {
// force a gc. bit of bad form but it makes for better numbers
System.gc();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This triggered a Spotbugs warning. Do think the forced GC should go behind a config flag, default off, and turned on in the tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I will do that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GC pulled out of production code & only invoked in test code

// needed to avoid this test contaminating others in the same JVM
FileSystem.closeAll();
conf.set(fileImpl, fileImplClassname);
conf.set(fileImpl, fileImplClassname);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated line?

I wasn't sure why we need to set the conf here in the finally block. Did something mutate it after line 761, and now we need to restore it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate. cut
the reason it is in is not so much because of any change in the pr, as it surfaced a condition which is already there -this test changes the default "file" fs, and in some test runs that wasn't being reset, so other tests were failing later for no obvious reason

@@ -63,6 +81,10 @@ public void setup() throws Exception {
.isGreaterThan(0);
}

public long heapSize() {
return Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: some indentation issues here.

success.save(summaryFS, path, true);
LOG.info("Saved summary to {}", path);
ManifestPrinter showManifest = new ManifestPrinter();
ManifestSuccessData manifestSuccessData =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: some indentation issues here.

@steveloughran steveloughran marked this pull request as draft April 4, 2023 10:37
@steveloughran
Copy link
Contributor Author

@cnauroth -thanks for the comments; will update

I've converted this to a draft as I am working on the next step of this: streaming the list of files to rename from each manifest into a SequenceFile saved to the local FS; rename stage reading that in and spreading the renames across the worker pool, maybe in batches.

this will eliminate the need to store the list of files to rename in memory at all and so not worry about #of files or path lengths. the file will be on localfs, so on an SSD machine fairly quick to write and read back, especially if the os buffers well/is optimised for transient files.
does complicate propagation of data, hence the extra work and the need for some more tests, including some of the save/restore process itself

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 52s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 10 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 16m 6s Maven dependency ordering for branch
+1 💚 mvninstall 28m 31s trunk passed
+1 💚 compile 25m 13s trunk passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 compile 21m 43s trunk passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 checkstyle 4m 5s trunk passed
+1 💚 mvnsite 3m 20s trunk passed
+1 💚 javadoc 2m 19s trunk passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 1m 50s trunk passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
-1 ❌ spotbugs 1m 30s /branch-spotbugs-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-core-warnings.html hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core in trunk has 1 extant spotbugs warnings.
+1 💚 shadedclient 23m 46s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 23s Maven dependency ordering for patch
+1 💚 mvninstall 2m 5s the patch passed
+1 💚 compile 24m 23s the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 javac 24m 23s the patch passed
+1 💚 compile 21m 42s the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 javac 21m 42s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 3m 48s /results-checkstyle-root.txt root: The patch generated 8 new + 33 unchanged - 0 fixed = 41 total (was 33)
+1 💚 mvnsite 3m 17s the patch passed
+1 💚 javadoc 2m 11s the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 1m 50s the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
-1 ❌ spotbugs 1m 44s /new-spotbugs-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-core.html hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core generated 3 new + 1 unchanged - 0 fixed = 4 total (was 1)
+1 💚 shadedclient 24m 6s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 18m 21s hadoop-common in the patch passed.
+1 💚 unit 7m 15s hadoop-mapreduce-client-core in the patch passed.
+1 💚 unit 2m 17s hadoop-azure in the patch passed.
+1 💚 asflicense 0m 52s The patch does not generate ASF License warnings.
255m 39s
Reason Tests
SpotBugs module:hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core
Should org.apache.hadoop.mapreduce.lib.output.committer.manifest.impl.EntryFileIO$EntryIterator be a static inner class? At EntryFileIO.java:inner class? At EntryFileIO.java:[lines 394-463]
Should org.apache.hadoop.mapreduce.lib.output.committer.manifest.impl.EntryFileIO$EntryWriter be a static inner class? At EntryFileIO.java:inner class? At EntryFileIO.java:[lines 178-383]
org.apache.hadoop.mapreduce.lib.output.committer.manifest.impl.ManifestCommitterSupport.addHeapInformation(IOStatisticsSetters, String) forces garbage collection; extremely dubious except in benchmarking code At ManifestCommitterSupport.java:extremely dubious except in benchmarking code At ManifestCommitterSupport.java:[line 239]
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5519/6/artifact/out/Dockerfile
GITHUB PR #5519
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 9599f9060afa 4.15.0-206-generic #217-Ubuntu SMP Fri Feb 3 19:10:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 2d0dcce
Default Java Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5519/6/testReport/
Max. process+thread count 2613 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-tools/hadoop-azure U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5519/6/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@steveloughran steveloughran force-pushed the mapreduce/MAPREDUCE-7435-committer-oom branch from 2d0dcce to de8c6e5 Compare April 11, 2023 14:09
Copy link
Contributor

@cnauroth cnauroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I'll wait to hear from you when you want code review again. Thanks, Steve.

@steveloughran steveloughran force-pushed the mapreduce/MAPREDUCE-7435-committer-oom branch from de8c6e5 to 1358391 Compare April 17, 2023 18:49
@steveloughran
Copy link
Contributor Author

updated pr has been run through azure, with stats of a terasort being


2023-04-19 16:15:23,305 INFO  [JUnit-test_140_teracomplete]: statistics.IOStatisticsLogging (IOStatisticsLogging.java:logIOStatisticsAtLevel(269)) - IOStatistics: counters=((commit_file_rename=4)
(committer_bytes_committed=200021)
(committer_commit_job=3)
(committer_files_committed=4)
(committer_task_directory_depth=7)
(committer_task_file_count=8)
(committer_task_file_size=200021)
(committer_task_manifest_file_size=127578)
(job_stage_create_target_dirs=3)
(job_stage_load_manifests=3)
(job_stage_rename_files=3)
(job_stage_setup=3)
(op_create_directories=3)
(op_delete=3)
(op_get_file_status=13)
(op_get_file_status.failures=13)
(op_list_status=10)
(op_load_all_manifests=3)
(op_load_manifest=7)
(op_mkdirs=13)
(op_msync=3)
(task_stage_commit=7)
(task_stage_scan_directory=7)
(task_stage_setup=7));

gauges=((stage.job_stage_create_target_dirs.free.memory=1542076752)
(stage.job_stage_create_target_dirs.heap.memory=533055152)
(stage.job_stage_create_target_dirs.total.memory=2075131904)
(stage.job_stage_load_manifests.free.memory=1544775808)
(stage.job_stage_load_manifests.heap.memory=530356096)
(stage.job_stage_load_manifests.total.memory=2075131904)
(stage.job_stage_rename_files.free.memory=1505757416)
(stage.job_stage_rename_files.heap.memory=569374488)
(stage.job_stage_rename_files.total.memory=2075131904)
(stage.setup.free.memory=1688139168)
(stage.setup.heap.memory=386992736)
(stage.setup.total.memory=2075131904));

minimums=((commit_file_rename.min=53)
(committer_task_directory_count=0)
(committer_task_directory_depth=1)
(committer_task_file_count=0)
(committer_task_file_size=0)
(committer_task_manifest_file_size=18018)
(job_stage_create_target_dirs.min=3)
(job_stage_load_manifests.min=184)
(job_stage_rename_files.min=73)
(job_stage_setup.min=267)
(op_create_directories.min=0)
(op_delete.min=30)
(op_get_file_status.failures.min=24)
(op_list_status.min=170)
(op_load_all_manifests.min=73)
(op_load_manifest.min=54)
(op_mkdirs.min=26)
(op_msync.min=0)
(task_stage_commit.min=176)
(task_stage_scan_directory.min=176)
(task_stage_setup.min=52));

maximums=((commit_file_rename.max=62)
(committer_task_directory_count=0)
(committer_task_directory_depth=1)
(committer_task_file_count=1)
(committer_task_file_size=100000)
(committer_task_manifest_file_size=18389)
(job_stage_create_target_dirs.max=4)
(job_stage_load_manifests.max=250)
(job_stage_rename_files.max=74)
(job_stage_setup.max=295)
(op_create_directories.max=1)
(op_delete.max=42)
(op_get_file_status.failures.max=113)
(op_list_status.max=189)
(op_load_all_manifests.max=139)
(op_load_manifest.max=125)
(op_mkdirs.max=74)
(op_msync.max=0)
(task_stage_commit.max=194)
(task_stage_scan_directory.max=194)
(task_stage_setup.max=93));

means=((commit_file_rename.mean=(samples=4, sum=227, mean=56.7500))
(committer_task_directory_count=(samples=14, sum=0, mean=0.0000))
(committer_task_directory_depth=(samples=7, sum=7, mean=1.0000))
(committer_task_file_count=(samples=14, sum=8, mean=0.5714))
(committer_task_file_size=(samples=7, sum=200021, mean=28574.4286))
(committer_task_manifest_file_size=(samples=7, sum=127578, mean=18225.4286))
(job_stage_create_target_dirs.mean=(samples=3, sum=11, mean=3.6667))
(job_stage_load_manifests.mean=(samples=3, sum=638, mean=212.6667))
(job_stage_rename_files.mean=(samples=3, sum=220, mean=73.3333))
(job_stage_setup.mean=(samples=3, sum=845, mean=281.6667))
(op_create_directories.mean=(samples=3, sum=2, mean=0.6667))
(op_delete.mean=(samples=3, sum=107, mean=35.6667))
(op_get_file_status.failures.mean=(samples=13, sum=638, mean=49.0769))
(op_list_status.mean=(samples=10, sum=1431, mean=143.1000))
(op_load_all_manifests.mean=(samples=3, sum=287, mean=95.6667))
(op_load_manifest.mean=(samples=7, sum=546, mean=78.0000))
(op_mkdirs.mean=(samples=13, sum=580, mean=44.6154))
(op_msync.mean=(samples=3, sum=0, mean=0.0000))
(task_stage_commit.mean=(samples=7, sum=1279, mean=182.7143))
(task_stage_scan_directory.mean=(samples=7, sum=1279, mean=182.7143))
(task_stage_setup.mean=(samples=7, sum=505, mean=72.1429)));

once abfs adds iostats context update of input stream reads, we could collect and add that into the stats too; not worrying about it until then.

@steveloughran
Copy link
Contributor Author

steveloughran commented Apr 19, 2023

tested azure cardiff, which is a slow test run today (no network). It would be nice to move the LoadManifests test into the parallel bit of the test run, but trying to do it seems to blow up too much stuff (the abfs parallel test phase runs individual test cases in parallel, see)

----

[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.937 s - in org.apache.hadoop.fs.azurebfs.services.ITestReadBufferManager
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 67.344 s - in org.apache.hadoop.fs.azurebfs.commit.ITestAbfsLoadManifestsStage
[INFO] Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 137.539 s - in org.apache.hadoop.fs.azurebfs.commit.ITestAbfsTerasort
[INFO] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 90.628 s - in org.apache.hadoop.fs.azurebfs.ITestAzureBlobFileSystemListStatus
[INFO] Tests run: 13, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 140.388 s - in org.apache.hadoop.fs.azurebfs.contract.ITestAbfsFileSystemContractDistCp

----

@steveloughran steveloughran marked this pull request as ready for review April 19, 2023 18:26
@steveloughran
Copy link
Contributor Author

parallel test running failed everywhere, but I have improved ITestAbfsLoadManifestsStage performance

  • back to the original 200 manifest files
  • increase worker pool and buffer queue size (more significant before
    reducing the manifest count)

brings test time down to 10s locally. IOStats does imply many MB of data is
being PUT/GET so it is good to keep small so people running with less
bandwidth don't suffer. Maybe, maybe, the size could switch
with a -Dscale?

iostat.
there seems a lot of delete requests, but its because when we write the manifest it is done as a write to temp then rename, and the dest is deleted first, without any check.
in production that cost is absorbed in task commit, and @60ms vs 40 for a head, we should decide what to do here. I think for renames in job commit, we could do the HEAD before the DELETE simply because that is bottleneck, so maybe do it here too...

2023-04-19 19:43:05,489 INFO  [JUnit]: manifest.AbstractManifestCommitterTest (AbstractManifestCommitterTest.java:dumpFileSystemIOStatistics(450)) - Aggregate FileSystem Statistics counters=((action_http_delete_request=402)
(action_http_delete_request.failures=200)
(action_http_get_request=202)
(action_http_head_request=404)
(action_http_head_request.failures=202)
(action_http_put_request=1103)
(bytes_received=10160814)
(bytes_sent=10160814)
(committer_task_directory_count=20000)
(committer_task_file_count=20000)
(committer_task_manifest_file_size=10160814)
(connections_made=2111)
(directories_created=303)
(files_created=200)
(get_responses=2111)
(job_stage_create_target_dirs=1)
(job_stage_load_manifests=1)
(job_stage_setup=1)
(op_create=200)
(op_create_directories=1)
(op_delete=803)
(op_get_file_status=407)
(op_get_file_status.failures=202)
(op_list_status=2)
(op_load_all_manifests=1)
(op_load_manifest=200)
(op_mkdirs=605)
(op_msync=1)
(op_open=200)
(op_rename=400)
(rename_path_attempts=200)
(send_requests=1103)
(task_stage_save_manifest=200)
(task_stage_save_task_manifest=200)
(task_stage_setup=200));

gauges=();

minimums=((action_http_delete_request.failures.min=25)
(action_http_delete_request.min=36)
(action_http_get_request.min=40)
(action_http_head_request.failures.min=22)
(action_http_head_request.min=20)
(action_http_put_request.min=24)
(committer_task_directory_count=100)
(committer_task_file_count=100)
(committer_task_manifest_file_size=49990)
(job_stage_create_target_dirs.min=259)
(job_stage_load_manifests.min=2804)
(job_stage_setup.min=183)
(op_create_directories.min=256)
(op_delete.min=25)
(op_get_file_status.failures.min=22)
(op_get_file_status.min=22)
(op_list_status.min=87)
(op_load_all_manifests.min=2627)
(op_load_manifest.min=49)
(op_mkdirs.min=24)
(op_msync.min=0)
(op_rename.min=70)
(task_stage_save_manifest.min=273)
(task_stage_save_task_manifest.min=144)
(task_stage_setup.min=51));

maximums=((action_http_delete_request.failures.max=413)
(action_http_delete_request.max=291)
(action_http_get_request.max=2031)
(action_http_head_request.failures.max=439)
(action_http_head_request.max=430)
(action_http_put_request.max=2662)
(committer_task_directory_count=100)
(committer_task_file_count=100)
(committer_task_manifest_file_size=50876)
(job_stage_create_target_dirs.max=259)
(job_stage_load_manifests.max=2804)
(job_stage_setup.max=183)
(op_create_directories.max=256)
(op_delete.max=413)
(op_get_file_status.failures.max=439)
(op_get_file_status.max=22)
(op_list_status.max=127)
(op_load_all_manifests.max=2627)
(op_load_manifest.max=2031)
(op_mkdirs.max=245)
(op_msync.max=0)
(op_rename.max=932)
(task_stage_save_manifest.max=2863)
(task_stage_save_task_manifest.max=2757)
(task_stage_setup.max=471));

means=((action_http_delete_request.failures.mean=(samples=200, sum=9850, mean=49.2500))
(action_http_delete_request.mean=(samples=202, sum=12448, mean=61.6238))
(action_http_get_request.mean=(samples=202, sum=78955, mean=390.8663))
(action_http_head_request.failures.mean=(samples=202, sum=12782, mean=63.2772))
(action_http_head_request.mean=(samples=202, sum=8096, mean=40.0792))
(action_http_put_request.mean=(samples=1103, sum=108966, mean=98.7906))
(committer_task_directory_count=(samples=200, sum=20000, mean=100.0000))
(committer_task_file_count=(samples=200, sum=20000, mean=100.0000))
(committer_task_manifest_file_size=(samples=200, sum=10160814, mean=50804.0700))
(job_stage_create_target_dirs.mean=(samples=1, sum=259, mean=259.0000))
(job_stage_load_manifests.mean=(samples=1, sum=2804, mean=2804.0000))
(job_stage_setup.mean=(samples=1, sum=183, mean=183.0000))
(op_create_directories.mean=(samples=1, sum=256, mean=256.0000))
(op_delete.mean=(samples=401, sum=22278, mean=55.5561))
(op_get_file_status.failures.mean=(samples=202, sum=12806, mean=63.3960))
(op_get_file_status.mean=(samples=1, sum=22, mean=22.0000))
(op_list_status.mean=(samples=2, sum=214, mean=107.0000))
(op_load_all_manifests.mean=(samples=1, sum=2627, mean=2627.0000))
(op_load_manifest.mean=(samples=200, sum=79570, mean=397.8500))
(op_mkdirs.mean=(samples=302, sum=15536, mean=51.4437))
(op_msync.mean=(samples=1, sum=0, mean=0.0000))
(op_rename.mean=(samples=200, sum=22999, mean=114.9950))
(task_stage_save_manifest.mean=(samples=200, sum=115797, mean=578.9850))
(task_stage_save_task_manifest.mean=(samples=200, sum=82893, mean=414.4650))
(task_stage_setup.mean=(samples=200, sum=21878, mean=109.3900)));

@steveloughran
Copy link
Contributor Author

testrun failure is the usual intermittent failure of the slow tests, showing some transient failure happened and was recovered from. wifi was playing up all evening


[ERROR] Tests run: 48, Failures: 2, Errors: 0, Skipped: 24, Time elapsed: 2,057.747 s <<< FAILURE! - in org.apache.hadoop.fs.azurebfs.ITestSmallWriteOptimization
[ERROR] testSmallWriteOptimization[OptmOFF_CloseTest_EmptyFile_MultiSmallWritesStillLessThanBufferSize](org.apache.hadoop.fs.azurebfs.ITestSmallWriteOptimization)  Time elapsed: 512.92 s  <<< FAILURE!
java.lang.AssertionError: Mismatch in connections_made expected:<4> but was:<5>
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.apache.hadoop.fs.azurebfs.AbstractAbfsIntegrationTest.assertAbfsStatistics(AbstractAbfsIntegrationTest.java:526)
        at org.apache.hadoop.fs.azurebfs.ITestSmallWriteOptimization.assertOpStats(ITestSmallWriteOptimization.java:499)
        at org.apache.hadoop.fs.azurebfs.ITestSmallWriteOptimization.formulateSmallWriteTestAppendPattern(ITestSmallWriteOptimization.java:437)
        at org.apache.hadoop.fs.azurebfs.ITestSmallWriteOptimization.testSmallWriteOptimization(ITestSmallWriteOptimization.java:324)
        at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
        at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
        at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
        at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
        at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
        at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
        at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
        at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.lang.Thread.run(Thread.java:750)

[ERROR] testSmallWriteOptimization[OptmOFF_FlushCloseTest_EmptyFile_MultiBufferSizeWrite](org.apache.hadoop.fs.azurebfs.ITestSmallWriteOptimization)  Time elapsed: 781.323 s  <<< FAILURE!
java.lang.AssertionError: Mismatch in connections_made expected:<10> but was:<11>
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.apache.hadoop.fs.azurebfs.AbstractAbfsIntegrationTest.assertAbfsStatistics(AbstractAbfsIntegrationTest.java:526)
        at org.apache.hadoop.fs.azurebfs.ITestSmallWriteOptimization.assertOpStats(ITestSmallWriteOptimization.java:499)
        at org.apache.hadoop.fs.azurebfs.ITestSmallWriteOptimization.formulateSmallWriteTestAppendPattern(ITestSmallWriteOptimization.java:437)
        at org.apache.hadoop.fs.azurebfs.ITestSmallWriteOptimization.testSmallWriteOptimization(ITestSmallWriteOptimization.java:324)
        at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
        at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
        at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
        at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
        at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
        at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
        at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
        at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.lang.Thread.run(Thread.java:750)

[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2,079.726 s - in org.apache.hadoop.fs.azurebfs.ITestAzureBlobFileSystemE2EScale
[INFO] Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2,277.925 s - in org.apache.hadoop.fs.azurebfs.ITestAbfsReadWriteAndSeek
[INFO] 
[INFO] Results:
[INFO] 
[ERROR] Failures: 
[ERROR]   ITestSmallWriteOptimization.testSmallWriteOptimization:324->formulateSmallWriteTestAppendPattern:437->assertOpStats:499->AbstractAbfsIntegrationTest.assertAbfsStatistics:526->Assert.assertEquals:647->Assert.failNotEquals:835->Assert.fail:89 Mismatch in connections_made expected:<4> but was:<5>
[ERROR]   ITestSmallWriteOptimization.testSmallWriteOptimization:324->formulateSmallWriteTestAppendPattern:437->assertOpStats:499->AbstractAbfsIntegrationTest.assertAbfsStatistics:526->Assert.assertEquals:647->Assert.failNotEquals:835->Assert.fail:89 Mismatch in connections_made expected:<10> but was:<11>
[INFO] 
[ERROR] Tests run: 336, Failures: 2, Errors: 0, Skipped: 42

@steveloughran
Copy link
Contributor Author

spotbugs complaint is valid, but nothing to do with this PR



Code | Warning
-- | --
ST | Write to static field org.apache.hadoop.mapreduce.task.reduce.Fetcher.nextId from instance method new org.apache.hadoop.mapreduce.task.reduce.Fetcher(JobConf, TaskAttemptID, ShuffleSchedulerImpl, MergeManager, Reporter, ShuffleClientMetrics, ExceptionReporter, SecretKey)
  | Bug type ST_WRITE_TO_STATIC_FROM_INSTANCE_METHOD (click for details)In class org.apache.hadoop.mapreduce.task.reduce.FetcherIn method new org.apache.hadoop.mapreduce.task.reduce.Fetcher(JobConf, TaskAttemptID, ShuffleSchedulerImpl, MergeManager, Reporter, ShuffleClientMetrics, ExceptionReporter, SecretKey)Field org.apache.hadoop.mapreduce.task.reduce.Fetcher.nextIdAt Fetcher.java:[line 120]

Code	Warning
ST	Write to static field org.apache.hadoop.mapreduce.task.reduce.Fetcher.nextId from instance method new org.apache.hadoop.mapreduce.task.reduce.Fetcher(JobConf, TaskAttemptID, ShuffleSchedulerImpl, MergeManager, Reporter, ShuffleClientMetrics, ExceptionReporter, SecretKey)
[Bug type ST_WRITE_TO_STATIC_FROM_INSTANCE_METHOD (click for details)](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5519/11/artifact/out/branch-spotbugs-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-core-warnings.html#ST_WRITE_TO_STATIC_FROM_INSTANCE_METHOD)
In class org.apache.hadoop.mapreduce.task.reduce.Fetcher
In method new org.apache.hadoop.mapreduce.task.reduce.Fetcher(JobConf, TaskAttemptID, ShuffleSchedulerImpl, MergeManager, Reporter, ShuffleClientMetrics, ExceptionReporter, SecretKey)
Field org.apache.hadoop.mapreduce.task.reduce.Fetcher.nextId
At Fetcher.java:[line 120]

@steveloughran steveloughran force-pushed the mapreduce/MAPREDUCE-7435-committer-oom branch from 039648b to b5166b6 Compare April 21, 2023 14:06
throw new UncheckedIOException(e);
} catch (InterruptedException e) {
// being stopped implicitly
LOG.debug("interrupted", e);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set stop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in finally

Copy link
Contributor

@mehakmeet mehakmeet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really interesting and well-written.
Have started to look at some bits of core functionality, moving towards tests afterwards.

addHeapInformation(heapInfo, "setup");
// load the manifests
final StageConfig stageConfig = getStageConfig();
LoadManifestsStage.Result result = new LoadManifestsStage(stageConfig).apply(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: We can include a duration tracker to know the time taken to load manifests in the final stats.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already done in AbstractJobOrTaskStage

}
if (active.get()) {
try {
queue.put(new QueueEntry(Actions.write, entries));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: We could use queue.offer(E e, long timeout, TimeUnit unit), such that we are waiting for the queue to have the capacity to add the Entry while also having a timeout in case something goes wrong. We can catch the interrupt and throw/swallow accordingly if it exceeds the timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm. I would rather have the writer work. If have a timeout here it should be really big as we want to cope with all threads blocking for a while

if (active.get()) {
try {
queue.put(new QueueEntry(Actions.write, entries));
LOG.debug("Queued {}", entries.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some info about the entry that was queued in the LOG?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a list...what do we want to add?

* @param path path to create
* @return true if dir created/found
* @param dirEntry dir to create
* @return Outcome
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Better javadocs for return, "State of the directory in the dir map" or something?

* This is primarily for tests or when submitting work into a TaskPool.
* equivalent to
* <pre>
* for(long l = start, l &lt; finis; l++) yield l;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "l < excludedFinish"

Thread.currentThread().setName("EntryIOWriter");
try {
while (!stop.get()) {
final QueueEntry queueEntry = queue.take();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like we could wait indefinitely on this.
How about poll(long timeout, TimeUnit unit)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't know how long we should wait here? It assumes that yes, the caller will eventually stop the write

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed to 10 minutes, returning false. caller gets to react (which it will do by raising an IOE but giving anything raised by the writer thread priority)

// signal queue closure by queuing a stop option.
// this is added at the end of the list of queued blocks,
// of which are written.
try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOG something like "Tasks left in queue = capacity - queue.remainingCapacity()" for better logging. We could do something like this while offering as well but seems apt for close().


It can help limit the amount of memory consumed during manifest load during
job commit.
The maximumum number of loaded manifests will be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "maximum"

* Add heap information to gauges in _SUCCESS
* which includes pulling up part of impl.IOStatisticsStore into a public
  IOStatisticsSetters interface with the put{Counter, Gauge, etc} methods only.
* TestLoadManifests scaled up with #of tasks and #of files in each task
  to generate more load.
* code to build up dir map during load phase; not wired up

Summary: abfs uses a lot more heap during the load phase than file; possibly
due to buffering, but doing a pipeline for processing the results
isn't sustainable.

Either

two phase:
  * phase 1, build up the dir list, discard manifests after each load
  * phase 2, load manifests and rename incrementally

Or:
 unified with some complicated directory creation process to ensure
 that each task's dirs exist before its rename begins.

Change-Id: I8595c083435e3d4df27343599687677abfc1c013
* DirEntry and FileEntry are writeable
* LoadManifests to take a path to where to cache the rename list.

not yet wired up.

Change-Id: Ibd992b179bd0bcf26a39ae4ce5407257ecbfcb10
This is a big change, with tests as far as verifying core read/write happy.

Current state: simple read/write good, async queue not yet tested

Change-Id: I7cb1443024780b355a8f3bb96fbfe08d8608d968
interim commit

Change-Id: I80bb4e72c1029baad8fb87d8c9287b08c0b000f4
...but not tested the job commit yet

Change-Id: I0f54ede94e41592558468df1c87f4a39d2461223
...but not tested the job commit yet

Change-Id: I4d50636542673a3f25a7ab363df1b1bd221216ae
* TestEntryFileIO extended for ths
* ABFS terasort test happy!

Change-Id: I068861973114d9947f3d22eaf32a6ee3b7ca8fa2
TODO: fault injection on the writes
* validation also uses manifest entries (and so works!)
* testing expects this
* tests of IOStats
* tests of new RemoteIterators

Change-Id: I4cfb308d4b08f1f775cfdbe2df6f8ff07ac6bc54
Change-Id: I2008d31bff3af59396a04dddc1b9357b1a812294
* moved RangeExcludingLongIterator into RemoteIterators, added test.
* address checkstyle
* address spotbugs
* address deprecation
* ValidateRenameFilesStage doesn't validate etags on wasb; helps address a
  JIRA about hadoop-azure testing.

Change-Id: Id6507d79f8d3cfa434afb65bfe9fc7539a7c1cf5
* back to the original 200 manifest files
* increase worker pool and buffer queue size (more significant before
  reducing the manifest count)

brings test time down to 10s locally. IOStats does imply many MB of data is
being PUT/GET so it is good to keep small so people running with less
bandwidth don't suffer. Maybe, maybe, the size could switch
with a -Dscale?

Change-Id: I49d201d7af7434797ab6fff5831a0f899c5c4185
…leanup

Change-Id: If043263676c4d5694065e7ec35954a7f66c04d90
@steveloughran
Copy link
Contributor Author

ok, rebasing and pushing up with a commit to address most of the changes.

Not addressed: having timeouts on the offer/take of the EntryWriter in the queue. I agree it is safest if we do add a timeout here, just as an emergency.
What do you think? something big like 10 minutes? as it is only a fallback in case the code is broken ... should be reported as an error.

Change-Id: Ib93ba8ba632135a05da126a75f34e78bd381cf2a
@steveloughran steveloughran force-pushed the mapreduce/MAPREDUCE-7435-committer-oom branch from b5166b6 to 8e83fdc Compare April 26, 2023 11:35
@steveloughran
Copy link
Contributor Author

latest patch tested against azure cardiff, timeout unrelated

[ERROR] Errors: 
[ERROR]   ITestAzureBlobFileSystemLease.testTwoWritersCreateAppendWithInfiniteLeaseEnabled:186->twoWriters:154 » TestTimedOut
[INFO] 

I wonder if we should just increase the timeout there?

@steveloughran
Copy link
Contributor Author

@mehakmeet when you get time, can you review this. I would like to get this in before I forget about it

Copy link
Contributor

@mehakmeet mehakmeet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is very close to getting merged. Had one doubt with respect to the testing, was the large loading of manifests performed on the old way of handling manifest files to see OOM errors? Tests look good overall just wanted to know if that limit was being hit and now we see it rectified with the new approach.

// do an explicit close to help isolate any failure.
SequenceFile.Writer writer = createWriter();
writer.append(NullWritable.get(), source);
writer.flush();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a doubt here. Do we need to explicitly flush the writer before closing? Won't that be done in close too? If yes, we can test both mechanisms by just saying writer.close()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just being rigorous. on hdfs close() doesn't actually sync the data, just flushes it, FWIW.

// now use the iterator to access it.
List<FileEntry> files = new ArrayList<>();
Assertions.assertThat(foreach(iterateOverEntryFile(), files::add))
.isEqualTo(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify which value equates to "0" in this test by some comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a description

.isEqualTo(2);

// unknown value
ioStatistics.setCounter("c2", 3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about e assert that this unknown counter shouldn't exist in the counters map? Just to test the no-op I guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets see...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had to parameterize the test so we assert than on snapshots they do accrue, but on the other impls they don't

@@ -141,6 +145,9 @@
*/
private StageConfig ta11Config;

private LoadedManifestData
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

javadocs for consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. also needed to be static to work properly, so doc that after fixing it

@@ -82,6 +85,8 @@ public class TestRenameStageFailure extends AbstractManifestCommitterTest {
/** resilient commit expected? */
private boolean resilientCommit;

private EntryFileIO entryFileIO;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

javadocs for consistency.

@apache apache deleted a comment from hadoop-yetus Jun 1, 2023
@apache apache deleted a comment from hadoop-yetus Jun 1, 2023
@apache apache deleted a comment from hadoop-yetus Jun 1, 2023
@apache apache deleted a comment from hadoop-yetus Jun 1, 2023
@apache apache deleted a comment from hadoop-yetus Jun 1, 2023
@apache apache deleted a comment from hadoop-yetus Jun 1, 2023
@apache apache deleted a comment from hadoop-yetus Jun 1, 2023
@apache apache deleted a comment from hadoop-yetus Jun 1, 2023
@apache apache deleted a comment from hadoop-yetus Jun 1, 2023
@apache apache deleted a comment from hadoop-yetus Jun 1, 2023
@apache apache deleted a comment from hadoop-yetus Jun 1, 2023
that is: success file contains entries which aren't present in the FS

Fixes
* find bit in earlier test where file was being deleted, and restore it
  (and re-order it too!)
* LoadManifestsStage doesn't optionally return manifests for testing;
  tests modified to match.
* EntryFileIO will report timeout after 10 minutes if queue blocks somehow.
* LoadManifestsStage handles this timeout and will raise it as a failure,
  but only secondary to any exception raised by the writer thread
* SUCCESS file can be configured with #of files to list, allows for tests
  to assert on many thousands of files, although in production it is still
  fixed to a small number for performance reasons.

Change-Id: I642c1178928de427bf6e09f0fe0d345876311fb5
Change-Id: Ica813c6068eca18d83bf2f5f94fac4a1e1996c36
@apache apache deleted a comment from hadoop-yetus Jun 1, 2023
@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 51s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 markdownlint 0m 0s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 12 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 42m 36s Maven dependency ordering for branch
+1 💚 mvninstall 22m 40s trunk passed
+1 💚 compile 18m 38s trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚 compile 17m 21s trunk passed with JDK Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
+1 💚 checkstyle 4m 3s trunk passed
+1 💚 mvnsite 4m 2s trunk passed
+1 💚 javadoc 2m 48s trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 2m 24s trunk passed with JDK Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 6m 47s trunk passed
+1 💚 shadedclient 24m 17s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 26s Maven dependency ordering for patch
+1 💚 mvninstall 2m 20s the patch passed
+1 💚 compile 18m 13s the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚 javac 18m 13s the patch passed
+1 💚 compile 16m 42s the patch passed with JDK Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
+1 💚 javac 16m 42s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 4m 11s /results-checkstyle-root.txt root: The patch generated 5 new + 32 unchanged - 1 fixed = 37 total (was 33)
+1 💚 mvnsite 3m 52s the patch passed
+1 💚 javadoc 2m 38s the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 2m 24s the patch passed with JDK Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 7m 23s the patch passed
+1 💚 shadedclient 24m 10s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 18m 49s hadoop-common in the patch passed.
+1 💚 unit 7m 31s hadoop-mapreduce-client-core in the patch passed.
+1 💚 unit 2m 40s hadoop-aws in the patch passed.
+1 💚 unit 2m 18s hadoop-azure in the patch passed.
+1 💚 asflicense 0m 52s The patch does not generate ASF License warnings.
266m 23s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5519/16/artifact/out/Dockerfile
GITHUB PR #5519
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint
uname Linux cc769a5d2de9 4.15.0-206-generic #217-Ubuntu SMP Fri Feb 3 19:10:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / b289707
Default Java Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5519/16/testReport/
Max. process+thread count 1236 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-tools/hadoop-aws hadoop-tools/hadoop-azure U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5519/16/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 50s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 markdownlint 0m 0s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 12 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 19m 39s Maven dependency ordering for branch
+1 💚 mvninstall 23m 36s trunk passed
+1 💚 compile 19m 7s trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚 compile 16m 49s trunk passed with JDK Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
+1 💚 checkstyle 4m 13s trunk passed
+1 💚 mvnsite 4m 2s trunk passed
+1 💚 javadoc 2m 55s trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 2m 28s trunk passed with JDK Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 6m 22s trunk passed
+1 💚 shadedclient 25m 9s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 25s Maven dependency ordering for patch
+1 💚 mvninstall 2m 24s the patch passed
+1 💚 compile 18m 25s the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚 javac 18m 25s the patch passed
+1 💚 compile 16m 59s the patch passed with JDK Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
+1 💚 javac 16m 59s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 4m 2s /results-checkstyle-root.txt root: The patch generated 5 new + 32 unchanged - 1 fixed = 37 total (was 33)
+1 💚 mvnsite 3m 58s the patch passed
+1 💚 javadoc 2m 49s the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 2m 27s the patch passed with JDK Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 6m 56s the patch passed
+1 💚 shadedclient 24m 53s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 19m 10s hadoop-common in the patch passed.
+1 💚 unit 7m 16s hadoop-mapreduce-client-core in the patch passed.
+1 💚 unit 2m 35s hadoop-aws in the patch passed.
+1 💚 unit 2m 13s hadoop-azure in the patch passed.
+1 💚 asflicense 0m 54s The patch does not generate ASF License warnings.
246m 5s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5519/17/artifact/out/Dockerfile
GITHUB PR #5519
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint
uname Linux d83fcb100c00 4.15.0-206-generic #217-Ubuntu SMP Fri Feb 3 19:10:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 355fa35
Default Java Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5519/17/testReport/
Max. process+thread count 1277 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-tools/hadoop-aws hadoop-tools/hadoop-azure U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5519/17/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

The other new ones are related to test methods whose numbering breaks
the style checker's requirements
* test_0440_validateSuccessFiles
* test_0450_validationDetectsFailures

Change-Id: I36267e4d9912873e457126341385f866acd6d148
@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 49s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+0 🆗 markdownlint 0m 1s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 12 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 20m 36s Maven dependency ordering for branch
+1 💚 mvninstall 22m 44s trunk passed
+1 💚 compile 17m 27s trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚 compile 15m 45s trunk passed with JDK Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
+1 💚 checkstyle 4m 0s trunk passed
+1 💚 mvnsite 3m 48s trunk passed
+1 💚 javadoc 2m 50s trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 2m 29s trunk passed with JDK Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 6m 13s trunk passed
+1 💚 shadedclient 23m 58s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 23s Maven dependency ordering for patch
+1 💚 mvninstall 2m 15s the patch passed
+1 💚 compile 16m 34s the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚 javac 16m 34s the patch passed
+1 💚 compile 15m 41s the patch passed with JDK Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
+1 💚 javac 15m 41s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 3m 53s /results-checkstyle-root.txt root: The patch generated 2 new + 32 unchanged - 1 fixed = 34 total (was 33)
+1 💚 mvnsite 3m 44s the patch passed
+1 💚 javadoc 2m 41s the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 2m 30s the patch passed with JDK Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 6m 49s the patch passed
+1 💚 shadedclient 24m 4s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 18m 27s hadoop-common in the patch passed.
+1 💚 unit 7m 16s hadoop-mapreduce-client-core in the patch passed.
+1 💚 unit 2m 32s hadoop-aws in the patch passed.
+1 💚 unit 2m 14s hadoop-azure in the patch passed.
+1 💚 asflicense 0m 53s The patch does not generate ASF License warnings.
236m 22s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5519/18/artifact/out/Dockerfile
GITHUB PR #5519
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint
uname Linux 6f59fa66c594 4.15.0-206-generic #217-Ubuntu SMP Fri Feb 3 19:10:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 070c788
Default Java Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5519/18/testReport/
Max. process+thread count 3134 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-tools/hadoop-aws hadoop-tools/hadoop-azure U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5519/18/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@steveloughran
Copy link
Contributor Author

@mehakmeet @cnauroth need a final review here.

we also need a google gcs test suite somewhere, don't we?

Copy link
Contributor

@mehakmeet mehakmeet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one doubt about new constants, else we're good to go in with this. Really nice implementation btw.

Comment on lines +69 to +77
public static final int WRITER_SHUTDOWN_TIMEOUT_SECONDS = 60;

/**
* How long should trying to queue a write block before giving up
* with an error?
* This is a safety feature to ensure that if something has gone wrong
* in the queue code the job fails with an error rather than just hangs
*/
public static final int WRITER_QUEUE_PUT_TIMEOUT_MINUTES = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think I missed these constants being added, don't you think these should be configurable, just for some kind of fallback sakes, so that these values never cause any issues and are easily changeable? I guess if it waits for this long then, we can assume it's just hanging as well. Your call on it being configurable or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my view if things are this bad it is a disaster and the job is failing as either the thread concurrency is broken or the local fs has failed.

@steveloughran steveloughran merged commit 7a45ef4 into apache:trunk Jun 9, 2023
3 checks passed
@steveloughran
Copy link
Contributor Author

merged to trunk, now backporting to 3.3

steveloughran added a commit to steveloughran/hadoop that referenced this pull request Jun 9, 2023
This modifies the manifest committer so that the list of files
to rename is passed between stages as a file of
writeable entries on the local filesystem.

The map of directories to create is still passed in memory;
this map is built across all tasks, so even if many tasks
created files, if they all write into the same set of directories
the memory needed is O(directories) with the
task count not a factor.

The _SUCCESS file reports on heap size through gauges.
This should give a warning if there are problems.

Contributed by Steve Loughran

Change-Id: Ic7707d2dde9daa28cd3a927e49972c15313336ad
steveloughran added a commit that referenced this pull request Jun 12, 2023
This modifies the manifest committer so that the list of files
to rename is passed between stages as a file of
writeable entries on the local filesystem.

The map of directories to create is still passed in memory;
this map is built across all tasks, so even if many tasks
created files, if they all write into the same set of directories
the memory needed is O(directories) with the
task count not a factor.

The _SUCCESS file reports on heap size through gauges.
This should give a warning if there are problems.

Contributed by Steve Loughran
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants