MAPREDUCE-7435. Manifest Committer OOM on abfs #5519

steveloughran · 2023-03-29T14:25:47Z

Add heap information to gauges in _SUCCESS
which includes pulling up part of impl.IOStatisticsStore into a public IOStatisticsSetters interface with the put{Counter, Gauge, etc} methods only.
TestLoadManifests scaled up with #of tasks and #of files in each task to generate more load.

Summary: abfs uses a lot more heap during the load phase than file; possibly due to buffering.

How was this patch tested?

modified existing tests
azure tests in progress

For code changes:

Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

cnauroth

Thanks, @steveloughran . This mostly looks good. I entered a few comments.

cnauroth · 2023-03-30T22:38:36Z

...roject/hadoop-common/src/main/java/org/apache/hadoop/fs/statistics/IOStatisticsSnapshot.java

+
+  @Override
+  public void setMeanStatistic(final String key, final MeanStatistic value) {
+


meanStatistics().put(key, value);?

cnauroth · 2023-03-30T22:43:00Z

...org/apache/hadoop/mapreduce/lib/output/committer/manifest/impl/ManifestCommitterSupport.java

+  public static void addHeapInformation(IOStatisticsSetters ioStatisticsSetters,
+      String stage) {
+    // force a gc. bit of bad form but it makes for better numbers
+    System.gc();


This triggered a Spotbugs warning. Do think the forced GC should go behind a config flag, default off, and turned on in the tests?

yes, I will do that

GC pulled out of production code & only invoked in test code

cnauroth · 2023-03-30T22:57:49Z

...lient-core/src/test/java/org/apache/hadoop/mapreduce/lib/output/TestFileOutputCommitter.java

+      // needed to avoid this test contaminating others in the same JVM
+      FileSystem.closeAll();
+      conf.set(fileImpl, fileImplClassname);
+      conf.set(fileImpl, fileImplClassname);


Duplicated line?

I wasn't sure why we need to set the conf here in the finally block. Did something mutate it after line 761, and now we need to restore it?

duplicate. cut
the reason it is in is not so much because of any change in the pr, as it surfaced a condition which is already there -this test changes the default "file" fs, and in some test runs that wasn't being reset, so other tests were failing later for no obvious reason

cnauroth · 2023-03-30T22:58:32Z

...t/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/TestLoadManifestsStage.java

@@ -63,6 +81,10 @@ public void setup() throws Exception {
        .isGreaterThan(0);
  }

+  public long heapSize() {
+       return Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();


Nitpick: some indentation issues here.

cnauroth · 2023-03-30T23:00:18Z

...t/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/TestLoadManifestsStage.java

+    success.save(summaryFS, path, true);
+    LOG.info("Saved summary to {}", path);
+    ManifestPrinter showManifest = new ManifestPrinter();
+        ManifestSuccessData manifestSuccessData =


Nitpick: some indentation issues here.

steveloughran · 2023-04-04T10:46:32Z

@cnauroth -thanks for the comments; will update

I've converted this to a draft as I am working on the next step of this: streaming the list of files to rename from each manifest into a SequenceFile saved to the local FS; rename stage reading that in and spreading the renames across the worker pool, maybe in batches.

this will eliminate the need to store the list of files to rename in memory at all and so not worry about #of files or path lengths. the file will be on localfs, so on an SSD machine fairly quick to write and read back, especially if the os buffers well/is optimised for transient files.
does complicate propagation of data, hence the extra work and the need for some more tests, including some of the save/restore process itself

hadoop-yetus · 2023-04-06T22:05:35Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 52s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 10 new or modified test files.
			_ trunk Compile Tests _
+0 🆗	mvndep	16m 6s		Maven dependency ordering for branch
+1 💚	mvninstall	28m 31s		trunk passed
+1 💚	compile	25m 13s		trunk passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚	compile	21m 43s		trunk passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚	checkstyle	4m 5s		trunk passed
+1 💚	mvnsite	3m 20s		trunk passed
+1 💚	javadoc	2m 19s		trunk passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚	javadoc	1m 50s		trunk passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
-1 ❌	spotbugs	1m 30s	/branch-spotbugs-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-core-warnings.html	hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core in trunk has 1 extant spotbugs warnings.
+1 💚	shadedclient	23m 46s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 23s		Maven dependency ordering for patch
+1 💚	mvninstall	2m 5s		the patch passed
+1 💚	compile	24m 23s		the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚	javac	24m 23s		the patch passed
+1 💚	compile	21m 42s		the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚	javac	21m 42s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	3m 48s	/results-checkstyle-root.txt	root: The patch generated 8 new + 33 unchanged - 0 fixed = 41 total (was 33)
+1 💚	mvnsite	3m 17s		the patch passed
+1 💚	javadoc	2m 11s		the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚	javadoc	1m 50s		the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
-1 ❌	spotbugs	1m 44s	/new-spotbugs-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-core.html	hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core generated 3 new + 1 unchanged - 0 fixed = 4 total (was 1)
+1 💚	shadedclient	24m 6s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	18m 21s		hadoop-common in the patch passed.
+1 💚	unit	7m 15s		hadoop-mapreduce-client-core in the patch passed.
+1 💚	unit	2m 17s		hadoop-azure in the patch passed.
+1 💚	asflicense	0m 52s		The patch does not generate ASF License warnings.
		255m 39s

Reason	Tests
SpotBugs	module:hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core
	Should org.apache.hadoop.mapreduce.lib.output.committer.manifest.impl.EntryFileIO$EntryIterator be a static inner class? At EntryFileIO.java:inner class? At EntryFileIO.java:[lines 394-463]
	Should org.apache.hadoop.mapreduce.lib.output.committer.manifest.impl.EntryFileIO$EntryWriter be a static inner class? At EntryFileIO.java:inner class? At EntryFileIO.java:[lines 178-383]
	org.apache.hadoop.mapreduce.lib.output.committer.manifest.impl.ManifestCommitterSupport.addHeapInformation(IOStatisticsSetters, String) forces garbage collection; extremely dubious except in benchmarking code At ManifestCommitterSupport.java:extremely dubious except in benchmarking code At ManifestCommitterSupport.java:[line 239]

Subsystem	Report/Notes
Docker	ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5519/6/artifact/out/Dockerfile
GITHUB PR	#5519
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname	Linux 9599f9060afa 4.15.0-206-generic #217-Ubuntu SMP Fri Feb 3 19:10:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `2d0dcce`
Default Java	Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5519/6/testReport/
Max. process+thread count	2613 (vs. ulimit of 5500)
modules	C: hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-tools/hadoop-azure U: .
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5519/6/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

cnauroth

Sounds good. I'll wait to hear from you when you want code review again. Thanks, Steve.

steveloughran · 2023-04-19T15:58:36Z

updated pr has been run through azure, with stats of a terasort being


2023-04-19 16:15:23,305 INFO  [JUnit-test_140_teracomplete]: statistics.IOStatisticsLogging (IOStatisticsLogging.java:logIOStatisticsAtLevel(269)) - IOStatistics: counters=((commit_file_rename=4)
(committer_bytes_committed=200021)
(committer_commit_job=3)
(committer_files_committed=4)
(committer_task_directory_depth=7)
(committer_task_file_count=8)
(committer_task_file_size=200021)
(committer_task_manifest_file_size=127578)
(job_stage_create_target_dirs=3)
(job_stage_load_manifests=3)
(job_stage_rename_files=3)
(job_stage_setup=3)
(op_create_directories=3)
(op_delete=3)
(op_get_file_status=13)
(op_get_file_status.failures=13)
(op_list_status=10)
(op_load_all_manifests=3)
(op_load_manifest=7)
(op_mkdirs=13)
(op_msync=3)
(task_stage_commit=7)
(task_stage_scan_directory=7)
(task_stage_setup=7));

gauges=((stage.job_stage_create_target_dirs.free.memory=1542076752)
(stage.job_stage_create_target_dirs.heap.memory=533055152)
(stage.job_stage_create_target_dirs.total.memory=2075131904)
(stage.job_stage_load_manifests.free.memory=1544775808)
(stage.job_stage_load_manifests.heap.memory=530356096)
(stage.job_stage_load_manifests.total.memory=2075131904)
(stage.job_stage_rename_files.free.memory=1505757416)
(stage.job_stage_rename_files.heap.memory=569374488)
(stage.job_stage_rename_files.total.memory=2075131904)
(stage.setup.free.memory=1688139168)
(stage.setup.heap.memory=386992736)
(stage.setup.total.memory=2075131904));

minimums=((commit_file_rename.min=53)
(committer_task_directory_count=0)
(committer_task_directory_depth=1)
(committer_task_file_count=0)
(committer_task_file_size=0)
(committer_task_manifest_file_size=18018)
(job_stage_create_target_dirs.min=3)
(job_stage_load_manifests.min=184)
(job_stage_rename_files.min=73)
(job_stage_setup.min=267)
(op_create_directories.min=0)
(op_delete.min=30)
(op_get_file_status.failures.min=24)
(op_list_status.min=170)
(op_load_all_manifests.min=73)
(op_load_manifest.min=54)
(op_mkdirs.min=26)
(op_msync.min=0)
(task_stage_commit.min=176)
(task_stage_scan_directory.min=176)
(task_stage_setup.min=52));

maximums=((commit_file_rename.max=62)
(committer_task_directory_count=0)
(committer_task_directory_depth=1)
(committer_task_file_count=1)
(committer_task_file_size=100000)
(committer_task_manifest_file_size=18389)
(job_stage_create_target_dirs.max=4)
(job_stage_load_manifests.max=250)
(job_stage_rename_files.max=74)
(job_stage_setup.max=295)
(op_create_directories.max=1)
(op_delete.max=42)
(op_get_file_status.failures.max=113)
(op_list_status.max=189)
(op_load_all_manifests.max=139)
(op_load_manifest.max=125)
(op_mkdirs.max=74)
(op_msync.max=0)
(task_stage_commit.max=194)
(task_stage_scan_directory.max=194)
(task_stage_setup.max=93));

means=((commit_file_rename.mean=(samples=4, sum=227, mean=56.7500))
(committer_task_directory_count=(samples=14, sum=0, mean=0.0000))
(committer_task_directory_depth=(samples=7, sum=7, mean=1.0000))
(committer_task_file_count=(samples=14, sum=8, mean=0.5714))
(committer_task_file_size=(samples=7, sum=200021, mean=28574.4286))
(committer_task_manifest_file_size=(samples=7, sum=127578, mean=18225.4286))
(job_stage_create_target_dirs.mean=(samples=3, sum=11, mean=3.6667))
(job_stage_load_manifests.mean=(samples=3, sum=638, mean=212.6667))
(job_stage_rename_files.mean=(samples=3, sum=220, mean=73.3333))
(job_stage_setup.mean=(samples=3, sum=845, mean=281.6667))
(op_create_directories.mean=(samples=3, sum=2, mean=0.6667))
(op_delete.mean=(samples=3, sum=107, mean=35.6667))
(op_get_file_status.failures.mean=(samples=13, sum=638, mean=49.0769))
(op_list_status.mean=(samples=10, sum=1431, mean=143.1000))
(op_load_all_manifests.mean=(samples=3, sum=287, mean=95.6667))
(op_load_manifest.mean=(samples=7, sum=546, mean=78.0000))
(op_mkdirs.mean=(samples=13, sum=580, mean=44.6154))
(op_msync.mean=(samples=3, sum=0, mean=0.0000))
(task_stage_commit.mean=(samples=7, sum=1279, mean=182.7143))
(task_stage_scan_directory.mean=(samples=7, sum=1279, mean=182.7143))
(task_stage_setup.mean=(samples=7, sum=505, mean=72.1429)));

once abfs adds iostats context update of input stream reads, we could collect and add that into the stats too; not worrying about it until then.

steveloughran · 2023-04-19T16:30:06Z

tested azure cardiff, which is a slow test run today (no network). It would be nice to move the LoadManifests test into the parallel bit of the test run, but trying to do it seems to blow up too much stuff (the abfs parallel test phase runs individual test cases in parallel, see)

----

[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.937 s - in org.apache.hadoop.fs.azurebfs.services.ITestReadBufferManager
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 67.344 s - in org.apache.hadoop.fs.azurebfs.commit.ITestAbfsLoadManifestsStage
[INFO] Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 137.539 s - in org.apache.hadoop.fs.azurebfs.commit.ITestAbfsTerasort
[INFO] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 90.628 s - in org.apache.hadoop.fs.azurebfs.ITestAzureBlobFileSystemListStatus
[INFO] Tests run: 13, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 140.388 s - in org.apache.hadoop.fs.azurebfs.contract.ITestAbfsFileSystemContractDistCp

----

steveloughran · 2023-04-19T18:59:21Z

parallel test running failed everywhere, but I have improved ITestAbfsLoadManifestsStage performance

back to the original 200 manifest files
increase worker pool and buffer queue size (more significant before
reducing the manifest count)

brings test time down to 10s locally. IOStats does imply many MB of data is
being PUT/GET so it is good to keep small so people running with less
bandwidth don't suffer. Maybe, maybe, the size could switch
with a -Dscale?

iostat.
there seems a lot of delete requests, but its because when we write the manifest it is done as a write to temp then rename, and the dest is deleted first, without any check.
in production that cost is absorbed in task commit, and @60ms vs 40 for a head, we should decide what to do here. I think for renames in job commit, we could do the HEAD before the DELETE simply because that is bottleneck, so maybe do it here too...

2023-04-19 19:43:05,489 INFO  [JUnit]: manifest.AbstractManifestCommitterTest (AbstractManifestCommitterTest.java:dumpFileSystemIOStatistics(450)) - Aggregate FileSystem Statistics counters=((action_http_delete_request=402)
(action_http_delete_request.failures=200)
(action_http_get_request=202)
(action_http_head_request=404)
(action_http_head_request.failures=202)
(action_http_put_request=1103)
(bytes_received=10160814)
(bytes_sent=10160814)
(committer_task_directory_count=20000)
(committer_task_file_count=20000)
(committer_task_manifest_file_size=10160814)
(connections_made=2111)
(directories_created=303)
(files_created=200)
(get_responses=2111)
(job_stage_create_target_dirs=1)
(job_stage_load_manifests=1)
(job_stage_setup=1)
(op_create=200)
(op_create_directories=1)
(op_delete=803)
(op_get_file_status=407)
(op_get_file_status.failures=202)
(op_list_status=2)
(op_load_all_manifests=1)
(op_load_manifest=200)
(op_mkdirs=605)
(op_msync=1)
(op_open=200)
(op_rename=400)
(rename_path_attempts=200)
(send_requests=1103)
(task_stage_save_manifest=200)
(task_stage_save_task_manifest=200)
(task_stage_setup=200));

gauges=();

minimums=((action_http_delete_request.failures.min=25)
(action_http_delete_request.min=36)
(action_http_get_request.min=40)
(action_http_head_request.failures.min=22)
(action_http_head_request.min=20)
(action_http_put_request.min=24)
(committer_task_directory_count=100)
(committer_task_file_count=100)
(committer_task_manifest_file_size=49990)
(job_stage_create_target_dirs.min=259)
(job_stage_load_manifests.min=2804)
(job_stage_setup.min=183)
(op_create_directories.min=256)
(op_delete.min=25)
(op_get_file_status.failures.min=22)
(op_get_file_status.min=22)
(op_list_status.min=87)
(op_load_all_manifests.min=2627)
(op_load_manifest.min=49)
(op_mkdirs.min=24)
(op_msync.min=0)
(op_rename.min=70)
(task_stage_save_manifest.min=273)
(task_stage_save_task_manifest.min=144)
(task_stage_setup.min=51));

maximums=((action_http_delete_request.failures.max=413)
(action_http_delete_request.max=291)
(action_http_get_request.max=2031)
(action_http_head_request.failures.max=439)
(action_http_head_request.max=430)
(action_http_put_request.max=2662)
(committer_task_directory_count=100)
(committer_task_file_count=100)
(committer_task_manifest_file_size=50876)
(job_stage_create_target_dirs.max=259)
(job_stage_load_manifests.max=2804)
(job_stage_setup.max=183)
(op_create_directories.max=256)
(op_delete.max=413)
(op_get_file_status.failures.max=439)
(op_get_file_status.max=22)
(op_list_status.max=127)
(op_load_all_manifests.max=2627)
(op_load_manifest.max=2031)
(op_mkdirs.max=245)
(op_msync.max=0)
(op_rename.max=932)
(task_stage_save_manifest.max=2863)
(task_stage_save_task_manifest.max=2757)
(task_stage_setup.max=471));

means=((action_http_delete_request.failures.mean=(samples=200, sum=9850, mean=49.2500))
(action_http_delete_request.mean=(samples=202, sum=12448, mean=61.6238))
(action_http_get_request.mean=(samples=202, sum=78955, mean=390.8663))
(action_http_head_request.failures.mean=(samples=202, sum=12782, mean=63.2772))
(action_http_head_request.mean=(samples=202, sum=8096, mean=40.0792))
(action_http_put_request.mean=(samples=1103, sum=108966, mean=98.7906))
(committer_task_directory_count=(samples=200, sum=20000, mean=100.0000))
(committer_task_file_count=(samples=200, sum=20000, mean=100.0000))
(committer_task_manifest_file_size=(samples=200, sum=10160814, mean=50804.0700))
(job_stage_create_target_dirs.mean=(samples=1, sum=259, mean=259.0000))
(job_stage_load_manifests.mean=(samples=1, sum=2804, mean=2804.0000))
(job_stage_setup.mean=(samples=1, sum=183, mean=183.0000))
(op_create_directories.mean=(samples=1, sum=256, mean=256.0000))
(op_delete.mean=(samples=401, sum=22278, mean=55.5561))
(op_get_file_status.failures.mean=(samples=202, sum=12806, mean=63.3960))
(op_get_file_status.mean=(samples=1, sum=22, mean=22.0000))
(op_list_status.mean=(samples=2, sum=214, mean=107.0000))
(op_load_all_manifests.mean=(samples=1, sum=2627, mean=2627.0000))
(op_load_manifest.mean=(samples=200, sum=79570, mean=397.8500))
(op_mkdirs.mean=(samples=302, sum=15536, mean=51.4437))
(op_msync.mean=(samples=1, sum=0, mean=0.0000))
(op_rename.mean=(samples=200, sum=22999, mean=114.9950))
(task_stage_save_manifest.mean=(samples=200, sum=115797, mean=578.9850))
(task_stage_save_task_manifest.mean=(samples=200, sum=82893, mean=414.4650))
(task_stage_setup.mean=(samples=200, sum=21878, mean=109.3900)));

steveloughran · 2023-04-20T10:46:26Z

testrun failure is the usual intermittent failure of the slow tests, showing some transient failure happened and was recovered from. wifi was playing up all evening


[ERROR] Tests run: 48, Failures: 2, Errors: 0, Skipped: 24, Time elapsed: 2,057.747 s <<< FAILURE! - in org.apache.hadoop.fs.azurebfs.ITestSmallWriteOptimization
[ERROR] testSmallWriteOptimization[OptmOFF_CloseTest_EmptyFile_MultiSmallWritesStillLessThanBufferSize](org.apache.hadoop.fs.azurebfs.ITestSmallWriteOptimization)  Time elapsed: 512.92 s  <<< FAILURE!
java.lang.AssertionError: Mismatch in connections_made expected:<4> but was:<5>
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.apache.hadoop.fs.azurebfs.AbstractAbfsIntegrationTest.assertAbfsStatistics(AbstractAbfsIntegrationTest.java:526)
        at org.apache.hadoop.fs.azurebfs.ITestSmallWriteOptimization.assertOpStats(ITestSmallWriteOptimization.java:499)
        at org.apache.hadoop.fs.azurebfs.ITestSmallWriteOptimization.formulateSmallWriteTestAppendPattern(ITestSmallWriteOptimization.java:437)
        at org.apache.hadoop.fs.azurebfs.ITestSmallWriteOptimization.testSmallWriteOptimization(ITestSmallWriteOptimization.java:324)
        at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
        at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
        at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
        at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
        at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
        at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
        at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
        at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.lang.Thread.run(Thread.java:750)

[ERROR] testSmallWriteOptimization[OptmOFF_FlushCloseTest_EmptyFile_MultiBufferSizeWrite](org.apache.hadoop.fs.azurebfs.ITestSmallWriteOptimization)  Time elapsed: 781.323 s  <<< FAILURE!
java.lang.AssertionError: Mismatch in connections_made expected:<10> but was:<11>
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.apache.hadoop.fs.azurebfs.AbstractAbfsIntegrationTest.assertAbfsStatistics(AbstractAbfsIntegrationTest.java:526)
        at org.apache.hadoop.fs.azurebfs.ITestSmallWriteOptimization.assertOpStats(ITestSmallWriteOptimization.java:499)
        at org.apache.hadoop.fs.azurebfs.ITestSmallWriteOptimization.formulateSmallWriteTestAppendPattern(ITestSmallWriteOptimization.java:437)
        at org.apache.hadoop.fs.azurebfs.ITestSmallWriteOptimization.testSmallWriteOptimization(ITestSmallWriteOptimization.java:324)
        at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
        at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
        at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
        at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
        at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
        at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
        at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
        at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.lang.Thread.run(Thread.java:750)

[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2,079.726 s - in org.apache.hadoop.fs.azurebfs.ITestAzureBlobFileSystemE2EScale
[INFO] Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2,277.925 s - in org.apache.hadoop.fs.azurebfs.ITestAbfsReadWriteAndSeek
[INFO] 
[INFO] Results:
[INFO] 
[ERROR] Failures: 
[ERROR]   ITestSmallWriteOptimization.testSmallWriteOptimization:324->formulateSmallWriteTestAppendPattern:437->assertOpStats:499->AbstractAbfsIntegrationTest.assertAbfsStatistics:526->Assert.assertEquals:647->Assert.failNotEquals:835->Assert.fail:89 Mismatch in connections_made expected:<4> but was:<5>
[ERROR]   ITestSmallWriteOptimization.testSmallWriteOptimization:324->formulateSmallWriteTestAppendPattern:437->assertOpStats:499->AbstractAbfsIntegrationTest.assertAbfsStatistics:526->Assert.assertEquals:647->Assert.failNotEquals:835->Assert.fail:89 Mismatch in connections_made expected:<10> but was:<11>
[INFO] 
[ERROR] Tests run: 336, Failures: 2, Errors: 0, Skipped: 42

steveloughran · 2023-04-20T10:55:11Z

spotbugs complaint is valid, but nothing to do with this PR



Code | Warning
-- | --
ST | Write to static field org.apache.hadoop.mapreduce.task.reduce.Fetcher.nextId from instance method new org.apache.hadoop.mapreduce.task.reduce.Fetcher(JobConf, TaskAttemptID, ShuffleSchedulerImpl, MergeManager, Reporter, ShuffleClientMetrics, ExceptionReporter, SecretKey)
  | Bug type ST_WRITE_TO_STATIC_FROM_INSTANCE_METHOD (click for details)In class org.apache.hadoop.mapreduce.task.reduce.FetcherIn method new org.apache.hadoop.mapreduce.task.reduce.Fetcher(JobConf, TaskAttemptID, ShuffleSchedulerImpl, MergeManager, Reporter, ShuffleClientMetrics, ExceptionReporter, SecretKey)Field org.apache.hadoop.mapreduce.task.reduce.Fetcher.nextIdAt Fetcher.java:[line 120]

Code	Warning
ST	Write to static field org.apache.hadoop.mapreduce.task.reduce.Fetcher.nextId from instance method new org.apache.hadoop.mapreduce.task.reduce.Fetcher(JobConf, TaskAttemptID, ShuffleSchedulerImpl, MergeManager, Reporter, ShuffleClientMetrics, ExceptionReporter, SecretKey)
[Bug type ST_WRITE_TO_STATIC_FROM_INSTANCE_METHOD (click for details)](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5519/11/artifact/out/branch-spotbugs-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-core-warnings.html#ST_WRITE_TO_STATIC_FROM_INSTANCE_METHOD)
In class org.apache.hadoop.mapreduce.task.reduce.Fetcher
In method new org.apache.hadoop.mapreduce.task.reduce.Fetcher(JobConf, TaskAttemptID, ShuffleSchedulerImpl, MergeManager, Reporter, ShuffleClientMetrics, ExceptionReporter, SecretKey)
Field org.apache.hadoop.mapreduce.task.reduce.Fetcher.nextId
At Fetcher.java:[line 120]

steveloughran · 2023-04-24T09:42:22Z

...rc/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/impl/EntryFileIO.java

+        throw new UncheckedIOException(e);
+      } catch (InterruptedException e) {
+        // being stopped implicitly
+        LOG.debug("interrupted", e);


done in finally

...rc/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/impl/EntryFileIO.java

mehakmeet

Looks really interesting and well-written.
Have started to look at some bits of core functionality, moving towards tests afterwards.

...in/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/CommitJobStage.java

mehakmeet · 2023-04-25T05:54:03Z

...in/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/CommitJobStage.java

+      addHeapInformation(heapInfo, "setup");
+      // load the manifests
+      final StageConfig stageConfig = getStageConfig();
+      LoadManifestsStage.Result result = new LoadManifestsStage(stageConfig).apply(


suggestion: We can include a duration tracker to know the time taken to load manifests in the final stats.

already done in AbstractJobOrTaskStage

mehakmeet · 2023-04-25T06:46:20Z

...rc/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/impl/EntryFileIO.java

+      }
+      if (active.get()) {
+        try {
+          queue.put(new QueueEntry(Actions.write, entries));


Suggestion: We could use queue.offer(E e, long timeout, TimeUnit unit), such that we are waiting for the queue to have the capacity to add the Entry while also having a timeout in case something goes wrong. We can catch the interrupt and throw/swallow accordingly if it exceeds the timeout?

hmmm. I would rather have the writer work. If have a timeout here it should be really big as we want to cope with all threads blocking for a while

...rc/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/impl/EntryFileIO.java

mehakmeet · 2023-04-25T06:48:50Z

...rc/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/impl/EntryFileIO.java

+      if (active.get()) {
+        try {
+          queue.put(new QueueEntry(Actions.write, entries));
+          LOG.debug("Queued {}", entries.size());


Some info about the entry that was queued in the LOG?

it's a list...what do we want to add?

mehakmeet · 2023-04-25T07:31:30Z

...ache/hadoop/mapreduce/lib/output/committer/manifest/stages/CreateOutputDirectoriesStage.java

-   * @param path path to create
-   * @return true if dir created/found
+   * @param dirEntry dir to create
+   * @return Outcome


nit: Better javadocs for return, "State of the directory in the dir map" or something?

mehakmeet · 2023-04-25T07:42:13Z

...n-project/hadoop-common/src/main/java/org/apache/hadoop/util/functional/RemoteIterators.java

+   * This is primarily for tests or when submitting work into a TaskPool.
+   * equivalent to
+   * <pre>
+   *   for(long l = start, l &lt; finis; l++) yield l;


typo: "l < excludedFinish"

mehakmeet · 2023-04-25T07:57:55Z

...rc/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/impl/EntryFileIO.java

+      Thread.currentThread().setName("EntryIOWriter");
+      try {
+        while (!stop.get()) {
+          final QueueEntry queueEntry = queue.take();


seems like we could wait indefinitely on this.
How about poll(long timeout, TimeUnit unit)?

i don't know how long we should wait here? It assumes that yes, the caller will eventually stop the write

fixed to 10 minutes, returning false. caller gets to react (which it will do by raising an IOE but giving anything raised by the writer thread priority)

mehakmeet · 2023-04-25T08:03:11Z

...rc/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/impl/EntryFileIO.java

+      // signal queue closure by queuing a stop option.
+      // this is added at the end of the list of queued blocks,
+      // of which are written.
+      try {


LOG something like "Tasks left in queue = capacity - queue.remainingCapacity()" for better logging. We could do something like this while offering as well but seems apt for close().

mehakmeet · 2023-04-25T08:06:00Z

...hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/manifest_committer.md

+
+It can help limit the amount of memory consumed during manifest load during
+job commit.
+The maximumum number of loaded manifests will be


typo: "maximum"

* Add heap information to gauges in _SUCCESS * which includes pulling up part of impl.IOStatisticsStore into a public IOStatisticsSetters interface with the put{Counter, Gauge, etc} methods only. * TestLoadManifests scaled up with #of tasks and #of files in each task to generate more load. * code to build up dir map during load phase; not wired up Summary: abfs uses a lot more heap during the load phase than file; possibly due to buffering, but doing a pipeline for processing the results isn't sustainable. Either two phase: * phase 1, build up the dir list, discard manifests after each load * phase 2, load manifests and rename incrementally Or: unified with some complicated directory creation process to ensure that each task's dirs exist before its rename begins. Change-Id: I8595c083435e3d4df27343599687677abfc1c013

* DirEntry and FileEntry are writeable * LoadManifests to take a path to where to cache the rename list. not yet wired up. Change-Id: Ibd992b179bd0bcf26a39ae4ce5407257ecbfcb10

This is a big change, with tests as far as verifying core read/write happy. Current state: simple read/write good, async queue not yet tested Change-Id: I7cb1443024780b355a8f3bb96fbfe08d8608d968

interim commit Change-Id: I80bb4e72c1029baad8fb87d8c9287b08c0b000f4

...but not tested the job commit yet Change-Id: I0f54ede94e41592558468df1c87f4a39d2461223

...but not tested the job commit yet Change-Id: I4d50636542673a3f25a7ab363df1b1bd221216ae

* TestEntryFileIO extended for ths * ABFS terasort test happy! Change-Id: I068861973114d9947f3d22eaf32a6ee3b7ca8fa2 TODO: fault injection on the writes

* validation also uses manifest entries (and so works!) * testing expects this * tests of IOStats * tests of new RemoteIterators Change-Id: I4cfb308d4b08f1f775cfdbe2df6f8ff07ac6bc54

Change-Id: I2008d31bff3af59396a04dddc1b9357b1a812294

* moved RangeExcludingLongIterator into RemoteIterators, added test. * address checkstyle * address spotbugs * address deprecation * ValidateRenameFilesStage doesn't validate etags on wasb; helps address a JIRA about hadoop-azure testing. Change-Id: Id6507d79f8d3cfa434afb65bfe9fc7539a7c1cf5

* back to the original 200 manifest files * increase worker pool and buffer queue size (more significant before reducing the manifest count) brings test time down to 10s locally. IOStats does imply many MB of data is being PUT/GET so it is good to keep small so people running with less bandwidth don't suffer. Maybe, maybe, the size could switch with a -Dscale? Change-Id: I49d201d7af7434797ab6fff5831a0f899c5c4185

…leanup Change-Id: If043263676c4d5694065e7ec35954a7f66c04d90

steveloughran · 2023-04-26T11:33:17Z

ok, rebasing and pushing up with a commit to address most of the changes.

Not addressed: having timeouts on the offer/take of the EntryWriter in the queue. I agree it is safest if we do add a timeout here, just as an emergency.
What do you think? something big like 10 minutes? as it is only a fallback in case the code is broken ... should be reported as an error.

Change-Id: Ib93ba8ba632135a05da126a75f34e78bd381cf2a

steveloughran · 2023-04-26T13:25:00Z

latest patch tested against azure cardiff, timeout unrelated

[ERROR] Errors: 
[ERROR]   ITestAzureBlobFileSystemLease.testTwoWritersCreateAppendWithInfiniteLeaseEnabled:186->twoWriters:154 » TestTimedOut
[INFO]

I wonder if we should just increase the timeout there?

steveloughran · 2023-05-17T10:36:14Z

@mehakmeet when you get time, can you review this. I would like to get this in before I forget about it

mehakmeet

I think this is very close to getting merged. Had one doubt with respect to the testing, was the large loading of manifests performed on the old way of handling manifest files to see OOM errors? Tests look good overall just wanted to know if that limit was being hit and now we see it rectified with the new approach.

mehakmeet · 2023-05-17T11:56:21Z

...est/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/impl/TestEntryFileIO.java

+    // do an explicit close to help isolate any failure.
+    SequenceFile.Writer writer = createWriter();
+    writer.append(NullWritable.get(), source);
+    writer.flush();


just a doubt here. Do we need to explicitly flush the writer before closing? Won't that be done in close too? If yes, we can test both mechanisms by just saying writer.close()

just being rigorous. on hdfs close() doesn't actually sync the data, just flushes it, FWIW.

mehakmeet · 2023-05-17T12:00:05Z

...est/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/impl/TestEntryFileIO.java

+    // now use the iterator to access it.
+    List<FileEntry> files = new ArrayList<>();
+    Assertions.assertThat(foreach(iterateOverEntryFile(), files::add))
+        .isEqualTo(0);


Can you clarify which value equates to "0" in this test by some comments?

added a description

mehakmeet · 2023-05-18T07:57:25Z

...ect/hadoop-common/src/test/java/org/apache/hadoop/fs/statistics/TestIOStatisticsSetters.java

+        .isEqualTo(2);
+
+    // unknown value
+    ioStatistics.setCounter("c2", 3);


how about e assert that this unknown counter shouldn't exist in the counters map? Just to test the no-op I guess.

lets see...

had to parameterize the test so we assert than on snapshots they do accrue, but on the other impls they don't

mehakmeet · 2023-05-18T07:59:26Z

...g/apache/hadoop/mapreduce/lib/output/committer/manifest/TestJobThroughManifestCommitter.java

@@ -141,6 +145,9 @@
   */
  private StageConfig ta11Config;

+  private LoadedManifestData


javadocs for consistency.

done. also needed to be static to work properly, so doc that after fixing it

mehakmeet · 2023-05-18T08:09:57Z

...t/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/TestRenameStageFailure.java

@@ -82,6 +85,8 @@ public class TestRenameStageFailure extends AbstractManifestCommitterTest {
  /** resilient commit expected? */
  private boolean resilientCommit;

+  private EntryFileIO entryFileIO;


javadocs for consistency.

that is: success file contains entries which aren't present in the FS Fixes * find bit in earlier test where file was being deleted, and restore it (and re-order it too!) * LoadManifestsStage doesn't optionally return manifests for testing; tests modified to match. * EntryFileIO will report timeout after 10 minutes if queue blocks somehow. * LoadManifestsStage handles this timeout and will raise it as a failure, but only secondary to any exception raised by the writer thread * SUCCESS file can be configured with #of files to list, allows for tests to assert on many thousands of files, although in production it is still fixed to a small number for performance reasons. Change-Id: I642c1178928de427bf6e09f0fe0d345876311fb5

Change-Id: Ica813c6068eca18d83bf2f5f94fac4a1e1996c36

hadoop-yetus · 2023-06-01T23:03:26Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 51s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 1s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+0 🆗	markdownlint	0m 0s		markdownlint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 12 new or modified test files.
			_ trunk Compile Tests _
+0 🆗	mvndep	42m 36s		Maven dependency ordering for branch
+1 💚	mvninstall	22m 40s		trunk passed
+1 💚	compile	18m 38s		trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚	compile	17m 21s		trunk passed with JDK Private Build-1.8.0_362-8u372-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	checkstyle	4m 3s		trunk passed
+1 💚	mvnsite	4m 2s		trunk passed
+1 💚	javadoc	2m 48s		trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚	javadoc	2m 24s		trunk passed with JDK Private Build-1.8.0_362-8u372-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	spotbugs	6m 47s		trunk passed
+1 💚	shadedclient	24m 17s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 26s		Maven dependency ordering for patch
+1 💚	mvninstall	2m 20s		the patch passed
+1 💚	compile	18m 13s		the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚	javac	18m 13s		the patch passed
+1 💚	compile	16m 42s		the patch passed with JDK Private Build-1.8.0_362-8u372-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	javac	16m 42s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	4m 11s	/results-checkstyle-root.txt	root: The patch generated 5 new + 32 unchanged - 1 fixed = 37 total (was 33)
+1 💚	mvnsite	3m 52s		the patch passed
+1 💚	javadoc	2m 38s		the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚	javadoc	2m 24s		the patch passed with JDK Private Build-1.8.0_362-8u372-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	spotbugs	7m 23s		the patch passed
+1 💚	shadedclient	24m 10s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	18m 49s		hadoop-common in the patch passed.
+1 💚	unit	7m 31s		hadoop-mapreduce-client-core in the patch passed.
+1 💚	unit	2m 40s		hadoop-aws in the patch passed.
+1 💚	unit	2m 18s		hadoop-azure in the patch passed.
+1 💚	asflicense	0m 52s		The patch does not generate ASF License warnings.
		266m 23s

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5519/16/artifact/out/Dockerfile
GITHUB PR	#5519
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint
uname	Linux cc769a5d2de9 4.15.0-206-generic #217-Ubuntu SMP Fri Feb 3 19:10:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `b289707`
Default Java	Private Build-1.8.0_362-8u372-ga~~us1-0ubuntu1~~20.04-b09
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u372-ga~~us1-0ubuntu1~~20.04-b09
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5519/16/testReport/
Max. process+thread count	1236 (vs. ulimit of 5500)
modules	C: hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-tools/hadoop-aws hadoop-tools/hadoop-azure U: .
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5519/16/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

hadoop-yetus · 2023-06-01T23:14:57Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 50s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 1s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+0 🆗	markdownlint	0m 0s		markdownlint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 12 new or modified test files.
			_ trunk Compile Tests _
+0 🆗	mvndep	19m 39s		Maven dependency ordering for branch
+1 💚	mvninstall	23m 36s		trunk passed
+1 💚	compile	19m 7s		trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚	compile	16m 49s		trunk passed with JDK Private Build-1.8.0_362-8u372-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	checkstyle	4m 13s		trunk passed
+1 💚	mvnsite	4m 2s		trunk passed
+1 💚	javadoc	2m 55s		trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚	javadoc	2m 28s		trunk passed with JDK Private Build-1.8.0_362-8u372-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	spotbugs	6m 22s		trunk passed
+1 💚	shadedclient	25m 9s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 25s		Maven dependency ordering for patch
+1 💚	mvninstall	2m 24s		the patch passed
+1 💚	compile	18m 25s		the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚	javac	18m 25s		the patch passed
+1 💚	compile	16m 59s		the patch passed with JDK Private Build-1.8.0_362-8u372-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	javac	16m 59s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	4m 2s	/results-checkstyle-root.txt	root: The patch generated 5 new + 32 unchanged - 1 fixed = 37 total (was 33)
+1 💚	mvnsite	3m 58s		the patch passed
+1 💚	javadoc	2m 49s		the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚	javadoc	2m 27s		the patch passed with JDK Private Build-1.8.0_362-8u372-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	spotbugs	6m 56s		the patch passed
+1 💚	shadedclient	24m 53s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	19m 10s		hadoop-common in the patch passed.
+1 💚	unit	7m 16s		hadoop-mapreduce-client-core in the patch passed.
+1 💚	unit	2m 35s		hadoop-aws in the patch passed.
+1 💚	unit	2m 13s		hadoop-azure in the patch passed.
+1 💚	asflicense	0m 54s		The patch does not generate ASF License warnings.
		246m 5s

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5519/17/artifact/out/Dockerfile
GITHUB PR	#5519
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint
uname	Linux d83fcb100c00 4.15.0-206-generic #217-Ubuntu SMP Fri Feb 3 19:10:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `355fa35`
Default Java	Private Build-1.8.0_362-8u372-ga~~us1-0ubuntu1~~20.04-b09
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u372-ga~~us1-0ubuntu1~~20.04-b09
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5519/17/testReport/
Max. process+thread count	1277 (vs. ulimit of 5500)
modules	C: hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-tools/hadoop-aws hadoop-tools/hadoop-azure U: .
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5519/17/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

The other new ones are related to test methods whose numbering breaks the style checker's requirements * test_0440_validateSuccessFiles * test_0450_validationDetectsFailures Change-Id: I36267e4d9912873e457126341385f866acd6d148

hadoop-yetus · 2023-06-06T20:27:53Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 49s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 1s		codespell was not available.
+0 🆗	detsecrets	0m 1s		detect-secrets was not available.
+0 🆗	markdownlint	0m 1s		markdownlint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 12 new or modified test files.
			_ trunk Compile Tests _
+0 🆗	mvndep	20m 36s		Maven dependency ordering for branch
+1 💚	mvninstall	22m 44s		trunk passed
+1 💚	compile	17m 27s		trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚	compile	15m 45s		trunk passed with JDK Private Build-1.8.0_362-8u372-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	checkstyle	4m 0s		trunk passed
+1 💚	mvnsite	3m 48s		trunk passed
+1 💚	javadoc	2m 50s		trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚	javadoc	2m 29s		trunk passed with JDK Private Build-1.8.0_362-8u372-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	spotbugs	6m 13s		trunk passed
+1 💚	shadedclient	23m 58s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 23s		Maven dependency ordering for patch
+1 💚	mvninstall	2m 15s		the patch passed
+1 💚	compile	16m 34s		the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚	javac	16m 34s		the patch passed
+1 💚	compile	15m 41s		the patch passed with JDK Private Build-1.8.0_362-8u372-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	javac	15m 41s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	3m 53s	/results-checkstyle-root.txt	root: The patch generated 2 new + 32 unchanged - 1 fixed = 34 total (was 33)
+1 💚	mvnsite	3m 44s		the patch passed
+1 💚	javadoc	2m 41s		the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚	javadoc	2m 30s		the patch passed with JDK Private Build-1.8.0_362-8u372-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	spotbugs	6m 49s		the patch passed
+1 💚	shadedclient	24m 4s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	18m 27s		hadoop-common in the patch passed.
+1 💚	unit	7m 16s		hadoop-mapreduce-client-core in the patch passed.
+1 💚	unit	2m 32s		hadoop-aws in the patch passed.
+1 💚	unit	2m 14s		hadoop-azure in the patch passed.
+1 💚	asflicense	0m 53s		The patch does not generate ASF License warnings.
		236m 22s

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5519/18/artifact/out/Dockerfile
GITHUB PR	#5519
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint
uname	Linux 6f59fa66c594 4.15.0-206-generic #217-Ubuntu SMP Fri Feb 3 19:10:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `070c788`
Default Java	Private Build-1.8.0_362-8u372-ga~~us1-0ubuntu1~~20.04-b09
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u372-ga~~us1-0ubuntu1~~20.04-b09
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5519/18/testReport/
Max. process+thread count	3134 (vs. ulimit of 5500)
modules	C: hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-tools/hadoop-aws hadoop-tools/hadoop-azure U: .
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5519/18/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

steveloughran · 2023-06-07T10:26:53Z

@mehakmeet @cnauroth need a final review here.

we also need a google gcs test suite somewhere, don't we?

mehakmeet

LGTM, one doubt about new constants, else we're good to go in with this. Really nice implementation btw.

mehakmeet · 2023-06-08T08:32:33Z

...rc/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/impl/EntryFileIO.java

+  public static final int WRITER_SHUTDOWN_TIMEOUT_SECONDS = 60;
+
+  /**
+   * How long should trying to queue a write block before giving up
+   * with an error?
+   * This is a safety feature to ensure that if something has gone wrong
+   * in the queue code the job fails with an error rather than just hangs
+   */
+  public static final int WRITER_QUEUE_PUT_TIMEOUT_MINUTES = 10;


Sorry, I think I missed these constants being added, don't you think these should be configurable, just for some kind of fallback sakes, so that these values never cause any issues and are easily changeable? I guess if it waits for this long then, we can assume it's just hanging as well. Your call on it being configurable or not.

my view if things are this bad it is a disaster and the job is failing as either the thread concurrency is broken or the local fs has failed.

steveloughran · 2023-06-09T16:03:28Z

merged to trunk, now backporting to 3.3

This modifies the manifest committer so that the list of files to rename is passed between stages as a file of writeable entries on the local filesystem. The map of directories to create is still passed in memory; this map is built across all tasks, so even if many tasks created files, if they all write into the same set of directories the memory needed is O(directories) with the task count not a factor. The _SUCCESS file reports on heap size through gauges. This should give a warning if there are problems. Contributed by Steve Loughran Change-Id: Ic7707d2dde9daa28cd3a927e49972c15313336ad

This modifies the manifest committer so that the list of files to rename is passed between stages as a file of writeable entries on the local filesystem. The map of directories to create is still passed in memory; this map is built across all tasks, so even if many tasks created files, if they all write into the same set of directories the memory needed is O(directories) with the task count not a factor. The _SUCCESS file reports on heap size through gauges. This should give a warning if there are problems. Contributed by Steve Loughran

steveloughran force-pushed the mapreduce/MAPREDUCE-7435-committer-oom branch from c0fc290 to 720f120 Compare March 29, 2023 16:23

cnauroth requested changes Mar 30, 2023

View reviewed changes

steveloughran marked this pull request as draft April 4, 2023 10:37

steveloughran force-pushed the mapreduce/MAPREDUCE-7435-committer-oom branch from 2d0dcce to de8c6e5 Compare April 11, 2023 14:09

cnauroth reviewed Apr 13, 2023

View reviewed changes

steveloughran force-pushed the mapreduce/MAPREDUCE-7435-committer-oom branch from de8c6e5 to 1358391 Compare April 17, 2023 18:49

steveloughran marked this pull request as ready for review April 19, 2023 18:26

steveloughran force-pushed the mapreduce/MAPREDUCE-7435-committer-oom branch from 039648b to b5166b6 Compare April 21, 2023 14:06

steveloughran commented Apr 24, 2023

View reviewed changes

...rc/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/impl/EntryFileIO.java Outdated Show resolved Hide resolved

mehakmeet reviewed Apr 25, 2023

View reviewed changes

steveloughran added 12 commits April 26, 2023 10:08

MAPREDUCE-7435. committer OOM

e153791

* DirEntry and FileEntry are writeable * LoadManifests to take a path to where to cache the rename list. not yet wired up. Change-Id: Ibd992b179bd0bcf26a39ae4ce5407257ecbfcb10

MAPREDUCE-7435. oom: switch to sequence file for storage of the files.

16422e4

This is a big change, with tests as far as verifying core read/write happy. Current state: simple read/write good, async queue not yet tested Change-Id: I7cb1443024780b355a8f3bb96fbfe08d8608d968

MAPREDUCE-7435. oom: switch to sequence file for storage of the files.

f95a926

interim commit Change-Id: I80bb4e72c1029baad8fb87d8c9287b08c0b000f4

MAPREDUCE-7435. starting to get write/read chain working

3046b08

...but not tested the job commit yet Change-Id: I0f54ede94e41592558468df1c87f4a39d2461223

MAPREDUCE-7435. starting to get write/read chain working

3182c97

...but not tested the job commit yet Change-Id: I4d50636542673a3f25a7ab363df1b1bd221216ae

MAPREDUCE-7435. Async queue/write working

f18e9da

* TestEntryFileIO extended for ths * ABFS terasort test happy! Change-Id: I068861973114d9947f3d22eaf32a6ee3b7ca8fa2 TODO: fault injection on the writes

MAPREDUCE-7435. following chain through to validation

30d90e0

* validation also uses manifest entries (and so works!) * testing expects this * tests of IOStats * tests of new RemoteIterators Change-Id: I4cfb308d4b08f1f775cfdbe2df6f8ff07ac6bc54

MAPREDUCE-7435. Parallel writing test

ed04b54

Change-Id: I2008d31bff3af59396a04dddc1b9357b1a812294

MAPREDUCE-7435. tweak test performance by disabling parallel TA dir c…

9874acb

…leanup Change-Id: If043263676c4d5694065e7ec35954a7f66c04d90

MAPREDUCE-7435. Mehakmeet comments, excluding timeouts.

8e83fdc

Change-Id: Ib93ba8ba632135a05da126a75f34e78bd381cf2a

steveloughran force-pushed the mapreduce/MAPREDUCE-7435-committer-oom branch from b5166b6 to 8e83fdc Compare April 26, 2023 11:35

mehakmeet reviewed May 18, 2023

View reviewed changes

apache deleted a comment from hadoop-yetus Jun 1, 2023

steveloughran added 2 commits June 1, 2023 19:35

MAPREDUCE-7435. Mehakmeet review

355fa35

Change-Id: Ica813c6068eca18d83bf2f5f94fac4a1e1996c36

apache deleted a comment from hadoop-yetus Jun 1, 2023

MAPREDUCE-7435. checkstyle: remove unused imports.

070c788

The other new ones are related to test methods whose numbering breaks the style checker's requirements * test_0440_validateSuccessFiles * test_0450_validationDetectsFailures Change-Id: I36267e4d9912873e457126341385f866acd6d148

mehakmeet approved these changes Jun 8, 2023

View reviewed changes

steveloughran merged commit 7a45ef4 into apache:trunk Jun 9, 2023
3 checks passed


		@Override
		public void setMeanStatistic(final String key, final MeanStatistic value) {

MAPREDUCE-7435. Manifest Committer OOM on abfs #5519

MAPREDUCE-7435. Manifest Committer OOM on abfs #5519

Conversation

steveloughran commented Mar 29, 2023 • edited

How was this patch tested?

For code changes:

cnauroth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steveloughran commented Apr 4, 2023

hadoop-yetus commented Apr 6, 2023

cnauroth left a comment

Choose a reason for hiding this comment

steveloughran commented Apr 19, 2023

steveloughran commented Apr 19, 2023 • edited

steveloughran commented Apr 19, 2023

steveloughran commented Apr 20, 2023

steveloughran commented Apr 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mehakmeet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steveloughran commented Apr 26, 2023

steveloughran commented Apr 26, 2023

steveloughran commented May 17, 2023

mehakmeet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hadoop-yetus commented Jun 1, 2023

hadoop-yetus commented Jun 1, 2023

hadoop-yetus commented Jun 6, 2023

steveloughran commented Jun 7, 2023

mehakmeet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steveloughran commented Jun 9, 2023

steveloughran commented Mar 29, 2023 •

edited

steveloughran commented Apr 19, 2023 •

edited