HADOOP-13560 S3A to support huge file writes and operations -with tests #125

steveloughran · 2016-09-08T10:32:38Z

Adds

Scale tests for S3A huge file support;

always running at the MB size (maybe best to make optional)
-configurable to bigger sizes in the auth-keys XML or in the build -Dfs.s3a.scale.test.huge.filesize=1000
limited to upload, seek, read, rename, delete. The JUnit test cases are explicltly set up to run in order here.

New scalable output stream for writing, `S3ABlockOutputStream`

-always saves in incremental blocks as writes proceed, block size == partition size.
-supports Fast output stream memory buffer code (for regression testing)
-supports a back end which buffers blocks in files, using RR disk allocation. As such, write/read bandwidth is limited to aggregate HDD bandwidth.
-adding extra failure resilience as testing throws up failure conditions (network timeouts, no-response from server on multipart commit, etc).
-adding instrumentation, including using callbacks from AWS SDK to update gauges and counters (in progress)

What we have here is essentially something that can replace the classic "save to file, upload at the end" stream and the fast "store it all in RAM and hope there's space" stream. It should offer incremental upload for faster output of larger files compared the classic file stream, with the scaleability the fast one lacks. And the instrumentation to show what's happening.

steveloughran · 2016-09-08T10:33:25Z

hadoop-tools/hadoop-aws/pom.xml

@@ -183,6 +199,8 @@
                    <include>**/ITestS3AFileSystemContract.java</include>
                    <include>**/ITestS3AMiniYarnCluster.java</include>
                    <include>**/ITest*Root*.java</include>
+                    <include>**/ITestS3AFileContextStatistics.java</include>


moved this line down as it was failing sometimes

… serve up its statistics

…ate()

…ing on inside S3A, including a gauge of active request counts. +more troubleshooting docs. The fast output stream will retry on errors

… are passing tests

block streaming is in, testing at moderate scale <100 MB. you can choose for buffer-by-ram (current fast uploader) or buffer by HDD; in a test using SSD & remote S3, I got ~1.38MB/s bandwidth, got something similar 1.44 on RAM. But: we shouldn't run out off heap on the HDD option. RAM buffering uses existing ByteArrays, to ease source code migration off FastUpload (which is still there, for now). * I do plan to add pooled ByteBuffers * Add metrics of total and ongoing upload, including tracking what quantity of the outstanding block data has actually been uploaded.

…trip @ 1.57 MB/s

…ng the feature. Minor cleanups of code

-supercede the Fast output stream, -run tests, tune outcomes (especially race conditions in multipart operations)

* more debug statements * fixed name of fs.s3a.block.output option in core-default and docs. Thanks Rajesh! * more attempts at managing close() operation rigorously. No evidence this is the cause of the problem rajesh saw though. * rearranged layout of code in S3ADatablocks so associated classes are adjacent; * retry on multipart commit adding sleep statements between retries * new Progress log for logging progress @ debug level in s3a. Why? Because logging events every 8KB gets too chatty when debugging many-MB uploads. * gauges of active block uploads wired up.

…rocessorId Refactoring LocalApplicationRunner s.t. each processor has its own listener instance, instead of a single listener keeping track of all processors. Author: Navina Ramesh <navina@apache.org> Reviewers: Prateek Maheshwari <pmaheshw@linkedin.com>, Xinyu Liu <xiliu@linkedin.com> Closes apache#125 from navina/SAMZA-1213

steveloughran reviewed Sep 8, 2016
View reviewed changes

steveloughran added 20 commits September 21, 2016 14:34

HADOOP-13560 adding test for create/copy 5GB files

ce164df

HADOOP-13560 tuning test scale and timeouts

66ed17f

HADOOP-13560 scale tests take maven build arguments

c35768c

HADOOP-13567 S3AFileSystem to override getStoragetStatistics() and so…

2fae934

… serve up its statistics

HADOOP-13566 NPE in S3AFastOutputStream.write

5a8fd69

HADOOP-13560 use STest as prefix for scale tests

28397a0

HADOOP-13560 test improvements

e1c8f61

HADOOP-13560 fix typo in the name of a statistic

33afa29

HADOOP-13569 S3AFastOutputStream to take ProgressListener in file cre…

d17da15

…ate()

HADOOP-13560 lots of improvement in test and monitoring of what is go…

ace9f5f

…ing on inside S3A, including a gauge of active request counts. +more troubleshooting docs. The fast output stream will retry on errors

HADOOP-13531 S3 output stream allocator to round-robin directories

9b25371

HADOOP-13560 WiP: adding new incremental output stream

6733165

HADOOP-13560 data block design is coalescing and memory buffer writes…

9a5dee0

… are passing tests

HADOOP-13560 ongoing work on disk uploads at 2+ GB scale.

1e8c6de

HADOOP-13560 complete merge with branch-2. Milestone: 1GB file round …

272dd5a

…trip @ 1.57 MB/s

HADOOP-13560 block output stream collects statistics

22b8de7

add block options to core-default.xml and write big section documenti…

6c869f6

…ng the feature. Minor cleanups of code

HADOOP-13560 big iteration on this

21cc2ce

-supercede the Fast output stream, -run tests, tune outcomes (especially race conditions in multipart operations)

steveloughran force-pushed the s3/HADOOP-13560-5GB-blobs branch from 26f3c44 to c342feb Compare September 21, 2016 22:18

steveloughran closed this Sep 23, 2016

steveloughran deleted the s3/HADOOP-13560-5GB-blobs branch October 7, 2016 17:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HADOOP-13560 S3A to support huge file writes and operations -with tests #125

HADOOP-13560 S3A to support huge file writes and operations -with tests #125

steveloughran commented Sep 8, 2016

steveloughran Sep 8, 2016

HADOOP-13560 S3A to support huge file writes and operations -with tests #125

HADOOP-13560 S3A to support huge file writes and operations -with tests #125

Conversation

steveloughran commented Sep 8, 2016

Scale tests for S3A huge file support;

New scalable output stream for writing, S3ABlockOutputStream

steveloughran Sep 8, 2016

Choose a reason for hiding this comment

New scalable output stream for writing, `S3ABlockOutputStream`