Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

APEXMALHAR-2022 Developed S3 Output Module #483

Merged
merged 1 commit into from
Dec 1, 2016

Conversation

chaithu14
Copy link
Contributor

No description provided.

if (currentWindowId <= windowDataManager.getLargestCompletedWindow()) {
return;
}
S3BlockMetaData metaData = blockInfo.get(tuple.getBlockId());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having just blockId as the key here might cause problems since the way block ids are computed, blocks in different files may have the same block id. I'll suggest to have something like filename as well, as part of the key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. You are right. I made the changes as you suggested.

}
processedBlocks.add(tuple.getBlockId());
long partSize = tuple.getRecord().length;
PartETag pt = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename variable => partETag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* Send the CompleteMultipartUploadRequest to S3 if all the blocks of a file are uploaded into S3.
* @param keyName file to upload into S3
*/
private void emitFileMerge(String keyName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name seems misleading. Can we rename it to something like: verifyAndEmitFileMerge ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return;
}
if (currentWindowId <= windowDataManager.getLargestCompletedWindow()) {
uploadedFiles.add(keyName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Files can get uploaded across windows. We cannot assume this keyName is uploaded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I fixed this.

uploadFileMetadata.getFileMetadata().getNumberOfBlocks() != partETags.size()) {
return;
}
if (currentWindowId <= windowDataManager.getLargestCompletedWindow()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During recovery, no need to do anything as the window is already processed.
We can move this if block to the beginning of the method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

@Override
public void committed(long windowId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we save the uploadedFiles in the WAL every window? This will help us to clear up uploadParts and fileMetaDatas in the end window.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you missed this..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I made the changes.

for (int i = 0; i < blocks.length; i++) {
blockInfo.put(blocks[i], new S3BlockMetaData(tuple.getKeyName(), tuple.getUploadId(), i + 1));
}
if (blocks.length < 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor?
if (blocks.length > 0) { blockInfo.get(blocks[blocks.length - 1]).setLastBlock(true); }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

protected transient AmazonS3 s3Client;
private transient Set<Long> processedBlocks;
private transient long currentWindowId;
private transient List<AbstractBlockReader.ReaderRecord<Slice>> recordData;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to waitingTuples ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

/**
* Process the blocks which are in wait state.
*/
private void processWaitBlocks()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method seems idempotent. Can we call this during handleIdleTime() as well? This may shorten the time spent in endWindow().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocks has to wait till it receives the file meta data. I added this method call after receives the file meta data of the waited block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we should add it to handleIdleTime() and remove it from the processUploadFileMetadata() call.
Most of the times, it will have the meta data before it receives the actual data right? So only in some cases will we have the blocks waiting for meta data. Better to do it in the handleIdleTime and endWindow calls than blocking the operator thread to process waiting blocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if (processedBlocks.contains(tuple.getBlockId())) {
return;
}
processedBlocks.add(tuple.getBlockId());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need processedBlocks. Can't we directly make changes to blockInfo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can do. I added this because of there is an issue in FSInputModule that the same block id is emitting. PR for this fix is merged. I removed the "processedBlocks" as it is not needed.

@chaithu14 chaithu14 force-pushed the APEXMALHAR-2022-S3Output-multiPart branch 2 times, most recently from a77a280 to dc945a4 Compare November 21, 2016 10:37
Copy link
Contributor

@bhupeshchawda bhupeshchawda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a few more comments. Please check.

* Creates the empty object metadata for initiate multipart upload request.
* @return the ObjectMetadata
*/
public ObjectMetadata createObjectMetadata()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. By default it creates empty objectMetadata. User can override this and set the properties like encryption algorithm, etc. For more details, please refer below link:
http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/ObjectMetadata.html

* specific language governing permissions and limitations
* under the License.
*/
package org.apache.apex.malhar.lib.fs.s3output;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to package: org.apache.apex.malhar.lib.io.fs.s3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chaithu14 should we have a separate maven module for this similar to kafka?
@tweise Suggestions?

private String secretAccessKey;
private String endPoint;
private Map<String, S3BlockMetaData> blockInfo = new HashMap<>();
private Map<Long, String> blockId2FilePath = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to blockIdToFilePath?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this seems to be for tracking whether the block has been received from the ``BlockMetaDatainput port. If so, can we change it to aSet``` instead of a ```Map```?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. We need filepath of the received block to construct unique block id.

public final transient DefaultOutputPort<UploadBlockMetadata> output = new DefaultOutputPort<>();

/**
* This input port receives incoming tuples.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incoming tuples. => incoming tuples (Block Data)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* @param tuple UploadFileMetadata
*/
protected void processUploadFileMetadata(S3InitiateFileUpload.UploadFileMetadata tuple)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we skip processing this if currentWindowId <= windowDataManager.getLargestCompletedWindow() ? Similar to other process calls..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constructs the blockInfo Map and clearing this in replay call. So, I think it's better to process even if currentWindowId <= windowDataManager.getLargestCompletedWindow().

if (partETags.size() <= 1) {
uploadedFiles.add(keyName);
currentWindowRecoveryState.add(keyName);
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOG.debug("Uploaded file {} successfully", keyName);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

*/
protected void processUploadBlock(S3BlockUpload.UploadBlockMetadata tuple)
{
List<PartETag> listOfUploads = uploadParts.get(tuple.getKeyName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skip processing if windowId <= windowDataManager.getLargestCompletedWindow()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Need to process this because I am not saving the uploadParts in WAL.

*/
protected void processFileMetadata(S3InitiateFileUpload.UploadFileMetadata tuple)
{
String keyName = tuple.getKeyName();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skip processing if windowId <= windowDataManager.getLargestCompletedWindow()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

}

@Override
public void committed(long windowId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you missed this..

@@ -0,0 +1,115 @@
/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests seem to be excluded from the build.
Can you try looking for some embedded S3 server for testing? Or write mock tests which simulate the behavior of S3 in order to have the unit testing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added mock test

@chaithu14 chaithu14 force-pushed the APEXMALHAR-2022-S3Output-multiPart branch from dc945a4 to 3fb799f Compare November 23, 2016 13:06
@NotNull
private String secretAccessKey;
private String endPoint;
protected List<String> uploadedFiles = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this e transient?

@@ -193,6 +193,7 @@
<configuration>
<excludes>
<exclude>**/S3InputModuleAppTest.java</exclude>
<exclude>**/S3OutputModuleAppTest.java</exclude>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have the mock tests now, we can remove this test as it will not run anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

private static final String APPLICATION_PATH_PREFIX = "target/s3outputtest/";
private String applicationPath;
private Attribute.AttributeMap.DefaultAttributeMap attributes;
Context.OperatorContext context;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

PutObjectResult objResult = new PutObjectResult();
objResult.setETag("SuccessFullyUploaded");

UploadPartResult partResult = new UploadPartResult();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you try it out using multiple parts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be difficult to be done. Because it depends on UploadPartRequest and this is the passed argument to client.uploadPart().

LocalMode.Controller lc = lma.getController();
lc.setHeartbeatMonitoringEnabled(true);
lc.runAsync();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -0,0 +1,54 @@
package org.apache.apex.malhar.lib.fs.s3output;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Headers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

import static org.mockito.Matchers.any;
import static org.mockito.Mockito.when;

public class S3OutputModuleMockTest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some docs on what is being validated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

import static org.mockito.Mockito.any;
import static org.mockito.Mockito.when;

public class S3InitiateUploadTest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some docs on what is being validated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

*/
private void verifyAndEmitFileMerge(String keyName)
{
if (currentWindowId <= windowDataManager.getLargestCompletedWindow()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be moved to the processFileMetadata and processBlockUpload methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This can't be moved to processFileMetadata and processBlockUpload methods. Because I am not saving the uploadParts and fileMetadatas in WAL.

/**
* This operator can be used to upload the block into S3 bucket using multi-part feature or putObject API.
* Upload the block into S3 using multi-part feature only if the number of blocks of a file is > 1.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to mention that this is useful in context of the S3 Output Module. Same for all operators in this module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@bhupeshchawda
Copy link
Contributor

Also check the build - some rat check failure..

@chaithu14 chaithu14 force-pushed the APEXMALHAR-2022-S3Output-multiPart branch 3 times, most recently from b57099e to 91bb9f6 Compare November 28, 2016 05:17
Copy link
Contributor

@bhupeshchawda bhupeshchawda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a couple of minor comments...


import static org.apache.apex.malhar.lib.fs.s3.S3OutputModuleMockTest.client;

public class S3OutputModuleTest extends S3OutputModule
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can you rename this class to S3OutputTestModule?
Otherwise this seems to be a test class without any tests :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

});
lc.run(10000);

lc.shutdown();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need of shutdown(); lc.run() will shutdown automatically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Updated

@chaithu14 chaithu14 force-pushed the APEXMALHAR-2022-S3Output-multiPart branch from 91bb9f6 to 6ab63bd Compare November 28, 2016 08:55
@chaithu14 chaithu14 closed this Nov 28, 2016
@chaithu14 chaithu14 reopened this Nov 28, 2016
Copy link
Contributor

@bhupeshchawda bhupeshchawda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me...Will merge soon if there are no other comments.

*/

@InterfaceStability.Evolving
public class S3BlockUpload implements Operator, Operator.CheckpointNotificationListener, Operator.IdleTimeHandler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please change file name to S3BlockUploadOperator or S3BlockUploader?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to S3BlockUploadOperator.

* This operator is useful in context of S3 Output Module.
*/
@InterfaceStability.Evolving
public class S3InitiateFileUpload implements Operator, Operator.CheckpointNotificationListener
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please change filename to S3FileUploadInitiator or S3InitiateFileUploadOperator?

Copy link
Contributor Author

@chaithu14 chaithu14 Nov 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to S3InitiateFileUploadOperator.

* - S3BlockUpload
* - S3FileMerger
*
* Initial BenchMark Results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be very useful. Excellent.

* AWS access key
*/
@NotNull
private String accessKey;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make properties for S3OutputModule consistent with S3InputModule.
Either we should take accessKey, secretAccessKey seperately for both the modules.
OR take URL input for both the modules.

There could be discussion around pros, cons of both the approaches. Thus, you might tackle it as separate JIRA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I will create the JIRA for the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request you to put link to new JIRA in reply to this comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* @tags S3, Output
*/
@InterfaceStability.Evolving
public class S3OutputModule implements Module
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend to move this as a separate pom project under malhar. Just like we have malhar-kafka, malhar-hive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this under malhar library because the existing S3InputModule is under malhar library.

* Creates the number of instances of S3FileMerger operator.
*/
@Min(1)
private int noOfFileMergers = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the input module we have property called readerCount.
To make it consistent, we should rename this property to mergerCount.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return;
}
for (Map.Entry<String, UploadBlockMetadata> ubm: recoveredData.entrySet()) {
output.emit(ubm.getValue());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid abbreviations in the variable names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

*/
private void processWaitBlocks()
{
Iterator<AbstractBlockReader.ReaderRecord<Slice>> ite = waitingTuples.iterator();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid abbreviations in the variable names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

@chaithu14 chaithu14 force-pushed the APEXMALHAR-2022-S3Output-multiPart branch 3 times, most recently from 376d1eb to cb28681 Compare November 30, 2016 05:17
@yogidevendra
Copy link
Contributor

Changes looks good to me.

@chaithu14 chaithu14 force-pushed the APEXMALHAR-2022-S3Output-multiPart branch from cb28681 to a5e8fa3 Compare November 30, 2016 11:03
@chaithu14 chaithu14 closed this Nov 30, 2016
@chaithu14 chaithu14 reopened this Nov 30, 2016
@chaithu14 chaithu14 closed this Nov 30, 2016
@chaithu14 chaithu14 reopened this Nov 30, 2016
@asfgit asfgit merged commit a5e8fa3 into apache:master Dec 1, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants