Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

TAJO-2069: Implement finding the total size of all objects in a bucket with AWS SDK. #953

Closed
wants to merge 0 commits into from

Conversation

blrunner
Copy link
Contributor

@blrunner blrunner commented Feb 1, 2016

Not yet implemented unit test cases and it depends on TAJO-2063 (#952).

@blrunner blrunner changed the title TAJO-2069: Remove getContentsSummary in TableSpace and Query. TAJO-2069: Implement finding the total size of all objects in a bucket with AWS SDK. Feb 3, 2016
@blrunner
Copy link
Contributor Author

blrunner commented Feb 3, 2016

Here is my benchmark results as follows.

Configuration

  • EC2 instance type : c3.xlarge
  • Tajo version : 0.12.0-SNAPSHOT
  • Cluster: 1 master, 1 worker

Contents summary time

#of directories S3AFileSystem::getContentsSummary S3FileTableSpace::getTotalSize Improvement
5 1372 ms 17 ms 80.7x
365 55447 ms 120 ms 462.0x
730 110245 ms 101 ms 1091.5x
1095 164812 ms 222 ms 742.4x
1460 221492 ms 217 ms 1020.7x

@blrunner
Copy link
Contributor Author

blrunner commented Feb 3, 2016

Removed TAJO-2063(#952) dependency.

@jihoonson
Copy link
Contributor

I wonder why the time taken by getTotalSize() is not proportional to the number of directories. It shows faster speed for more directories sometimes.
Do you know the reason?

@blrunner
Copy link
Contributor Author

blrunner commented Feb 4, 2016

@jihoonson

There may be various reasons : local network connection, and the health of Amazon's servers, AWS SDK retry mechanism.

@jihoonson
Copy link
Contributor

If they are reasons, you can mitigate those overheads by testing several times and averaging the results.


<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hadoop-aws is included in 2.6.0 and higher
If you add hadoop-aws, We should discuss hadoop compatibility

@blrunner
Copy link
Contributor Author

blrunner commented Feb 5, 2016

@jihoonson
Thank you for your feedback. I'll test it again with reference to your comments.

@jinossy
That's a good point. I'll write a e-mail about Hadoop compatibility.

@blrunner
Copy link
Contributor Author

@jinossy

I removed hadoop-aws dependency and added Amazon SDK dependency.

@blrunner
Copy link
Contributor Author

@jihoonson

Here is my second benchmark results as follows.

#of directories S3AFileSystem S3FileTableSpace Improvement
5 1056.5 ms 136.2 ms 7.8x
365 56549 ms 153.8 ms 367.7x
730 113007.5 ms 193.2 ms 585x
1095 168567 ms 215.7 ms 781.5x
1460 228129.5 ms 234.2 ms 974.1x

@blrunner
Copy link
Contributor Author

Finished test successfully as following:

  • EC2 instances which have been deployed manually.
  • EMR cluster using below script
aws emr create-cluster \
    --name="<CLUSTER_NAME> \
    --release-label=emr-4.4.0 \
    --no-auto-terminate \
    --use-default-roles \
    --ec2-attributes KeyName=<KEY_NAME> \
    --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=c3.xlarge \
    --bootstrap-action Name="Install tajo",Path=s3://jhjung-us/tajo-emr/install-tajo-java8.py,Args=["-t","s3://jhjung-us/tajo-emr/tajo-0.12.0-SNAPSHOT.tar.gz","-c","s3://tajo-emr/tajo-0.11.0/c3.xlarge/conf"]
  • Access to S3 directly on OSX

@blrunner
Copy link
Contributor Author

This PR had been moved to #1024.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants