Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync and cp hangs with a large amount of files #657

Closed
moatazelmasry2 opened this issue Feb 18, 2014 · 13 comments · Fixed by #673 or #676
Closed

sync and cp hangs with a large amount of files #657

moatazelmasry2 opened this issue Feb 18, 2014 · 13 comments · Fixed by #673 or #676

Comments

@moatazelmasry2
Copy link

Hello

Whenever I'm using aws s3 cp/sync the process hangs after sometimes, no errors or warnings, it just hangs forever. Here are some remarks about the use case and observations
1- s3 cp/sync is executed to process large amount of files 2K-80K files, with a total size between 10G-100GB maybe
2- I'm using aws-cli/1.2.13 Python/2.6.6 Linux/2.6.32-279.1.1.el6.x86_64
3- The command is executed on a m2.4xlarge instance with centos 6.5

When the process starts, the IO gets checked, wait time is over 99%, and the load average after sometime reaches 44.
4 Hours later, the resources go back to normal with load average of almost 0 as well as almost 0% wait time

The other interesting remark is that it always stops short before the end, for example, try a set of files, let us call x, it will always hangs at:
"Completed 2712 of 2714 part(s) with 1 file(s) remaining"
Now trying a different set of files y, it will always have the last line:
"Completed 39411 of 39417 part(s) with 3 file(s) remaining"

No matter how many times I repeat the test, it will always stop at one of the gicen lines (depending on which file dataset). This is a consistent failure with a large amount of files

If I'm downloading just one huge file, lets say 100GB, then everything is fine.

Any ideas?
Thanks

Update1:
The bug does not appear when I use an instance of type c3.2xlarge using the exact same configurations to fire up the instance. This problem is still reproducable on m2.4xlarge
The main difference between the two instances is that m2.4xlarge is using PARAVIRT ami, while the c3.2xlarge is using a HVM.

@daveadams
Copy link

I get the exact same problem using Python 3.3.3 on Ubuntu 10.04 LTS from a bare-metal server to sync files from S3 to local disk. Double-ctrl-C will cancel out after which I can re-run the sync and get another batch of files, sometimes completing the sync, sometimes not.

aws-cli 1.2.13
Ubuntu 10.04.4 LTS
Linux 2.6.32-47-server x86_64

@moatazelmasry2
Copy link
Author

Hello daveadams
Unfortunately this is not a solution for me, as this call cli call is a part of a huge software building and deployment pipeline and there can be no interaction. I can't go around this problem, I need to reall solve it.
The problem is still reproducible, I'll attach a debug log today and also try to invest sometime on this issue
Cheers

@jamesls
Copy link
Member

jamesls commented Feb 20, 2014

I'm going to try to repro this issue. I want to make sure I'm getting as similar as an environment as possible. If I'm doing the math right, 100GB total size of 80k files works out to about 1.3MB for the average size of the file?

Is the file transfer local -> s3, s3->local or s3->s3?

@moatazelmasry2
Copy link
Author

Hi Jamesls and thanks for the help
No the math is not right, file size vary greatly, they can be a couple of KBytes or even 1 GB, I attached a script that will generate a storage of random binary data and random sizes, with approx 100GB total size, distirbuted hopefully among thousands of files.
This happens on S3->local on ephemeral disk, m2.4xlarge machine that uses paravirt AMI, on computation optimized with HVM instances I could not produce the error

I'm generating a storage currently using this script and will try it on different types vanilla instances and see if I can produce it on other m2.4xlarge instances. Please see the attached script

#! /bin/bash

#   This script will create a storage of size 
#      ~= 100GB in 10 directories, each of size 
#     ~= 10GB distributed among thousands of randomly generated binary files
#

rm -rf storage 2>/dev/null
mkdir -p storage && pushd storage

function random_text {
    echo $(cat /dev/urandom| tr -dc '0-9a-zA-Z'|head -c 8)
}

for i in {1..10}
do
    parent_dir=`random_text`
    mkdir -p ${parent_dir} && pushd ${parent_dir}
    #while directory size less than 10GB
    while [[ `du -s . | awk '{print $1}'` -lt 10000000 ]]
    do
         #produce random binary file with size < 100MB
         dd if=/dev/urandom of=`random_text` bs=1K count=`echo $(( ( RANDOM % 100 ) * (RANDOM % 1000 ) ))`
    done &
    popd
done
popd

wait

@jamesls
Copy link
Member

jamesls commented Feb 21, 2014

Thanks I'll give your script a shot and report back what I find.

@jamesls
Copy link
Member

jamesls commented Feb 22, 2014

I can confirm the issue. Investigating why it's hanging.

@moatazelmasry2
Copy link
Author

What instance type did you use?which virtualisation?
On compute optimized instances for example I can not produce the bug,

@jamesls
Copy link
Member

jamesls commented Feb 25, 2014

On an m2.4xlarge. I believe I have a fix for this here, but I want to run more tests before I'm certain.

@moatazelmasry2
Copy link
Author

Everything works great after installing aws cli from github. Great work!!!! Thanks

@vishwasg1974
Copy link

I still see this issue. I have downloaded aws build from the installer just couple of days back.
aws --version
aws-cli/1.3.21 Python/2.7.2 Darwin/12.5.0

Thoughts on what could be wrong?

@moatazelmasry2
Copy link
Author

Hi @vishwasg1974
This bug is not reproduced anymore on our instances, we have upgraded to version 1.3.22.
Can you describe the files you are trying to upload/download, binary text, large small, depth in structure etc...?
Anyways try to use the script attached to this issue to produce custom random binary files

@cvniru
Copy link

cvniru commented Jul 24, 2014

I am using awscli version 1.3.23. My transfer is hanging after it says finished 945 of 946. Any suggestions?

@moatazelmasry2
Copy link
Author

Hi, aws cli got upgraded to 1.4.x. We transfer daily a couple of TB from/to s3 with the cli.
Is the problem reproducible with the newer version
Can you come up with a script that reproduces this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants