Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Slow performance of sync - Local File to Blob #81

Closed
Kapanther opened this issue Oct 11, 2018 · 23 comments
Closed

[Question] Slow performance of sync - Local File to Blob #81

Kapanther opened this issue Oct 11, 2018 · 23 comments
Assignees

Comments

@Kapanther
Copy link

V10.0.2 Preview - Win 7

.\azcopy sync "C:\GCDS_dev" "https://azgcdsdevst1.blob.core.windows.net/gcdstest2?--Key Retracted--" --recursive

When syncing larger amounts of data >1GB (local file to Blob) sync seems to take a long time to even prep the job. (i.e syncing 1.4gb of data seems to take greater than 30 mins to even srart the job)

While copy function seems to start almost straight away.

I know the sync command obviously has some file comparison work to do before it can do anything, but it still seems extraordinarily slow to begin.

Any idea what could be causing delay?

Is it possible to report file conflict check progress to the command line with a flag?

@Kapanther
Copy link
Author

From the copy log. As you see very slow response times per file checked..

RESPONSE Status: 201 Created
Content-Md5: [Y08Lyv2EjoV4Z3tXjHkPSA==]
Date: [Thu, 11 Oct 2018 05:07:05 GMT]
Etag: ["0x8D62F3763A44874"]
Last-Modified: [Thu, 11 Oct 2018 05:07:06 GMT]
Server: [Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0]
X-Ms-Request-Id: [39aa8fcc-c01e-00a7-3120-61b85c000000]
X-Ms-Request-Server-Encrypted: [true]
X-Ms-Version: [2018-03-28]
2018/10/11 05:07:06 ==> REQUEST/RESPONSE (Try=1/3.02s[SLOW >3s], OpTime=3.02s) -- RESPONSE SUCCESSFULLY RECEIVED
PUT https://azgcdsdevst1.blob.core.windows.net/gcdstest2/Customisation/PAT_CUSTOM/S15.PAT?si=gcdstest2-16657ecd551&sig=REDACTED&sr=c&sv=2018-03-28&timeout=901
Content-Length: [1271]
User-Agent: [AzCopy/v10.0.2-Preview Azure-Storage/0.1 (go1.10.3; Windows_NT)]
X-Ms-Blob-Cache-Control: []
X-Ms-Blob-Content-Disposition: []
X-Ms-Blob-Content-Encoding: []
X-Ms-Blob-Content-Language: []
X-Ms-Blob-Content-Type: [text/plain; charset=utf-8]
X-Ms-Blob-Type: [BlockBlob]
X-Ms-Client-Request-Id: [c26faa5d-043f-4759-589f-a2d4f2aeca9e]
X-Ms-Version: [2018-03-28]

@zezha-msft
Copy link
Contributor

Hi @Kapanther, thanks for reaching!

To clarify, you said that it took a long time for the job to start. How did you observe this? Was it low throughput?

And also, how many files did you have?

@Kapanther
Copy link
Author

~10000 files. 1.4GB. An azure copy took less than 2 minutes. Basically azcopy accepts the command and prints nothing for about 30 minutes.

How can i check if low throughput is a problem? throughput = files/sec? I posted the log above

@zezha-msft
Copy link
Contributor

Hi @Kapanther, thanks for the additional info! Just to confirm, this is a reproducible problem, right?

@prjain-msft could you help to confirm this behavior?

@prjain-msft
Copy link
Contributor

Hey Kapanther,
How many files you have in the destination ??
The way sync works is it compare source against the destination and destination against source. All the files present in the destination are also listed and compared with expected file in the source. If the fle is not present in the source, they are marked for deletion. If you have very large amount of files in the destination, then probably the files from destination are being listed and then compared against the source. Until, there are 10000 transfers queued you won't see the output. So for scenarios in sync where there are 100000 thousand of files in source and destination and only very few of them are out of sync, then all the files will be compared first before you will see any output.

@Kapanther
Copy link
Author

Well i tested two scenarios. The local file empty and the blob empty. (definitely reproducible)

When syncing from the localfile to blob (with blob empty) it took 31 mins before the job even started. But syncing back down to the local file from the blob was lightning fast. see below.

10000 files on local file (source) – empty on blob (destination) - 31mins to start job
10000 files on blob (source) - empty on local file (destination) - less than 1 minute to start.

seems like its taking a long time to queue the transfer to the blob when syncing,

@prjain-msft
Copy link
Contributor

Hi Kapanther,
Can you please provide me the commands you tried and also mark the resources (as file / destination) ?

@Kapanther
Copy link
Author

Localfile to Azure Command
C:\software\azcopy\azcopy.exe sync "C:\GCDS_dev" "https://azgcdsdevst1.blob.core.windows.net/gcdstest2?sv=2018-03-28&si=gcdstest2-16657ECD551&sr=c&sig=!retracted!" --recursive

Source = c:\GCDS_dev
Dest = https://azgcdsdevst1.blob.core.windows.net/gcdstest2?sv=2018-03-28&si=gcdstest2-16657ECD551&sr=c&sig=!retracted!

Azure to Localfile Command
C:\software\azcopy\azcopy.exe sync "https://azgcdsdevst1.blob.core.windows.net/gcdstest2?sv=2018-03-28&si=gcdstest2-16657ECD551&sr=c&sig=!retracted!" "C:\GCDS_dev" --recursive

Source = https://azgcdsdevst1.blob.core.windows.net/gcdstest2?sv=2018-03-28&si=gcdstest2-16657ECD551&sr=c&sig=!retracted!
Dest = C:\GCDS_dev

@prjain-msft
Copy link
Contributor

Hi Kapanther,
Can you please confirm on certain things ?
c:\GCDS_dev (This is directory or file ?)
https://azgcdsdevst1.blob.core.windows.net/gcdstest2 (Points to a blob or a virtual folder ?)

@Kapanther
Copy link
Author

C:/gcds_dev is a directory..

This is a blob.. no virtual folder. It goes straight into the root.

@VelizarVESSELINOV
Copy link

VelizarVESSELINOV commented Oct 18, 2018

Compare to gsutil sync the azcopy sync performance are really very bad. Using macOS Mojave.
Azcopy being written in go in large part, expected higher performance than gsutil/boto written in Python.

Related to slow performance extra observations:

  • missing clear intermediate output information to follow what the program is doing specially during diff analysis phase (try gsutil if want to understand what I’m talking about)
  • missing compression option during coping readable files like CSV
  • too much file transfer failures
  • missing chunking of the large files
  • missing multi-threaded option

@zezha-msft
Copy link
Contributor

Hi @VelizarVESSELINOV, thanks for the feedbacks! We are actively working on this tool to improve the performance.

To clarify, we do perform concurrent operations and chunk up large files. What were the failures that you saw?

@VelizarVESSELINOV
Copy link

VelizarVESSELINOV commented Oct 18, 2018

Hi @zezha-msft, thanks for the quick answer. In my process explorer, I saw a lot of threads running but the CPU usage was limited. Are there an option to control parallel execution or not, maybe the user interface is not showing enough what is currently done in parallel and/or chunked. For failures, I have often this error

   ERROR:
-> github.com/Azure/azure-storage-azcopy/ste.newAzcopyHTTPClientFactory.func1.1, /go/src/github.com/Azure/azure-storage-azcopy/ste/mgr-JobPartMgr.go:95
HTTP request failed

The CPU is often low (3%), but obviously using a lot of some resources so few minutes after start execution VSCode and other applications switch to not responding mode, which is annoying.

@Kapanther
Copy link
Author

@VelizarVESSELINOV take into account sync is a new feature thats only "in preview" right now. The guys are still testing it and optimizing its performance.

This thread is focused on an issue with the sync command's initial comparison between source and destination been slow. Not the file transfer operation itself. Can i suggest you post issues with multi threading performance as a separate issue?

@zezha-msft
Copy link
Contributor

Hi @VelizarVESSELINOV, which command were you running exactly? Was it sync or copy?

If you don't mind, please open up a new issue and fill out the issue template so that we can have a bit more info. Thanks!

The concurrency is indeed configurable, please refer to this guide. Our ultimate goal is to adjust the concurrency based on the environment&network; we are still working on this.

@prjain-msft
Copy link
Contributor

Hi @Kapanther
In you command, the source is a directory and destination is a container so sync in background first lists all the files inside the source and compares them against the expected files at the destination. Then it lists all the files inside the destination and compares them against the expected files locally. That is why sync doesn't start immediately. We are working on improving the user experiecne for sync.

@Kapanther
Copy link
Author

@prjain-msft
two questions here

  1. if sync is only one way (source -> destination).. why does it have to check back again from destination against the source?

  2. What i find particularly strange is that checking 10000 files locally to an empty blob using sync takes 31 minutes. But going the other way is almost instant, both checks require the blob to be inspected though... Is there a timeout parameter here for files that are not found or something?

I guess I will leave it for you guys to interrogate, when you get a chance. In the mean time I'll use Rclone and see if i get similar results.

@VelizarVESSELINOV
Copy link

@zezha-msft the answer to your question is sync

Eventually, I will open a new ticket, for now, stopped using azcopy.

The default AZCOPY_CONCURRENCY_VALUE 300 is probably too aggressive for macOS and make all the OS difficult to use. No time right now to test better default option of AZCOPY_CONCURRENCY_VALUE for macOS. Managed to copy the files with az with acceptable performance/user interface, hopefully, the team will manage to improve the performance and the usability with the @Kapanther help.

@zezha-msft
Copy link
Contributor

Hi @VelizarVESSELINOV, thanks for your feedbacks.

To clarify though, if you only wanted to copy files, you should use the copy command, not sync which has severe overhead because we have to compare the contents of the source and destination to figure out exactly what to transfer or delete. On the other hand, copy simply transfers the source to destination. With the help of the --overwrite=false flag, copy can also avoid overwriting existing files at the destination.

@zezha-msft
Copy link
Contributor

Hi @Kapanther, @prjain-msft has improved the sync command's performance significantly. Could you please give it another try? Thank you!!

@Kapanther
Copy link
Author

@zezha-msft and @prjain-msft .. jsut got back from holidays.. will check now... im excited

@Kapanther
Copy link
Author

Kapanther commented Jan 12, 2019

@zezha-msft and @prjain-msft .. holy crap guys.. azcopy sync must now be using weapons grade plutonium.. because that is FAST! what was taking about 2 minutes before is taking less than a second.

Using that 10.0.0,5 preview...

@Kapanther
Copy link
Author

Consider this issue closed!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants