Skip to content
This repository has been archived by the owner on May 10, 2024. It is now read-only.

Incremental MD5 for uploads #711

Merged
merged 1 commit into from
Apr 21, 2012
Merged

Incremental MD5 for uploads #711

merged 1 commit into from
Apr 21, 2012

Conversation

evanworley
Copy link
Contributor

I've added support for computing incremental md5s for GS, when possible. There are a few situations that I identified where it wasn't possible, mentioned in the commit notes of the larger commit.

Please let me know if I missed anything. I ran a few manual tests, and everything seems to work as expected. It wasn't clear to me how to run the boto test suite, nor how to run gsutil with a tracker url.

Also, I noticed that the Name of the Key is set to the md5 when the name is missing, so I had to compute the MD5 ahead of time if the name is None.

@ghost ghost assigned mfschwartz Apr 20, 2012
@mfschwartz
Copy link
Member

Hi Evan,

I reviewed your code - please take a look on Github.

Thanks,

Mike

On Thu, Apr 19, 2012 at 5:31 PM, Evan Worley <
reply@reply.github.com

wrote:

I've added support for computing incremental md5s for GS, when possible.
There are a few situations that I identified where it wasn't possible,
mentioned in the commit notes of the larger commit.

Please let me know if I missed anything. I ran a few manual tests, and
everything seems to work as expected. It wasn't clear to me how to run the
boto test suite, nor how to run gsutil with a tracker url.

  • Evan

You can merge this Pull Request by running:

git pull https://github.com/evanworley/boto develop

Or you can view, comment on it, or merge it online at:

#711

-- Commit Summary --

  • Adding incremental MD5 calculation for GS when possible.
  • Removing print statements

-- File Changes --

M boto/gs/key.py (16)
M boto/gs/resumable_upload_handler.py (28)
M boto/s3/key.py (15)

-- Patch Links --

https://github.com/boto/boto/pull/711.patch
https://github.com/boto/boto/pull/711.diff


Reply to this email directly or view it on GitHub:
#711

@evanworley
Copy link
Contributor Author

Thanks Mike, I'll resolve that bug and run the tests. Thanks for the instructions on how to run the tests. Also can you elaborate on how you ran the resumable upload test in #2?

@evanworley
Copy link
Contributor Author

I'm getting a strange error when running the ssl tests, "gaierror: [Errno -5] No address associated with hostname". Any idea what this is about?

Also a seemly unrelated S3 test error

FAIL: test_1_versions (s3.test_versioning.S3VersionTest)

Traceback (most recent call last):
File "/test_versioning.py", line 116, in test_1_versions
self.assertEqual('Suspended', d['Versioning'])
AssertionError: 'Suspended' != 'Enabled'

@mfschwartz
Copy link
Member

On the tests: One thing I suggest is to get a baseline: sync back to just before you made your changes and run the tests from that code snapshot. See if you get either/both of these failures.

If you do, a couple thoughts about what might be causing them:
a) are you running behind a proxy? Possibly that's a red herring, but I tend to suspect proxies when network problems arise.
b) do you have an s3 account (and, are you able to use versioning directly, not through gsutil)?

@evanworley
Copy link
Contributor Author

Thanks for the tip. I have all the s3 and gs tests passing. The ssl tests
that fail, fail both in my fork and using the baseline, so I'm pretty
confident that everything is working as expected. Pushing momentarily.

On Fri, Apr 20, 2012 at 3:59 PM, Mike Schwartz <
reply@reply.github.com

wrote:

On the tests: One thing I suggest is to get a baseline: sync back to just
before you made your changes and run the tests from that code snapshot. See
if you get either/both of these failures.

If you do, a couple thoughts about what might be causing them:
a) are you running behind a proxy? Possibly that's a red herring, but I
tend to suspect proxies when network problems arise.
b) do you have an s3 account (and, are you able to use versioning
directly, not through gsutil)?


Reply to this email directly or view it on GitHub:
#711 (comment)

@evanworley
Copy link
Contributor Author

All tests are now passing, would you please have another look?

Thanks,
Evan

@mfschwartz
Copy link
Member

One last request: can you please do a git rebase so that your comment shows up as the top entry in the log? Otherwise this commit will look like "Merge branch 'develop' of github.com:evanworley/boto into develop".

@mfschwartz
Copy link
Member

Hi Evan,

I added a couple more comments. All looks good now; main thing left at this
point is if you could git rebase so the change description is your
description instead of the description from the merge, then I'll submit it
on the develop branch.

Thanks again,

Mike

On Fri, Apr 20, 2012 at 4:05 PM, Evan Worley <
reply@reply.github.com

wrote:

All tests are now passing, would you please have another look?

Thanks,
Evan


Reply to this email directly or view it on GitHub:
#711 (comment)

This greatly reduces the wall clock time for an upload when uploading large files.

Situations in which the full MD5 is still calculated in its entirety
  - When resuming a partial upload
  - When a name is not specified for the object, because GCS uses the MD5 as the name

Updating one test in ResumableUploadTests to change a byte that has already been uploaded, so a failure will occur.
Otherwise the on-the-fly evaluation of the md5 will not cause an error, as it will be computed after mutation
has occurred.
@evanworley
Copy link
Contributor Author

Hi Mike,

I just finished the rebase. Thanks for all the help and feedback.

Evan

@mfschwartz
Copy link
Member

I renamed this change to "Incremental MD5 for uploads" since it actually works for both GS and S3.

mfschwartz added a commit that referenced this pull request Apr 21, 2012
@mfschwartz mfschwartz merged commit a498da9 into boto:develop Apr 21, 2012
@mfschwartz
Copy link
Member

Committed.

Thanks for the work on this!

Mike

On Fri, Apr 20, 2012 at 5:18 PM, Evan Worley <
reply@reply.github.com

wrote:

Hi Mike,

I just finished the rebase. Thanks for all the help and feedback.

  • Evan

Reply to this email directly or view it on GitHub:
#711 (comment)

@evanworley
Copy link
Contributor Author

Excellent, thanks for all the help/feedback, Mike.

@mfschwartz
Copy link
Member

Hi Evan,

One additional thought about the MD5 change you made: Your mod only causes
incremental MD5 computation for uploads. For downloads we still wait for
the full file read/verification at the end. Might I be able to interest you
in also modifying that path?

Thanks,

Mike

On Sun, Apr 22, 2012 at 10:48 PM, Evan Worley <
reply@reply.github.com

wrote:

Excellent, thanks for all the help/feedback, Mike.


Reply to this email directly or view it on GitHub:
#711 (comment)

@evanworley
Copy link
Contributor Author

Hey Mike,

This hadn't even occurred to me, as we use aria2c to download our files
over http (as they are public-read). This would be an useful addition, and
I'd be happy to contribute it. I'll submit a pull request once it's ready.

Cheers,
Evan

On Mon, Apr 23, 2012 at 9:47 AM, Mike Schwartz <
reply@reply.github.com

wrote:

Hi Evan,

One additional thought about the MD5 change you made: Your mod only causes
incremental MD5 computation for uploads. For downloads we still wait for
the full file read/verification at the end. Might I be able to interest you
in also modifying that path?

Thanks,

Mike

On Sun, Apr 22, 2012 at 10:48 PM, Evan Worley <
reply@reply.github.com

wrote:

Excellent, thanks for all the help/feedback, Mike.


Reply to this email directly or view it on GitHub:
#711 (comment)


Reply to this email directly or view it on GitHub:
#711 (comment)

@evanworley
Copy link
Contributor Author

Thanks guys.

I took a look at incremental MD5 computation for downloads, and the only thing needed was a simple check in gsutil. Please have a look at http://code.google.com/p/gsutil/issues/detail?id=55. Just like resumed uploads, resumed downloads still compute the full MD5 at the end.

Both resumed uploads and resumed downloads could do incremental MD5 computation if they stored the partial checksum in the tracker, but this is a larger change and I wouldn't want to delay these simple fixes to support incremental MD5 computation for resumed uploads/downloads.

Please let me know if you have any questions, or if my patch is missing anything.

Evan

@mfschwartz
Copy link
Member

Hi Evan,

Something's not right with that patch - a number of gsutil tests fail with
it in place. Can you please try this:

PYTHONPATH=./boto:$PYTHONPATH ./gslib/test_commands.py

Thanks,

Mike

On Wed, Apr 25, 2012 at 6:01 PM, Evan Worley <
reply@reply.github.com

wrote:

Thanks guys.

I took a look at incremental MD5 computation for downloads, and the only
thing needed was a simple check in gsutil. Please have a look at
http://code.google.com/p/gsutil/issues/detail?id=55. Just like the
uploads, resumed downloads still compute the full MD5 at the end.

Both resumed uploads and resumed downloads could still do incremental MD5
computation if they stored the partial checksum in the tracker, but this is
a larger change and I wouldn't want to delay these simple fixes to support
resumed uploads/downloads.

Please let me know if you have any questions, or if my patch is missing
anything.

  • Evan

Reply to this email directly or view it on GitHub:
#711 (comment)

@evanworley
Copy link
Contributor Author

Thanks for the quick feedback, Mike. I wasn't aware of these tests, and the mock key object doesn't have the md5 attribute. I updated the check to also check for the md5 attribute's existence. I've attached an updated diff to the gsutil issue, and it passes all the tests on my machine.

Evan

@mfschwartz
Copy link
Member

Hmm, I don't see the patch, either in email or on github. Can you please
try mailing it again? Maybe just mail it directly to me instead of going
through github relayed email: mfschwartz@google.com.

Thanks for the continued work on this.

Mike

On Wed, Apr 25, 2012 at 7:30 PM, Evan Worley <
reply@reply.github.com

wrote:

Thanks for the quick feedback, Mike. I wasn't aware of these tests, and
the mock key object doesn't have the md5 attribute. I updated the check to
also check for the md5 attribute's existence. I've attached an updated diff
to the gsutil issue, and it passes all the tests on my machine.

Evan


Reply to this email directly or view it on GitHub:
#711 (comment)

msabramo pushed a commit to msabramo/boto that referenced this pull request Nov 28, 2012
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants