-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upload file with boto, download it with boto3: file gets corrupted (wrong md5 sum) #816
Comments
A clarification: one reason my project still has both boto and boto3 code in it is because I ran into a previous issue (#703, which though closed is still affecting me, or something like it is affecting me). If I change the sample script above to use boto3 for uploading, changing the upload section to: client = boto3.client('s3')
client.create_bucket(Bucket="mybucket")
client.upload_file(
Filename="K158154-Mi001716_S1_L001_R1_001.fastq.gz",
Bucket="mybucket",
Key="foo/bar.fastq.gz") Then I get the following error, which is what happens with #703 as well:
So I tweaked my code to use boto for uploads (if I am running unit tests, otherwise I use boto3). If I change the uploading section yet again to use a boto3 resource instead of a client: resource = boto3.resource("s3")
resource.create_bucket(Bucket="mybucket")
bucket = resource.Bucket("mybucket")
with open("K158154-Mi001716_S1_L001_R1_001.fastq.gz") as f:
bucket.put_object(Key="foo/bar.fastq.gz", Body=f) ...I get the same problem as I had originally (incorrect md5 sum). So I guess I could get rid of the boto code and replace it with boto3 resource code, then I'd only be dealing with one bug instead of two, but I'd still be stuck with this one... Anyway, for completeness' sake I'll post the full version of the script modified so it only uses boto3 and not boto as well: import sys
import os
import hashlib
import moto
import boto3
def md5(fname):
hash_md5 = hashlib.md5()
with open(fname, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
@moto.mock_s3
def doit():
print("Uploading...")
resource = boto3.resource("s3")
resource.create_bucket(Bucket="mybucket")
bucket = resource.Bucket("mybucket")
with open("K158154-Mi001716_S1_L001_R1_001.fastq.gz") as f:
bucket.put_object(Key="foo/bar.fastq.gz", Body=f)
# You can get this file from:
# https://s3-us-west-2.amazonaws.com/demonstrate-moto-problem/K158154-Mi001716_S1_L001_R1_001.fastq.gz
# key.set_contents_from_filename("K158154-Mi001716_S1_L001_R1_001.fastq.gz")
# download it again
dlfile = "bar.fastq.gz"
if os.path.exists(dlfile):
os.remove(dlfile)
print("Downloading...")
client = boto3.client('s3')
client.download_file(Bucket="mybucket",
Key="foo/bar.fastq.gz", Filename="bar.fastq.gz")
md5sum = md5(dlfile)
if not md5sum == "6083801a29ef4ebf78fbbed806e6ab2c":
print("Incorrect md5sum! {}").format(md5sum)
sys.exit(1)
while True:
doit() |
Hi! First, you should update your boto/moto code. You have
The current versions are
I'm running your first example to reproduce the issue. Using moto 0.4.23 (forgot to update that one) and boto/boto3 at the current version, it took 4 attempts before an error occurred. After updating to moto 0.4.31, I have now 15 successful runs and no errors. All running on Ubuntu 14.04 LTS and Python 2.7.10, so quite similar to your setup. |
After a total of 27 runs with your first example code, I got this exception:
Not sure though if that is a bug in boto (wouldn't be the first...) or moto. I'll retry using your boto3-only example. |
Your second example code with only boto3 also fails every 10th or 20th time. Error messages were Incorrect md5sum! 53c03ecd1e61dc7d1cd01f58a1435e8c so it downloaded different data each time. The length of the downloaded data is correct, though! A much faster way to reproduce the bug is to upload only once, then download many times. This also shows that the bug occurs during the download. |
The documentation says
I recall reading that the http library that moto uses underneath is not thread-safe. After modifying your code to
I now have 6,000 downloads without a single error. |
Thanks so much for all the great info. Did you ever get errors after updating moto/boto/boto3? Otherwise I will make sure I don't use threads when using moto. Thanks again. |
Except for the very first test (where I forgot to update moto), all test were run with the most recent version. So a version upgrade will not fix this issue. However, it would be interesting to add the following feature to moto: When @mock_s3 is in effect, enforce that |
Turns out that disabling multi-threading in managed transfer methods is quite easy. With this change in moto, your original code works fine:
@spulec: Should I create a pull request for that or do you see a more elegant way of doing this? |
@snordhausen does that patch presume that the user is using boto3 and not boto? |
Yes, it only fixes the issue for boto3. Which will be good enough for most people. But now that you mention it: The code also assumes that you have boto3 installed. It will fail with an ImportError if you do not have boto3 installed (e.g. because you are only using boto and you are working with a virtualenv). It's simple to fix, though. I'll create a new version of the file tomorrow and then you can give it a try. |
@dtenenba I created a new branch in my fork which has an improved version of the above patch. For testing, could you
Then, run your boto3-only example code with the hand-patched moto. This should fix your problem. |
@dtenenba Did you have a chance to test the patch? |
@snordhausen A PR would be very much welcome. I'm also trying to think about long-term how we can get away from HTTPretty and move toward a system that will work with multiple threads. |
I've been experiencing an issue with file uploads to S3, and after reading through this issue I think it's also related. However, I am using multiple processes for parallel uploads (using If I upload 10 files in parallel, when I want to check that the bucket has 10 files, it fails and says that there are no files in the bucket. On the other hand, if I test the upload sequentially, it works as expected. Is there any fix around this while still keeping my application code multi-process? |
hi, i'm getting similar issues writing parquet from Spark to moto server. Checksum mismatch. There are workarounds (not great from Python perspective, better for Scale), as follows, but it would be great if this could be fixed FYI, it looks like there was a similar issue in localstack which is now fixed. And this is how i create my Spark Session in Python i am testing using gluepyspark which has all AWS dependencies. E py4j.protocol.Py4JJavaError: An error occurred while calling o117.parquet. |
There have been a few improvements in how we handle md5sums/etags since 2020 - is anyone still running into issues using the latest version of Moto? |
Hi,
The following code uploads a file to a mock S3 bucket using boto, and downloads the same file to the local disk using boto3. I apologize for bringing both of the libraries into this, but the code I am testing in real life still uses both (definitely trying to get rid of all the boto code and fully migrate to boto3 but that isn't going to happen right away).
What happens is that the resulting file does not have the same md5 sum as the original file so it has been corrupted at some point (not sure if it was during the boto upload or the boto3 download).
This seems to be an issue with moto because if I comment out the line
@moto.mock_s3
(using 'real' S3) the script works fine (I also need to change the bucket name to a unique one to avoid collisions).The script keeps looping (doing the upload/download/md5sum comparison) until it fails (because in my real project this would not happen every time) but this test script seems to fail (for me anyway) on the first attempt every time.
The test file that it uploads/downloads is available here.
You can download it with:
At this point if you run md5sum on it you should get
6083801a29ef4ebf78fbbed806e6ab2c
:Here is the test script (
motoprob.py
):Version info:
Other ways to see that the resulting file is not the same as the original:
The text was updated successfully, but these errors were encountered: