Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3 cp "Cannot allocate memory" error #5876

Closed
2 of 4 tasks
tthyer opened this issue Jan 19, 2021 · 37 comments
Closed
2 of 4 tasks

s3 cp "Cannot allocate memory" error #5876

tthyer opened this issue Jan 19, 2021 · 37 comments
Assignees
Labels
bug This issue is a bug. s3

Comments

@tthyer
Copy link

tthyer commented Jan 19, 2021

Confirm by changing [ ] to [x] below:

Issue is about usage on:

  • Service API : I want to do X using Y service, what should I do?
  • CLI : passing arguments or cli configurations.
  • Other/Not sure.

Platform/OS/Hardware/Device
What are you running the cli on?
ECS via Batch. awscliv2 is installed via a launch template.

Describe the question
Intermittently, I get the following error when trying to download a large file (~45-50GB): download failed...[Errno 12] Cannot allocate memory as part of a workflow of batch jobs. This is occurring for batch jobs that each have >=3GB of memory specified for each; the last time this occurred, the batch job had 7GB memory allocated.

The command being executed looks something like /usr/local/aws-cli/v2/current/bin/aws s3 cp --no-progress s3://my-s3-bucket/etc/etc/1000.unmapped.unmerged.bam /tmp/scratch/my-s3-bucket/etc/etc/1000.unmapped.unmerged.bam

Is the python subprocess causing this? What do you recommend to avoid this while running on AWS Batch/ECS?

Logs/output
There are no more informative logs atm -- I will put debugging in so that the next time this happens the debug flag is passed.

@tthyer tthyer added guidance Question that needs advice or information. needs-triage This issue or PR still needs to be triaged. labels Jan 19, 2021
@kdaily kdaily added investigating This issue is being investigated and/or work is in progress to resolve the issue. s3 and removed needs-triage This issue or PR still needs to be triaged. labels Jan 19, 2021
@kdaily kdaily self-assigned this Jan 19, 2021
@kdaily
Copy link
Member

kdaily commented Jan 19, 2021

Hi @tthyer 👋🏻 I'm setting up an environment to replicate. I have a couple of clarifying questions.

  1. I think I know the answer to this, but what provisioning model are you using for Batch - on-demand or Fargate?
  2. Can you confirm the method of installation in your launch template - are you following the V2 installation guide from here?

I will try to reproduce, but if you know the answer to either of these it would help:

  1. Does this occur if you use the AWS CLI v1?
  2. Does this occur if you use the AWS CLI v2 Docker container image (link)?

@kdaily kdaily added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Jan 19, 2021
@tthyer
Copy link
Author

tthyer commented Jan 19, 2021

Hi @kdaily! I am running Cromwell on Batch, and am using the infrastructure provided by AWS Genomics. I've been in communication with them through the cromwell slack channel. They suggested that I provide additional RAM to my batch job, but I don't think that's going to help as I'm supplying a good deal to this job already, and sometimes this job completes without throwing the allocation error.

To answer your questions, first set:

  1. Using on-demand. The compute environment is defined in gwfcore-batch.template.yaml, logical id OnDemandComputeEnv. The only change I've made to this is to add bigger instances to the list.
  2. Line 107 of the LaunchTemplate is where awscliv2 is installed -- looks like it is following the instructions.

Second set:

  1. Have not tried switching to v1 yet. Is that recommended? If so, why?
  2. Not using the awscliv2 docker image: this is already inside a container.

What I've currently got in flight is just to run the cp commands with --debug. It may take some time to get anything form this because as previously mentioned, this error only appears intermittently and on different jobs in the same workflow.

I was wondering in particular whether there are some config options I should try?

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Jan 19, 2021
@tthyer
Copy link
Author

tthyer commented Jan 21, 2021

Another data point: I had assumed that the batch compute environments were defaulting to Amazon Linux 2, but that wasn't the case; I was still getting version 1. I've updated the compute environment to use AL2.

@kdaily
Copy link
Member

kdaily commented Jan 21, 2021

I'm testing this with the defaults as well (AL1) both my own stack as well as the AWS Genomics stack. Just downloaded a 50GB file to S3 and triggered some jobs to try and reproduce.

I suggested v1 as v2 is built using PyInstaller and uses it's own Python executable, which has caused some other hard-to-debug issues, so I was curious if I could rule out that. Some of the configuration options (namely the max_queue_size) can impact memory, but your memory limit is so high that I can't imagine it would be a problem. This will be a multipart transfer, so it's multithreaded, so it's possible that it could be a problem.

I know there are a few places where memory can be configured in compute environments and job definitions - is there enough memory allocated at all levels (the instance type selected as well as the memory made available to the container)?

@tthyer
Copy link
Author

tthyer commented Jan 22, 2021

Thanks for the background on the awscli versions.
For this workflow, Batch/ECS is regularly provisioning c5.9xlarge instances. Those have 72GiB memory.

@aws aws deleted a comment from tthyer Jan 22, 2021
@kdaily
Copy link
Member

kdaily commented Jan 22, 2021

I used the amazon/aws-cli container image, and I requested 4 vCPUs and 8 GB of memory to the container, and this was on the ECS-optimized AL1 instance (default). All was OK, a 50GB transfer finished in 8 minutes. I also tried with 4 GB of memory, and it was also successful.

The number of concurrent requests could be an issue. By default, the S3 client uses 10 concurrent requests (and thus 10 threads). Altering this parameter inside a Batch job (inside a container) is a bit involved, but I'm interested if I can reproduce by artificially increasing the number of requests significantly.

Given the sporadic nature of this issue, I'm afraid that getting debug logs of a failure may be the only recourse. Failing that, checking ulimits or swap space restrictions on the instance and/or container being used could be valuable.

@kdaily
Copy link
Member

kdaily commented Jan 22, 2021

I'm going to reach out to the AWS Genomics team as well to see if anything else is known on their side.

@kdaily kdaily removed the investigating This issue is being investigated and/or work is in progress to resolve the issue. label Jan 22, 2021
@henriqueribeiro
Copy link

I face exactly the same problem as @tthyer running some workflows on cromwell. The erro appears intermittently and on different jobs in the same workflow.
@tthyer did you had some luck updating to AL2?

@tthyer
Copy link
Author

tthyer commented Jan 25, 2021

@henriqueribeiro, I've only conducted a few workflow runs since upgrading, and the bug has not recurred yet, but it was very intermittent for us. I'll keep you posted. I opened an issue in the aws-genomics-workflows repository (see just above) for us to discuss different environmental changes we're making in that infrastructure.

@kdaily
Copy link
Member

kdaily commented Jan 26, 2021

@tthyer what version of Cromwell are you using?

@henriqueribeiro
Copy link

@kdaily I confirm that with Cromwell 55 is happening.

@tthyer
Copy link
Author

tthyer commented Jan 26, 2021

I'm also using 55

@kdaily
Copy link
Member

kdaily commented Jan 27, 2021

Thanks. I'm in contact with AWS Genomics, and will update once I hear more.

@kdaily
Copy link
Member

kdaily commented Feb 1, 2021

It's possible that if the scheduling of jobs ends up with many on the same host instance, there could be memory issues related to how containers use memory and how Python does or does not respect those limits:

From https://docs.docker.com/config/containers/resource_constraints/#limit-a-containers-access-to-memory:

Docker can enforce hard memory limits, which allow the container to use no more than a given amount of user or system memory, or soft limits, which allow the container to use as much memory as it needs unless certain conditions are met, such as when the kernel detects low memory or contention on the host machine.

My AWS Genomics contact also noted that container 'swappiness' can play a role.

I came across this (external) post regarding large disk writes and memory issues in Docker containers that might shed some light: https://codefresh.io/docker-tutorial/docker-memory-usage/.

I also came across this memory issue in containers for the V2 client:

#5047

From that, we can gather that PyInstaller (the tool used to build the AWS CLI V2 bundle) makes some choices where to write to, including possibly to /dev/shm, which could also be a possible problem here.

Long story short, there might be an underlying issue with Python and large disk writes on containers. I don't have a good short term solution though at this time. I'm going to add a new issue to investigate memory usage of the V2 client in containers.

I'll mark this as response-requested to see if you get some more details about the running environment when a failure occurs, which will close this in 7 days. As always, feel free to open again if that passes and you get more information!

@kdaily kdaily added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Feb 1, 2021
@henriqueribeiro
Copy link

hi @kdaily,

Following @tthyer suggestion, I updated my compute environment to use AL2 and re-run the jobs. The same error appeared. I've also added the --debug flag and I have the csv file with the logs when the error happened. Do you think it's worth sharing it with you?

@kdaily
Copy link
Member

kdaily commented Feb 1, 2021

Sure! Can you sanitize/redact anything like account numbers and upload it here?

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Feb 1, 2021
@henriqueribeiro
Copy link

Here it is: log-events-viewer-result.log

Not sure, if there is something interesting in there. If you have any suggestion in order to debug this problem please tell me and I can run some more workflows.

Also, I set the max_concurrent_requests parameter to 2 just to test.

@kdaily
Copy link
Member

kdaily commented Feb 1, 2021

@henriqueribeiro,

Thanks for the logs, and for lowering the max_concurrent_requests. You're right in that there doesn't seem to be something of interest - the lack of a stack trace for the memory error is unfortunate.

Does this reproduce for you regularly? If it does, this may be a big ask depending on how you're environment is configured, but can you run this with AWS CLI v1 instead of v2? If I can rule out anything related to the PyInstaller bundle (or confirm it), that would be great!

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Feb 2, 2021
@henriqueribeiro
Copy link

I just tried running the workflow with AWS CLI 1 and I got exactly the same error. Below is the log file:

*** LOCALIZING INPUTS ***
aws-cli/1.15.15 Python/2.7.18 Linux/4.14.203-156.332.amzn2.x86_64 exec-env/AWS_ECS_EC2 botocore/1.10.15
download: s3://clt-sv-resources/broad-references/v0/Homo_sapiens_assembly38.dict to clt-sv-resources/broad-references/v0/Homo_sapiens_assembly38.dict
download: s3://clt-cromwell/cromwell-execution/GATKSVPipelineBatch/a5bb3e54-8741-4930-873c-b2af500c109b/call-Module00aBatch/Module00aBatch/6177be93-3c4a-408b-a185-231b2d1384e6/call-Module00a/shard-110/Module00a/4c5271c0-0d5b-43c7-a1d1-be1b5e249150/call-PESRCollection/PESRCollection/59ca0b51-7971-4f65-8a6b-a660ed2f8e65/call-RunPESRCollection/script to clt-cromwell/cromwell-execution/GATKSVPipelineBatch/a5bb3e54-8741-4930-873c-b2af500c109b/call-Module00aBatch/Module00aBatch/6177be93-3c4a-408b-a185-231b2d1384e6/call-Module00a/shard-110/Module00a/4c5271c0-0d5b-43c7-a1d1-be1b5e249150/call-PESRCollection/PESRCollection/59ca0b51-7971-4f65-8a6b-a660ed2f8e65/call-RunPESRCollection/script
download: s3://clt-cromwell/cromwell-execution/GATKSVPipelineBatch/f7b52154-26a0-4e7b-a0da-3ab9a70d38dd/call-Module00aBatch/Module00aBatch/f90a1ff1-610b-448d-b198-c0ab0f9fba56/call-Module00a/shard-110/Module00a/252137e1-80a0-4751-b2ed-7e5c56c61d31/call-CramToBam/CramToBam/5cb4991d-ad34-4540-8e8c-3b6cea61183c/call-RunCramToBamRequesterPays/cacheCopy/NA18530.final.bam.bai to clt-cromwell/cromwell-execution/GATKSVPipelineBatch/f7b52154-26a0-4e7b-a0da-3ab9a70d38dd/call-Module00aBatch/Module00aBatch/f90a1ff1-610b-448d-b198-c0ab0f9fba56/call-Module00a/shard-110/Module00a/252137e1-80a0-4751-b2ed-7e5c56c61d31/call-CramToBam/CramToBam/5cb4991d-ad34-4540-8e8c-3b6cea61183c/call-RunCramToBamRequesterPays/cacheCopy/NA18530.final.bam.bai
download: s3://clt-sv-resources/broad-references/v0/Homo_sapiens_assembly38.fasta to clt-sv-resources/broad-references/v0/Homo_sapiens_assembly38.fasta
download: s3://clt-sv-resources/broad-references/v0/Homo_sapiens_assembly38.fasta.fai to clt-sv-resources/broad-references/v0/Homo_sapiens_assembly38.fasta.fai
download failed: s3://clt-cromwell/cromwell-execution/GATKSVPipelineBatch/f7b52154-26a0-4e7b-a0da-3ab9a70d38dd/call-Module00aBatch/Module00aBatch/f90a1ff1-610b-448d-b198-c0ab0f9fba56/call-Module00a/shard-110/Module00a/252137e1-80a0-4751-b2ed-7e5c56c61d31/call-CramToBam/CramToBam/5cb4991d-ad34-4540-8e8c-3b6cea61183c/call-RunCramToBamRequesterPays/cacheCopy/NA18530.final.bam to clt-cromwell/cromwell-execution/GATKSVPipelineBatch/f7b52154-26a0-4e7b-a0da-3ab9a70d38dd/call-Module00aBatch/Module00aBatch/f90a1ff1-610b-448d-b198-c0ab0f9fba56/call-Module00a/shard-110/Module00a/252137e1-80a0-4751-b2ed-7e5c56c61d31/call-CramToBam/CramToBam/5cb4991d-ad34-4540-8e8c-3b6cea61183c/call-RunCramToBamRequesterPays/cacheCopy/NA18530.final.bam [Errno 12] Cannot allocate memory

I also added a very rudimentary memory logger during the localization of the inputs. Attached is a plot of the memory over time. The vertical blue line is the start of the transfer of the last file, where the memory blows up.
bokeh_plot (2)

As you can see, the buff/cache memory is increasing but doesn't seem to be "huge".

Do you have any suggestions on what to do next to debug the problem?

@kdaily
Copy link
Member

kdaily commented Feb 8, 2021

Hi @henriqueribeiro, ok, good to know that it occurs in the AWS CLI version 1. I would note that this version is almost three years old and using Python 2, which will not be supported quite soon (July 2021). However given that the same issue occurs, I don't think it's the underlying cause.

What are the units on the y-axis?

@kdaily kdaily added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Feb 8, 2021
@henriqueribeiro
Copy link

Ahh sorry, I missed that.
Y-axis is MB.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Feb 8, 2021
@henriqueribeiro
Copy link

After reverting to AWS CLI 2, I added some swap space to the EC2 instances and after running the workflow 2 times, I never got the memory allocation error anymore. Also, I noticed that the swap space was being used, so it seems to be working.
Is AWS CLI doing some assumptions about swap space?

@kdaily
Copy link
Member

kdaily commented Feb 11, 2021

Thanks for the update @henriqueribeiro - I'll check on that. I think that still points to a container issue, reviewing this:

https://docs.docker.com/config/containers/resource_constraints/

Can you verify what the memory settings are for the container, and can you change them? Wondering if memory swappiness of the container is causing this.

@kdaily kdaily added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Feb 11, 2021
@henriqueribeiro
Copy link

The memory value depends on the task that will run. It can be changed on the task definition.
Regarding maxSwap(--memory-swap on docker) and swappiness(--memory-swappiness on docker) the values are the default ones. So, the container uses the swap configuration from the host.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Feb 12, 2021
@kdaily kdaily added the investigating This issue is being investigated and/or work is in progress to resolve the issue. label Feb 12, 2021
@kdaily
Copy link
Member

kdaily commented Feb 16, 2021

Thanks @henriqueribeiro - still looking into this. I don't think that this is likely an interplay between how much memory the CLI is using and what the memory configuration on the containers are. I'm going to mark this as a bug so that we can investigate it further.

In my test of downloading a 50GB file, the CLI was peaking at about 380MB of memory usage.

@kdaily kdaily added bug This issue is a bug. and removed guidance Question that needs advice or information. investigating This issue is being investigated and/or work is in progress to resolve the issue. labels Feb 16, 2021
@pjongeneel
Copy link

@henriqueribeiro I'm having this exact problem as well. How did you add swap to your docker container (" I added some swap space to the EC2 instances")? Did this fix your issue?

@kdaily
Copy link
Member

kdaily commented Jul 27, 2021

@pjongeneel - I've been looking into this a bit more. @henriqueribeiro was using AWS CLI v1 with Python 2 - can you try with Python 3 (we've since dropped support in new versions of the CLI and Python SDK for Python 2). Thanks!

@microbioticajon
Copy link

microbioticajon commented Jul 29, 2021

I have just built a brand new batch cluster and have started getting cat: write error: Cannot allocate memory" on cat of all things (all this task does is download two 2GB files from s3 , cat them together then copy them back up). If the workflows are triggered one-by-one these tasks complete ok. If more than one workflow is triggered at once we seem to get a mix of BatchJobException with cat throwing "Cannot allocate memory" or the container dying with "OutOfMemoryError: Container killed due to memory usage". Our production cluster is seemingly unaffected.

All containers:

  • VCPU: 1
  • Memory 1GB
  • awscliv1
  • default ami
  • the only docker config change is to the docker root to point to a different ebs volume
  • not using Cromwell as the scheduler

I appreciate this is not much to go on but we have just started experiencing this issue and are currently trying to work out what is happening. I will add more information as I come across it.

EDIT: I also appreciate our problem does not seem to be specifically aws-cli related - so will not add any further clutter to this thread.

EDIT: I think I got to the bottom our our problem. Our Batch cluster was configured with an additional EBS volume formatted with xfs. This was our Docker root. The docker containers were configured to create an anonymous volume in this EBS volume. After a bit of reading we came across this issue: docker/for-linux#651

Apparently cgroups writeback does not support the xfs file system. Thus, write operations incorrectly calculate the available dirty_background_bytes based on the total available system memory not the available memory assigned to the container. This causes oom errors and memory allocation errors on seemingly innocuous tasks. Since changing the filesystem on this EBS volume the above errors go away. Our problem still might not be related to this issue but if people are experiencing memory issues on Batch tasks with heavy write operations it might be worth checking your underlying filesystem.

@kdaily
Copy link
Member

kdaily commented Aug 4, 2021

Thanks for your research, @microbioticajon! I'm investigating how this relates to the EKS/Kubernetes issues mentioned there as well.

@microbioticajon
Copy link

Looks like I might have spoken too soon. Im no longer getting oom errors, cat+redirect seem happy however Im still getting the odd memory allocation error thrown by aws s3 cp under load:
download failed: s3://my_bucket/blah_blah_blah/5235_3_2.fastq.gz [Errno 12] Cannot allocate memory
We implemented https://github.com/awslabs/amazon-ebs-autoscale and so are using btrfs on our docker root volume.

AWS CLI v2.2.26
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" unzip awscliv2.zip ./aws/install

Docker info on the worker node yields:
Server Version: 19.03.13-ce
Cgroup Driver: cgroupfs
Kernel Version: 4.14.232-176.381.amzn2.x86_64
Operating System: Amazon Linux 2
Docker Root Dir: /autoscale-scratch

@microbioticajon
Copy link

I think I may have resolved our awscli cp memory allocation error on our Batch cluster.

By default the Batch/ECS optimised ami does not provision any swap. From the Batch docs, it suggests that the default job definition values for the linuxParameters section are swappiness:60 and maxSwap:2x_allocated_memory which seems contradictory. I tried setting swappiness:0 maxSwap:0 within the job definition but awscli cp memory allocation errors still occurred.

I added a small swap partition to the node launch_template and initialised the volume in the node user_data and we are now no longer getting memory allocation errors from awscli. Im also limiting maxSwap to 500MB in the job definitions in case it tries to use more swap than is available on large memory jobs.

Sorry I cannot be more specific but perhaps this information is useful to someone more knowledgeable than myself.

@kdaily
Copy link
Member

kdaily commented Sep 29, 2021

Thanks for that comment @microbioticajon!

We experienced a similar issue with Kubernetes, and this had to do with how memory usage was being reported. It was considering cached memory as counting to total memory usage. The operating system was keeping the full size of a file downloaded from S3 in the cache, so some memory reporting was adding this to the total even though that memory could be freed for use at any moment by the operating system.

I think we have enough evidence to close this out based on your Batch/ECS experience.

@kdaily kdaily closed this as completed Sep 29, 2021
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

@microbioticajon
Copy link

microbioticajon commented Sep 30, 2021

Hi @kdaily

Many thanks for the insight.

Since adding a swap partition to the Batch node the number of aws s3 cp related memory allocation errors has dropped off considerably. However, in the last week we experienced a few failures when running on large cluster nodes - possibly too many jobs exceeding the maximum swap available? The problem with adding a simple swap volume is that it is configured as part of the launch template and does not scale with either the number of jobs or the size of the node.

We have just changed the maxSwap configuration to match the requested job memory. According to the Docker run docs this should turn off swap usage altogether: https://docs.docker.com/config/containers/resource_constraints/#--memory-swap-details

If --memory-swap is set to the same value as --memory, and --memory is set to a positive integer, the container does not have access to swap. See Prevent a container from using swap.

So far we have had no failures under load, so it is no worse than adding a fixed swap volume at least (I will update if that changes).

So perhaps to summarise solutions from this thread:

  • dont use xfs for underlying volumes
  • add a swap partition to your Batch/EKS nodes Limit Batch job swap usage via maxSwap (might run into problems on large nodes) or
  • Explicitly switch off swap altogether by setting linuxParameters.maxSwap = memory

@martinber
Copy link

martinber commented Apr 13, 2023

EDIT: I was able to solve this, it was a kernel issue, see my last edit

I'm currently debugging out of memory problems and I think it is related to your issue @microbioticajon, I would be interested to know if you found something in the last year.

The host has around 400GB RAM and no swap. Several Docker containers run with memory limited to around 5GB (not sure if they have swap since options --memory and --memory-swap of Docker and maxSwap in AWS are confusing me). The containers are simply writing a lot of data to an gp3 ext4 EBS volume.

  • Sometimes I see out of memory errors when using the boto3 library in Python to download files. Indicating [Errno 12] Cannot allocate memory
  • Also sometimes in a simple C code that does a lot of fread() and fwrite() to convert the type of data. fwrite() returns a number of written elements that is smaller than requested while errno is set to 12. When using perror("ERROR") I see ERROR: Cannot allocate memory

Interestingly, it runs out of RAM when writing to disk (e.g. with fwrite()) but never during a nearby malloc(). In the host there is enough disk space. Just after failing I observe /proc/meminfo where I see 400GB of MemTotal, 300GB of MemFree, 380GB of MemAvailable, 60GB of Dirty, 500GB of Committed_AS. I also observed the process inside the container with the ps command and I only see around 200MB of RAM usage in terms of RSS or Anonymous memory (but usage of cache pages should be very high).

I'm thinking that there is something strange that involves Docker, cgroups, cache pages, writeback, EBS (Elastic Block Store), where the Linux kernel or Docker gets confused and goes out of memory instead of using free memory or instead of freeing cache pages. I'm still learning all of these things. I will update if I find something.

PS: This looks related and also this


Edit 2023-04-14:

I was able to reproduce more consistently my problem in the case where fwrite() fails. It happens in EC2 instances but not in my laptop.

Details

Now I used a EC2 instance of 60GB RAM, and ran on it 3 docker containers with --memory=50m --memory-swap=50m that simply did fwrite() on a loop. When using docker stats I see that the RAM usage of each increases since it includes page cache usage too, and after a few seconds the containers fail.

It happened with all the kernels I tried (4.14.309-231.529.amzn2.x86_64, 4.14.248-189.473.amzn2.x86_64 and 5.15.104-63.140.amzn2.x86_64). Sometimes it failed while doing fwrite() and sometimes the processes were killed by the OOM killer. Example of dmesg log:

[  344.947933] myapp invoked oom-killer: gfp_mask=0x1101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE|__GFP_SKIP_KASAN_POISON), order=0, oom_score_adj=0
[  344.960418] CPU: 9 PID: 22804 Comm: myapp Not tainted 5.15.104-63.140.amzn2.x86_64 #1
[  344.967635] Hardware name: Amazon EC2 m5.4xlarge/, BIOS 1.0 10/16/2017
[  344.972590] Call Trace:
[  344.975633]  <TASK>
[  344.978519]  dump_stack_lvl+0x34/0x48
[  344.982134]  dump_header+0x4a/0x1f4
[  344.985661]  oom_kill_process.cold+0xb/0x10
[  344.989495]  out_of_memory+0xed/0x2d0
[  344.993076]  mem_cgroup_out_of_memory+0x135/0x150
[  344.997159]  try_charge_memcg+0x62a/0x6f0
[  345.000977]  charge_memcg+0x40/0x90
[  345.004506]  __mem_cgroup_charge+0x29/0x80
[  345.008321]  __add_to_page_cache_locked+0x2d2/0x330
[  345.012491]  ? scan_shadow_nodes+0x30/0x30
[  345.016292]  add_to_page_cache_lru+0x48/0xd0
[  345.020169]  pagecache_get_page+0xdb/0x340
[  345.023954]  grab_cache_page_write_begin+0x1d/0x40
[  345.028082]  iomap_write_begin+0x164/0x280
[  345.031920]  iomap_write_iter+0xb7/0x1b0
[  345.035643]  iomap_file_buffered_write+0x75/0xd0
[  345.039618]  xfs_file_buffered_write+0xba/0x2c0
[  345.043640]  ? __handle_mm_fault+0x4a6/0x650
[  345.047511]  new_sync_write+0x11c/0x1b0
[  345.051190]  vfs_write+0x1d9/0x270
[  345.054659]  ksys_write+0x5f/0xe0
[  345.058093]  do_syscall_64+0x3b/0x90
[  345.061664]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[  345.065917] RIP: 0033:0x7fbd92db5833
[  345.069484] Code: 8b 15 61 26 0e 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 48 83 ec 28 48 89 54 24 18
[  345.083512] RSP: 002b:00007fff49eaff08 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  345.090633] RAX: ffffffffffffffda RBX: 0000000000017000 RCX: 00007fbd92db5833
[  345.095842] RDX: 0000000000017000 RSI: 000000000041f990 RDI: 0000000000000003
[  345.101049] RBP: 000000000041f990 R08: 0000000000000000 R09: 00000000004384a0
[  345.106250] R10: 000000000000006f R11: 0000000000000246 R12: 0000000000017000
[  345.111450] R13: 0000000000437960 R14: 0000000000017000 R15: 00007fbd92e94880
[  345.116630]  </TASK>
[  345.119557] memory: usage 51200kB, limit 51200kB, failcnt 0
[  345.124054] memory+swap: usage 51200kB, limit 51200kB, failcnt 99657
[  345.128815] kmem: usage 1552kB, limit 9007199254740988kB, failcnt 0
[  345.133515] Memory cgroup stats for /docker/e74aa1b18c5dd7d6e8f4323f76b6c60b0a53d82a56fdcf08c9518c1b9032d202:
[  345.133592] anon 3854336
               file 46985216
               kernel_stack 32768
               pagetables 81920
               percpu 576
               sock 0
               shmem 0
               file_mapped 0
               file_dirty 0
               file_writeback 33103872
               swapcached 0
               anon_thp 0
               file_thp 0
               shmem_thp 0
               inactive_anon 3846144
               active_anon 8192
               inactive_file 28712960
               active_file 18268160
               unevictable 0
               slab_reclaimable 1258856
               slab_unreclaimable 158296
               slab 1417152
               workingset_refault_anon 0
               workingset_refault_file 1897
               workingset_activate_anon 0
               workingset_activate_file 16
               workingset_restore_anon 0
               workingset_restore_file 9
               workingset_nodereclaim 4376
               pgfault 2811
               pgmajfault 157
               pgrefill 11261102
               pgscan 11846841
               pgsteal 381574
               pgactivate 11449817
               pgdeactivate 11261020
               pglazyfree 0
               pglazyfreed 0
               thp_fault_alloc 0
               thp_collapse_alloc 0
[  345.231865] Tasks state (memory values in pages):
[  345.235940] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[  345.243319] [  22744]     0 22744     3288      991    65536        0             0 python3
[  345.250650] [  22804]     0 22804      592       70    36864        0             0 myapp
[  345.257922] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=e74aa1b18c5dd7d6e8f4323f76b6c60b0a53d82a56fdcf08c9518c1b9032d202,mems_allowed=0,oom_memcg=/docker/e74aa1b18c5dd7d6e8f4323f76b6c60b0a53d82a56fdcf08c9518c1b9032d202,task_memcg=/docker/e74aa1b18c5dd7d6e8f4323f76b6c60b0a53d82a56fdcf08c9518c1b9032d202,task=python3,pid=22744,uid=0
[  345.280537] Memory cgroup out of memory: Killed process 22744 (python3) total-vm:13152kB, anon-rss:3580kB, file-rss:384kB, shmem-rss:0kB, UID:0 pgtables:64kB oom_score_adj:0

Apparently this is not a problem of the AWS CLI, or of Boto3. It is a problem where docker containers limited in memory can fail even in the "standard" RAM usage is low. Apparently when heavy writing is done, huge page caches are created in RAM and that goes over the limit imposed in the docker containers.

Normally, the Linux Kernel reduces these page caches and no problems happen, but I think there are issues with the linux kernel used in EC2 or with AWS Elastic Block Storage

Edit 2023-10-19:

In the end, there is an issue with some kernels of Amazon Linux: When a Docker/cgroup writes to disk, the write buffer/page cache size increases and can fill the RAM allocated to the Docker/cgroup, and the process is killed or the fwrite() fails. This is a bug and the behavior doesn't make sense: the write buffer/page cache should force a write to disk and reduce the size, or Linux should use a bit of extra RAM outside the amount allocated to the Docker/cgroup.

The solution I used is to add a swap to the host (not necessarily to the Docker/cgroup). With a few MB of swap, the Linux Kernel will use it and the error won't happen. I also upgraded the Kernel.

It was also confirmed by Amazon support that at least these kernels have the problem: ‘4.14.322-246.539.amzn2’, ‘5.10.192-183.736.amzn2’, ‘5.15.128-81.144.amzn2’. In any case, I observed that the issue happens only for containers with an allocated RAM smaller than a certain threshold (e.g. only for Dockers/cgroups with less than 900MB ram), and I observed that when upgrading the kernel, this threshold gets lower (e.g. bug happens only for containers with less than 100MB RAM). Therefore I recommend upgrading the Kernel as much as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. s3
Projects
None yet
Development

No branches or pull requests

6 participants