-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High lora vram usage after update #4343
Comments
i also noticed an increase in inference time going very close to 23.6 vram and maybe offloading to ram while running a lora even on fp8 |
Same thing I have noticed |
Same thing here. I reinstalled the standalone from the read-me, reinstalled pytorch, still eats all my VRAM and causes comfyui to crash after a couple generations every time. |
Same, one generation is slow the other is faster then slower and even slower. |
same, I got error out of vram using a 100m lora |
I noticed that the Before the changes I could stay under 12GB total VRAM usage when loading a After the changes, I run into the 16GB memory limit when the FLUX transformer unet is loaded. |
Can you try running it with: --disable-cuda-malloc to see if it improves things? |
it works fine with --disable-cuda-malloc |
error without --disable-cuda-mallocError occurred when executing KSampler: Allocation on device File "D:\sd-ComfyUI\ComfyUI\execution.py", line 152, in recursive_execute |
Update and let me know if it's fixed. |
updated and runs fine without adding '--cuda-malloc'. |
After updating, and without specifying any cuda-malloc related args, vram usage and inference speed are back to normal. Restarted and ran several times without any issues. It was consistently failing before the update. Thanks, comfy. |
On my system with a 3070Ti 8GB VRAM and 32GB RAM, I have the inverse problem. The default Cuda malloc was providing relatively good performances with Flux, without a noticeable downgrade when loading a LoRA. The new default downgrade the performance roughly 5 times. The |
Second Danamir. I'm running on a laptop featuring a 4060 8GB VRAM and 16GB RAM. Before the update, flux nf4 was using around 7.4GB of VRAM and 30GB of RAM when generating a high-res image. But after the update, the VRAM usage went up to 8.3GB exceeding the available dedicated VRAM, as a result, the rendered time went from 3.20 mins to a whooping 24 mins! Luckily, --cuda-malloc fixed the issue and now it's behaving as before prior to the update. |
After this went through, some images generated with Lora come out blurry for some reason. |
Reverted the change because it was causing too many issues. If you encounter the lora issue and need to use --disable-cuda-malloc to fix it let me know what your system specs are. |
Yeah I think that was the safest thing to do. Thanks for your reactivity ! |
I have rtx 2060 6gig .. original flux dev took 5 minutes per generation and flux nf4 took one hour. Updated everything tried with and without disable-cuda-malloc and still take forever for nf4 version. Is it possible that RTX 2000 series are not supported? |
In my case, --disable-cuda-malloc combined with --lowvram solves the issue at fp8, but not at fp16. The loras made with ai-toolkit are still loaded very very slowly and with extra vram usage. |
Not specifically Lora related, but since updating to use Flux I’ve noticed some of my old SDXL workflows that were right on the edge of my machines memory limits now OOM. I pulled the latest revert and tried the cuda malloc flags and it didn’t help. I reverted to an older commit from before Flux just to be safe and things work again as normal. Peeking at the commit history I found this: b8ffb29 Wondering if that could be the cause as it appears to increase the amount of memory required by adding a larger buffer (100 to 300)? |
Can you modify only that specific value in the latest version and test it for us? |
It solved the issue for me too, using Flux with Fp8 before I couldn't load some anime/fae Loras, now with these 2 arguments I can load multiple loras I couldn't before. Im on a 4080, 64gb |
4080. 32GB system RAM. Win 10. Loading Flux Dev in fp8 with an fp8 text encoder. Driver version didn't make a difference. Had problems on both 1 year old drivers and latest game ready drivers. |
I have a 3090 with 16GB of system RAM. I run Flux Dev with fp16, with one Lora at a time using the normal vram mode. I still have an issue after recent commits, however the --disable-cuda-malloc command seems to fix it. When I run that command, the VRAM usage goes back down once a creation finishes. For me it idles around 14GB. |
I'm running a RTX 3080 10GB, 64GB DDR5, Zen4 7950X, comfyui portable and noticed this behaviour after updating through the manager add-on in the last couple of days or so. I went from ~2s/it up to 20+ s/it for an identical workflow. I reinstalled (I'd kept the archive - commit hash b334605) and everything went back to normal with that version. |
i wonder if mimalloc would have any place here we use it on other tools/use-cases with memory-intensive workloads to overload the another one would be jemalloc which seems to offer some benefits for different operations with dense compute calls, eg. using SwiGLU. here is an example enable_mimalloc() {
! [ -z "${MIMALLOC_DISABLE}" ] && echo "mimalloc disabled." && return
LIBMIMALLOC_PATH='/usr/lib64/libmimalloc.so'
if ! [ -f "${LIBMIMALLOC_PATH}" ]; then
echo "mimalloc doesn't exist. You might really want to install this."
else
echo "Enabled mimalloc."
export MIMALLOC_ALLOW_LARGE_OS_PAGES=1
export MIMALLOC_RESERVE_HUGE_OS_PAGES=0 # Use n 1GiB pages
export MALLOC_ARENA_MAX=1 # Tell Glibc to only allocate memory in a single "arena".
export MIMALLOC_PAGE_RESET=0 # Signal when pages are empty
export MIMALLOC_EAGER_COMMIT_DELAY=4 # The first 4MiB of allocated memory won't be hugepages
export MIMALLOC_SHOW_STATS=0 # Display mimalloc stats
export LD_PRELOAD="${LD_PRELOAD} ${LIBMIMALLOC_PATH}"
return
fi
LIBHUGETLBFS_PATH="/usr/lib64/libhugetlbfs.so"
if [ -f "${LIBHUGETLBFS_PATH}" ]; then
export LD_PRELOAD="${LD_PRELOAD} ${LIBHUGETLBFS_PATH}"
export HUGETLB_MORECORE=thp
export HUGETLB_RESTRICT_EXE=python3.11
echo "Enabled libhugetlbfs parameters for easy huge page support."
else
echo "You do not even have libhugetlbfs installed. There is very little we can do for your performance here."
fi
}
configure_mempool() {
export HUGEADM_PATH
export HUGEADM_CURRENTSIZE
# Current pool size (allocated hugepages)
HUGEADM_CURRENTSIZE=$(hugeadm --pool-list | grep "${HUGEADM_POOLSIZE}" | awk '{ print $3; }')
# Maximum pool size (how many hugepages)
HUGEADM_MAXIMUMSIZE=$(hugeadm --pool-list | grep "${HUGEADM_POOLSIZE}" | awk '{ print $4; }')
HUGEADM_PATH=$(which hugeadm)
if [ -z "${HUGEADM_PATH}" ]; then
echo 'hugeadm is not installed. Was unable to configure the system hugepages pool size.'
fi
export HUGEADM_FREE
export TARGET_HUGEPAGESZ=0 # By default, we'll assume we need to allocate zero pages.
HUGEADM_FREE=$(expr "${HUGEADM_MAXIMUMSIZE}" - "${HUGEADM_CURRENTSIZE}")
if [ "${HUGEADM_FREE}" -lt "${HUGEADM_PAGESZ}" ]; then
# We don't have enough free hugepages. Let's go for gold and increase it by the current desired amount.
TARGET_HUGEPAGESZ=$(expr "${HUGEADM_PAGESZ}" - "${HUGEADM_FREE}")
sudo "${HUGEADM_PATH}" --hard --pool-pages-max "2MB:${TARGET_HUGEPAGESZ}" || echo "Could not configure hugepages pool size via hugeadm."
echo "Added ${TARGET_HUGEPAGESZ} to system hugepages memory pool."
else
echo "We have enough free pages (${HUGEADM_FREE} / ${HUGEADM_MAXIMUMSIZE}). Continuing."
fi
}
restore_mempool() {
if [ "${TARGET_HUGEPAGESZ}" -gt 0 ]; then
echo "Being a good citizen and restoring memory pool size back to ${HUGEADM_MAXIMUMSIZE}."
sudo "${HUGEADM_PATH}" --hard --pool-pages-max "2MB:${HUGEADM_MAXIMUMSIZE}" || echo "Could not configure hugepages pool size via hugeadm."
else
TOTAL_MEM_WASTED=$(expr "${HUGEADM_MAXIMUMSIZE}" \* 2)
echo "There were no extra hugepages allocated at startup, so there is nothing to clean up now. You could free ${TOTAL_MEM_WASTED}M for other applications by reducing the maximum pool size to zero by default."
fi
}
### How to load / use it
configure_mempool
enable_mimalloc
## call comfyui here
. ./start_ui.sh # or the correct start command
# Unconfigure hugepages if we've altered the system environment.
restore_mempool
you'll need libhugetlbfs and mimalloc installed from https://github.com/microsoft/mimalloc it gives me a 6-40% speedup on various operations but nothing consistent across the board. the total speedup for a generation was 13%. |
I tried both, both were OOM on my 4090(I can run it within memory once or twice on a fresh restart of windows), the former commit if anything was slower, both took up about 23.4GB of VRAM, and 1.5GB of shared. Tested once (20 steps, default settings, lora loaded) Current: 300 * 1024 * 1024
Old: 100 * 1024 * 1024 *
I would like to mention sampling using both kohya and ostrisai's lora trainer use quite a bit less vram, and never oom, so this model should fit eventually :) Sampling at 20 steps, takes about 25-35 seconds using ostrisai's |
What does that do? Will I need it on my RTX 3070 8GB to avoid potential problems? |
Running Forge with FLUX 1d NF without --cuda-malloc (console suggests to enable it, meaning it's not) still causes an OOM (16gb VRAM). I assume Forge uses same components as Comfy for FLUX, making this is relevant Begin to load 1 model |
Can you check if this is fixed on the latest? |
Yes! It works as it should now. Swapping loras while using fp16 and it's going quickly. Very happy about this, thank you very much for figuring it out. |
This is still broken for me on the latest commit, currently Depending on the sampler, I get one of these errors: Euler:
The size of the allocation varies, but the block number seems to be consistent:
UniPC:
Euler runs to completion, UniPC consistently fails on the second step. The LoRA style does apply and the output image matches one produced on a higher memory card. I get 100% VRAM usage even with a 48GB card, but that does not log a |
Expected Behavior
Lora should load with minimal vram overhead (considering it is a small lora; the 4/4 rank one is 40mb).
Actual Behavior
Large vram usage increase when loading certain loras, in my case trained with ai-toolkit by Ostris. When using fp8_e4m3fn flux.dev, vram usage is 14.3gb, regardless of lora size (4/4, 16/16, 32/32). However, downloaded loras made with SimpleTuner only go up to 12.2gb usage. This is a problem when loading the fp16 model because it no longer fits on a 24gb vram GPU.
Steps to Reproduce
Load a lora made with ai-toolkit by Ostris
Debug Logs
Other
It seems that yesterday's update which makes loading ai-toolkit loras possible (avoiding all the missing key errors) has introduced an issue where vram usage goes up significantly, no matter how small the lora is.
I am using 'LoadLoraModelOnly' and 'Load Diffusion Model' nodes. I have tried other nodes, --lowvram, --normalvram, disabling nvidia cuda fallback, updating comfyui, turning off Manager, disabling rgthree optimization, nothing makes a difference.
Thanks in advance
The text was updated successfully, but these errors were encountered: