Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What GPU needed to finetune Large version? #27

Closed
Rai220 opened this issue Nov 10, 2020 · 4 comments
Closed

What GPU needed to finetune Large version? #27

Rai220 opened this issue Nov 10, 2020 · 4 comments

Comments

@Rai220
Copy link

Rai220 commented Nov 10, 2020

I have 16Gb GPU and get CUDA out of memory error (for batch size = 1!):

RuntimeError: CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 14.76 GiB total capacity; 13.25 GiB already allocated; 21.44 MiB free; 13.84 GiB reserved in total by PyTorch)

Is this memory really not enough to train the large version? May be there is some tips to reduce memory using on pretraining? I using such list of parameters:

    --per_gpu_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --overwrite_cache \
    --num_train_epochs 2 \
    --save_steps 1000 \
    --block_size 256 \
    --fp16
@OzoneReloaded
Copy link

OzoneReloaded commented Nov 11, 2020

Hello! I've managed to run finetuning on 11 gb GPU with:

gpt_options="
--hidden-size 1024
--seq-length 1024
--cpu-optimizer
--cpu_torch_adam
"

Hope it helps. @Rai220

@fen0s
Copy link

fen0s commented Nov 12, 2020

I have 16Gb GPU and get CUDA out of memory error (for batch size = 1!):

RuntimeError: CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 14.76 GiB total capacity; 13.25 GiB already allocated; 21.44 MiB free; 13.84 GiB reserved in total by PyTorch)

Is this memory really not enough to train the large version? May be there is some tips to reduce memory using on pretraining? I using such list of parameters:

    --per_gpu_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --overwrite_cache \
    --num_train_epochs 2 \
    --save_steps 1000 \
    --block_size 256 \
    --fp16

Apparently, optimization level of O3 helps, but I haven't quite figured out how to make it generate samples, it just outputs negative probability for some reason. The above answer is for GPT-3 large, not GPT-2 large, so...

@fen0s
Copy link

fen0s commented Nov 12, 2020

Basically what's needed is gradient checkpointing that was provided in one of transformers library versions. Not sure if I can implement it, especially considering that old versions of transformers library is used in here...

@TatianaShavrina
Copy link
Collaborator

Hey @Rai220 @fen0s The organizers gave participants the opportunity to get access to Cristofari. To get access, please send to AIJ_ruGPT-3@sberbank.ru your request with brief information about your project. We will review your request and get back to you. Please note that the number of such accesses is limited. If necessary, please leave your request as early as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants