Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory when running Reverse-Time Migration of Marmousi example #52

Closed
XuVV opened this issue Feb 27, 2023 · 2 comments
Closed

Comments

@XuVV
Copy link

XuVV commented Feb 27, 2023

Hi, Doctor. Sorry to bother you again.

When I tried to run Reverse-Time Migration of Marmousi example using the following code, it shows that CUDA out of memory.

Run optimisation/inversion

n_epochs = 1
n_batch = 60
n_shots_per_batch = (n_shots + n_batch - 1) // n_batch
for epoch in range(n_epochs):
    epoch_loss = 0
    # optimiser.zero_grad()
    for batch in range(n_batch):
        print(batch)
        optimiser.zero_grad()
        batch_start = batch * n_shots_per_batch
        batch_end = min(batch_start + n_shots_per_batch, n_shots)
        if batch_end <= batch_start:
            continue
        s = slice(batch_start, batch_end)

        simulated_data = scalar_born(v_mig.detach(), scatter, dx, dt,
                                     source_amplitudes=source_amplitudes[s].detach(),
                                     source_locations=source_locations[s].detach(),
                                     receiver_locations=receiver_locations[s].detach(),
                                     pml_freq=freq)
        loss = (1e9 * loss_fn(simulated_data[-1] * mask[s], observed_scatter_masked[s]))
        epoch_loss += loss.item()
        loss.backward()
        optimiser.step()
        # del simulated_data
        # torch.cuda.empty_cache()
    print(epoch_loss)

I found that it can be run at the 1st batch, the command "scalar_born" will take about 11002Mb space of GPU, but at the 2nd batch, instead of releasing this 11002Mb space (or let's say using this same space), the command "scalar_born" will take another about 11002Mb of GPU, then at and after the 3rd batch, the situation of taking GPU memory doesn't change anymore.

I am so confuesd about that why at the 2nd batch, it will take another space of GPU, is it normal?

@ar4
Copy link
Owner

ar4 commented Feb 27, 2023

Hello again,

PyTorch's caching, when it decides to free unused memory, and the use of optimizers that accumulate information over iterations, make it hard to predict memory usage. I see that you have already tried to force the cache to be emptied. You might try to expand that a bit to something like this to see if it helps (you will need to add import gc):

del loss, simulated_data
gc.collect()
torch.cuda.empty_cache()

If that is not sufficient, then you will probably have to use smaller batch sizes. You can still perform the optimizer step with gradients from the same number of shots, if you wish, by accumulating the gradients over multiple batches before performing a step. You could do that with something like this (where I accumulate the gradients over two batches, each half the size of yours, before performing an optimizer step):

n_epochs = 1
n_batch = 120
n_shots_per_batch = (n_shots + n_batch - 1) // n_batch
for epoch in range(n_epochs):
    epoch_loss = 0
    for outer_batch in range(n_batch//2):
        optimiser.zero_grad()
        for inner_batch in range(2):
            batch = outer_batch * 2 + inner_batch
            print(batch)
            batch_start = batch * n_shots_per_batch
            batch_end = min(batch_start + n_shots_per_batch, n_shots)
            if batch_end <= batch_start:
                continue
            s = slice(batch_start, batch_end)

            simulated_data = scalar_born(v_mig.detach(), scatter, dx, dt,
                                         source_amplitudes=source_amplitudes[s].detach(),
                                         source_locations=source_locations[s].detach(),
                                         receiver_locations=receiver_locations[s].detach(),
                                         pml_freq=freq)
            loss = (1e9 * loss_fn(simulated_data[-1] * mask[s], observed_scatter_masked[s]))
            epoch_loss += loss.item()
            loss.backward()
        optimiser.step()
    print(epoch_loss)

If you are going to perform an optimizer step after each batch (whether bigger batches, as you were doing, or when accumulating over multiple smaller batches as in my example above), then I suggest that you might want to randomise which shots are in each batch between epochs.

@XuVV
Copy link
Author

XuVV commented Feb 27, 2023

Thank you very much, Doctor. using del loss, simulated_data gc.collect() torch.cuda.empty_cache() can solve this problem.

@XuVV XuVV closed this as completed Feb 27, 2023
ar4 added a commit that referenced this issue Sep 16, 2023
@tmasthay reported that the DistributedDataParallel (DDP)
example was not producing correct results, and, after investing,
discovered that this could be rectified by adding
`torch.cuda.set_device(rank)` to the code.

Closes #52
[ci skip]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants