Enhance PSI Constructor: Lower Peak Device Memory Usage #4154

denghuilu · 2024-05-11T14:16:40Z

Reminder

Have you linked an issue with this pull request?
Have you added adequate unit tests and/or case tests for your pull request?
Have you noticed possible changes of behavior below or in the linked issue?
Have you explained the changes of codes in core modules of ESolver, HSolver, ElecState, Hamilt, Operator or Psi? (ignore if not applicable)

Linked Issue

Close #4153

Unit Tests and/or Case Tests for my changes

A unit test is added for each new feature or bug fix.

What's changed?

Example: My changes might affect the performance of the application under certain conditions, and I have tested the impact on various scenarios...

Any changes of core modules? (ignore if not applicable)

Example: I have added a new virtual function in the esolver base class in order to ...

source/module_psi/psi.cpp

caic99 · 2024-05-12T13:29:12Z

Hi @denghuilu ,
I wonder if it is possible to offer an option to perform in-place conversion? Hence we can avoid all possible memory allocations.

dyzheng

LGTM

caic99

@dyzheng This PR patches the case of transferring data between CPU and GPU without type conversion. And in this PR only data transfer from CPU to GPU with type conversion requires memory, which is not our case.

caic99 · 2024-05-13T04:57:06Z

@denghuilu However, I think it would be better put this optimization for T_in = T_out in implementation of cast_memory_op:

abacus-develop/source/module_psi/kernels/cuda/memory_op.cu

Lines 111 to 145 in bec5e64

    
           template <typename FPTYPE_out, typename FPTYPE_in> 
        
           struct cast_memory_op<FPTYPE_out, FPTYPE_in, psi::DEVICE_GPU, psi::DEVICE_CPU> { 
        
               void operator()(const psi::DEVICE_GPU* dev_out, 
        
                               const psi::DEVICE_CPU* dev_in, 
        
                               FPTYPE_out* arr_out, 
        
                               const FPTYPE_in* arr_in, 
        
                               const size_t size) { 
        
                   if (size == 0) {return;} 
        
                   FPTYPE_in * arr = nullptr; 
        
                   cudaErrcheck(cudaMalloc((void **)&arr, sizeof(FPTYPE_in) * size)); 
        
                   cudaErrcheck(cudaMemcpy(arr, arr_in, sizeof(FPTYPE_in) * size, cudaMemcpyHostToDevice)); 
        
                   const int block = (size + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK; 
        
                   cast_memory<<<block, THREADS_PER_BLOCK>>>(arr_out, arr, size); 
        
                   cudaErrcheck(cudaGetLastError()); 
        
                   cudaErrcheck(cudaDeviceSynchronize()); 
        
                   cudaErrcheck(cudaFree(arr)); 
        
               } 
        
           }; 
        
           template <typename FPTYPE_out, typename FPTYPE_in> 
        
           struct cast_memory_op<FPTYPE_out, FPTYPE_in, psi::DEVICE_CPU, psi::DEVICE_GPU> { 
        
               void operator()(const psi::DEVICE_CPU* dev_out, 
        
                               const psi::DEVICE_GPU* dev_in, 
        
                               FPTYPE_out* arr_out, 
        
                               const FPTYPE_in* arr_in, 
        
                               const size_t size) { 
        
                   auto * arr = (FPTYPE_in*) malloc(sizeof(FPTYPE_in) * size); 
        
                   cudaErrcheck(cudaMemcpy(arr, arr_in, sizeof(FPTYPE_in) * size, cudaMemcpyDeviceToHost)); 
        
                   for (int ii = 0; ii < size; ii++) { 
        
                       arr_out[ii] = static_cast<FPTYPE_out>(arr[ii]); 
        
                   } 
        
                   free(arr); 
        
               } 
        
           };

denghuilu · 2024-05-13T05:11:17Z

@denghuilu However, I think it would be better put this optimization for T_in = T_out in implementation of cast_memory_op:

abacus-develop/source/module_psi/kernels/cuda/memory_op.cu

Lines 111 to 145 in bec5e64

template <typename FPTYPE_out, typename FPTYPE_in>

struct cast_memory_op<FPTYPE_out, FPTYPE_in, psi::DEVICE_GPU, psi::DEVICE_CPU> {

void operator()(const psi::DEVICE_GPU* dev_out,

const psi::DEVICE_CPU* dev_in,

FPTYPE_out* arr_out,

const FPTYPE_in* arr_in,

const size_t size) {

if (size == 0) {return;}

FPTYPE_in * arr = nullptr;

cudaErrcheck(cudaMalloc((void **)&arr, sizeof(FPTYPE_in) * size));

cudaErrcheck(cudaMemcpy(arr, arr_in, sizeof(FPTYPE_in) * size, cudaMemcpyHostToDevice));

const int block = (size + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK;

cast_memory<<<block, THREADS_PER_BLOCK>>>(arr_out, arr, size);

cudaErrcheck(cudaGetLastError());

cudaErrcheck(cudaDeviceSynchronize());

cudaErrcheck(cudaFree(arr));

}

};

template <typename FPTYPE_out, typename FPTYPE_in>

struct cast_memory_op<FPTYPE_out, FPTYPE_in, psi::DEVICE_CPU, psi::DEVICE_GPU> {

void operator()(const psi::DEVICE_CPU* dev_out,

const psi::DEVICE_GPU* dev_in,

FPTYPE_out* arr_out,

const FPTYPE_in* arr_in,

const size_t size) {

auto * arr = (FPTYPE_in*) malloc(sizeof(FPTYPE_in) * size);

cudaErrcheck(cudaMemcpy(arr, arr_in, sizeof(FPTYPE_in) * size, cudaMemcpyDeviceToHost));

for (int ii = 0; ii < size; ii++) {

arr_out[ii] = static_cast<FPTYPE_out>(arr[ii]);

}

free(arr);

}

};

LGTM, I'll put a PR later

reduce peak device memory usage of psi's constructor

d246198

denghuilu requested a review from dyzheng May 11, 2024 14:16

mohanchen added the GPU & DCU & HPC GPU and DCU and HPC related any issues label May 12, 2024

dyzheng reviewed May 12, 2024

View reviewed changes

source/module_psi/psi.cpp Show resolved Hide resolved

dyzheng reviewed May 12, 2024

View reviewed changes

source/module_psi/psi.cpp Show resolved Hide resolved

dyzheng approved these changes May 13, 2024

View reviewed changes

caic99 approved these changes May 13, 2024

View reviewed changes

dyzheng merged commit 9abeccc into deepmodeling:develop May 13, 2024
13 checks passed

caic99 mentioned this pull request May 15, 2024

Refactor: optimize cast_memory_op #4160

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance PSI Constructor: Lower Peak Device Memory Usage #4154

Enhance PSI Constructor: Lower Peak Device Memory Usage #4154

denghuilu commented May 11, 2024 •

edited

caic99 commented May 12, 2024

dyzheng left a comment

caic99 left a comment

caic99 commented May 13, 2024

denghuilu commented May 13, 2024

Enhance PSI Constructor: Lower Peak Device Memory Usage #4154

Enhance PSI Constructor: Lower Peak Device Memory Usage #4154

Conversation

denghuilu commented May 11, 2024 • edited

Reminder

Linked Issue

Unit Tests and/or Case Tests for my changes

What's changed?

Any changes of core modules? (ignore if not applicable)

caic99 commented May 12, 2024

dyzheng left a comment

Choose a reason for hiding this comment

caic99 left a comment

Choose a reason for hiding this comment

caic99 commented May 13, 2024

denghuilu commented May 13, 2024

denghuilu commented May 11, 2024 •

edited