Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance PSI Constructor: Lower Peak Device Memory Usage #4154

Merged
merged 1 commit into from
May 13, 2024

Conversation

denghuilu
Copy link
Member

@denghuilu denghuilu commented May 11, 2024

Reminder

  • Have you linked an issue with this pull request?
  • Have you added adequate unit tests and/or case tests for your pull request?
  • Have you noticed possible changes of behavior below or in the linked issue?
  • Have you explained the changes of codes in core modules of ESolver, HSolver, ElecState, Hamilt, Operator or Psi? (ignore if not applicable)

Linked Issue

Close #4153

Unit Tests and/or Case Tests for my changes

  • A unit test is added for each new feature or bug fix.

What's changed?

  • Example: My changes might affect the performance of the application under certain conditions, and I have tested the impact on various scenarios...

Any changes of core modules? (ignore if not applicable)

  • Example: I have added a new virtual function in the esolver base class in order to ...

@denghuilu denghuilu requested a review from dyzheng May 11, 2024 14:16
@mohanchen mohanchen added the GPU & DCU & HPC GPU and DCU and HPC related any issues label May 12, 2024
@caic99
Copy link
Member

caic99 commented May 12, 2024

Hi @denghuilu ,
I wonder if it is possible to offer an option to perform in-place conversion? Hence we can avoid all possible memory allocations.

Copy link
Collaborator

@dyzheng dyzheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@caic99 caic99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dyzheng This PR patches the case of transferring data between CPU and GPU without type conversion. And in this PR only data transfer from CPU to GPU with type conversion requires memory, which is not our case.

@dyzheng dyzheng merged commit 9abeccc into deepmodeling:develop May 13, 2024
13 checks passed
@caic99
Copy link
Member

caic99 commented May 13, 2024

@denghuilu However, I think it would be better put this optimization for T_in = T_out in implementation of cast_memory_op:

template <typename FPTYPE_out, typename FPTYPE_in>
struct cast_memory_op<FPTYPE_out, FPTYPE_in, psi::DEVICE_GPU, psi::DEVICE_CPU> {
void operator()(const psi::DEVICE_GPU* dev_out,
const psi::DEVICE_CPU* dev_in,
FPTYPE_out* arr_out,
const FPTYPE_in* arr_in,
const size_t size) {
if (size == 0) {return;}
FPTYPE_in * arr = nullptr;
cudaErrcheck(cudaMalloc((void **)&arr, sizeof(FPTYPE_in) * size));
cudaErrcheck(cudaMemcpy(arr, arr_in, sizeof(FPTYPE_in) * size, cudaMemcpyHostToDevice));
const int block = (size + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK;
cast_memory<<<block, THREADS_PER_BLOCK>>>(arr_out, arr, size);
cudaErrcheck(cudaGetLastError());
cudaErrcheck(cudaDeviceSynchronize());
cudaErrcheck(cudaFree(arr));
}
};
template <typename FPTYPE_out, typename FPTYPE_in>
struct cast_memory_op<FPTYPE_out, FPTYPE_in, psi::DEVICE_CPU, psi::DEVICE_GPU> {
void operator()(const psi::DEVICE_CPU* dev_out,
const psi::DEVICE_GPU* dev_in,
FPTYPE_out* arr_out,
const FPTYPE_in* arr_in,
const size_t size) {
auto * arr = (FPTYPE_in*) malloc(sizeof(FPTYPE_in) * size);
cudaErrcheck(cudaMemcpy(arr, arr_in, sizeof(FPTYPE_in) * size, cudaMemcpyDeviceToHost));
for (int ii = 0; ii < size; ii++) {
arr_out[ii] = static_cast<FPTYPE_out>(arr[ii]);
}
free(arr);
}
};

@denghuilu
Copy link
Member Author

@denghuilu However, I think it would be better put this optimization for T_in = T_out in implementation of cast_memory_op:

template <typename FPTYPE_out, typename FPTYPE_in>
struct cast_memory_op<FPTYPE_out, FPTYPE_in, psi::DEVICE_GPU, psi::DEVICE_CPU> {
void operator()(const psi::DEVICE_GPU* dev_out,
const psi::DEVICE_CPU* dev_in,
FPTYPE_out* arr_out,
const FPTYPE_in* arr_in,
const size_t size) {
if (size == 0) {return;}
FPTYPE_in * arr = nullptr;
cudaErrcheck(cudaMalloc((void **)&arr, sizeof(FPTYPE_in) * size));
cudaErrcheck(cudaMemcpy(arr, arr_in, sizeof(FPTYPE_in) * size, cudaMemcpyHostToDevice));
const int block = (size + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK;
cast_memory<<<block, THREADS_PER_BLOCK>>>(arr_out, arr, size);
cudaErrcheck(cudaGetLastError());
cudaErrcheck(cudaDeviceSynchronize());
cudaErrcheck(cudaFree(arr));
}
};
template <typename FPTYPE_out, typename FPTYPE_in>
struct cast_memory_op<FPTYPE_out, FPTYPE_in, psi::DEVICE_CPU, psi::DEVICE_GPU> {
void operator()(const psi::DEVICE_CPU* dev_out,
const psi::DEVICE_GPU* dev_in,
FPTYPE_out* arr_out,
const FPTYPE_in* arr_in,
const size_t size) {
auto * arr = (FPTYPE_in*) malloc(sizeof(FPTYPE_in) * size);
cudaErrcheck(cudaMemcpy(arr, arr_in, sizeof(FPTYPE_in) * size, cudaMemcpyDeviceToHost));
for (int ii = 0; ii < size; ii++) {
arr_out[ii] = static_cast<FPTYPE_out>(arr[ii]);
}
free(arr);
}
};

LGTM, I'll put a PR later

@caic99 caic99 mentioned this pull request May 15, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GPU & DCU & HPC GPU and DCU and HPC related any issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Double peak memory cost in cast_memory_op
4 participants