Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于yolov4的测速 #34

Closed
qinxianglinya opened this issue Jul 23, 2020 · 6 comments
Closed

关于yolov4的测速 #34

qinxianglinya opened this issue Jul 23, 2020 · 6 comments

Comments

@qinxianglinya
Copy link

博主,你好。目前我已经在windows上将yolov4编译成功了。
环境:win10, tensorrt6.0.1.5, cuda10.0, cudnn7.6.5, 1080Ti。
目前我针对自己训练的模型进行了测速。
配置文件中图片大小为:800x800x3,tensorrt精度为FP16,batchsize为1。
enquequ()+cudaMemcpyAsync()的时间为1ms,但是cudaStreamSynchronize()操作花费了29ms,请问这个地方有没有能够改善了方法,非常感谢。

@enazoe
Copy link
Owner

enazoe commented Jul 23, 2020

@qinxianglinya 时间这里你能描述的再清楚一点吗?cudaMemcpyAsync是异步拷贝数据到显存,因为是异步所以函数会立即执行完不阻塞。后面的cudaStreamSynchronize是用于同步的,等待执行完成。所以你统计出来的26 ms基本就是执行时间。还是不理解的话你可以查下那两个cuda函数的作用

@qinxianglinya
Copy link
Author

下面是我打印的时间信息,不知道是不是打印的有问题。
clock_t start = clock();
m_Context->enqueue(batchSize, m_DeviceBuffers.data(), m_CudaStream, nullptr);
clock_t end = clock();
std::cout << "infer time:" << (float)(end - start) << "ms" << std::endl;
std::cout << m_OutputTensors.size() << std::endl;
clock_t start1 = clock();
for (auto& tensor : m_OutputTensors)
{
NV_CUDA_CHECK(cudaMemcpyAsync(tensor.hostBuffer, m_DeviceBuffers.at(tensor.bindingIndex),
batchSize * tensor.volume * sizeof(float),
cudaMemcpyDeviceToHost, m_CudaStream));
}
clock_t end1 = clock();
std::cout << "gpu to cpu:" << float(end1 - start1) << "ms" << std:: endl;
clock_t start2 = clock();
cudaStreamSynchronize(m_CudaStream);
clock_t end2 = clock();
std::cout << "cudaStreamSynchronize time:" << float(end2 - start2) << "ms" <<std::endl;

@enazoe
Copy link
Owner

enazoe commented Jul 23, 2020

uncomment line536 and line550

Timer timer;
    assert(batchSize <= m_BatchSize && "Image batch size exceeds TRT engines batch size");
    NV_CUDA_CHECK(cudaMemcpyAsync(m_DeviceBuffers.at(m_InputBindingIndex), input,
                                  batchSize * m_InputSize * sizeof(float), cudaMemcpyHostToDevice,
                                  m_CudaStream));
    m_Context->enqueue(batchSize, m_DeviceBuffers.data(), m_CudaStream, nullptr);
    for (auto& tensor : m_OutputTensors)
    {
        NV_CUDA_CHECK(cudaMemcpyAsync(tensor.hostBuffer, m_DeviceBuffers.at(tensor.bindingIndex),
                                      batchSize * tensor.volume * sizeof(float),
                                      cudaMemcpyDeviceToHost, m_CudaStream));
    }
    cudaStreamSynchronize(m_CudaStream);
    timer.out("inference");

@qinxianglinya
Copy link
Author

取消注释第536行和第550行

Timer timer;
    assert(batchSize <= m_BatchSize && "Image batch size exceeds TRT engines batch size");
    NV_CUDA_CHECK(cudaMemcpyAsync(m_DeviceBuffers.at(m_InputBindingIndex), input,
                                  batchSize * m_InputSize * sizeof(float), cudaMemcpyHostToDevice,
                                  m_CudaStream));
    m_Context->enqueue(batchSize, m_DeviceBuffers.data(), m_CudaStream, nullptr);
    for (auto& tensor : m_OutputTensors)
    {
        NV_CUDA_CHECK(cudaMemcpyAsync(tensor.hostBuffer, m_DeviceBuffers.at(tensor.bindingIndex),
                                      batchSize * tensor.volume * sizeof(float),
                                      cudaMemcpyDeviceToHost, m_CudaStream));
    }
    cudaStreamSynchronize(m_CudaStream);
    timer.out("inference");

ok,谢谢

@enazoe
Copy link
Owner

enazoe commented Jul 24, 2020

@qinxianglinya 觉得有用点个start哈

@enazoe enazoe closed this as completed Jul 24, 2020
@qinxianglinya
Copy link
Author

@qinxianglinya 觉得有用点个start哈

嗯嗯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants