关于yolov4的测速 #34

qinxianglinya · 2020-07-23T13:08:37Z

博主，你好。目前我已经在windows上将yolov4编译成功了。
环境：win10, tensorrt6.0.1.5, cuda10.0, cudnn7.6.5, 1080Ti。
目前我针对自己训练的模型进行了测速。
配置文件中图片大小为：800x800x3，tensorrt精度为FP16，batchsize为1。
enquequ()+cudaMemcpyAsync()的时间为1ms，但是cudaStreamSynchronize()操作花费了29ms，请问这个地方有没有能够改善了方法，非常感谢。

enazoe · 2020-07-23T13:42:04Z

@qinxianglinya 时间这里你能描述的再清楚一点吗？cudaMemcpyAsync是异步拷贝数据到显存，因为是异步所以函数会立即执行完不阻塞。后面的cudaStreamSynchronize是用于同步的，等待执行完成。所以你统计出来的26 ms基本就是执行时间。还是不理解的话你可以查下那两个cuda函数的作用

qinxianglinya · 2020-07-23T13:47:28Z

下面是我打印的时间信息，不知道是不是打印的有问题。
clock_t start = clock();
m_Context->enqueue(batchSize, m_DeviceBuffers.data(), m_CudaStream, nullptr);
clock_t end = clock();
std::cout << "infer time:" << (float)(end - start) << "ms" << std::endl;
std::cout << m_OutputTensors.size() << std::endl;
clock_t start1 = clock();
for (auto& tensor : m_OutputTensors)
{
NV_CUDA_CHECK(cudaMemcpyAsync(tensor.hostBuffer, m_DeviceBuffers.at(tensor.bindingIndex),
batchSize * tensor.volume * sizeof(float),
cudaMemcpyDeviceToHost, m_CudaStream));
}
clock_t end1 = clock();
std::cout << "gpu to cpu:" << float(end1 - start1) << "ms" << std:: endl;
clock_t start2 = clock();
cudaStreamSynchronize(m_CudaStream);
clock_t end2 = clock();
std::cout << "cudaStreamSynchronize time:" << float(end2 - start2) << "ms" <<std::endl;

enazoe · 2020-07-23T23:43:02Z

uncomment line536 and line550

Timer timer;
    assert(batchSize <= m_BatchSize && "Image batch size exceeds TRT engines batch size");
    NV_CUDA_CHECK(cudaMemcpyAsync(m_DeviceBuffers.at(m_InputBindingIndex), input,
                                  batchSize * m_InputSize * sizeof(float), cudaMemcpyHostToDevice,
                                  m_CudaStream));
    m_Context->enqueue(batchSize, m_DeviceBuffers.data(), m_CudaStream, nullptr);
    for (auto& tensor : m_OutputTensors)
    {
        NV_CUDA_CHECK(cudaMemcpyAsync(tensor.hostBuffer, m_DeviceBuffers.at(tensor.bindingIndex),
                                      batchSize * tensor.volume * sizeof(float),
                                      cudaMemcpyDeviceToHost, m_CudaStream));
    }
    cudaStreamSynchronize(m_CudaStream);
    timer.out("inference");

qinxianglinya · 2020-07-24T00:36:08Z

取消注释第536行和第550行

Timer timer;
    assert(batchSize <= m_BatchSize && "Image batch size exceeds TRT engines batch size");
    NV_CUDA_CHECK(cudaMemcpyAsync(m_DeviceBuffers.at(m_InputBindingIndex), input,
                                  batchSize * m_InputSize * sizeof(float), cudaMemcpyHostToDevice,
                                  m_CudaStream));
    m_Context->enqueue(batchSize, m_DeviceBuffers.data(), m_CudaStream, nullptr);
    for (auto& tensor : m_OutputTensors)
    {
        NV_CUDA_CHECK(cudaMemcpyAsync(tensor.hostBuffer, m_DeviceBuffers.at(tensor.bindingIndex),
                                      batchSize * tensor.volume * sizeof(float),
                                      cudaMemcpyDeviceToHost, m_CudaStream));
    }
    cudaStreamSynchronize(m_CudaStream);
    timer.out("inference");

ok,谢谢

enazoe · 2020-07-24T00:40:24Z

@qinxianglinya 觉得有用点个start哈

qinxianglinya · 2020-07-24T00:41:04Z

@qinxianglinya 觉得有用点个start哈

嗯嗯

enazoe closed this as completed Jul 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

关于yolov4的测速 #34

关于yolov4的测速 #34

qinxianglinya commented Jul 23, 2020

enazoe commented Jul 23, 2020

qinxianglinya commented Jul 23, 2020

enazoe commented Jul 23, 2020 •

edited

qinxianglinya commented Jul 24, 2020

enazoe commented Jul 24, 2020

qinxianglinya commented Jul 24, 2020

关于yolov4的测速 #34

关于yolov4的测速 #34

Comments

qinxianglinya commented Jul 23, 2020

enazoe commented Jul 23, 2020

qinxianglinya commented Jul 23, 2020

enazoe commented Jul 23, 2020 • edited

qinxianglinya commented Jul 24, 2020

enazoe commented Jul 24, 2020

qinxianglinya commented Jul 24, 2020

enazoe commented Jul 23, 2020 •

edited