Use the video input from the camera for action recognition #45

shijubushiju · 2020-12-23T09:19:03Z

Hello, author, I have reproduced your code now, but I want to use it to achieve a function of reading in the video while classifying actions such as shaking hands and hugging. How can I achieve this

bryanyzhu · 2020-12-23T19:44:30Z

You can use cv2.VideoCapture to read in the video (no matter read videos offline or read videos from camera), then get the frames, and finally do the prediction. Something like this,

cap = cv2.VideoCapture(VIDEO_NAME)
net = get_model(MODEL_NAME)
while(cap.isOpened()):
    ret, frame = cap.read()
    input = preprocess(frame)
    pred = net(input)
    if not ret: break

cap.release()

shijubushiju · 2020-12-24T12:59:35Z

@bryanyzhu Thank you very much. Let me have a try

shijubushiju · 2020-12-24T13:17:14Z

@bryanyzhu Hello, I have another question：
Lines 85 and 86 of the flow_vgg16.py file look like this:
rgb_weight_mean = torch.mean(rgb_weight, dim=1)
flow_weight = rgb_weight_mean.repeat(1,in_channels,1,1)
However, lines 179 and 182 of the file flow_resnet.py look like this:
rgb_weight_mean = torch.mean(rgb_weight, dim=1)
flow_weight = rgb_weight_mean.unsqueeze(1).repeat(1,in_channels,1,1)
How do I make sense of the difference

bryanyzhu · 2020-12-24T18:20:03Z

For current PyTorch version, I think the second one is more rigorous. But both of them should work because many operators support automatic broadcasting, so users don't need to worry about the dimension mismatch.

shijubushiju · 2020-12-25T02:58:29Z

@bryanyzhu Ok, I get an error when I run the first one, and then I run it perfectly with the second modification.Thank you for your reply. I will consult you if I have any questions

shijubushiju · 2021-03-12T03:58:02Z

@bryanyzhu
In VideoSpatialPrediction. py :
def VideoSpatialPrediction(
vid_name,
net,
num_categories,
start_frame=0,
num_frames=0,
num_samples=25
) :
Is num_samples the number of test videos in here？

The other problem is:
The model was tested using recorded video. What should I do if I want to test it online?

bryanyzhu · 2021-03-12T04:45:54Z

num_samples means number of frames sampled from one video. This is the standard evaluation setting used before, that is, we take 25 frames per video and do 10-crop per frame. So for each video, we actually perform 250 forward and average the predictions to get the final result.

For online videos, usually what people do (or the simplest way) is to wait for a few frames, do the prediction and perform average. Then doing the same thing in a sliding window fashion.

bryanyzhu closed this as completed Dec 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the video input from the camera for action recognition #45

Use the video input from the camera for action recognition #45

shijubushiju commented Dec 23, 2020

bryanyzhu commented Dec 23, 2020

shijubushiju commented Dec 24, 2020

shijubushiju commented Dec 24, 2020

bryanyzhu commented Dec 24, 2020

shijubushiju commented Dec 25, 2020

shijubushiju commented Mar 12, 2021

bryanyzhu commented Mar 12, 2021

Use the video input from the camera for action recognition #45

Use the video input from the camera for action recognition #45

Comments

shijubushiju commented Dec 23, 2020

bryanyzhu commented Dec 23, 2020

shijubushiju commented Dec 24, 2020

shijubushiju commented Dec 24, 2020

bryanyzhu commented Dec 24, 2020

shijubushiju commented Dec 25, 2020

shijubushiju commented Mar 12, 2021

bryanyzhu commented Mar 12, 2021