Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the video input from the camera for action recognition #45

Closed
shijubushiju opened this issue Dec 23, 2020 · 7 comments
Closed

Use the video input from the camera for action recognition #45

shijubushiju opened this issue Dec 23, 2020 · 7 comments

Comments

@shijubushiju
Copy link

Hello, author, I have reproduced your code now, but I want to use it to achieve a function of reading in the video while classifying actions such as shaking hands and hugging. How can I achieve this

@bryanyzhu
Copy link
Owner

You can use cv2.VideoCapture to read in the video (no matter read videos offline or read videos from camera), then get the frames, and finally do the prediction. Something like this,

cap = cv2.VideoCapture(VIDEO_NAME)
net = get_model(MODEL_NAME)
while(cap.isOpened()):
    ret, frame = cap.read()
    input = preprocess(frame)
    pred = net(input)
    if not ret: break

cap.release()

@shijubushiju
Copy link
Author

@bryanyzhu Thank you very much. Let me have a try

@shijubushiju
Copy link
Author

@bryanyzhu Hello, I have another question:
Lines 85 and 86 of the flow_vgg16.py file look like this:
rgb_weight_mean = torch.mean(rgb_weight, dim=1)
flow_weight = rgb_weight_mean.repeat(1,in_channels,1,1)
However, lines 179 and 182 of the file flow_resnet.py look like this:
rgb_weight_mean = torch.mean(rgb_weight, dim=1)
flow_weight = rgb_weight_mean.unsqueeze(1).repeat(1,in_channels,1,1)
How do I make sense of the difference

@bryanyzhu
Copy link
Owner

For current PyTorch version, I think the second one is more rigorous. But both of them should work because many operators support automatic broadcasting, so users don't need to worry about the dimension mismatch.

@shijubushiju
Copy link
Author

@bryanyzhu Ok, I get an error when I run the first one, and then I run it perfectly with the second modification.Thank you for your reply. I will consult you if I have any questions

@shijubushiju
Copy link
Author

@bryanyzhu
In VideoSpatialPrediction. py :
def VideoSpatialPrediction(
vid_name,
net,
num_categories,
start_frame=0,
num_frames=0,
num_samples=25
) :
Is num_samples the number of test videos in here?

The other problem is:
The model was tested using recorded video. What should I do if I want to test it online?

@bryanyzhu
Copy link
Owner

num_samples means number of frames sampled from one video. This is the standard evaluation setting used before, that is, we take 25 frames per video and do 10-crop per frame. So for each video, we actually perform 250 forward and average the predictions to get the final result.

For online videos, usually what people do (or the simplest way) is to wait for a few frames, do the prediction and perform average. Then doing the same thing in a sliding window fashion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants