New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configs for HowTo100M #9
Comments
Hi Antoine, I used the same training hyperparameters for HowTo100M as for the other experiments. However, I used a different loader because the original loader would often fail to load HowTo100M videos properly. Therefore, as a first step I would suggest to look into data loading and make sure that everything is working properly. Furthermore, I used a 32 or 64 gpu setup for these experiments and haven't tested it with smaller number of GPUs. I don't suspect that this would cause issues, since the model is forced to accumulate the gradient if smaller number of GPUs is used. However, as I said I haven't tested it with different hardware setup. What infrastructure are you using? Lastly, could you share your training logs with me so that I could take a look? |
Indeed it seems to work better by using a decoding based on ffmpeg-python instead of pyav / torchvision. I am using one node with 4 32GB GPUs, allowing for train batch size 64 for divided space-time 8x32. Here are the training logs (tensorboard) after 5-6 epochs (approx 32 train acc): https://drive.google.com/drive/folders/1qz3Nk4aroLCfNgiTo42foWRG91AmrCJW?usp=sharing Also, I did not quite understand the "single clip coverage" of Table 8: if I am not mistaken, videos are 25 fps, and you sample one frame every 32 frames, so one 8-frames clip covers 32*8/25 = 10.24s, not 8.5s as mentioned in Table 8? |
Clip-level accuracy of 32% with random temporal/spatial sampling after 5 epochs sounds roughly right to me. I believe that the full inference procedure with 48 temporal clips should yield roughly the same results as in our paper once the training is done. In regards to the "single clip coverage", the target FPS of the decoder that I used was set to 30 FPS. Therefore, 32*8/30 would be ~8.5s. |
@gberta , @antoyang - I guess that if you use Kinetics data loader, you also have to change the number of the sampled clips, since by default Kinetics loader samples a single clip in train / val and allows multiple clips samples only for test mode through
) |
Hi, I am trying to reproduce the results for HowTo100M video classification but there seems to be a problem with how pyav handles HowTo100M long videos. However small the batch size I choose, there's a memory error in dataloader. I am not able to train the network with HowTo100M videos or even evaluate them for the given checkpoint. Can someone provide ffmpeg snippet that would help me train and test for this task? Thanks |
Hi,
Thanks for this great work and repo! I'd like to know if you used different training parameters / processing for the HowTo100M task. I did a straightforward adaptation of the code and config used for Kinetics (just changing the number of classes to 1059) but it doesn't seem to work (loss doesn't decrease), both when fine-tuning from ImageNet directly / fine-tuning from Kinetics.
Best,
Antoine Yang
The text was updated successfully, but these errors were encountered: