Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training/Validation Data Split #27

Closed
Boese0601 opened this issue Jul 30, 2023 · 6 comments
Closed

Training/Validation Data Split #27

Boese0601 opened this issue Jul 30, 2023 · 6 comments

Comments

@Boese0601
Copy link

Hi, thanks for your great work. I check the TikTok tsv dataset and find that you've already split the dataset into trainig set and validation set. Since it's not easy to match the image with original sequence id of the dataset each by each, Then could you please just clarify that which sequences from original TikTok datset(from 000 to 340) are used for tranining and which are for validation? Thanks!

@Wangt-CN
Copy link
Owner

Wangt-CN commented Aug 4, 2023

Hi @Boese0601 , thanks for your nice comment.

For training, we use 000-334 of TikTok dataset; For testing, we find that there are potential risks of the person ID leakage for the TikTok dataset. Therefore, we choose to collect 10 short videos from both the 335-340 sequence and the Internet to make sure there are no person ID coincide for fair comparison.

@Boese0601
Copy link
Author

Thanks for your kind reply! That makes things clear. Btw could you please also upload those additional video sequences collected from the internet to Google Drive?

@Wangt-CN
Copy link
Owner

Hi @Boese0601 , I have submitted the query to the corporation to open-source the additional TikTok-style data. Since it is collected by the corporation so we need to get the permission.
Currently, if you want to make a fair comparison, you could follow the penultimate line of Table 1 which does not use the additional data for training.

@notorious-eric
Copy link

Hi, I download the tsv file and found that there are additional data in the file. Therefore, in the penultimate line of Table 1, you do not use the tsv file you presented, just use 335-340 sequence for evaluation, is that correct?

@Wangt-CN
Copy link
Owner

@notorious-eric Hi, do you mean the evaluation data? All the models are evaluated on the same data, i.e., 10 videos which is the combination of the original testing tiktok and additional data.

@Kelu007
Copy link

Kelu007 commented May 8, 2024

Hi @Boese0601 , thanks for your nice comment.

For training, we use 000-334 of TikTok dataset; For testing, we find that there are potential risks of the person ID leakage for the TikTok dataset. Therefore, we choose to collect 10 short videos from both the 335-340 sequence and the Internet to make sure there are no person ID coincide for fair comparison.

What are the videos collected from the Internet for evaluation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants