Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems on result reproduction on MEGC2021 Benchmark #6

Closed
xjtupanda opened this issue Nov 19, 2022 · 5 comments
Closed

Problems on result reproduction on MEGC2021 Benchmark #6

xjtupanda opened this issue Nov 19, 2022 · 5 comments

Comments

@xjtupanda
Copy link

xjtupanda commented Nov 19, 2022

I've had a hard time trying to reproduce the results. Listed are what I've tried.

  1. I've re-organized the code in the way I'm used to, and run experiments on CASME_sq using the features extracted by myself as instructed. The overall F1-score is somewhere around 0.23. So I doubt if there's something wrong with my feature extraction procedure, so I turn to the preprocessed features offered in the repo.
  2. Run experiments on CASME_sq using the features you provided in repo.
    Results:
    Final result: TP:101, FP:290, FN:256
    Precision = 0.2583
    Recall = 0.185
    F1-Score = 0.2156
    The results are still not so good. So I finally tried to run the code in the jupyter notebook in provided in Code for evaluation on MEGC2021 benchmark? #3
  3. Run experiments on CASME_sq & SAMMLV using the notebook & features you provided in repo. Here are the results.

Reproduction ipynb:

CASME:

Micro result: TP:3 FP:137 FN:54 F1_score:0.0305
Macro result: TP:100 FP:206 FN:200 F1_score:0.3300
Overall result: TP:103 FP:343 FN:254 F1_score:0.2565

SAMMLV:

Cumulative result until subject 30:
Micro result: TP:10 FP:169 FN:149 F1_score:0.0592
Macro result: TP:97 FP:277 FN:246 F1_score:0.2706
Overall result: TP:107 FP:446 FN:395 F1_score:0.2028

Orig ipynb:

CASME:
Micro result: TP:5 FP:77 FN:52 F1_score:0.0719
Macro result: TP:108 FP:166 FN:192 F1_score:0.3763
Overall result: TP:113 FP:243 FN:244 F1_score:0.3170

SAMM:

Micro result: TP:12 FP:104 FN:147 F1_score:0.0873
Macro result: TP:88 FP:198 FN:255 F1_score:0.2798
Overall result: TP:100 FP:302 FN:402 F1_score:0.2212

As reported above, there's a huge gap between the reproduction result & orig. performance on CASME_sq, while the gap for SAMMLV dataset is much smaller.

I've also tried fixing the ransom seed=1, the result does not improve, and replacing the mix of hard&soft label loss by pure hard label loss improves results. Moreover, I notive there are many subtle differences between the orig. code & jupyter notebook, using spotting method in the orig. code produces very bad results:
Final result: TP:53, FP:320, FN:304
Precision = 0.1421
Recall = 0.0849
F1-Score = 0.1063
Replacing it by the spotting method in the jupyter notebook turns out better, with results:
Final result: TP:102, FP:299, FN:255
Precision = 0.2544
Recall = 0.1841
F1-Score = 0.2136

And I found a typo in the orig. code

if end-start > macro_min and end-start < macro_max and ( score_plot_micro[peak] > 0.95 or (score_plot_macro[peak] > score_plot_macro[start] and score_plot_macro[peak] > score_plot_macro[end])):

I believe score_plot_micro[peak] > 0.95 should be score_plot_macro[peak] > 0.95

I'm trying to make some improvement on your work and take that as a baseline model, but I'm veru frustrated by the reproduction result. Any insight/help would be very precious to me.

@xjtupanda
Copy link
Author

I've also uploaded my notebook of reproduction.https://drive.google.com/file/d/1L-8EHmPC3HIj5Swd_aArrcCjTGZ99Wc6/view?usp=share_link

The only modification is on some evaluation detail, if len(preds_micro) == 0: preds_micro.append([0, 0, 0, 0, 0, -1])#, 0]) # -1 to bypass the count of additional fp
Because the evaluation function requires each pred item of length 6 and GT item of length 7, so I remove the last element in each item. The original notebook has length 7 and 8 respectively though.

@xjtupanda
Copy link
Author

xjtupanda commented Nov 19, 2022

After a closer inspection, I think the main reason for the bad result is the low F1-score of micro-expression detection, (I still don't understand why I can't reproduce the results though.) I've found many of my results producing few TPs and large FPs for micro-expression, and thus dragging the result of macro-expression. And there are so many hyper-parameters for spotting to tune, which I believe is crucial to the final result. Is there any search strategy or just tuning by viewing the score plot?

@genbing99
Copy link
Owner

Thanks for your investigation, especially the typo in the original codes.

Please refer to the Jupyter Notebook as I might wrongly copy some parts when converting to py files.

For the modification that you did:
preds_micro.append([0, 0, 0, 0, 0, -1])#, 0]) # -1 to bypass the count of additional fp
You should use the line in the original code:
preds_micro.append([0, 0, 0, 0, 0, -1, 0]) # -1 to bypass the count of additional fp
Otherwise, you will encounter an issue of adding an FP if no prediction is made on the particular video.
See that the FP:7 for the evaluation of CASME_sq Subject 1. In fact, it should be FP:0.

@genbing99
Copy link
Owner

The hyperparameters were set by using a loop, which will take some time. I think this is normal to do so in the spotting task, and other methods also used a lot of hyperparameters while processing the "signal" graph.

@xjtupanda
Copy link
Author

The hyperparameters were set by using a loop, which will take some time. I think this is normal to do so in the spotting task, and other methods also used a lot of hyperparameters while processing the "signal" graph.

I see. Thank you again for your patience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants