Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't reproduce that Page 6, Table 5, Evaluation on Point Cloud-Text Tasks' Bleu, METEOR and ROUGE_L numbers #23

Closed
zhurob opened this issue Jun 25, 2024 · 2 comments

Comments

@zhurob
Copy link

zhurob commented Jun 25, 2024

I have used https://github.com/csuhan/OneLLM/blob/main/docs/Evaluation.md:

Point-Text Evaluation
PointLLM Caption
Download PointLLM data from this link
Fill pretrained_path in eval/point_cap_pointllm.py and run: python eval/point_cap_pointllm.py.
Evaluate with eval/caption_eval.py. The annotation file is at datasets/Eval/point/pointllm_test_cococap.json

I and several of my team members, all got similar Bleu, METEOR and ROUGE_L to reproduce your Table 5 on OneLLM, we all got very low numbers like below, also, CIDEr is zero. Can you please double check that? We believe that we are using same point cloud files and scripts and model. Thank you. Rob
SPICE: 0.094
Bleu_1: 0.104
Bleu_2: 0.065
Bleu_3: 0.045
Bleu_4: 0.034

METEOR: 0.131
ROUGE_L: 0.175

CIDEr: 0.000
SPICE: 0.094

From https://arxiv.org/pdf/2312.03700, Page 6, Table 5, Evaluation on Point Cloud-Text Tasks. The evalua�tion dataset is from Objaverse [16], following the data split in
PointLLM [92]. InstructBLIP takes single-view image as input,
while PointLLM and OneLLM take point cloud as input. GPT4-
Acc.: GPT4 as the accuracy evaluator [92].

Model Captioning Classification
BLEU-1 ROUGE-L METEOR GPT4-Acc.
InstructBLIP-7B [15] 11.2 13.9 14.9 38.5
InstructBLIP-13B [15] 12.6 15.0 16.0 35.5
PointLLM-7B [92] 8.0 11.1 15.2 47.5
PointLLM-13B [92] 9.7 12.8 15.3 45.0
One-LLM-7B (Ours) 42.2 45.3 20.3 44.5

@csuhan
Copy link
Owner

csuhan commented Jul 8, 2024

Our point cloud caption results are evaluated with Phase II model: Multimodal Alignment. The final model after instruction tuning tends to output long and detailed response, while the caption benchmark requires a short sentence, making it perform bad on the benchmark.

A simple way to improve it is change the task prompt from: "What is this?" to "Provide a one-sentence caption".
https://github.com/csuhan/OneLLM/blob/73393b17a14fa58a179b450a2fe2d2d640dd61fc/eval/point_cap_pointllm.py#L38C21-L38C34

@zhurob
Copy link
Author

zhurob commented Jul 8, 2024

Good fix. Thank you very much, verified, works.

@zhurob zhurob closed this as completed Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants