Can't reproduce that Page 6, Table 5, Evaluation on Point Cloud-Text Tasks' Bleu, METEOR and ROUGE_L numbers #23

zhurob · 2024-06-25T23:08:04Z

I have used https://github.com/csuhan/OneLLM/blob/main/docs/Evaluation.md:

Point-Text Evaluation
PointLLM Caption
Download PointLLM data from this link
Fill pretrained_path in eval/point_cap_pointllm.py and run: python eval/point_cap_pointllm.py.
Evaluate with eval/caption_eval.py. The annotation file is at datasets/Eval/point/pointllm_test_cococap.json

I and several of my team members, all got similar Bleu, METEOR and ROUGE_L to reproduce your Table 5 on OneLLM, we all got very low numbers like below, also, CIDEr is zero. Can you please double check that? We believe that we are using same point cloud files and scripts and model. Thank you. Rob
SPICE: 0.094
Bleu_1: 0.104
Bleu_2: 0.065
Bleu_3: 0.045
Bleu_4: 0.034
METEOR: 0.131
ROUGE_L: 0.175
CIDEr: 0.000
SPICE: 0.094

From https://arxiv.org/pdf/2312.03700, Page 6, Table 5, Evaluation on Point Cloud-Text Tasks. The evalua�tion dataset is from Objaverse [16], following the data split in
PointLLM [92]. InstructBLIP takes single-view image as input,
while PointLLM and OneLLM take point cloud as input. GPT4-
Acc.: GPT4 as the accuracy evaluator [92].

Model Captioning Classification
BLEU-1 ROUGE-L METEOR GPT4-Acc.
InstructBLIP-7B [15] 11.2 13.9 14.9 38.5
InstructBLIP-13B [15] 12.6 15.0 16.0 35.5
PointLLM-7B [92] 8.0 11.1 15.2 47.5
PointLLM-13B [92] 9.7 12.8 15.3 45.0
One-LLM-7B (Ours) 42.2 45.3 20.3 44.5

csuhan · 2024-07-08T03:53:11Z

Our point cloud caption results are evaluated with Phase II model: Multimodal Alignment. The final model after instruction tuning tends to output long and detailed response, while the caption benchmark requires a short sentence, making it perform bad on the benchmark.

A simple way to improve it is change the task prompt from: "What is this?" to "Provide a one-sentence caption".
https://github.com/csuhan/OneLLM/blob/73393b17a14fa58a179b450a2fe2d2d640dd61fc/eval/point_cap_pointllm.py#L38C21-L38C34

zhurob · 2024-07-08T20:53:46Z

Good fix. Thank you very much, verified, works.

zhurob closed this as completed Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't reproduce that Page 6, Table 5, Evaluation on Point Cloud-Text Tasks' Bleu, METEOR and ROUGE_L numbers #23

Can't reproduce that Page 6, Table 5, Evaluation on Point Cloud-Text Tasks' Bleu, METEOR and ROUGE_L numbers #23

zhurob commented Jun 25, 2024

csuhan commented Jul 8, 2024

zhurob commented Jul 8, 2024

Can't reproduce that Page 6, Table 5, Evaluation on Point Cloud-Text Tasks' Bleu, METEOR and ROUGE_L numbers #23

Can't reproduce that Page 6, Table 5, Evaluation on Point Cloud-Text Tasks' Bleu, METEOR and ROUGE_L numbers #23

Comments

zhurob commented Jun 25, 2024

csuhan commented Jul 8, 2024

zhurob commented Jul 8, 2024