Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About test code #3

Closed
aspenstarss opened this issue Jan 11, 2021 · 11 comments
Closed

About test code #3

aspenstarss opened this issue Jan 11, 2021 · 11 comments

Comments

@aspenstarss
Copy link

aspenstarss commented Jan 11, 2021

Hello Zhihong,
Thanks for opening your source code. It's very nice work.
I'd like to ask you a few questions about reproducing paper results.

I evaluate results by saving generation sentence to json file.
When I resume the model from your provide in checkpoint, and using command as follows:

CUDA_VISIBLE_DEVICES=4 python main.py \
--image_dir data/iu/images/ \
--ann_path data/iu/annotation.json \
--dataset_name iu_xray \
--max_seq_length 60 \
--threshold 3 \
--batch_size 16 \
--epochs 100 \
--save_dir results/reproduce_iu_xray \
--step_size 50 \
--gamma 0.1 \
--seed 9223 \
--resume data/model_iu_xray.pth

I see Checkpoint loaded. Resume training from epoch 15. And the model generates output JSON files.
I use pycocoevalcap to evaluate the results. The results are as follows:

Bleu_1 Bleu_2 Bleu_3 Bleu_4 CIDEr ROUGE_L METEOR
0.4334 0.2863 0.2069 0.1554 0.5432 0.3245 0.1945

It seems different somewhere.
Could you give me you test code or provide your generated results JSON file?

@aspenstarss
Copy link
Author

aspenstarss commented Jan 11, 2021

My generated sentece can be find in gist.
Thanks for you attention.

@aspenstarss aspenstarss changed the title About reproducing paper results About test code Jan 11, 2021
@ksz-creat
Copy link

Hello Zhihong,
Thanks for opening your source code. It's very nice work.
I'd like to ask you a few questions about reproducing paper results.

I evaluate results by saving generation sentence to json file.Thus, I I comment the train code(Trainer.py Line190-202) and add some code in you Trainer.py-L177 as following:

    def _output_generation(self, predictions, gts, idxs, epoch, subset):
        # for evaluating and saving
        from nltk.translate.bleu_score import sentence_bleu
        import json
        # for saving json file
        output = list()
        for idx, pre, gt in zip(idxs, predictions, gts):
            score = sentence_bleu([gt.split()], pre.split())
            output.append({'filename': idx, 'prediction': pre, 'ground_truth': gt, 'bleu4': score})

        output = sorted(output, key=lambda x: x['bleu4'], reverse=True)
        output_filename = os.path.join(self.checkpoint_dir, 'Enc2Dec-' + str(epoch) + '_' + subset + '_generated.json')
        with open(output_filename, 'w') as f:
            json.dump(output, f, ensure_ascii=False)

And using the function in Trainer.py-L232 as follows:

self._output_generation(test_res, test_gts, test_idxs, epoch, 'test')

When I resume the model from your provide in checkpoint, and using command as follows:

CUDA_VISIBLE_DEVICES=4 python main.py \
--image_dir data/iu_2image/images/ \
--ann_path data/iu_2image/annotation.json \
--dataset_name iu_xray \
--max_seq_length 60 \
--threshold 3 \
--batch_size 16 \
--epochs 100 \
--save_dir results/reproduce_iu_xray \
--step_size 50 \
--gamma 0.1 \
--seed 9223 \
--resume data/model_iu_xray.pth

I see Checkpoint loaded. Resume training from epoch 15. And the model generates output JSON files.
I use pycocoevalcap to evaluate the results. The results are as follows:

Bleu_1 Bleu_2 Bleu_3 Bleu_4 CIDEr ROUGE_L METEOR
0.4334 0.2863 0.2069 0.1554 0.5432 0.3245 0.1945
However, when I use nltk.translate.bleu_score.sentence_bleu function to evaluate, I got Bleus as follows:

Bleu_1 Bleu_2 Bleu_3 Bleu_4
0.4879 0.3194 0.2324 0.1772
It seems I made a mistake somewhere.
Could you give me you test code or provide your generated results JSON file?

Best,
Shuxin

hello
could you tell me where the mistake is
I will be appreciated
Thanks

@nooralahzadeh
Copy link

nooralahzadeh commented Jan 27, 2021

Hi, Thanks for sharing the code, it helps to reproduce the results.
When I run the code on the IU X-RAY dataset, I am getting the following results, it seems that you report the performance of the model based on the best result on the test set and not based on the validation set!
whereas it is also lower than the reported one on this dataset in the paper . I wonder if your hyperparameter setting is different than the one you have in the GitHub!
Best results (w.r.t BLEU_4) in validation set: val_BLEU_4 : 0.13399466273794292

    epoch          : 22
    train_loss     : 0.5585661835395372
    val_BLEU_1     : 0.3783063834532518
    val_BLEU_2     : 0.24550515630336806
    val_BLEU_3     : 0.17719979948687553
    val_ROUGE_L    : 0.3390687946347631
    test_BLEU_1    : 0.38823129074169
    test_BLEU_2    : 0.24574379214373973
    test_BLEU_3    : 0.17582539135315156
    test_BLEU_4    : 0.13336767408445888
    test_ROUGE_L   : 0.3419288795162052`

Best results (w.r.t BLEU_4) in test set: test_BLEU_4 : 0.15495773913794939

    epoch          : 15
    train_loss     : 0.9282983885361598
    val_BLEU_1     : 0.4048809446059833
    val_BLEU_2     : 0.24409295208500478
    val_BLEU_3     : 0.16905404129725854
    val_BLEU_4     : 0.1268627057419589
    val_ROUGE_L    : 0.3322505123128848
    test_BLEU_1    : 0.446254324796507
    test_BLEU_2    : 0.27826410927242545
    test_BLEU_3    : 0.20113763688850164
    test_ROUGE_L   : 0.35131416076389516`

@ksz-creat
Copy link

Hello Zhihong,
Thanks for opening your source code. It's very nice work.
I'd like to ask you a few questions about reproducing paper results.

I evaluate results by saving generation sentence to json file.
When I resume the model from your provide in checkpoint, and using command as follows:

CUDA_VISIBLE_DEVICES=4 python main.py \
--image_dir data/iu/images/ \
--ann_path data/iu/annotation.json \
--dataset_name iu_xray \
--max_seq_length 60 \
--threshold 3 \
--batch_size 16 \
--epochs 100 \
--save_dir results/reproduce_iu_xray \
--step_size 50 \
--gamma 0.1 \
--seed 9223 \
--resume data/model_iu_xray.pth

I see Checkpoint loaded. Resume training from epoch 15. And the model generates output JSON files.
I use pycocoevalcap to evaluate the results. The results are as follows:

Bleu_1 Bleu_2 Bleu_3 Bleu_4 CIDEr ROUGE_L METEOR
0.4334 0.2863 0.2069 0.1554 0.5432 0.3245 0.1945
It seems different somewhere.
Could you give me you test code or provide your generated results JSON file?

hello, I try your code to produce the generation sentence, however it can onle generate 14 sentence , could you tell me where the problem is , thanks very much

@zhjohnchan
Copy link
Contributor

Thanks for your attention to our paper!

I will add more features you mentioned in the future after I finish those things with high priorities.

@mlii0117
Copy link

mlii0117 commented Mar 4, 2021

Hi guys,
Firstly, thanks for sharing your code, and it is really a nice work.
When I used your code for training, I found the best results in epoch 1.
However, the value of the training loss is the highest.
Do you have any idea about this?
Does that mean the current metric is not suitable to this task?
Screenshot from 2021-03-04 11-57-38
Did you find the same problem @nooralahzadeh ?
Thanks

@nooralahzadeh
Copy link

@mlii0117 can you give more info, how did you run the code? what are the values of the parameters? If you look at the generated report it probably generates the same sentence for all the cases by having the best result in first epoch1

@mlii0117
Copy link

mlii0117 commented Mar 4, 2021

@nooralahzadeh thanks for your reply. I have found the reason.

@luantunez
Copy link

Thank you for your explanations. I was wondering about the content of annotation.json
What does it have?

@Markin-Wang
Copy link

@nooralahzadeh thanks for your reply. I have found the reason.

Hi, I also have this problem, can you provide any reasons causing this problem and how to solve it?
Thanks.

@wlufy
Copy link

wlufy commented Apr 11, 2022

Hi, Thanks for sharing the code, it helps to reproduce the results. When I run the code on the IU X-RAY dataset, I am getting the following results, it seems that you report the performance of the model based on the best result on the test set and not based on the validation set! whereas it is also lower than the reported one on this dataset in the paper . I wonder if your hyperparameter setting is different than the one you have in the GitHub! Best results (w.r.t BLEU_4) in validation set: val_BLEU_4 : 0.13399466273794292

    epoch          : 22
    train_loss     : 0.5585661835395372
    val_BLEU_1     : 0.3783063834532518
    val_BLEU_2     : 0.24550515630336806
    val_BLEU_3     : 0.17719979948687553
    val_ROUGE_L    : 0.3390687946347631
    test_BLEU_1    : 0.38823129074169
    test_BLEU_2    : 0.24574379214373973
    test_BLEU_3    : 0.17582539135315156
    test_BLEU_4    : 0.13336767408445888
    test_ROUGE_L   : 0.3419288795162052`

Best results (w.r.t BLEU_4) in test set: test_BLEU_4 : 0.15495773913794939

    epoch          : 15
    train_loss     : 0.9282983885361598
    val_BLEU_1     : 0.4048809446059833
    val_BLEU_2     : 0.24409295208500478
    val_BLEU_3     : 0.16905404129725854
    val_BLEU_4     : 0.1268627057419589
    val_ROUGE_L    : 0.3322505123128848
    test_BLEU_1    : 0.446254324796507
    test_BLEU_2    : 0.27826410927242545
    test_BLEU_3    : 0.20113763688850164
    test_ROUGE_L   : 0.35131416076389516`

Sorry to bother you.
I also get the similar result. And I find that these metrics are very high in the first few epochs like @mlii0117 . Can you give some advise on this problem and how to solve it?
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants