Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BLEU metric #2001

Merged
merged 4 commits into from Nov 15, 2018
Merged

Add BLEU metric #2001

merged 4 commits into from Nov 15, 2018

Conversation

@epwalsh
Copy link
Contributor

epwalsh commented Nov 1, 2018

I thought I should share my implementation of BLEU. This is the first time I've implemented it, so there may be a more efficient way, but I'm pretty sure the math checks out at least. If anyone has thoughts on how to improve it please let me know!

# Unigrams.
counts = Counter(self.metric._ngrams(tensor, 1, set((0,))))
unigram_check = {(1,): 2, (2,): 2, (3,): 1}
assert counts == unigram_check

This comment has been minimized.

Copy link
@42B

42B Nov 1, 2018

I'm curious about the use of bare assert statements inside of essentially a unittest.TestCase class?

import unittest


class Test(unittest.TestCase):
    def test_bare(self):
        a = [1, 2, 3]
        b = [3, 4, 5]
        assert a == b

    def test_self_assert(self):
        a = [1, 2, 3]
        b = [3, 4, 5]
        self.assertEqual(a, b)


if __name__ == '__main__':
    unittest.main()

The output is much more descriptive by using (in this particular case) self.assertEqual?

FF
======================================================================
FAIL: test_bare (__main__.Test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "", line 8, in test_bare
    assert a == b
AssertionError

======================================================================
FAIL: test_self_assert (__main__.Test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "", line 13, in test_self_assert
    self.assertEqual(a, b)
AssertionError: Lists differ: [1, 2, 3] != [3, 4, 5]

First differing element 0:
1
3

- [1, 2, 3]
+ [3, 4, 5]

----------------------------------------------------------------------
Ran 2 tests in 0.001s

FAILED (failures=2)

This comment has been minimized.

Copy link
@epwalsh

epwalsh Nov 1, 2018

Author Contributor

Given that we're using pytest, I think the output from a raw assert failure is actually just as, if not more, descriptive:

Here's an assertDictEqual failure:
assertdictequal
Here's a bare assert failure:
base_assert

This comment has been minimized.

Copy link
@42B

42B Nov 1, 2018

@epwalsh

I'm curious why you would still write tests using the old unittest style then? What's the advantage?

This comment has been minimized.

Copy link
@brendan-ai2

brendan-ai2 Nov 12, 2018

Member

I agree with you, @epwalsh. Bare assert seems more descriptive. It's also great to just compose raw Python instead of use custom functions.

@matt-gardner

This comment has been minimized.

Copy link
Member

matt-gardner commented Nov 2, 2018

@42B, we use pytest. See https://docs.pytest.org/en/latest/.

@epwalsh, thanks, it'd be awesome to have BLEU in here! We're mostly all at EMNLP this week, though, so it'll be a little bit before we can pick this up.

@epwalsh

This comment has been minimized.

Copy link
Contributor Author

epwalsh commented Nov 2, 2018

@matt-gardner no worries! I enjoyed the slides for the writing code for NLP talk by the way 👏

@brendan-ai2 brendan-ai2 self-requested a review Nov 6, 2018
@brendan-ai2 brendan-ai2 self-assigned this Nov 6, 2018
@maksym-del

This comment has been minimized.

Copy link
Contributor

maksym-del commented Nov 10, 2018

Hi @epwalsh!

So far I was wrapping BLEU from nltk, since the package is anyway in allennlp's requirements.

Is there any reason about not using that one?

@epwalsh

This comment has been minimized.

Copy link
Contributor Author

epwalsh commented Nov 11, 2018

Hey @MaxDel! That's actually what I was doing initially as well. The reason I ended up implementing it this way was so that I could pass the tensors (gold targets and predicted targets) directly to the metric and avoid the extra step of converting the tensors into lists of strings / ints.

Copy link
Member

brendan-ai2 left a comment

Thanks for this PR, @epwalsh! I made a few minor comments. Overall looks good.

@MaxDel's point is a good one, however. Could you please add a note about why you didn't take that approach in bleu.py? If you have numbers showing it was less efficient in time or space, that'd be ideal. Also, I'm not sure the sentence brevity penalty would be applied correctly across batches if we dispatched to sentence_bleu or corpus_bleu. Correct me if I'm wrong!

allennlp/training/metrics/bleu.py Outdated Show resolved Hide resolved
allennlp/training/metrics/bleu.py Outdated Show resolved Hide resolved
allennlp/training/metrics/bleu.py Outdated Show resolved Hide resolved
allennlp/training/metrics/bleu.py Outdated Show resolved Hide resolved
# Unigrams.
counts = Counter(self.metric._ngrams(tensor, 1, set((0,))))
unigram_check = {(1,): 2, (2,): 2, (3,): 1}
assert counts == unigram_check

This comment has been minimized.

Copy link
@brendan-ai2

brendan-ai2 Nov 12, 2018

Member

I agree with you, @epwalsh. Bare assert seems more descriptive. It's also great to just compose raw Python instead of use custom functions.

@epwalsh

This comment has been minimized.

Copy link
Contributor Author

epwalsh commented Nov 13, 2018

@brendan-ai2 thanks for the review! I'll try and address your comments soon. And yes, I think you're right that the sentence brevity penalty would be wrong if we applied the NLTK implementation by batch.

@epwalsh

This comment has been minimized.

Copy link
Contributor Author

epwalsh commented Nov 14, 2018

I think I addressed all your comments, @brendan-ai2, but let me know if you see anything else! Also, do you think it would be reasonable to make this the default metric for simple_seq2seq? That way we could get rid of this "TODO" comment: https://github.com/allenai/allennlp/blob/master/allennlp/models/encoder_decoders/simple_seq2seq.py#L213

@brendan-ai2

This comment has been minimized.

Copy link
Member

brendan-ai2 commented Nov 14, 2018

Thanks for all the updates! Looks good overall, but mypy needs some help: http://build.allennlp.org/viewLog.html?buildId=7541&buildTypeId=AllenNLP_AllenNLPPullRequests&tab=buildLog&_focus=3050

LGTM after that!

I'd be supportive of adding BLEU as a metric for simple_seq2seq too. We'd probably still want to calculate the cross entropy during validation, however -- for comparison against training and to use as a stopping condition. Preferably make it configurable so that a user can disable if they don't want to pay the overhead.

@epwalsh

This comment has been minimized.

Copy link
Contributor Author

epwalsh commented Nov 14, 2018

Oops, just fixed! And I agree with you on simple_seq2seq. I could submit a PR for that after this gets merged. Thanks!

@brendan-ai2 brendan-ai2 merged commit 19c784f into allenai:master Nov 15, 2018
3 checks passed
3 checks passed
Pull Requests (AllenNLP Library) TeamCity build finished
Details
codecov/patch 96% of diff hit (target 90%)
Details
codecov/project 92% (+<1%) compared to 6ecd193
Details
@brendan-ai2

This comment has been minimized.

Copy link
Member

brendan-ai2 commented Nov 15, 2018

And merged. Thanks again for the contribution! If you do modify simple_seq2seq, feel free to assign me as reviewer.

@epwalsh epwalsh deleted the epwalsh:bleu branch Jan 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.