Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Add IOB1 as allowed CRF constraint #1615

Merged
merged 10 commits into from
Aug 16, 2018

Conversation

nelson-liu
Copy link
Contributor

This adds IOB1 coding as an allowed constraint for the CRF tagger.

Here's a table of allowed transitions for reference:

@@ -84,6 +85,31 @@ def allowed_transitions(constraint_type: str, tokens: Dict[int, str]) -> List[Tu
if from_bio in ('O', 'B', 'I'):
allowed.append((i, end_tag))

elif constraint_type == "IOB1":
for i, (from_bio, *from_entity) in tokens.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like this is getting big enough these should all be refactored out into separate methods like allowed_iob1_transitions().

@@ -84,6 +85,31 @@ def allowed_transitions(constraint_type: str, tokens: Dict[int, str]) -> List[Tu
if from_bio in ('O', 'B', 'I'):
allowed.append((i, end_tag))

elif constraint_type == "IOB1":
for i, (from_bio, *from_entity) in tokens.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This syntax is confusing enough given the type annotation on tokens that it could probably use a comment (or just be simplified to putting the (from_bio, *from_entity) piece on its own line inside the for loop). It took me a minute to realize that the type annotation was actually correct, and this was separating out the first character from the rest of the characters. It's a slick piece of logic, but perhaps a little too slick.

@@ -84,6 +85,31 @@ def allowed_transitions(constraint_type: str, tokens: Dict[int, str]) -> List[Tu
if from_bio in ('O', 'B', 'I'):
allowed.append((i, end_tag))

elif constraint_type == "IOB1":
for i, (from_bio, *from_entity) in tokens.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it also took me quite a while to realize that tokens here was actually tags. This is your whole tag vocabulary. I was trying to figure out how i and j corresponded to positions in a sequence of tokens... I realize that you inherited this code, but can you fix the names here?

if is_allowed:
allowed.append((i, j))

# start transitions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These blocks are repeated in all three coding schemes. Maybe the right thing to do here is to define an is_transition_allowed(constraint_type, to_bio, to_entity, from_bio, from_entity), and then you can just remove the whole top-level if/else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, actually, not quite. I didn't realize that the blocks are slightly different. Ok, here's another proposal:

num_tags = len(tokens)
tags = [(index, tag) for index, tag in tokens.items()] + [(num_tags, "START"), (num_tags + 1, "END")]
for i, from_tag in tags:
    for j, to_tag in tags:
        if is_transition_allowed(constraint_type, from_tag, to_tag):
            allowed.append(i, j)
return allowed

And then you take all of the is_allowed = any([...]) logic and move it to that other method, and add in the start and end logic there. That way it's all in one place, and the higher-level logic is consistent for all constraint types. This is just getting very complicated as it is with every additional constraint type.

@nelson-liu
Copy link
Contributor Author

ok @matt-gardner , this is ready to be looked at again. let me know if i misconstrued your vision...

Copy link
Contributor

@matt-gardner matt-gardner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks better to me, where we're not repeating a bunch of logic across the different constraint types.

label is ``I-PER``, the ``from_tag`` is ``I``.
from_entity: ``str``, required
The entity corresponding to the ``from_tag``. For example, if the
label is ``I-PER``, the ``from_entity`` is ``PER``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, because you're using from_label[1:], you'll actually get -PER, not PER. But, that probably doesn't matter...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i opted to do this since the original behavior with (tag, *entity) meant entity = ['-', 'P', 'E', 'R']

if to_tag == "START" or from_tag == "END":
# Cannot transition into START or from END
return False
if from_tag == "START":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you've grouped "START" and "END" together here, and I now have three places to look in the code to see all of the "BIOUL" constraints. Seems more coherent to instead have a if from_tag == "START" block inside of each if constraint_type == ... block below.

@nelson-liu nelson-liu merged commit 82686c1 into allenai:master Aug 16, 2018
@nelson-liu nelson-liu deleted the iob1_crf_constraint branch August 16, 2018 20:05
gabrielStanovsky pushed a commit to gabrielStanovsky/allennlp that referenced this pull request Sep 7, 2018
* Add IOB1 as allowed CRF constraint

* Fix spacing

* Add is_transition_allowed function

* Replace token with label

* Replace n_tags with num_tags

* Consolidate everything except START and END tags

* Handle boundary tags in is_transition_allowed

* Fix lint, clean up conditional branching
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants