[#4525] fix gold.align #4526

tamuhey · 2019-10-26T18:41:39Z

Fixed the bug described in gold.align seems to be broken #4525
Made the gold.align code simpler and more efficient

Types of change

bug fix and refactor.

Checklist

I have submitted the spaCy Contributor Agreement.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

tamuhey · 2019-10-26T18:46:22Z

Below is a notebook on profiling the old and new gold.align functions.
https://gist.github.com/tamuhey/1f16a107030659ff75781889fbc37fd1

spacy/gold.pyx

honnibal · 2019-10-27T12:37:31Z

Really appreciate your help @tamuhey ! Thanks so much. I've had a fair bit of trouble with the alignment code over time.

honnibal · 2019-10-27T12:42:28Z

I merged a bit hastily here, and missed something. Will make a commit to master to address it.

PR #4526 missed extra lower-casing and spacing normalization.

spacy/gold.pyx

adrianeboyd · 2019-10-27T19:44:50Z

This has fixed some of the subtok problems I was having (this is great!) but now I have cases where texts aren't being aligned like they were before. At least one problem is when the raw text starts with a whitespace character. A minimal example:

[
  {
    "id":0,
    "paragraphs":[
      {
        "raw":" a",
        "sentences":[
          {
            "tokens":[
              {
                "head":0,
                "dep":"ROOT",
                "tag":"A",
                "orth":"a",
                "ner":"U-DATE",
                "id":0
              }
            ],
            "brackets":[

            ]
          }
        ]
      }
    ]
  }
]

adrianeboyd · 2019-10-27T22:08:57Z

Ah, wait, I think I managed to test this at just the wrong point and the failing overall alignment was what Matt fixed above.

Now it doesn't throw an error but I get incorrect results with the GoldParse for the same example:

# loaded through GoldCorpus:
gold.words  # ['a', 'a']
gold.ner    # ['B-DATE', 'U-DATE']
gold.labels # ['ROOT', 'ROOT']

The difference is i2j_multi:

cost, i2j, j2i, i2j_multi, j2i_multi = align([" ", "a"], ["a"])
i2j_multi   # {0: 0}

tamuhey · 2019-10-28T02:28:07Z

@adrianeboyd Thanks for catching! I'm trying to fix this bug in (#4537)

* Switch from original `_align` to new simpler alignment algorithm from explosion#4526 * Remove alignment normalizations beyond whitespace and lowercasing

* Switch from original `_align` to new simpler alignment algorithm from #4526 * Remove alignment normalizations beyond whitespace and lowercasing

tamuhey added 3 commits October 27, 2019 00:44

fix: gold.align

d84cf70

fix align

9d95a42

remove old align

c92ebd3

tamuhey mentioned this pull request Oct 27, 2019

gold.align seems to be broken #4525

Closed

tamuhey changed the title ~~fix gold.align~~ [#4525] fix gold.align Oct 27, 2019

ines reviewed Oct 27, 2019

View reviewed changes

spacy/gold.pyx Show resolved Hide resolved

ines added the bug Bugs and behaviour differing from documentation label Oct 27, 2019

honnibal merged commit 5548502 into explosion:master Oct 27, 2019

honnibal added a commit that referenced this pull request Oct 27, 2019

Restore missing normalization from gold align

bddfbc7

PR #4526 missed extra lower-casing and spacing normalization.

tamuhey commented Oct 27, 2019

View reviewed changes

spacy/gold.pyx Show resolved Hide resolved

tamuhey mentioned this pull request Oct 28, 2019

modified gold.align to handle space tokens #4537

Merged

adrianeboyd mentioned this pull request Oct 30, 2019

Issues with gold alignment #4554

Closed

tamuhey mentioned this pull request Jan 5, 2020

Update gold.align with pytokenizations #4878

Closed

3 tasks

adrianeboyd mentioned this pull request Jan 23, 2020

Possible bug in spacy.gold.align #4936

Closed

adrianeboyd added a commit to adrianeboyd/spaCy that referenced this pull request Apr 21, 2020

Switch to new gold.align method

3b29243

* Switch from original `_align` to new simpler alignment algorithm from explosion#4526 * Remove alignment normalizations beyond whitespace and lowercasing

adrianeboyd mentioned this pull request Apr 21, 2020

Switch to new gold.align method #5334

Merged

3 tasks

honnibal pushed a commit that referenced this pull request Apr 21, 2020

Switch to new gold.align method (#5334)

521f361

* Switch from original `_align` to new simpler alignment algorithm from #4526 * Remove alignment normalizations beyond whitespace and lowercasing

tamuhey deleted the patch/gold-align branch May 12, 2020 16:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#4525] fix gold.align #4526

[#4525] fix gold.align #4526

tamuhey commented Oct 26, 2019 •

edited

Loading

tamuhey commented Oct 26, 2019

honnibal commented Oct 27, 2019

honnibal commented Oct 27, 2019

adrianeboyd commented Oct 27, 2019

adrianeboyd commented Oct 27, 2019

tamuhey commented Oct 28, 2019 •

edited

Loading

[#4525] fix gold.align #4526

[#4525] fix gold.align #4526

Conversation

tamuhey commented Oct 26, 2019 • edited Loading

Types of change

Checklist

tamuhey commented Oct 26, 2019

honnibal commented Oct 27, 2019

honnibal commented Oct 27, 2019

adrianeboyd commented Oct 27, 2019

adrianeboyd commented Oct 27, 2019

tamuhey commented Oct 28, 2019 • edited Loading

tamuhey commented Oct 26, 2019 •

edited

Loading

tamuhey commented Oct 28, 2019 •

edited

Loading