Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#4525] fix gold.align #4526

Merged
merged 3 commits into from
Oct 27, 2019
Merged

[#4525] fix gold.align #4526

merged 3 commits into from
Oct 27, 2019

Conversation

tamuhey
Copy link
Contributor

@tamuhey tamuhey commented Oct 26, 2019

from #4525

Types of change

bug fix and refactor.

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@tamuhey
Copy link
Contributor Author

tamuhey commented Oct 26, 2019

Below is a notebook on profiling the old and new gold.align functions.
https://gist.github.com/tamuhey/1f16a107030659ff75781889fbc37fd1

@tamuhey tamuhey changed the title fix gold.align [#4525] fix gold.align Oct 27, 2019
spacy/gold.pyx Show resolved Hide resolved
@ines ines added the bug Bugs and behaviour differing from documentation label Oct 27, 2019
@honnibal
Copy link
Member

Really appreciate your help @tamuhey ! Thanks so much. I've had a fair bit of trouble with the alignment code over time.

@honnibal honnibal merged commit 5548502 into explosion:master Oct 27, 2019
@honnibal
Copy link
Member

I merged a bit hastily here, and missed something. Will make a commit to master to address it.

honnibal added a commit that referenced this pull request Oct 27, 2019
PR #4526 missed extra lower-casing and spacing normalization.
spacy/gold.pyx Show resolved Hide resolved
@adrianeboyd
Copy link
Contributor

This has fixed some of the subtok problems I was having (this is great!) but now I have cases where texts aren't being aligned like they were before. At least one problem is when the raw text starts with a whitespace character. A minimal example:

[
  {
    "id":0,
    "paragraphs":[
      {
        "raw":" a",
        "sentences":[
          {
            "tokens":[
              {
                "head":0,
                "dep":"ROOT",
                "tag":"A",
                "orth":"a",
                "ner":"U-DATE",
                "id":0
              }
            ],
            "brackets":[

            ]
          }
        ]
      }
    ]
  }
]

@adrianeboyd
Copy link
Contributor

Ah, wait, I think I managed to test this at just the wrong point and the failing overall alignment was what Matt fixed above.

Now it doesn't throw an error but I get incorrect results with the GoldParse for the same example:

# loaded through GoldCorpus:
gold.words  # ['a', 'a']
gold.ner    # ['B-DATE', 'U-DATE']
gold.labels # ['ROOT', 'ROOT']

The difference is i2j_multi:

cost, i2j, j2i, i2j_multi, j2i_multi = align([" ", "a"], ["a"])
i2j_multi   # {0: 0}

@tamuhey
Copy link
Contributor Author

tamuhey commented Oct 28, 2019

@adrianeboyd Thanks for catching! I'm trying to fix this bug in (#4537)

adrianeboyd added a commit to adrianeboyd/spaCy that referenced this pull request Apr 21, 2020
* Switch from original `_align` to new simpler alignment algorithm from
  explosion#4526

* Remove alignment normalizations beyond whitespace and lowercasing
@adrianeboyd adrianeboyd mentioned this pull request Apr 21, 2020
3 tasks
honnibal pushed a commit that referenced this pull request Apr 21, 2020
* Switch from original `_align` to new simpler alignment algorithm from
  #4526

* Remove alignment normalizations beyond whitespace and lowercasing
@tamuhey tamuhey deleted the patch/gold-align branch May 12, 2020 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs and behaviour differing from documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants