Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Track token positions and use source string to tag NER entities #81275

Merged
merged 4 commits into from
Dec 7, 2021

Conversation

davidkyle
Copy link
Member

@davidkyle davidkyle commented Dec 2, 2021

By recording the position of the tokens in the original source string the entity labels can correctly constructed.

  • Case is preserved
  • Accents and other characters stripped during normalisation are preserved
  • Fixes bugs with extra spaces or punctuation

The input "My name is Benjamin Trent, I work at Acme Inc.." now correctly groups Acme Inc.

{
  "predicted_value" : "My name is [Benjamin Trent](PER&Benjamin+Trent), I work at [Acme Inc.](ORG&Acme+Inc.).",
  "entities" : [
    {
      "entity" : "Benjamin Trent",
      "class_name" : "PER",
      "class_probability" : 0.9994054708878661,
      "start_pos" : 11,
      "end_pos" : 25
    },
    {
      "entity" : "Acme Inc.",
      "class_name" : "ORG",
      "class_probability" : 0.818787531972483,
      "start_pos" : 37,
      "end_pos" : 46
    }
  ]
}

Monsieur Gillenormand is correctly grouped : "M. Gillenormand, who had risen betimes like all old men in good health, had heard his entrance"

{
  "predicted_value" : "[M. Gillenormand](PER&M.+Gillenormand), who had risen betimes like all old men in good health, had heard his entrance",
  "entities" : [
    {
      "entity" : "M. Gillenormand",
      "class_name" : "PER",
      "class_probability" : 0.9861165998812788,
      "start_pos" : 0,
      "end_pos" : 15
    }
  ]
}

@davidkyle davidkyle marked this pull request as ready for review December 6, 2021 12:13
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Dec 6, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@davidkyle
Copy link
Member Author

run elasticsearch-ci/part-2

@davidkyle davidkyle merged commit d7117f2 into elastic:master Dec 7, 2021
@davidkyle davidkyle deleted the ner-tokens branch December 7, 2021 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :ml Machine learning Team:ML Meta label for the ML team v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants