Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute character index mapping for before preprocess.normalize_whitespace #121

Open
betatim opened this issue Jul 27, 2017 · 2 comments
Open

Comments

@betatim
Copy link

betatim commented Jul 27, 2017

Currently I have the following process:

  1. user provides some text possibly containing double white space, newlines, etc
  2. apply preprocess.normalize_whitespace
  3. use the NER from spacy
  4. highlight found entities in the unnormalized text

However doing (4) is kind of hard as the character coordinates (doc = nlp(text); doc.ents[0].start) match up with the normalised text. Any bright ideas how to transform the coordinates back to the original string? Would be nice not to have to reformat the text the user typed in ("Hey user, we reformatted your text for you, you better like it!")

@bdewilde
Copy link
Collaborator

Hi @betatim , I understand the problem, although I don't know of a "good" way to solve it. The preprocessing functions are destructive and one-way, so not a lot of thought has been given to recovering the changes. Basic question: Do you need to normalize the white space before using spacy's NER? It seems like weird spacing shouldn't affect the model's performance, in which case, I'd just skip the normalization.

The only solution that comes to mind is iterating over the resulting entities and re-locating them in the original text, a process which can be made more efficient than the simplest implementation but not, like, great.

This reminds me of annotating, say, keyterms visually in a PDF document while using the extracted/processed text in the analysis. It's definitely a thing I've seen done. (Unfortunately, my google-fu failed me — I couldn't find a concrete example.) Might be worth trying to track down...

@betatim
Copy link
Author

betatim commented Jul 27, 2017

Do you need to normalize the white space before using spacy's NER? It seems like weird spacing shouldn't affect the model's performance, in which case, I'd just skip the normalization.

It seem to help with things like "07\n Feb 2017" being found as a date and not as a CARDINAL and a DATE.

Was hoping you had found a nice way to do the transporting things back. Will think if we can solve it by tweaking the UI a bit.

Will see if I can find something on the PDFs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants