Compute character index mapping for before `preprocess.normalize_whitespace` #121

betatim · 2017-07-27T09:47:52Z

Currently I have the following process:

user provides some text possibly containing double white space, newlines, etc
apply preprocess.normalize_whitespace
use the NER from spacy
highlight found entities in the unnormalized text

However doing (4) is kind of hard as the character coordinates (doc = nlp(text); doc.ents[0].start) match up with the normalised text. Any bright ideas how to transform the coordinates back to the original string? Would be nice not to have to reformat the text the user typed in ("Hey user, we reformatted your text for you, you better like it!")

The text was updated successfully, but these errors were encountered:

bdewilde · 2017-07-27T13:47:05Z

Hi @betatim , I understand the problem, although I don't know of a "good" way to solve it. The preprocessing functions are destructive and one-way, so not a lot of thought has been given to recovering the changes. Basic question: Do you need to normalize the white space before using spacy's NER? It seems like weird spacing shouldn't affect the model's performance, in which case, I'd just skip the normalization.

The only solution that comes to mind is iterating over the resulting entities and re-locating them in the original text, a process which can be made more efficient than the simplest implementation but not, like, great.

This reminds me of annotating, say, keyterms visually in a PDF document while using the extracted/processed text in the analysis. It's definitely a thing I've seen done. (Unfortunately, my google-fu failed me — I couldn't find a concrete example.) Might be worth trying to track down...

betatim · 2017-07-27T15:16:24Z

Do you need to normalize the white space before using spacy's NER? It seems like weird spacing shouldn't affect the model's performance, in which case, I'd just skip the normalization.

It seem to help with things like "07\n Feb 2017" being found as a date and not as a CARDINAL and a DATE.

Was hoping you had found a nice way to do the transporting things back. Will think if we can solve it by tweaking the UI a bit.

Will see if I can find something on the PDFs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute character index mapping for before `preprocess.normalize_whitespace` #121

Compute character index mapping for before `preprocess.normalize_whitespace` #121

betatim commented Jul 27, 2017

bdewilde commented Jul 27, 2017

betatim commented Jul 27, 2017

Compute character index mapping for before preprocess.normalize_whitespace #121

Compute character index mapping for before preprocess.normalize_whitespace #121

Comments

betatim commented Jul 27, 2017

bdewilde commented Jul 27, 2017

betatim commented Jul 27, 2017

Compute character index mapping for before `preprocess.normalize_whitespace` #121

Compute character index mapping for before `preprocess.normalize_whitespace` #121