Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File "/scratch/sjn/anaconda/lib/python3.6/site-packages/ccg_nlpy/core/text_annotation.py", line 78, in _extract_char_offset assert sentence[characterId] == tokens[tokenId][tokenLength], sentence[characterId] + " expected, found " + tokens[tokenId][tokenLength] + " instead in sentence: " + sentence; AssertionError: � expected, found s instead in sentence: #90

Closed
monajalal opened this issue May 16, 2018 · 4 comments

Comments

@monajalal
Copy link

Hello,
How can I fix the following?
Thanks for the help.

@realDonaldTrump @FoxNews @seanhannity @CNN @andersoncooper HE IS TRUMP!!!!!!!!!!!!!!!!! https://t.co/E0JGWvSKFB
NER_CONLL view: this view does not have constituents in your input text. 
Hillary Clinton�s Candidacy Reveals Generational Schism Among Women https://t.co/6u3lmN7nIL
Traceback (most recent call last):
  File "ccg_test_remote.py", line 14, in <module>
    doc = pipeline.doc(df.iloc[i]['Tweet'])
  File "/scratch/sjn/anaconda/lib/python3.6/site-packages/ccg_nlpy/pipeline_base.py", line 38, in doc
    return TextAnnotation(response, self)
  File "/scratch/sjn/anaconda/lib/python3.6/site-packages/ccg_nlpy/core/text_annotation.py", line 34, in __init__
    self.char_offsets = self._extract_char_offset(self.text, self.tokens)
  File "/scratch/sjn/anaconda/lib/python3.6/site-packages/ccg_nlpy/core/text_annotation.py", line 78, in _extract_char_offset
    assert sentence[characterId] == tokens[tokenId][tokenLength], sentence[characterId] + " expected, found " + tokens[tokenId][tokenLength] + " instead in sentence: " + sentence;
AssertionError: � expected, found s instead in sentence: Hillary Clinton�s Candidacy Reveals Generational Schism Among Women https://t.co/6u3lmN7nIL

@danyaljj
Copy link
Member

I think you have to clean up the redundant characters like .

@monajalal
Copy link
Author

when I worked with spaCy NER it took care of the issue internally. Anyways, fixed after removing those characters.

@danyaljj
Copy link
Member

You're absolutely right. We have partial solutions, but will take care of it in future releases.

@flackbash
Copy link

The same problem occurs for certain whitespaces such as the no-break space U+00A0.

This is particularly mean as these whitespaces are hard to discover for the user.
Maybe you can convert special whitespaces to a standard whitespace?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants