-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
long text #53
Comments
We still see long words in the body field, the code to prevent this is at https://github.com/adsabs/ADSfulltext/blob/master/adsft/utils.py#L246. The computed body field for bibcode 2018Tectp.722...69L has a very long toke that begins with |
reopen until deployed |
Should not have been re-opened. Recent "token too large" were fixed by re-extracting fulltext. The code works. |
sizes over 32766 chars cause solr to reject the document
incidentally, I'm seeing that the extraction is full of \u00a in place of spaces; perhaps that could be treated
example: 2009arXiv0906.5028K
this\\u00a0research\\u00a0project\\u00a0was\\u00a0conducted\\u00a0at\\u00a0St.\\u00a0Louis\\u00a0University\\u00a0School\\u00a0of\\u00a0Medicine.\\u00a0\\nTK\\u00a0received\\u00a0salary\\u00a0support\\u00a0from\\u00a0Sankyo\\u00a0Co,\\u00a0Ltd
The text was updated successfully, but these errors were encountered: