Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

long text #53

Closed
romanchyla opened this issue Oct 30, 2017 · 3 comments
Closed

long text #53

romanchyla opened this issue Oct 30, 2017 · 3 comments
Assignees

Comments

@romanchyla
Copy link
Contributor

sizes over 32766 chars cause solr to reject the document

incidentally, I'm seeing that the extraction is full of \u00a in place of spaces; perhaps that could be treated

example: 2009arXiv0906.5028K

this\\u00a0research\\u00a0project\\u00a0was\\u00a0conducted\\u00a0at\\u00a0St.\\u00a0Louis\\u00a0University\\u00a0School\\u00a0of\\u00a0Medicine.\\u00a0\\nTK\\u00a0received\\u00a0salary\\u00a0support\\u00a0from\\u00a0Sankyo\\u00a0Co,\\u00a0Ltd

@marblestation marblestation self-assigned this Nov 1, 2017
@spacemansteve
Copy link
Contributor

spacemansteve commented Dec 20, 2017

We still see long words in the body field, the code to prevent this is at https://github.com/adsabs/ADSfulltext/blob/master/adsft/utils.py#L246. The computed body field for bibcode 2018Tectp.722...69L has a very long toke that begins with
#[U] ppm[Th] ppm[Pb] ppmTh/U meas207Pb/206Pbs %207Pb/235Us %206Pb/238Us %r207Pb/206Pbs207Pb/235Us206Pb/238Usf206%54,550@114642786510.1900.128
Long tokens are identified with a regular expression r'\b\w{'+str(maxlength)+r',}. It doesn't work well because the tables that appear as a single token are full of special characters like /, %, @, etc. A very long string that repeated %123%123%123 would not be changed. Instead, I think \b\w should be replaced with \S which will match on all non-whitespace characters, including special characters which the tokens are full of.

@spacemansteve spacemansteve reopened this Dec 20, 2017
spacemansteve pushed a commit to spacemansteve/ADSfulltext that referenced this issue Dec 20, 2017
@spacemansteve
Copy link
Contributor

reopen until deployed

@spacemansteve
Copy link
Contributor

Should not have been re-opened. Recent "token too large" were fixed by re-extracting fulltext. The code works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants