long text #53

romanchyla · 2017-10-30T22:46:17Z

sizes over 32766 chars cause solr to reject the document

incidentally, I'm seeing that the extraction is full of \u00a in place of spaces; perhaps that could be treated

example: 2009arXiv0906.5028K

this\\u00a0research\\u00a0project\\u00a0was\\u00a0conducted\\u00a0at\\u00a0St.\\u00a0Louis\\u00a0University\\u00a0School\\u00a0of\\u00a0Medicine.\\u00a0\\nTK\\u00a0received\\u00a0salary\\u00a0support\\u00a0from\\u00a0Sankyo\\u00a0Co,\\u00a0Ltd

The text was updated successfully, but these errors were encountered:

spacemansteve · 2017-12-20T19:57:45Z

We still see long words in the body field, the code to prevent this is at https://github.com/adsabs/ADSfulltext/blob/master/adsft/utils.py#L246. The computed body field for bibcode 2018Tectp.722...69L has a very long toke that begins with
#[U] ppm[Th] ppm[Pb] ppmTh/U meas207Pb/206Pbs %207Pb/235Us %206Pb/238Us %r207Pb/206Pbs207Pb/235Us206Pb/238Usf206%54,550@114642786510.1900.128
Long tokens are identified with a regular expression r'\b\w{'+str(maxlength)+r',}. It doesn't work well because the tables that appear as a single token are full of special characters like /, %, @, etc. A very long string that repeated %123%123%123 would not be changed. Instead, I think \b\w should be replaced with \S which will match on all non-whitespace characters, including special characters which the tokens are full of.

spacemansteve · 2017-12-21T18:36:23Z

reopen until deployed

spacemansteve · 2018-01-09T19:34:07Z

Should not have been re-opened. Recent "token too large" were fixed by re-extracting fulltext. The code works.

marblestation self-assigned this Nov 1, 2017

romanchyla closed this as completed in 8104170 Nov 8, 2017

spacemansteve reopened this Dec 20, 2017

spacemansteve pushed a commit to spacemansteve/ADSfulltext that referenced this issue Dec 20, 2017

adsabs#53 improved support for long text

67cf1f2

romanchyla pushed a commit that referenced this issue Dec 20, 2017

#53 improved support for long text (#62)

7571be1

spacemansteve closed this as completed Dec 21, 2017

spacemansteve reopened this Dec 21, 2017

spacemansteve assigned spacemansteve and unassigned marblestation Dec 21, 2017

spacemansteve closed this as completed Jan 2, 2018

spacemansteve reopened this Jan 9, 2018

spacemansteve closed this as completed Jan 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

long text #53

long text #53

romanchyla commented Oct 30, 2017

spacemansteve commented Dec 20, 2017 •

edited

Loading

spacemansteve commented Dec 21, 2017

spacemansteve commented Jan 9, 2018

long text #53

long text #53

Comments

romanchyla commented Oct 30, 2017

spacemansteve commented Dec 20, 2017 • edited Loading

spacemansteve commented Dec 21, 2017

spacemansteve commented Jan 9, 2018

spacemansteve commented Dec 20, 2017 •

edited

Loading