Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve text extraction of large documents #28

Closed
aaccomazzi opened this issue Nov 28, 2016 · 2 comments
Closed

Improve text extraction of large documents #28

aaccomazzi opened this issue Nov 28, 2016 · 2 comments

Comments

@aaccomazzi
Copy link
Member

Here is a list of records which have a large fulltext contents. Due to limitations in SOLR, we currently throw away anything beyond 32k bytes. Under these circumstances, it would be nice to be more sensible when generating fulltext so that we throw away things which are not interesting (e.g. numeric tables) and keep the text that we want.

['1998astro.ph..7308B', '/proj/ads/articles/fulltext/extracted/19/98/as/tr/o,/ph/,,/73/08/B/fulltext.txt', 105938]
['2003astro.ph..4480S', '/proj/ads/articles/fulltext/extracted/20/03/as/tr/o,/ph/,,/44/80/S/fulltext.txt', 62708]
['2004ADNDT..88...83L', '/proj/ads/articles/fulltext/extracted/20/04/AD/ND/T,/,8/8,/,,/83/L/fulltext.txt', 319850]
['2004JMoSp.228..593B', '/proj/ads/articles/fulltext/extracted/20/04/JM/oS/p,/22/8,/,5/93/B/fulltext.txt', 113945]
['2005ADNDT..89....1G', '/proj/ads/articles/fulltext/extracted/20/05/AD/ND/T,/,8/9,/,,/,1/G/fulltext.txt', 204757]
['2005ADNDT..89..101Z', '/proj/ads/articles/fulltext/extracted/20/05/AD/ND/T,/,8/9,/,1/01/Z/fulltext.txt', 126118]
['2005ADNDT..89..139E', '/proj/ads/articles/fulltext/extracted/20/05/AD/ND/T,/,8/9,/,1/39/E/fulltext.txt', 238924]
['2005ADNDT..89..195L', '/proj/ads/articles/fulltext/extracted/20/05/AD/ND/T,/,8/9,/,1/95/L/fulltext.txt', 280702]
['2005ADNDT..89..267G', '/proj/ads/articles/fulltext/extracted/20/05/AD/ND/T,/,8/9,/,2/67/G/fulltext.txt', 90409]
['2005ADNDT..90..177L', '/proj/ads/articles/fulltext/extracted/20/05/AD/ND/T,/,9/0,/,1/77/L/fulltext.txt', 313130]
['2005ADNDT..90..259Z', '/proj/ads/articles/fulltext/extracted/20/05/AD/ND/T,/,9/0,/,2/59/Z/fulltext.txt', 248951]
['2005NewA...10..325Z', '/proj/ads/articles/fulltext/extracted/20/05/Ne/wA/,,/,1/0,/,3/25/Z/fulltext.txt', 127388]
['2006ADNDT..92..105B', '/proj/ads/articles/fulltext/extracted/20/06/AD/ND/T,/,9/2,/,1/05/B/fulltext.txt', 269501]
['2006ADNDT..92..305L', '/proj/ads/articles/fulltext/extracted/20/06/AD/ND/T,/,9/2,/,3/05/L/fulltext.txt', 272532]
['2006ADNDT..92..481Z', '/proj/ads/articles/fulltext/extracted/20/06/AD/ND/T,/,9/2,/,4/81/Z/fulltext.txt', 550232]
['2006JMoSt.780..182L', '/proj/ads/articles/fulltext/extracted/20/06/JM/oS/t,/78/0,/,1/82/L/fulltext.txt', 68805]
['2006math......9485D', '/proj/ads/articles/fulltext/extracted/20/06/ma/th/,,/,,/,,/94/85/D/fulltext.txt', 127583]
['2007ADNDT..93....1L', '/proj/ads/articles/fulltext/extracted/20/07/AD/ND/T,/,9/3,/,,/,1/L/fulltext.txt', 207798]
['2007ADNDT..93..275B', '/proj/ads/articles/fulltext/extracted/20/07/AD/ND/T,/,9/3,/,2/75/B/fulltext.txt', 303136]
['2007ADNDT..93..615A', '/proj/ads/articles/fulltext/extracted/20/07/AD/ND/T,/,9/3,/,6/15/A/fulltext.txt', 468786]
['2007ADNDT..93..742B', '/proj/ads/articles/fulltext/extracted/20/07/AD/ND/T,/,9/3,/,7/42/B/fulltext.txt', 143947]
['2007ADNDT..93..864H', '/proj/ads/articles/fulltext/extracted/20/07/AD/ND/T,/,9/3,/,8/64/H/fulltext.txt', 169796]
['2007cmd..book....1C', '/proj/ads/articles/fulltext/extracted/20/07/cm/d,/,b/oo/k,/,,/,1/C/fulltext.txt', 52302]
['2008ADNDT..94....1L', '/proj/ads/articles/fulltext/extracted/20/08/AD/ND/T,/,9/4,/,,/,1/L/fulltext.txt', 128320]
['2008ADNDT..94..561D', '/proj/ads/articles/fulltext/extracted/20/08/AD/ND/T,/,9/4,/,5/61/D/fulltext.txt', 189230]
['2008ADNDT..94..807L', '/proj/ads/articles/fulltext/extracted/20/08/AD/ND/T,/,9/4,/,8/07/L/fulltext.txt', 410355]
['2009ADNDT..95....1S', '/proj/ads/articles/fulltext/extracted/20/09/AD/ND/T,/,9/5,/,,/,1/S/fulltext.txt', 285454]
['2009ADNDT..95..547L', '/proj/ads/articles/fulltext/extracted/20/09/AD/ND/T,/,9/5,/,5/47/L/fulltext.txt', 121969]
['2009ADNDT..95..607A', '/proj/ads/articles/fulltext/extracted/20/09/AD/ND/T,/,9/5,/,6/07/A/fulltext.txt', 920690]
['2009arXiv0904.2782S', '/proj/ads/articles/fulltext/extracted/20/09/ar/Xi/v0/90/4,/27/82/S/fulltext.txt', 352583]
['2009arXiv0910.1690A', '/proj/ads/articles/fulltext/extracted/20/09/ar/Xi/v0/91/0,/16/90/A/fulltext.txt', 79066]
['2009arXiv0910.5784S', '/proj/ads/articles/fulltext/extracted/20/09/ar/Xi/v0/91/0,/57/84/S/fulltext.txt', 918828]
['2010ADNDT..96....1T', '/proj/ads/articles/fulltext/extracted/20/10/AD/ND/T,/,9/6,/,,/,1/T/fulltext.txt', 175280]
['2010ADNDT..96..123A', '/proj/ads/articles/fulltext/extracted/20/10/AD/ND/T,/,9/6,/,1/23/A/fulltext.txt', 678572]
['2010ADNDT..96..481H', '/proj/ads/articles/fulltext/extracted/20/10/AD/ND/T,/,9/6,/,4/81/H/fulltext.txt', 219803]
['2010ADNDT..96..759S', '/proj/ads/articles/fulltext/extracted/20/10/AD/ND/T,/,9/6,/,7/59/S/fulltext.txt', 315417]
['2011ADNDT..97...50B', '/proj/ads/articles/fulltext/extracted/20/11/AD/ND/T,/,9/7,/,,/50/B/fulltext.txt', 308131]
['2011ADNDT..97..225A', '/proj/ads/articles/fulltext/extracted/20/11/AD/ND/T,/,9/7,/,2/25/A/fulltext.txt', 702513]
['2011ADNDT..97..587L', '/proj/ads/articles/fulltext/extracted/20/11/AD/ND/T,/,9/7,/,5/87/L/fulltext.txt', 309957]
['2011ESASP.690E..24A', '/proj/ads/articles/fulltext/extracted/20/11/ES/AS/P,/69/0E/,,/24/A/fulltext.txt', 80391]
['2011PhDT........92J', '/proj/ads/articles/fulltext/extracted/20/11/Ph/DT/,,/,,/,,/,,/92/J/fulltext.txt', 918724]
['2012ADNDT..98..149M', '/proj/ads/articles/fulltext/extracted/20/12/AD/ND/T,/,9/8,/,1/49/M/fulltext.txt', 185016]
['2012ADNDT..98..437D', '/proj/ads/articles/fulltext/extracted/20/12/AD/ND/T,/,9/8,/,4/37/D/fulltext.txt', 163161]
['2012ADNDT..98..779W', '/proj/ads/articles/fulltext/extracted/20/12/AD/ND/T,/,9/8,/,7/79/W/fulltext.txt', 104213]
['2013ADNDT..99..249T', '/proj/ads/articles/fulltext/extracted/20/13/AD/ND/T,/,9/9,/,2/49/T/fulltext.txt', 362192]
['2013ADNDT..99..459O', '/proj/ads/articles/fulltext/extracted/20/13/AD/ND/T,/,9/9,/,4/59/O/fulltext.txt', 141759]
['2013arXiv1308.5199C', '/proj/ads/articles/fulltext/extracted/20/13/ar/Xi/v1/30/8,/51/99/C/fulltext.txt', 579391]
['2013arXiv1312.4478L', '/proj/ads/articles/fulltext/extracted/20/13/ar/Xi/v1/31/2,/44/78/L/fulltext.txt', 443883]
['2014ADNDT.100..651M', '/proj/ads/articles/fulltext/extracted/20/14/AD/ND/T,/10/0,/,6/51/M/fulltext.txt', 437029]
['2014ADNDT.100..802F', '/proj/ads/articles/fulltext/extracted/20/14/AD/ND/T,/10/0,/,8/02/F/fulltext.txt', 132232]
['2014ADNDT.100..986T', '/proj/ads/articles/fulltext/extracted/20/14/AD/ND/T,/10/0,/,9/86/T/fulltext.txt', 335214]
['2014ADNDT.100.1156T', '/proj/ads/articles/fulltext/extracted/20/14/AD/ND/T,/10/0,/11/56/T/fulltext.txt', 135543]
['2014ADNDT.100.1292F', '/proj/ads/articles/fulltext/extracted/20/14/AD/ND/T,/10/0,/12/92/F/fulltext.txt', 132797]
['2014ADNDT.100.1357X', '/proj/ads/articles/fulltext/extracted/20/14/AD/ND/T,/10/0,/13/57/X/fulltext.txt', 142772]
['2014ADNDT.100.1399A', '/proj/ads/articles/fulltext/extracted/20/14/AD/ND/T,/10/0,/13/99/A/fulltext.txt', 515880]
['2014ADNDT.100.1519L', '/proj/ads/articles/fulltext/extracted/20/14/AD/ND/T,/10/0,/15/19/L/fulltext.txt', 339602]
['2014ADNDT.100.1603A', '/proj/ads/articles/fulltext/extracted/20/14/AD/ND/T,/10/0,/16/03/A/fulltext.txt', 716084]
['2015ADNDT.101...41Z', '/proj/ads/articles/fulltext/extracted/20/15/AD/ND/T,/10/1,/,,/41/Z/fulltext.txt', 631607]
['2015arXiv150309147G', '/proj/ads/articles/fulltext/extracted/20/15/ar/Xi/v1/50/30/91/47/G/fulltext.txt', 61666]
['2016ADNDT.107..140A', '/proj/ads/articles/fulltext/extracted/20/16/AD/ND/T,/10/7,/,1/40/A/fulltext.txt', 295055]
['2016ADNDT.107..221A', '/proj/ads/articles/fulltext/extracted/20/16/AD/ND/T,/10/7,/,2/21/A/fulltext.txt', 554068]
['2016ADNDT.108...15W', '/proj/ads/articles/fulltext/extracted/20/16/AD/ND/T,/10/8,/,,/15/W/fulltext.txt', 145479]
['2016ADNDT.111..280A', '/proj/ads/articles/fulltext/extracted/20/16/AD/ND/T,/11/1,/,2/80/A/fulltext.txt', 204495]
['2016JPhCS.758a2002D', '/proj/ads/articles/fulltext/extracted/20/16/JP/hC/S,/75/8a/20/02/D/fulltext.txt', 19531]
['2016JPhCS.761a2034C', '/proj/ads/articles/fulltext/extracted/20/16/JP/hC/S,/76/1a/20/34/C/fulltext.txt', 30861]
['2016JQSRT.168..102S', '/proj/ads/articles/fulltext/extracted/20/16/JQ/SR/T,/16/8,/,1/02/S/fulltext.txt', 121291]
['2016JVGR..311...79G', '/proj/ads/articles/fulltext/extracted/20/16/JV/GR/,,/31/1,/,,/79/G/fulltext.txt', 99296]
['2016NJPh...18j3050C', '/proj/ads/articles/fulltext/extracted/20/16/NJ/Ph/,,/,1/8j/30/50/C/fulltext.txt', 22097]
['2016Tectp.677....1L', '/proj/ads/articles/fulltext/extracted/20/16/Te/ct/p,/67/7,/,,/,1/L/fulltext.txt', 137039]
@aaccomazzi
Copy link
Member Author

For extraction from XML sources, this issue is fixed with commit c18f29b

We should still investigate whether this is also an issue with PDF extraction and then decide how to tackle it.

@aaccomazzi
Copy link
Member Author

To clarify: the SOLR ingestion problem had to do with a limitation in the word length (blobs of characters longer than 32k caused a failure), not text length.

Commit ee36740 modifies the PDF extraction code so that newlines are kept, greatly lowering the possibility that huge strings of nonsense will be generated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant