Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-file parse #65

Closed
spacemansteve opened this issue Jan 2, 2018 · 5 comments
Closed

Multi-file parse #65

spacemansteve opened this issue Jan 2, 2018 · 5 comments
Assignees

Comments

@spacemansteve
Copy link
Contributor

spacemansteve commented Jan 2, 2018

It appears fulltext generates an error when the /proj/ads/abstracts/config/links/fulltext/all.links contains multiple files, for example, a single line from all.links:

2003A&A...402..531C /proj/ads/fulltext/sources/A+A/backdata/2003/17/aah3724/aah3724.right.html,/proj/ads/fulltext/sources/A+A/backdata/2003/17/aah3724/tableE.1.html,/proj/ads/fulltext/sources/A+A/backdata/2003/17/aah3724/table2.html,/proj/ads/fulltext/sources/A+A/backdata/2003/17/aah3724/table1.html EDP

@spacemansteve
Copy link
Contributor Author

From the error message, it appears code tries to open a file whose name is the comma separated list:
"Bibcode '2003A&A...402..531C' is linked to a non-existent file '/proj/ads/fulltext/sources/A+A/backdata/2003/17/aah3724/aah3724.right.html,/proj/ads/fulltext/sources/A+A/backdata/2003/17/aah3724/tableE.1.html,/proj/ads/fulltext/sources/A+A/backdata/2003/17/aah3724/table2.html,/proj/ads/fulltext/sources/A+A/backdata/2003/17/aah3724/table1.html'"

@spacemansteve
Copy link
Contributor Author

non-existent file error is generated at https://github.com/adsabs/ADSfulltext/blob/master/adsft/checker.py#L235

@spacemansteve
Copy link
Contributor Author

Currently, there are 11259 bibcodes in fulltext/all.links that list multiple files. This is out of 4823177 bibcodes that have fulltext. When multiple files are listed, it looks like the first file (right.html) contains the text of the paper while the other files hold tables.

@spacemansteve
Copy link
Contributor Author

One complicating factor, commas can part of a filename:
2008bhgs.confE...1V /proj/ads/fulltext/sources/downloads/cache/POS/pos.sissa.it//archive/conferences/075/001/BHs,%20GR%20and%20Strings_001.pdf POS

Perhaps a comma followed by a '/' is the start of a new filename.

spacemansteve pushed a commit to spacemansteve/ADSfulltext that referenced this issue Jan 10, 2018
the fulltext is spread over multiple files for over 11k bibcodes.  currently, fulltext fails on all of them.  with this change only the first file of the list of files will be processed.  this is very much a partial solution, fortunately the first file holds the bulk of the text.  other files in the list typically hold text from tables.
spacemansteve added a commit that referenced this issue Jan 12, 2018
#65 partial support for multi-file fulltext records
spacemansteve pushed a commit to spacemansteve/ADSfulltext that referenced this issue Jan 17, 2018
concatentate values read in from multiple files for at least the keys dataset, fulltext and acknowledgements
spacemansteve pushed a commit to spacemansteve/ADSfulltext that referenced this issue Jan 17, 2018
@spacemansteve
Copy link
Contributor Author

I verified the Solr body field for 2003A&A...402..531C has text from each of the 4 files listed in fulltext/all.links.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants