-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-file parse #65
Comments
From the error message, it appears code tries to open a file whose name is the comma separated list: |
non-existent file error is generated at https://github.com/adsabs/ADSfulltext/blob/master/adsft/checker.py#L235 |
Currently, there are 11259 bibcodes in fulltext/all.links that list multiple files. This is out of 4823177 bibcodes that have fulltext. When multiple files are listed, it looks like the first file (right.html) contains the text of the paper while the other files hold tables. |
One complicating factor, commas can part of a filename: Perhaps a comma followed by a '/' is the start of a new filename. |
the fulltext is spread over multiple files for over 11k bibcodes. currently, fulltext fails on all of them. with this change only the first file of the list of files will be processed. this is very much a partial solution, fortunately the first file holds the bulk of the text. other files in the list typically hold text from tables.
#65 partial support for multi-file fulltext records
concatentate values read in from multiple files for at least the keys dataset, fulltext and acknowledgements
I verified the Solr body field for 2003A&A...402..531C has text from each of the 4 files listed in fulltext/all.links. |
It appears fulltext generates an error when the /proj/ads/abstracts/config/links/fulltext/all.links contains multiple files, for example, a single line from all.links:
2003A&A...402..531C /proj/ads/fulltext/sources/A+A/backdata/2003/17/aah3724/aah3724.right.html,/proj/ads/fulltext/sources/A+A/backdata/2003/17/aah3724/tableE.1.html,/proj/ads/fulltext/sources/A+A/backdata/2003/17/aah3724/table2.html,/proj/ads/fulltext/sources/A+A/backdata/2003/17/aah3724/table1.html EDP
The text was updated successfully, but these errors were encountered: