Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python run.py -s -f /proj/ads/abstracts/config/links/fulltext/all.links sends 94885 messages #51

Closed
romanchyla opened this issue Oct 25, 2017 · 3 comments · Fixed by #52
Assignees

Comments

@romanchyla
Copy link
Contributor

all of them make it into the master pipeline

however there is lot more fulltext:

 ads@adsvm05:/proj.adsvm05/backoffice$ wc -l /proj/ads/abstracts/config/links/fulltext/all.links
4895019 /proj/ads/abstracts/config/links/fulltext/all.links

@marblestation can you please check? shall we re-run extraction?

@romanchyla
Copy link
Contributor Author

btw: the old solr seems to have 5,281,144 docs with fulltext

@romanchyla
Copy link
Contributor Author

when run as run.py -f /proj/ads/abstracts/config/links/fulltext/all.links --max_queue_size 0 -s

it is submitting everything, this should be the default

@romanchyla
Copy link
Contributor Author

SOLR has 2.4M docs (the old one 5.2M)

there are some errors in the log queue of adsft

eg.

./logs/adsft.checker.log.2:2017-10-30 17:24:13,527 ERROR [469:MainThread:checker.py:229] Bibcode '1999BAAS...31.1185L' is linked to a zero byte size file '/proj/ads/fulltext/sources/downloads/cache/ADS/articles.adsabs.harvard.edu/full/BAAS/0031/1999BAAS...31.1185L.ocr' ./logs/adsft.checker.log.2:2017-10-30 17:24:13,901 ERROR [478:MainThread:checker.py:229] Bibcode '1999BAAS...31Q1185C' is linked to a zero byte size file '/proj/ads/fulltext/sources/downloads/cache/ADS/articles.adsabs.harvard.edu/full/BAAS/0031/1999BAAS...31Q1185C.ocr'

and

./logs/adsft.extraction.log:2017-10-30 17:13:28,428 ERROR [470:MainThread:extraction.py:897] Fulltext extraction failed for bibcode '1997EL.....38..675R': '/proj/ads/articles/fulltext/sources/EL/0038/epl_38_9_675.pdf' ./logs/adsft.extraction.log:2017-10-30 17:13:31,423 ERROR [480:MainThread:extraction.py:897] Fulltext extraction failed for bibcode '1997EL.....38..423S': '/proj/ads/articles/fulltext/sources/EL/0038/epl_38_6_423.pdf'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants