Feb 15, 2024
alejandropaz edited this page Feb 15, 2024
·
2 revisions
- continue combinging results and document (reading week) - Gy
- monitor NYT archive and send email to Alejandro to update - Gy
- follow up with IA regarding connection refused error - Ra
- ask Nat for meeting about connection refused error - Ra
- try another kind of crawl to see if there's a refused error - Ra
- try to update version of node to see if that helps - Ra
- take a sample of Wa/Po and see if can reproduce the right result - Fr
- follow up by email about Wa/Po output number - Fr
- unit-testing each function of the postprocessor for IA dataset - Ar
- if unit-testing shows accuracy, then request IA Electronic Intifada dataset from Raazia and proceed with postprocessing - Ar
- need at least 2000 a day on NYT Mid E archive crawl
- Nat helped with workarounds:
- separate downloading of urls through cdx from crawling
- storing failed responses and try them again (but switching through successful responses)
- randomizing the pagination attempts
- filtering after download of urls, check for duplicate urls before assigning them to the queue for crawling
- some of the problem may be the responsisveness of IA servers, so slow down requests
- test NYT Mid E Archive crawl with a speed of 2000 results a day, if possible, then continue, if not, then abandon - Gy
- continue combinging results and document (reading week) - Gy
- integrating Nat's suggestions and testing again the NYT Mid E - Ra