May 26, 2022
alejandropaz edited this page May 26, 2022
·
2 revisions
- Alejandro: finish updating server documentation
- Alejandro: send sites to crawl
- design crawl brake (pause and email)
- test new method above (2 threads, etc) on small domain list of domains
- main item is postprocessor refactoring along lines stated above
- NYT archive politics crawl
- Apify is not good at error handling, has error function but only caught under certain circumstances
- if a url fails, url added to queue and error only registers if re-try fails
- puppeteer error handling: need a number to crawl round (eg 1000) and then have error registering: if exceeds 50, then pause
- user has to set number for crawl round and then the number for errors before pausing (will be added to documentation)
- we'll whitelist 404 errors
- one list of crawling errors: https://www.bing.com/webmasters/help/crawl-error-alerts-e29a3f3e
- Apify has dataset for failed url: will appear in apify storage, same place as request queue
- Domain_crawler/guardian_2022_05_12/mediacat-domain-crawler/newCrawler/apify_storage/datasets
- added brake when the queue goes to 0 and send an email
- Shengsong tried it but it doesn't seem to be working, gets error "undefined"
- pre-navigation: probably need a blacklist for each domain, but could look into it in the future
- tested on 5 domains, with stealthy 2 thread 4-5 sec, no block error
- middleeasteye got 2 million urls:
- test pre-navigation on middleeasteye at some
- check in next week to see if finished the 10
- it is possible to crawl in rounds: apify
- go in rounds in order to reduce the pause time between calls to a given domain
- we can set the number of urls from each domain, e.g., 500
- theoretically, we could have enough domains that we wouldn't need a pause at all, but not for most crawling
- document crawling in rounds
- input processing and output
- further divide twitter & domain
- probably further divided after that
- still running - about 200,000 finished of 800,000
- add documentation about:
- crawler numbers for error registering and pausing the crawler
- brake when queue goes to 0
- apify crawl in rounds
- Twitter: embedded tweet issue
- Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
- using crawler proxies
- adding to regular postprocessor output:
- any non-scope domain hyperlink that ends in .co.il
- any link to a tweet or twitter handle
- This is a bit outside our normal functionality, so I will put it on the backburner for now.
- what to do with htz.li
- finding language function
- image_reference function
- dealing with embedded versus cited tweets