August 4, 2021
alejandropaz edited this page Aug 4, 2021
·
2 revisions
- SWPP for Raiyan
- timesheets
- apify update
- results of crawls now available on MVP?
- politics subdomain?
- cutting a beta
- ideas for running parallel instances for next person?
- udpated to apify 1.3.1, lots of major changes, esp queue
- no longer file based JSON but rather DB
- it works like it used to
- bad news: need to re-start politics and middle east from the beginning
- results JSON is the same, so that the input to the postprocessor should work
- probably any other updates won't be such a hassle
- could be that the multiple instances will be easier now
- if John were to look at the possibility of multiple instances:
- first read apify documentation
- trying to run the crawler in different terminals multiple times, and keeping an eye on the queue to ensure no errors
- couple of functions were deprecated, which means that there'll be new functions but not sure when:
- e.g., goto (blacklisted domains and videos etc, to help crawl faster) function probably will be removed, and Raiyan will try to remove and use workaround that does essentially the same thing
- NYT twitter crawl accessible and will be udpated to MVP
- NYT/middle east & NYT/politics will be restarted, and old data expunged
- we're not sure how to do a release
- twitter and domain crawler documentation should be good; we're not sure about the postprocessor documentation
- with DB: running the new python script -- "master crawler" -- which runs the crawler (not the bash script) multiple times: might work because Raiyan tested it and there were times that it did work
- suggestion: next dev should be in touch with the Apify development team to see if they have suggestions for how to run multiple instances; devs are good about responding
- in person at UTSC for week of August 30th?