Skip to content

August 4, 2021

alejandropaz edited this page Aug 4, 2021 · 2 revisions

Agenda

  • SWPP for Raiyan
  • timesheets
  • apify update
  • results of crawls now available on MVP?
  • politics subdomain?
  • cutting a beta
  • ideas for running parallel instances for next person?

Apify

  • udpated to apify 1.3.1, lots of major changes, esp queue
  • no longer file based JSON but rather DB
  • it works like it used to
  • bad news: need to re-start politics and middle east from the beginning
  • results JSON is the same, so that the input to the postprocessor should work
  • probably any other updates won't be such a hassle
  • could be that the multiple instances will be easier now
  • if John were to look at the possibility of multiple instances:
  1. first read apify documentation
  2. trying to run the crawler in different terminals multiple times, and keeping an eye on the queue to ensure no errors
  • couple of functions were deprecated, which means that there'll be new functions but not sure when:
  • e.g., goto (blacklisted domains and videos etc, to help crawl faster) function probably will be removed, and Raiyan will try to remove and use workaround that does essentially the same thing

Current Crawls

  • NYT twitter crawl accessible and will be udpated to MVP
  • NYT/middle east & NYT/politics will be restarted, and old data expunged

Cutting a release

  • we're not sure how to do a release
  • twitter and domain crawler documentation should be good; we're not sure about the postprocessor documentation

other ideas:

  • with DB: running the new python script -- "master crawler" -- which runs the crawler (not the bash script) multiple times: might work because Raiyan tested it and there were times that it did work
  • suggestion: next dev should be in touch with the Apify development team to see if they have suggestions for how to run multiple instances; devs are good about responding

Next meeting

  • in person at UTSC for week of August 30th?
Clone this wiki locally