Aug 11, 2023
alejandropaz edited this page Aug 11, 2023
·
2 revisions
- develop script and documentation to remove extra header lines from twitter crawl output as prior to postprocessing - Fr
- check URL extender to see if most updated - Fr
- run URL extender on test twitter crawl output (~23,000) and run postprocessor on the resulting output - Fr
- check results of postprocessor on test data - Al
- if results work, run URL extender on all twitter crawl (Fox News and Washington Post, keeping separate) and postprocess - Fr
- check if new IP address created with new instance - Gy
- pause Israeli domain crawl while testing other crawl technique - Gy
- set up individual crawls for Israeli domains to test crawl technique, and check regularly to see if multiple errors have cause brake - Gy
- if new IP address is created with new instance, try NYT archive crawl - Gy
- problem with the postprocessed results: haaretz in crawler output and strange results in postprocessor output
- developed script and documentation for removal of extra header lines from twitter crawl output
- fixed the warning messages from running postprocessor
- cleaned up the postprocessor documentation
- created new instance on Graham, and then ran into problem of attaching storage because storage was still connected to deleted instance
- wrote script for running small domain crawl to ease pressure on domains
- suspect on some domains, server blocks any crawler
- check crawl every 2 days - Gy
- update the MVP esp wrt format of data going into postprocessor and coming out, and then as input to the visualization environment - Gy/Fr
- push corrected postprocessor code to master - Gy/Fr
- postprocessor: document with instructions the order of utilities and steps to use the postprocessor - Gy
- backburner: figure out corruption in small domain crawl
- troubleshoot the postprocessor results to see why they aren't accurate - Fr
- add comments to postprocessor - Fr
- if time allows, run postprocessor on Fox News and Washington Post twitter crawls - Fr
- document issue of deleting instance without detaching storage - Gy
- try new crawl techniques as best as possible and experiment with new IP address - Gy
- with new instance and new code, try NYT archive crawl - Gy
- attempt to create instance and then attach storage - Gy