Aug 18, 2023
alejandropaz edited this page Aug 18, 2023
·
2 revisions
- troubleshoot the postprocessor results to see why they aren't accurate - Fr
- add comments to postprocessor - Fr
- if time allows, run postprocessor on Fox News and Washington Post twitter crawls - Fr
- document issue of deleting instance without detaching storage - Gy
- try new crawl techniques as best as possible and experiment with new IP address - Gy
- with new instance and new code, try NYT archive crawl - Gy
- attempt to create instance and then attach storage - Gy
- URL expander wasn't working and now fixed;
- URL expander: updated - javascript and then ran into bugs
- added documentation to header cleaner: give it a directory of files and outputs a new directory with all the clean files
- managed to delete the volume
- crawls: getting blocked very soon, and so using new code which stops the crawl at 3 days for a 1 day break
- will put all the crawls on new code
- will consult with Nat
- error message about corrupt: disk image malformed -- suspect that it's due to apify itself
- mentions in error message problem with Apify
- will consult with Nat about error and also about possibility of copy/paste of Apify folder
- if Apify folder is present, then the crawler does not re-crawl the URLs but if the Apify folder isn't present, then it will re-do whole crawl
- check crawl every 2 days - Gy
- update the MVP esp wrt format of data going into postprocessor and coming out, and then as input to the visualization environment - Gy/Fr
- push corrected postprocessor code to master - Gy/Fr
- postprocessor: document with instructions the order of utilities and steps to use the postprocessor - Gy
- backburner: figure out corruption in small domain crawl
- troubleshoot the postprocessor results to see why they aren't accurate
- run postprocessor on Fox News and Washington Post twitter crawls