February 24, 2022
alejandropaz edited this page Mar 3, 2022
·
4 revisions
- 2 tickets:
- readability & domain crawler errors
- question: last spreadsheet only 800 rows
- heap memory error
- readability & domain crawler errors
- documenting two scopes on padlet & update the list of keys provided on JSON export from the post-processor
- Puppeteer update?
- need to do full re-crawl of al-monitor?
- Twitter API crawl
- errors were readability - plain text from link on url
- replaced with puppeteer plain text retrieval, which seems to be working nearly perfectly
- Shengsong added code for filtering required out
- checked NYT as well as al-monitor.com -- filtering should not be domain specific -- will test
- simple filter -- update domain crawler & commit to main branch
- Shengsong added code for filtering required out
- this should speed up our crawl without the extra call to readability
- create a second version since the original one works
- Shengsong's version will become the master branch, and Amy's will be v1
- main change is that Shengsong's formally separates the crawl scope from the citation scope, and changes the names of the keys that the postprocessor outputs
- changes are summarized here
- citation scope can be split into smaller crawl_scope: and the crawler create outputs, and the postprocessor brings together the various outputs and their cross-referencing
- adds flexibility to enable multiple recursions of the postprocessing and possibility to update a crawl
- original heap memory 4 GB, yesterday Shengsong increased to 7 GB, can be increased to 16 GB
- this is sys admin issue: user should allocate heap memory in setting up server
- Shengsong adjust the script to add a variable for heap memory
- this should resolve domain crawler issue
- could take 2-3 days
- one depreciate function, but shouldn't affect the structure
- 2-3 days to get going
- re-do al-monitor.com crawl & benchmarking speed
- Twitter API
- testing new puppeteer filter code on 50 domains before documenting and committing
- creating second version of postprocessor and make master, to preserve Amy's version
- adjust domain crawler set up script to add a variable for heap memory
- document in mediacat domain crawler
- Alejandro will update 2 tickets
- Benchmarking
- puppeteer update: 2.2 and we have 1.5, take 2-3 days
- re-do small domain crawl
- finish documenting where different data are on our server
- finding language function
- image_reference function
- documenting two scopes on padlet & update the list of keys provided on JSON export from the post-processor