February 17, 2022

Agenda

Shengsong update on postprocessor csv creation process
- udpating postprocessor category names and adding "title" of URL-article
Shengsong will develop reader for scope to start Twitter crawl
Colin will commit csv processing to media back-end repo

how to crawl massive data and not get plain text errors
could it be the readability function? https://github.com/mozilla/readability
- https://github.com/UTMediaCAT/mediacat-domain-crawler/blob/4837c5b018e2e8f1543374f64fad579e301f0ff7/newCrawler/crawlCheerio.js#L77
- https://github.com/UTMediaCAT/mediacat-domain-crawler/blob/8235e672dde8c87b45a526491c6dcbcaf9983908/newCrawler/crawl.js#L201
- track down the function and do an echo, and find the plain text before it gets saved
- also see if problem in the queue: use program to open sqlite to see how keys are being stored
- also check puppeteer github: and check for updates in puppeteer code
probably separate issue: Error: "Fatal error: ineffective mark-compacts near heap limit allocation failed - Javascript heap out of memory"
- al-monitor/original/ - heap memory: memory leak - our code used memory and didn't delete what it stored in the temp memory once it finished
- this error only came up once
- JS error not puppeteer error
- there's a work around to increase the heap memory, but it can still fail
when doing just a shorter crawl (e.g., urls w /FA/ for Farsi), then much lower rate of error.

Shengsong streamlined and modified the find citation alias
Postprocessor currently
- finds URL-articles & Tweets with relevant citations (either text alias or hyperlink) and creates a row for them, but does not store and list the relevant citations -- Shengsong will correct this second part so it does
  - Shengsong modified the find alias function
- includes a column for language, image reference - which we will remove
  - info about image reference is there, but will put on back burner
- does not include a column for article title, which we will include
Changes to postprocessor column names:
- change the name of "url or alias text" to "url"
- change the name of "name/title" to "name"
- change the name of "citation name/title" to "citation name"
New terminology:
- MediaCAT takes 2 kinds of scope:
1. crawl_scope = a set of domains (for the domain crawler) and/or twitter accounts (for the twitter API crawler) to be crawled
2. citation_scope = the scope of news sites and twitter accounts which the user wants the postprocessor to find in the crawled data; this is inputted into the postprocessor
- Crawl scope and citation scope can be different or the same depending on the needs of the user.