February 17, 2022
alejandropaz edited this page Feb 17, 2022
·
6 revisions
- Shengsong update on postprocessor csv creation process
- udpating postprocessor category names and adding "title" of URL-article
- Shengsong will develop reader for scope to start Twitter crawl
- Colin will commit csv processing to media back-end repo
- Here is the committed repo: https://github.com/UTMediaCAT/mediacat-backend/tree/master/csv_processing
- how to crawl massive data and not get plain text errors
- could it be the readability function? https://github.com/mozilla/readability
- https://github.com/UTMediaCAT/mediacat-domain-crawler/blob/4837c5b018e2e8f1543374f64fad579e301f0ff7/newCrawler/crawlCheerio.js#L77
- https://github.com/UTMediaCAT/mediacat-domain-crawler/blob/8235e672dde8c87b45a526491c6dcbcaf9983908/newCrawler/crawl.js#L201
- track down the function and do an echo, and find the plain text before it gets saved
- also see if problem in the queue: use program to open sqlite to see how keys are being stored
- also check puppeteer github: and check for updates in puppeteer code
- probably separate issue: Error: "Fatal error: ineffective mark-compacts near heap limit allocation failed - Javascript heap out of memory"
- al-monitor/original/ - heap memory: memory leak - our code used memory and didn't delete what it stored in the temp memory once it finished
- this error only came up once
- JS error not puppeteer error
- there's a work around to increase the heap memory, but it can still fail
- when doing just a shorter crawl (e.g., urls w /FA/ for Farsi), then much lower rate of error.
- Shengsong streamlined and modified the find citation alias
- Postprocessor currently
- finds URL-articles & Tweets with relevant citations (either text alias or hyperlink) and creates a row for them, but does not store and list the relevant citations -- Shengsong will correct this second part so it does
- Shengsong modified the find alias function
- includes a column for language, image reference - which we will remove
- info about image reference is there, but will put on back burner
- does not include a column for article title, which we will include
- finds URL-articles & Tweets with relevant citations (either text alias or hyperlink) and creates a row for them, but does not store and list the relevant citations -- Shengsong will correct this second part so it does
- Changes to postprocessor column names:
- change the name of "url or alias text" to "url"
- change the name of "name/title" to "name"
- change the name of "citation name/title" to "citation name"
- New terminology:
- MediaCAT takes 2 kinds of scope:
- crawl_scope = a set of domains (for the domain crawler) and/or twitter accounts (for the twitter API crawler) to be crawled
- citation_scope = the scope of news sites and twitter accounts which the user wants the postprocessor to find in the crawled data; this is inputted into the postprocessor
- Crawl scope and citation scope can be different or the same depending on the needs of the user.
- Shengsong read through the crawl but has not yet had a chance to begin coding
- Shengsong will create 2 tickets:
- readability and plain text and debugging - priority
- JS heap memory error
- Alejandro will add langauge about two crawl scopes to MVP
- Benchmarking
- re-do small domain crawl
- finish documenting where different data are on our server
- finding language function
- image_reference function