Skip to content

February 17, 2022

alejandropaz edited this page Feb 17, 2022 · 6 revisions

Agenda

  • Shengsong update on postprocessor csv creation process
    • udpating postprocessor category names and adding "title" of URL-article
  • Shengsong will develop reader for scope to start Twitter crawl
  • Colin will commit csv processing to media back-end repo

Commiting csv processing

Error-checking: problem extracting plain text

Postprocessor and CSV creation

  • Shengsong streamlined and modified the find citation alias
  • Postprocessor currently
    • finds URL-articles & Tweets with relevant citations (either text alias or hyperlink) and creates a row for them, but does not store and list the relevant citations -- Shengsong will correct this second part so it does
      • Shengsong modified the find alias function
    • includes a column for language, image reference - which we will remove
      • info about image reference is there, but will put on back burner
    • does not include a column for article title, which we will include
  • Changes to postprocessor column names:
    • change the name of "url or alias text" to "url"
    • change the name of "name/title" to "name"
    • change the name of "citation name/title" to "citation name"
  • New terminology:
    • MediaCAT takes 2 kinds of scope:
    1. crawl_scope = a set of domains (for the domain crawler) and/or twitter accounts (for the twitter API crawler) to be crawled
    2. citation_scope = the scope of news sites and twitter accounts which the user wants the postprocessor to find in the crawled data; this is inputted into the postprocessor
    • Crawl scope and citation scope can be different or the same depending on the needs of the user.

Twitter API crawl

  • Shengsong read through the crawl but has not yet had a chance to begin coding

Action Items

  • Shengsong will create 2 tickets:
  • readability and plain text and debugging - priority
  • JS heap memory error
  • Alejandro will add langauge about two crawl scopes to MVP

Backburner

  • Benchmarking
  • re-do small domain crawl
  • finish documenting where different data are on our server
  • finding language function
  • image_reference function
Clone this wiki locally