Skip to content

July 28, 2022

alejandropaz edited this page Jul 28, 2022 · 2 revisions

Agenda

  • flagging issue -- any insights?
  • documentation on the new order of postprocessor input with conversion to pandas before conversion to DASK
    • next week consult with Nat about this
  • on Jul 19: resume the WaPo/Foxnews twitter
  • visualization
  • Alejandro will send scope for Israeli and Palestinian news domains
  • Twitter embedding issue - this week
  • to discuss next meeting:
    • how to cut a release
    • writing a paper about MediaCAT and architecture

text alias issue

  • problem with text alias
  • re-run postprocessing on KPP/MediaCAT date

Flagging

  • the URL expander seems to have set off alarm bells, question is if there's something we could do different
  • information leakage flag: attempt to mimic one of their client's websites
  • very difficult to say what is getting us flagged
  • python request library: like a crawler, trying to get the URL as best as possible, not using a headless browser
    • not a real person
    • how much slower with headless browser?
      • slightly faster than domain crawler
      • headless browser in python - yes, but easier to re-use the headless browser we have
    • how long to develop headless browser URL expander?
      • probably a week
  • flag when up due to automated function

Twitter embedding issue

  • this is complicated: need to download a lot of tweets in order to look at the problem
  • need to get to this week

documentation

  • file system documentation - to show where every files are
  • explaining the compute canada - Shengsong updated: restart instance, transfer files, back up files, compute canada map, temp issue and how to completely rebuild an instance if something goes wrong, setting up SSH

cutting a release

  • update documentation
  • in every repository, need same version number
  • need release policy or strategy
  • need automated environment to download everything
  • the versioning guidelines determine how long it will take
  • could release as different parts
    • document what is needed to run the entire thing
  • need to find the stale branches and remove
  • can get a DOI - for each repo
  • we will evaluate next week, probably do early september

Visualization

  • error for stacked area graph: simple label problem, and produced correct graphs
  • D3 vector diagram: produces a html file, and then it's interactive
  • when A is back in Toronto, Shengsong and A will record session about how to set up the environment

Crawls

  • WaPo/Foxnews: re-start: today will re-start
  • postprocessing NYT archive politics: stopped, will be re-started, didn't lose what was done
  • small domain crawler: still running, 1.6 million
  • the Guardian, still running

Action Items

  • work on headless browser URL expander
  • Twitter embed issue
  • code cleanup on D3 vector diagram
  • re-do the KPP postprocessing
  • restart WaPo/Foxnews twitter crawl
  • restart the postprocessing of NYT politics archive
  • send new graphs for KPP data

Backburner

  • Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
  • using crawler proxies
  • adding to regular postprocessor output:
    1. any non-scope domain hyperlink that ends in .co.il
    2. any link to a tweet or twitter handle
    • This is a bit outside our normal functionality, so I will put it on the backburner for now.
  • what to do with htz.li
  • finding language function
  • image_reference function
Clone this wiki locally