July 28, 2022
alejandropaz edited this page Jul 28, 2022
·
2 revisions
- flagging issue -- any insights?
- documentation on the new order of postprocessor input with conversion to pandas before conversion to DASK
- next week consult with Nat about this
- on Jul 19: resume the WaPo/Foxnews twitter
- visualization
- Alejandro will send scope for Israeli and Palestinian news domains
- Twitter embedding issue - this week
- to discuss next meeting:
- how to cut a release
- writing a paper about MediaCAT and architecture
- problem with text alias
- re-run postprocessing on KPP/MediaCAT date
- the URL expander seems to have set off alarm bells, question is if there's something we could do different
- information leakage flag: attempt to mimic one of their client's websites
- very difficult to say what is getting us flagged
- python request library: like a crawler, trying to get the URL as best as possible, not using a headless browser
- not a real person
- how much slower with headless browser?
- slightly faster than domain crawler
- headless browser in python - yes, but easier to re-use the headless browser we have
- how long to develop headless browser URL expander?
- probably a week
- flag when up due to automated function
- this is complicated: need to download a lot of tweets in order to look at the problem
- need to get to this week
- file system documentation - to show where every files are
- explaining the compute canada - Shengsong updated: restart instance, transfer files, back up files, compute canada map, temp issue and how to completely rebuild an instance if something goes wrong, setting up SSH
- update documentation
- in every repository, need same version number
- need release policy or strategy
- need automated environment to download everything
- the versioning guidelines determine how long it will take
- could release as different parts
- document what is needed to run the entire thing
- need to find the stale branches and remove
- can get a DOI - for each repo
- we will evaluate next week, probably do early september
- error for stacked area graph: simple label problem, and produced correct graphs
- D3 vector diagram: produces a html file, and then it's interactive
- when A is back in Toronto, Shengsong and A will record session about how to set up the environment
- WaPo/Foxnews: re-start: today will re-start
- postprocessing NYT archive politics: stopped, will be re-started, didn't lose what was done
- small domain crawler: still running, 1.6 million
- the Guardian, still running
- work on headless browser URL expander
- Twitter embed issue
- code cleanup on D3 vector diagram
- re-do the KPP postprocessing
- restart WaPo/Foxnews twitter crawl
- restart the postprocessing of NYT politics archive
- send new graphs for KPP data
- Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
- using crawler proxies
- adding to regular postprocessor output:
- any non-scope domain hyperlink that ends in .co.il
- any link to a tweet or twitter handle
- This is a bit outside our normal functionality, so I will put it on the backburner for now.
- what to do with htz.li
- finding language function
- image_reference function