July 12, 2022
alejandropaz edited this page Jul 13, 2022
·
2 revisions
- finalizing testing of new postprocessor and merge to master if working
- start NYT Politics Archive postprocessing if postprocessor is done
- continue learning D3 for edge-node
- start new crawl with twitter accounts Alejandro will send
- meet with Alejandro to finalize looking at datasets
- twitter: embedded tweet issue
- to discuss next meeting:
- how to cut a release
- writing a paper about MediaCAT and architecture
- testing of the new postprocessor is complete & found in repo "postprocessor"
- in GitHub, old postprocessor was in repo "mediacat_backend" and now will be moved to new repo with note that it is no longer in use
- old backend has a few utils that are now available directly with the new crawler, eg get all urls
- started NYT politics Archive
- low input - for domain crawls: have very lengthy plain text, so the input has to be first converted to pandas and then converted to DASK in order to process (DASK giving errors otherwise)
- Shengsong will consult with Nat about this issue
- it shouldn't add a lot of processing time
- Shengsong will complete documentation
- WaPo/Foxnews twitter crawl: url expander going
- can add the other tweets when the crawl is finalized after July 19
- Twitter: for WaPo/FoxNews twitter crawl reached 10 million per month max by crawling where each embedded or replied tweet is also counted
- re-start twitter crawl on July 19 when the quota refreshes
- small domain: still going, 1.1 million
- The Guardian: 1.2 million
- added blacklist for comments section
- not yet
- D3 visualizations: in testing
- documentation on the new order of postprocessor input with conversion to pandas before conversion to DASK
- next week consult with Nat about this
- on Jul 19: resume the WaPo/Foxnews twitter
- Alejandro will send scope for Israeli and Palestinian news domains
- Twitter embedding issue - this week
- Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
- using crawler proxies
- adding to regular postprocessor output:
- any non-scope domain hyperlink that ends in .co.il
- any link to a tweet or twitter handle
- This is a bit outside our normal functionality, so I will put it on the backburner for now.
- what to do with htz.li
- finding language function
- image_reference function