March 17, 2022

Agenda

new policy for url extender
- URL extender issues: double shortening and retweets
finalize puppeteer update
finalize crawl of timelines for KPP/MediaCAT: 60,001 + tweets
make postprocessor able to read twitter API output
Alejandro/RA checking output from postprocessor extraction of plain text

retweets: URL not included
- used library called request to fetch original url from the original tweet
extender errors:
- added code to get every shortened URL that works to find the original URL
need to send request every time for shortened url and also to get original url from retweeted tweet -- slows down
- perhaps slow down the requests to ensure we aren't blocked
- each request from puppeteer actually takes a few seconds
send support request to Twitter

apify v 1 patch - apify uses puppeteer, documentation for v 1 but not for v 2
- v 2: tried it and it gave bugs
apify v 1 is functioning fine, no benchmark yet
Shengsong will check for documentation on a weekly basis

Shengsong to Alejandro: send support request to Twitter API team about cases without url included from retweets, and also about shortened urls as expanded
postprocessor for Twitter API output
nytimes.com crawl benchmarking