March 17, 2022
alejandropaz edited this page Mar 17, 2022
·
2 revisions
- new policy for url extender
- URL extender issues: double shortening and retweets
- finalize puppeteer update
- finalize crawl of timelines for KPP/MediaCAT: 60,001 + tweets
- make postprocessor able to read twitter API output
- Alejandro/RA checking output from postprocessor extraction of plain text
- retweets: URL not included
- used library called request to fetch original url from the original tweet
- extender errors:
- added code to get every shortened URL that works to find the original URL
- need to send request every time for shortened url and also to get original url from retweeted tweet -- slows down
- perhaps slow down the requests to ensure we aren't blocked
- each request from puppeteer actually takes a few seconds
- send support request to Twitter
- apify v 1 patch - apify uses puppeteer, documentation for v 1 but not for v 2
- v 2: tried it and it gave bugs
- apify v 1 is functioning fine, no benchmark yet
- Shengsong will check for documentation on a weekly basis
- we have 3 files of less than 1 million rows with the entire corpus
- not yet
- Shengsong to Alejandro: send support request to Twitter API team about cases without url included from retweets, and also about shortened urls as expanded
- postprocessor for Twitter API output
- nytimes.com crawl benchmarking
- what to do with htz.li
- small domain crawl
- Benchmarking
- finish documenting where different data are on our server
- finding language function
- image_reference function
- dealing with embedded versus cited tweets