April 21, 2022
alejandropaz edited this page Apr 21, 2022
·
4 revisions
- finalize clean up, updating, and documentation of methods of NYT crawl
- look at retweet/tweet issue
- re-run KPP/MediaCAT twitter crawl
- run small domain crawl with information from Alejandro
- Alejandro: think through proposals for CDHI conference
- Alejandro: find new time for weekly meeting
- workshop
- paper from Alejandro based on NYT crawl
- fixed issue and running on KPP/MediaCAT
- plain text: found what comes after RT @user and added it in one key
- need to look example with comment before RT
- should not include duplication of comment before RT
- data should be available for checking in the next day or so.
- from email:
- The general domain crawler ended up crawling 703,641 articles from NYTimes and the NYTimes search crawler crawled 251,716 articles (without duplications) from three search URLs you give me.
- NYT site crawl stopped at 700,000+ because puppeteer queue ran to 0
- what is the earliest date?
- With increase in the number of crawled articles and different types of articles, we have some new problems.
- Fixed Issues:
- The NYTimes search crawler sometimes stopped scrolling down after crawled about 10,000 articles without reporting any error. This is a Puppeteer bug; I fixed this bug by restart the crawler after crawling 5000 articles.
- Unresolved Issues:
- Post-processor had weird memory error when try to use multithreading to process 251,716 URLs from three NYTimes search URLS. Good news is the single thread post-processor works fine. The output is in the attachment.
- There are many different types of URLs from NYTimes general crawler. Some will cause the readability to be stuck when trying to get the plain text. (meta-scraper get plain texts for 120,00/703,641 URLs then stuck) Therefore, I added a timeout of 5s for trying to get plain text for each URL, it skips the URL after 5s then continue with the other URLs (meta-scraper get plain texts for 300,000/703,641 URLs). However, the meta-scraper was still unable to get plain text for all 703,641 articles due to memory error.
- now resolved: metascraper was trying to check duplication and the file with urls would get too big and then there was a memory issue; now, it doesn't check for duplication, and if there is a duplicate, then it will overwrite.
- now run postprocessor on NYT site crawl
- seems like author is better in new archive crawl
- update by email
- assessment of any updates needed for libraries
- postprocessor issue with text alias
- finalize clean up, updating, and documentation of methods of NYT crawl
- assessment of any updates needed for libraries
- kpp/mediacat postprocessed results
- postprocess NYT site crawl - think about NYT -- why cut off?
- adding to regular postprocessor output:
- any non-scope domain hyperlink that ends in .co.il
- any link to a tweet or twitter handle
- This is a bit outside our normal functionality, so I will put it on the backburner for now.
- how to get multithreading with postprocessor
- what to do with htz.li
- small domain crawl
- Benchmarking
- finish documenting where different data are on our server
- finding language function
- image_reference function
- dealing with embedded versus cited tweets