Skip to content
No description, website, or topics provided.
Python
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
nyt
README.md

README.md

NYT Crawler

Media Analytics is a web app that allows anyone to query a large corpus of journalistic data using natural language processing tools. My role in this project revolved around collection of data, specifically articles from the New York Times archive. The NLP model needed to support frequency of word usage over the last 100+ years, which required the collection of millions of articles. To accomplish this, I learned how to use the Scrapy web crawling framework indepthly and created a Spider which crawled through the NYT archive and scraped the appropriate items from the correct links.

Built With

  • Scrapy - Web crawling framework

Authors

  • Fawaz Dinnunhan
You can’t perform that action at this time.