Who Wrote This News Crawler
Web crawler for news articles from a subset of sources to power an open source news search engine.
Used in "Machine Learning Techniques for Detecting Identifying Linguistic Patterns in the News Media" by A Samuel Pottinger, web crawler parses RSS feeds from a list of news agencies, saving the articles found to a SQLite database.
This requires Python 3 and pip to be installed for your platform. If available, run
$ pip install -r requirements.txt.
These set of scripts are executable from the command line with
$ python news_crawler.py. It will write to
articles.db as a sqlite database in the same directory and expects the table to have been created using
Some automated tests are available and can be run with
Please unit test and follow the Google Python Style Guide where possible.
Note that this is in a series of related projects as linked:
- who-wrote-this-training: logic for machine learning.
- who-wrote-this-server: web application to demo the model.
- who-wrote-this-news-crawler: crawler to record RSS feeds.
This application's source is released under the MIT License. The following open source libraries are used internally: