Misinformation crawler

Web crawler to collect snapshots of articles to web archive.

See main project for project board and issues.

Prerequisites

chromedriver
- needed by Selenium. See installation instructions
node.js
- needed both for some ReadabiliPy tests and also to avoid Cloudflare protections. See installation instructions
(optional) Microsoft SQL drivers
- Needed only if recording the crawl in the Azure database, not if writing output to a local file
- See sections below on Using pyodbc on macos and How to install Microsoft SQL Server drivers

Using pyodbc on OSX

If you are not using the latest version of macos, you may get an sql.h not found error when installing the pyodbc dependency via pip. This is because there is no compiled wheel for your version of OSX.

The options are to either (i) upgrade to the latest version of OSX or (ii) install the unixodbc driver libraries using brew install unixodbc.

How to install Microsoft SQL Server drivers on OSX

Install Homebrew is you have not already: /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
Add the Microsoft Tap: brew tap microsoft/mssql-release https://github.com/Microsoft/homebrew-mssql-release
Update Homebrew: brew update
Install the Microsoft SQL ODBC driver: brew install openssl msodbcsql17 mssql-tools

Installation

Check out the code from github
Install the Python dependencies for this package by running:
- As a user of this project pip install -r requirements.txt
- As a developer of this project pip install -r requirements-dev.txt
Ensure that ReadabiliPy is installed by running: git submodule update --init --recursive
- Install the Python dependencies for ReadabiliPy by typing pip install -r ReadabiliPy/requirements.txt
- (Optional) Install the node.js dependencies for ReadabiliPy by entering the ReadabiliPy directory and typing npm install

Usage

Site configurations for 107 sites are included in misinformation/site_configs.yml Crawled articles are saved one file per site in articles/ The actual number of articles returned may be slightly higher due to number of parallel requests scrapy has open at any time.

Crawling all sites

Usage: python crawl.py --all -n <max articles per site> (limit is optional and all articles will be crawled if left off)

Crawling a single site

Usage: python crawl.py --site <site name> -n <max articles per site> (limit is optional and all articles will be crawled if left off)

Crawling a list of URLs

Usage: python crawl.py --list <path to file> (the file must be in CSV format with an article_url column and a site name column)

Testing

To run tests, run python -m pytest from the repository root.

Running the crawler with the Azure backend

In order to run the crawler you will need to create a file at secrets/db_config.yaml inside the top-level misinformation-crawler directory. This should look like the following:

driver: ODBC Driver 17 for SQL Server
server: misinformation.database.windows.net
database: misinformation
user: database-crawler-user
password: <password>

where the password is obtained from the Azure keyvault for the database, using

az keyvault secret show --vault-name misinformation-user --name database-crawler-user

The crawler can then be run using python crawl.py --all -e blob.

Developing

To update to the latest version of ReadabiliPy.

Navigate to the ReadabiliPy folder with cd ReadabiliPy
Ensure you are on the master branch with git checkout master
Pull the latest version with git pull
Install the dependencies with pip install -r requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 1,441 Commits
ReadabiliPy @ fd59b60		ReadabiliPy @ fd59b60
misinformation		misinformation
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
.pylintrc		.pylintrc
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
article_override_lists.yml		article_override_lists.yml
crawl.py		crawl.py
populate_article_db.py		populate_article_db.py
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg
site_configs.yml		site_configs.yml

License

alan-turing-institute/misinformation-crawler

Folders and files

Latest commit

History

Repository files navigation

Misinformation crawler

Prerequisites

Using pyodbc on OSX

How to install Microsoft SQL Server drivers on OSX

Installation

Usage

Crawling all sites

Crawling a single site

Crawling a list of URLs

Testing

Running the crawler with the Azure backend

Developing

About

Topics

Resources

License

Stars

Watchers

Forks

Languages