Web crawler to collect snapshots of articles to web archive.
See main project for project board and issues.
chromedriver
- needed by Selenium. See installation instructions
node.js
- needed both for some
ReadabiliPy
tests and also to avoid Cloudflare protections. See installation instructions
- needed both for some
- (optional) Microsoft SQL drivers
- Needed only if recording the crawl in the Azure database, not if writing output to a local file
- See sections below on
Using pyodbc on macos
andHow to install Microsoft SQL Server drivers
If you are not using the latest version of macos, you may get an sql.h not found
error when installing the pyodbc
dependency via pip. This is because there is no compiled wheel for your version of OSX.
The options are to either (i) upgrade to the latest version of OSX or (ii) install the unixodbc
driver libraries using brew install unixodbc
.
- Install Homebrew is you have not already:
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
- Add the Microsoft Tap:
brew tap microsoft/mssql-release https://github.com/Microsoft/homebrew-mssql-release
- Update Homebrew:
brew update
- Install the Microsoft SQL ODBC driver:
brew install openssl msodbcsql17 mssql-tools
- Check out the code from
github
- Install the Python dependencies for this package by running:
- As a user of this project
pip install -r requirements.txt
- As a developer of this project
pip install -r requirements-dev.txt
- As a user of this project
- Ensure that
ReadabiliPy
is installed by running:git submodule update --init --recursive
- Install the Python dependencies for
ReadabiliPy
by typingpip install -r ReadabiliPy/requirements.txt
- (Optional) Install the
node.js
dependencies forReadabiliPy
by entering theReadabiliPy
directory and typingnpm install
- Install the Python dependencies for
Site configurations for 107 sites are included in misinformation/site_configs.yml
Crawled articles are saved one file per site in articles/
The actual number of articles returned may be slightly higher due to number of parallel requests scrapy has open at any time.
Usage: python crawl.py --all -n <max articles per site>
(limit is optional and all articles will be crawled if left off)
Usage: python crawl.py --site <site name> -n <max articles per site>
(limit is optional and all articles will be crawled if left off)
Usage: python crawl.py --list <path to file>
(the file must be in CSV format with an article_url
column and a site name
column)
To run tests, run python -m pytest
from the repository root.
In order to run the crawler you will need to create a file at secrets/db_config.yaml
inside the top-level misinformation-crawler
directory. This should look like the following:
driver: ODBC Driver 17 for SQL Server
server: misinformation.database.windows.net
database: misinformation
user: database-crawler-user
password: <password>
where the password is obtained from the Azure keyvault for the database, using
az keyvault secret show --vault-name misinformation-user --name database-crawler-user
The crawler can then be run using python crawl.py --all -e blob
.
To update to the latest version of ReadabiliPy.
- Navigate to the
ReadabiliPy
folder withcd ReadabiliPy
- Ensure you are on the
master
branch withgit checkout master
- Pull the latest version with
git pull
- Install the dependencies with
pip install -r requirements.txt