pa11ycrawler is a Scrapy spider that runs a Pa11y check on every page of an Open edX installation, to audit it for accessibility purposes. It will store the result of each page audit in a data directory as a set of JSON files, which can be transformed into a beautiful HTML report.
scrapy crawl edx
There are several options for this spider that you can configure using the
-a scrapy flag.
These options can be combined by specifying the
-a flag multiple times.
scrapy crawl edx -a domain=courses.edx.org -a port=80.
If an email and password are not specified, then pa11ycrawler will use the "auto auth" feature in Open edX to create a staff user, and crawl as that user. Note that this assumes that the "auto auth" feature is enabled -- if not, the crawler won't be able to crawl without an email and password set.
http_pass arguments are used for HTTP Basic Auth. If
either of these is unset, pa11ycrawler will not attempt to use HTTP Basic auth.
allow you to specify a YAML file, or the URL to a YAML file, containing
pa11y ignore rules. These rules are used to indicate that certain
output from pa11y has been manually checked, and can be safely ignored.
data_dir option is used to determine where this crawler will save its
output. pa11ycrawler will run each page of the site through
encode the result as JSON, and save it as a file in this directory.
This data directory is "data" by default, which means it will create a directory
named "data" in whatever directory you run the crawler from.
Whatever directory you specify, it will be automatically created if it does
not yet exist. The crawler will never delete data from the data directory,
so if you want to clear it out between runs, that's your responsibility.
There is a
make clean-data task available in the Makefile, which just runs
rm -rf data.
single_url option is available to allow the spider to only crawl one web page.
The result is evaluated through the pipeline, but the spider will not continue crawling
Transform to HTML
This project comes with a script that can transform the data in this
data directory into a pretty HTML table. The script is installed as
pa11ycrawler-html and it accepts two optional arguments:
--output-dir. These arguments default to "data"
and "html", respectively.
You can also run the script with the
--help argument to get more information.
Cleaning Data & HTML
This project comes with a
Makefile with a
clean-data task and a
task. The former will delete the
data directory in the current working
directory, and the latter will delete the
html directory in the current
working directory. These are the default locations for pa11ycrawler's data and
HTML. However, if you configure pa11ycrawler to output data and/or HTML
to a different location, this task has no way of knowing where
the data and HTML are located on your computer,
and will not be able to automatically remove them for you.
To remove data from the default location, run:
To remove HTML from the default location, run:
This project has tests for the pipeline functions, where the main
functionality of this crawler lives. To run those tests, run
make test. You can also run
scrapy check edx to test that the
scraper is scraping data correctly.