[TuringBench](https://alan-turing-institute.github.io/data-science-benchmarking/) Benchmarking workflow for [ReadabiliPy](https://github.com/alan-turing-institute/ReadabiliPy)
====

**Software:** [ReadabiliPy](https://github.com/alan-turing-institute/ReadabiliPy) - A simple HTML content extractor in Python. Can be run as a wrapper for Mozilla's Readability.js package or in pure-python mode.

**Benchmarks:** Benchmark the speed of core package functions at extracting information from an input HTML with [pytest](https://pypi.org/project/pytest-benchmark/)

This benchmarking notebook follows the TuringBench workflow outlined at [https://alan-turing-institute.github.io/data-science-benchmarking/](https://alan-turing-institute.github.io/data-science-benchmarking/)

Creating benchmarks
----

To get started with the TuringBench workflow, I created a branch of the ReadabiliPy repository (called ```benchmarking```) and added benchmarks to the pre-existing pytest tests, which was easy with the [pytest-benchmark](https://pypi.org/project/pytest-benchmark/) package, making sure to also add this package in the ```requirements-dev.txt``` used by ReadabiliPy.

I followed the pytest-benchmark instructions, setting up benchmarks for some of the ReadabiliPy package functions, which get run alongside existing tests with the command ```pytest```. We can also run the benchmarks only (ignoring other tests) with ```pytest --benchmark-only```.

Once happy with the benchmarks I wanted, I committed these changes and pushed to them to the remote ```benchmarking``` branch on GitHub.

Building a Docker image for Benchmarking ReadabiliPy
----

The Dockerfile below installs the requirements for ReadabiliPy and pulls the latest commit of the ```benchmarking``` branch from GitHub, then runs the benchmarks with pytest.

In [11]:
%%writefile Dockerfile
FROM python:3

# Install requirements
RUN apt-get update
RUN apt-get -y install curl
RUN curl -sL https://deb.nodesource.com/setup_11.x | bash -
RUN apt install nodejs
RUN npm install
RUN pip install --upgrade pip
RUN apt-get install -y git

# Clone ReadabiliPy and install python packages
RUN git clone -b benchmarking --single-branch https://github.com/alan-turing-institute/ReadabiliPy
WORKDIR "/ReadabiliPy"
RUN pip install -r requirements-dev.txt

# Run the benchmarks with Pytest
CMD pytest --benchmark-only


Overwriting Dockerfile


### Build

In [12]:
%%bash
docker build -t edwardchalstrey/readabilipy_benchmark:latest .

Sending build context to Docker daemon  41.98kB
Step 1/12 : FROM python:3
 ---> ac069ebfe1e1
Step 2/12 : RUN apt-get update
 ---> Using cache
 ---> 5a84d23aa7b5
Step 3/12 : RUN apt-get -y install curl
 ---> Using cache
 ---> fa727cce5ef4
Step 4/12 : RUN curl -sL https://deb.nodesource.com/setup_11.x | bash -
 ---> Using cache
 ---> 6072028d5b8f
Step 5/12 : RUN apt install nodejs
 ---> Using cache
 ---> c493d1b01b96
Step 6/12 : RUN npm install
 ---> Using cache
 ---> c952a91935a8
Step 7/12 : RUN pip install --upgrade pip
 ---> Using cache
 ---> 92e402750d57
Step 8/12 : RUN apt-get install -y git
 ---> Using cache
 ---> 1b10afe3306c
Step 9/12 : RUN git clone -b benchmarking --single-branch https://github.com/alan-turing-institute/ReadabiliPy
 ---> Using cache
 ---> cf99a8e9e4c9
Step 10/12 : WORKDIR "/ReadabiliPy"
 ---> Using cache
 ---> 6560e34a1b59
Step 11/12 : RUN pip install -r requirements-dev.txt
 ---> Using cache
 ---> dbdfb0d07fbf
Step 12/12 : CMD pytest --benchmark-only
 ---> U

In [15]:
%%bash
docker run edwardchalstrey/readabilipy_benchmark:latest

platform linux -- Python 3.7.2, pytest-4.4.1, py-1.8.0, pluggy-0.9.0
benchmark: 3.2.2 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /ReadabiliPy
plugins: cov-2.6.1, benchmark-3.2.2
collected 173 items

tests/test_benchmarking.py ..                                            [  1%]
tests/test_extractors.py s                                               [  1%]
tests/test_html_elements.py ssssssssssssssssssssssssssssssssssssssssssss [ 27%]
sssssssssssssssssssssssssssssssssssssssssssssssssssssssssss              [ 61%]
tests/test_javascript.py ss                                              [ 62%]
tests/test_json_parser.py sssss                                          [ 65%]
tests/test_normal_html.py ssssssssssssssssss                             [ 75%]
tests/test_plain_html_functions.py sss                                   [ 77%]
tests/test_readability.py sssssssss

In [14]:
%%bash
docker push edwardchalstrey/readabilipy_benchmark:latest

The push refers to repository [docker.io/edwardchalstrey/readabilipy_benchmark]
2e7dd3b4feb6: Preparing
3e1c1d117536: Preparing
6ff62ff8b98f: Preparing
dec915b951fe: Preparing
519fad9ce2ed: Preparing
1543960dda15: Preparing
e0bdf915dbf4: Preparing
6dc4e4c587a1: Preparing
65ef2276d16f: Preparing
4b381ae03f9a: Preparing
08a5b66845ac: Preparing
88a85bcf8170: Preparing
65860ac81ef4: Preparing
a22a5ac18042: Preparing
6257fa9f9597: Preparing
578414b395b9: Preparing
abc3250a6c7f: Preparing
13d5529fd232: Preparing
6257fa9f9597: Waiting
4b381ae03f9a: Waiting
08a5b66845ac: Waiting
abc3250a6c7f: Waiting
578414b395b9: Waiting
13d5529fd232: Waiting
88a85bcf8170: Waiting
1543960dda15: Waiting
65860ac81ef4: Waiting
a22a5ac18042: Waiting
6dc4e4c587a1: Waiting
65ef2276d16f: Waiting
2e7dd3b4feb6: Layer already exists
3e1c1d117536: Layer already exists
519fad9ce2ed: Layer already exists
6ff62ff8b98f: Layer already exists
dec915b951fe: Layer already exists
6dc4e4c587a1: Layer already exists
65ef2276d16f: 

Automated build setup
----

Options:
1. Use Docker Hub (I have requested permission to create automated builds for Alan Turing repos)
2. Use an Azure Container Registry

Edited Dockerfile to install master branch

(should be a copy of the above with ```RUN git clone https://github.com/alan-turing-institute/ReadabiliPy```)

In [None]:
%%writefile Dockerfile
FROM python:3

# Install requirements
RUN apt-get update
RUN apt-get -y install curl
RUN curl -sL https://deb.nodesource.com/setup_11.x | bash -
RUN apt install nodejs
RUN npm install
RUN pip install --upgrade pip
RUN apt-get install -y git

# Clone ReadabiliPy and install python packages
RUN git clone https://github.com/alan-turing-institute/ReadabiliPy
WORKDIR "/ReadabiliPy"
RUN pip install -r requirements-dev.txt

# Run the benchmarks with Pytest
CMD pytest --benchmark-only