[TuringBench](https://alan-turing-institute.github.io/data-science-benchmarking/) Benchmarking workflow for [ReadabiliPy](https://github.com/alan-turing-institute/ReadabiliPy)
====

**Software:** [ReadabiliPy](https://github.com/alan-turing-institute/ReadabiliPy) - A simple HTML content extractor in Python. Can be run as a wrapper for Mozilla's Readability.js package or in pure-python mode.

**Benchmarks:** Benchmark the speed of core package functions at extracting information from an input HTML with [pytest](https://pypi.org/project/pytest-benchmark/)

This benchmarking notebook follows the TuringBench workflow outlined at [https://alan-turing-institute.github.io/data-science-benchmarking/](https://alan-turing-institute.github.io/data-science-benchmarking/)

Creating benchmarks
----

To get started with the TuringBench workflow, I created a branch of the ReadabiliPy repository (called ```benchmarking```) and added benchmarks to the pre-existing pytest tests, which was easy with the [pytest-benchmark](https://pypi.org/project/pytest-benchmark/) package, making sure to also add this package in the ```requirements-dev.txt``` used by ReadabiliPy.

I followed the pytest-benchmark instructions, setting up benchmarks for some of the ReadabiliPy package functions, which get run alongside existing tests with the command ```pytest```. We can also run the benchmarks only (ignoring other tests) with ```pytest --benchmark-only```.

Once happy with the benchmarks I wanted, I committed these changes and pushed to them to the remote ```benchmarking``` branch on GitHub.

Building a Docker image for Benchmarking ReadabiliPy
----

The Dockerfile below installs the requirements for ReadabiliPy and pulls the latest commit of the ```benchmarking``` branch from GitHub, then runs the benchmarks with pytest.

*Note: After the benchmarking branch of the project is merged, the ```git clone``` command will need to be edited to the master branch (see the Post development version of the Dockerfile at the end of this notebook)*

In [3]:
%%writefile Dockerfile
FROM python:3

# Install requirements
RUN apt-get update
RUN apt-get -y install curl
RUN curl -sL https://deb.nodesource.com/setup_11.x | bash -
RUN apt install nodejs
RUN npm install
RUN pip install --upgrade pip
RUN apt-get install -y git

# Clone ReadabiliPy and install python packages
RUN git clone -b benchmarking --single-branch https://github.com/alan-turing-institute/ReadabiliPy
WORKDIR "/ReadabiliPy"
RUN git pull
RUN pip install -r requirements-dev.txt

# Run the benchmarks with Pytest
CMD pytest --benchmark-only


Overwriting Dockerfile


### Build

In [8]:
%%bash
docker build -t edwardchalstrey/readabilipy_benchmark:latest .

Sending build context to Docker daemon  52.74kB
Step 1/13 : FROM python:3
 ---> ac069ebfe1e1
Step 2/13 : RUN apt-get update
 ---> Using cache
 ---> 5a84d23aa7b5
Step 3/13 : RUN apt-get -y install curl
 ---> Using cache
 ---> fa727cce5ef4
Step 4/13 : RUN curl -sL https://deb.nodesource.com/setup_11.x | bash -
 ---> Using cache
 ---> 6072028d5b8f
Step 5/13 : RUN apt install nodejs
 ---> Using cache
 ---> c493d1b01b96
Step 6/13 : RUN npm install
 ---> Using cache
 ---> c952a91935a8
Step 7/13 : RUN pip install --upgrade pip
 ---> Using cache
 ---> 92e402750d57
Step 8/13 : RUN apt-get install -y git
 ---> Using cache
 ---> 1b10afe3306c
Step 9/13 : RUN git clone -b benchmarking --single-branch https://github.com/alan-turing-institute/ReadabiliPy
 ---> Using cache
 ---> cf99a8e9e4c9
Step 10/13 : WORKDIR "/ReadabiliPy"
 ---> Using cache
 ---> 6560e34a1b59
Step 11/13 : RUN git pull
 ---> Using cache
 ---> 2ec7fdcee516
Step 12/13 : RUN pip install -r requirements-dev.txt
 ---> Using cache
 ---

Run the containerised benchmarks
----

After pushing the container to remote repository/registry (e.g. to Docker Hub with ```docker push edwardchalstrey/readabilipy_benchmark:latest```), we can then ```docker run``` the benchmarks for ReadabiliPy on coputing platforms of our choosing and compare benchmarks across platforms and when new features are added.

### Results

I have benchmarked three of the html parsing features of ReadabiliPy on an example html file; see the tests in [ReadabiliPy](https://github.com/alan-turing-institute/ReadabiliPy/tree/master) repo within ```tests/test_benchmarking.py```.

Benchmarks run on these dates, are for the following [ReadabiliPy](https://github.com/alan-turing-institute/ReadabiliPy/tree/master) commits:
1. 2019-05-02 => [9ba2fdb7...](https://github.com/alan-turing-institute/ReadabiliPy/commit/9ba2fdb71b3b014f3252a29672ff41159203e45c)
2. 2019-05-14 => [d3b3c365...](https://github.com/alan-turing-institute/ReadabiliPy/commit/d3b3c365984aa26ce0a8f0fda6b3fd75b9e837a2)

In [4]:
from IPython.display import HTML, display
import tabulate
table = [["Benchmarks; (mean time ms)", "Date", "Container on MacBook", "MacBook"],
         ["Title parse", "2019-05-02", 40.2649, 55.5296],
         ["Title parse", "2019-05-14", 39.7405, 54.8936],
         ["Date parse", "2019-05-02", 46.4389, 69.5056],
         ["Date parse", "2019-05-14", 32.8276, 44.4991],
         ["Full parse", "2019-05-02", 3065.2467, 2140.0745],
         ["Full parse", "2019-05-14", 2642.1735, 1942.1609],
        ]
display(HTML(tabulate.tabulate(table, tablefmt='html')))

0,1,2,3
Benchmarks; (mean time ms),Date,Container on MacBook,MacBook
Title parse,2019-05-02,40.2649,55.5296
Title parse,2019-05-14,39.7405,54.8936
Date parse,2019-05-02,46.4389,69.5056
Date parse,2019-05-14,32.8276,44.4991
Full parse,2019-05-02,3065.2467,2140.0745
Full parse,2019-05-14,2642.1735,1942.1609


Automated build
====

The ```latest``` tag for ```edwardchalstrey/readabilipy_benchmark``` [on Docker Hub](https://cloud.docker.com/repository/docker/edwardchalstrey/readabilipy_benchmark) has been set to build whenever the master branch of the ReadabiliPy GitHub repo has a new commit.

Post development Dockerfile
----

In [1]:
%%writefile Dockerfile
FROM python:3

# Install requirements
RUN apt-get update
RUN apt-get -y install curl
RUN curl -sL https://deb.nodesource.com/setup_11.x | bash -
RUN apt install nodejs
RUN npm install
RUN pip install --upgrade pip
RUN apt-get install -y git

# Clone ReadabiliPy and install python packages
RUN git clone https://github.com/alan-turing-institute/ReadabiliPy
WORKDIR "/ReadabiliPy"
RUN git pull
RUN pip install -r requirements-dev.txt

# Run the benchmarks with Pytest
CMD pytest --benchmark-only

Overwriting Dockerfile
