It turns this sort of thing:
This is the web interface for viewing papers. The actual LaTeX to HTML conversion (the interesting bit) is done by Engrafo.
Arxiv Vanity downloads LaTeX source from Arxiv and renders it as HTML using the Engrafo LaTeX to HTML convertor.
The web app runs render jobs on Hyper.sh as Docker containers, and they report their status directly back to the web app with a webhook. This approach has two neat properties:
- It effectively scales infinitely
- There is no worker process or message queue
The process looks a bit like this:
- Details about the paper are fetched from the Arxiv API. Metadata is stored in a Postgres database using Django's ORM, and the paper's LaTeX source is stored on S3.
- Engrafo is run on Hyper.sh to convert the LaTeX source to HTML. It fetches the source and stores the result on S3. The container ID is stored in the Postgres database so the status of the rendering job can be queried.
- When the rendering job is finished, the Hyper.sh container makes an HTTP request to the web app to mark it as rendered.
Running in development
Install Docker for Mac or Windows.
Do the initial database migration and set up a user:
$ script/manage migrate $ script/manage createsuperuser
Then to run the app:
$ docker-compose up --build
You can scrape the latest papers from Arxiv by running:
$ script/manage scrape_papers
It'll probably fetch quite a lot, so hit
ctrl-C when you've got enough.
Thanks to our generous sponsors for supporting the development of Arxiv Vanity! Sponsor us to get your logo here.