A search engine for Project Gutenberg books built with PostgreSQL and Dash. Find it running on GutenSearch.com.
Project Gutenberg offers mirroring via rsync. However, in June 2016, Allison Parrish released a corpus of all text files and metadata up to that point in time, which was used here instead of the raw data.
- set up the instance, firewall, etc.
- create a new Postgres database
- stream the JSON metadata into a table
- stream the raw text data
- transform the data
- start the app
The below worked for a dedicated server with an Intel Atom 2.40GHz CPU, 16GB RAM and 250GB SSD. The queries are mostly CPU-bound, particularly for common phrases. The deployed app uses 128GB of its 217GB partition.
I've only tested this on a clean install of Ubuntu 20.04.1 LTS.
You'll need the following to get started:
sudo apt update
sudo apt install screen
sudo apt install unzip
sudo apt install vim # not strictly necessary
sudo apt install postgresql postgresql-contrib
You can use this guide. You may want to increase resources as follows:
vim /etc/postgresql/12/main/postgresql.conf
Changing the following (here shown for a server with 16GB RAM):
shared_buffers = 8GB # (25% of server RAM)
work_mem = 40MB # (RAM * 0.25 / 100)
maintenance_work_mem = 800MB # (RAM * 0.05)
effective_cache_size = 8GB # (RAM * 0.5)
As usual the app relies on an alphabet soup of libraries:
sudo apt install software-properties-common
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt install python3.8
sudo apt-get install python3-venv
sudo apt install python3-pip
sudo apt install libpq-dev
mkdir gutensearch
You will want this one in a screen as it might take a while - a few minutes in a decent data centre at 30MB/s, or a night and morning from a home connection.
screen -S download_data
wget -c http://static.decontextualize.com/gutenberg-dammit-files-v002.zip
mv gutenberg-dammit-files-v002.zip gutensearch/gutenberg-dammit-files-v002.zip
cd gutensearch
unzip gutenberg-dammit-files-v002.zip -d gutenberg-dammit-files-v002
exit
I recommend doing this one by hand line by line, instead of passing the file to psql. Open a screen, then line-by-line server-process-part1.sql
.
screen -S process_data
sudo -u postgres psql # run through server-process-part1.sql
exit
SQL part 1 streams the metadata JSON into a table.
This part streams the text files into an SQL file that can be run later. It may take a while so best have it in a screen.
screen -S app_venv
cd ~
python3 -m venv .venvs/dash
pip3 install --upgrade pip
python3 -m pip install psycopg2
python3 server-import.py # Change the path to yours first!
exit
This part will take the longest as 6GB zipped is expanded into more than 60GB of tables and indices. \timing for each part is included as comments in the code; on the instance mentioned earlier, you're looking at the better part of a day.
screen -r process_data
sudo -u postgres psql # now run through server-process-2.sql
exit
You'll need the following:
pip install --upgrade pip
python3 -m pip install dash
python3 -m pip install dash_auth
python3 -m pip install pandas
python3 -m pip install sqlalchemy
python3 -m pip install networkx
python3 -m pip install gunicorn
Follow instructions here.
Don't forget to backup the certs:
scp -r user@host:/etc/letsencrypt /path/to/backup/location
Find the relevant instructions for your provider. Mine are here.
You'll need to set up the firewall, instructions here.
Relevant files can be found here:
cd /etc/nginx/sites-available
sudo vim reverse-proxy.conf # add server_name and change the port
sudo ln -s /etc/nginx/sites-available/reverse-proxy.conf /etc/nginx/sites-enabled/reverse-proxy.conf
screen -S app_server
gunicorn app:server -b :port --workers=17 --log-level=debug --timeout=700
In accordance with GutenTag's and Gutenberg Dammit's license:
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by-sa/4.0/ or send a letter to Creative Commons, 444 Castro Street, Suite 900, Mountain View, California, 94041, USA.