Skip to content

Latest commit

 

History

History
173 lines (123 loc) · 5.56 KB

README.md

File metadata and controls

173 lines (123 loc) · 5.56 KB

Citation Hunt scripts and tools

The scripts in this directory are used to generate the Citation Hunt database.

Generating the database on Tools Labs

See the top-level README for how to set up Kubernetes cron jobs to generate the database, and refer to the Kubernetes cronjob documentation for more general information.

Below are some more handy commands for manually operating and troubleshooting those jobs.

Manually launch a job

kubectl create job --from=cronjob/citationhunt-update-it citationhunt-update-it-manual

Clean up the manual job

kubectl delete job citationhunt-update-it-manual

Delete the compute_fixed_snippets job

kubectl delete deployment citationhunt.compute-fixed-snippets

Get pods for running jobs

kubectl get pods --field-selector=status.phase=Running

Get logs from a pod

kubectl logs ${POD?}

Open a shell in a running pod

kubectl exec -it ${POD?} -- bash

Generating the database locally

Prerequisites:

  • A local installation of MySQL;
  • A working Internet connection;
  • A few hours, or potentially a rainy Sunday;

The commands we'll be typing depend on the language you're generating a database for. They expect an environment variable CH_LANG to be set to a language code taken from ../config.py. Since we're dealing with English in this document, let's set the variable accordingly:

export CH_LANG=en

First, we need to get access to (a copy of) the page and categorylinks database tables for the Wikipedia we're using.

There are two alternatives for that:

Option 1: Using a SQL dump imported locally

Click here to expand instructions.

Download the page.sql, categorylinks.sql and templatelinks.sql dumps. You can find the latest versions these for the English Wikipedia here.

From the MySQL console connected to your local database, import them:

mysql -u root
mysql> create database enwiki_p;
mysql> use enwiki_p;
mysql> source path/to/categorylinks.sql
mysql> source path/to/page.sql
mysql> source path/to/templatelinks.sql

This will create a new database named 'enwiki_p' and populate it with tables named 'categorylinks', 'page' and 'templatelinks'. This will take a few hours. You'll want to use 'enwiki_p' for simplicity, but that's configurable in ../config.py.

Then, to ensure these scripts can find the database, create a local config file at ~/replica.my.cnf:

$ cat ~/replica.my.cnf
[client]
user='root'
host='localhost'

Option 2: Connecting to live replicas using SSH local forwarding

Click here to expand instructions.

Alternatively, you can connect from your local computer to the real database replicas. The Toolforge documentation has more details on this option.

You'll need an existing Toolforge account for this method.

First, copy your Toolforge replica.my.cnf locally, to ~/replica.my.cnf, and create another mysql config that points to your local database. For example:

$ cat ~/ch.my.cnf
[client]
user='root'
host='localhost'

Then, establish a port forward to the database you're trying to access:

ssh -L 4711:enwiki.analytics.db.svc.wikimedia.cloud:3306 login.tools.wmflabs.org

Note the hostname in the command above: replace enwiki with whatever wiki you are working with.

Finally, set two environment variables:

  • CH_LOCAL_SSH_PORT to the forwarded port (4711, in the example above).
  • CH_MY_CNF to the local MySQL config (~/ch.my.cnf, in the example above).

Now, let's create all necessary databases and tables:

(cd ..; python -c 'import chdb; chdb.initialize_all_databases()')

Next, let's generate the list of ids of pages with unsourced statements with print_unsourced_pageids_from_wikipedia.py:

./print_unsourced_pageids_from_wikipedia.py > unsourced

This list should be passed to the parse_live.py script, which will query the Wikipedia API for the actual content of the pages and identify snippets lacking citations:

./parse_live.py unsourced

This should take a couple of hours on a multi-core machine. If you're impatient, you can also pass it a maximum running time in seconds using the --timeout command line option.

The next thing to do is to pick which categories will get to be displayed in Citation Hunt, thus filling up the articles_categories table in the database. This is done with the assign_categories.py script:

./assign_categories.py

At the end of this step, your MySQL installation should contain a database named root__scratch_en with all the tables Citation Hunt needs. The install_new_database.py script will atomically move these tables to a new database named root__citationhunt_en, which is where the app actually expects to find them:

./install_new_database.py

And that's it! If everything went well, you can refer to the instructions in ../README.md to run Citation Hunt using your new database.