-
Notifications
You must be signed in to change notification settings - Fork 8
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
97 additions
and
43 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,43 +1,97 @@ | ||
Run the pipeline | ||
================ | ||
|
||
It's useful when you need to run all the *soweego* steps. It does not | ||
require you to have the project in your machine: it is an already built | ||
docker image. Editing the credentials is all you need to do to chose the | ||
database. | ||
|
||
This environment is a docker container that executes | ||
``python -m soweego pipeline`` command. Obviously, it can be run with | ||
custom options. Note: the code is put into the image during the build | ||
process. | ||
|
||
What do you need to run it? | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
- `Docker <https://www.docker.com/get-started>`__ | ||
- Database credentials file, like | ||
``${PROJECT_ROOT}/soweego/importer/resources/db_credentials.json``. | ||
|
||
How do you run it? | ||
~~~~~~~~~~~~~~~~~~ | ||
|
||
1. In your terminal, move to the project root; | ||
2. Launch ``./scripts/docker/launch_pipeline.sh``; | ||
3. The script will ask you some parameters. | ||
4. The pipeline is now running! | ||
|
||
.. _launch_pipelinesh-options: | ||
|
||
launch_pipeline.sh options | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
================================ ========================================================================= ======================= ============================================================================================== | ||
**Option** **Expected Value** **Default Value** **Description** | ||
================================ ========================================================================= ======================= ============================================================================================== | ||
-s directory path ``/tmp/soweego_shared`` Tells docker which folder in your machine will be shared with *soweego* container. | ||
target (argument, not an option) one of the available targets name (atm. musicbrainz, discogs. imdb) None Tells *soweego* which database will go through the whole process. | ||
--validator / --no-validator No value needed. The option you choose among the two is the value itself. --no-validator Tells *soweego* to run or not the validator step. | ||
--importer / --no-importer No value needed. The option you choose among the two is the value itself. --importer Tells *soweego* to run or not the importer step. | ||
--linker / --no-linker No value needed. The option you choose among the two is the value itself. --linker Tells *soweego* to run or not the linker step. | ||
--upload / --no-upload No value needed. The option you choose among the two is the value itself. --upload Tells *soweego* to upload or not the computed links among wikidata and the target to wikidata. | ||
================================ ========================================================================= ======================= ============================================================================================== | ||
Run soweego | ||
=========== | ||
|
||
*soweego* can be seen as a **pipeline of submodules**. Maybe, is best to | ||
say that we designed it explicitly in this way. They can be combined at | ||
will, but in the following lines, you will read how we do it. | ||
|
||
Our flow starts by running the **importing process**, which translates | ||
the target dumps in structured database tables. After that, we run the | ||
**linking process**. In this step, the linker itself gathers the right | ||
up to date dataset from Wikidata and tries to match it with the | ||
preiously imported data. The last step we execute is **validation**. | ||
Basically, it scans the linked entities available in Wikidata to perform | ||
some naive quality checks. Each entity approved by the validator is | ||
consequently enriched with all the assertions minable from our imported | ||
data. | ||
|
||
What do I need to run your setup? | ||
--------------------------------- | ||
|
||
First of all, you need Docker up and running on the system. Then, since | ||
our “production” environment has benefited from Wikimedia Foundations’s | ||
database, you need to provide soweego a working database to run our | ||
setup (MariaDB 10.36 is the only database tested). To tell soweego where | ||
to find your database, you need to create a JSON file with the following | ||
structure: | ||
|
||
.. code:: json | ||
{ | ||
"DB_ENGINE": "mysql+pymysql", | ||
"HOST": "*ip address or equivalent*", | ||
"USER": "*database user*", | ||
"PASSWORD": "*database user password*", | ||
"TEST_DB": "soweego", | ||
"PROD_DB": "*database name*", | ||
"WIKIDATA_API_PASSWORD": "", # Optional wikidata api login | ||
"WIKIDATA_API_USER": "" # Optional wikidata api login | ||
} | ||
Finally, ensure to have a folder in which soweego will write | ||
results/helper files. soweego favourite food is disk space, but usually | ||
with 20GB it should be sated. | ||
|
||
How do I actually run it? | ||
------------------------- | ||
|
||
1. Clone soweego from `GitHub <https://github.com/wikidata/soweego/>`__; | ||
2. Open a bash and move in the project root; | ||
3. Choose a target to run eg Musicbrainz, Discogs, ImDB. | ||
4. Launch the following command replacing the variables with your | ||
absolute paths and append your target: | ||
``./docker/launch_pipeline.sh -c ${ABSOLUTE_PATH_TO_THE_JSON} -s ${ABSOLUTE_ PATH_TO_THE_OUTPUT_FOLDER} imdb`` | ||
|
||
Additional parameters tailable to your command: | ||
|
||
+-----------------------------------+-----------------------------------+ | ||
| Argument | Description | | ||
+===================================+===================================+ | ||
| ``--validator`` / | Enables/disables the validation | | ||
| ``--no-validator`` | step | | ||
+-----------------------------------+-----------------------------------+ | ||
| ``--importer`` / | Enables/disables the importing | | ||
| ``--no-importer`` | step | | ||
+-----------------------------------+-----------------------------------+ | ||
| ``--linker`` / ``--no-linker`` | Enables/disables the linking step | | ||
+-----------------------------------+-----------------------------------+ | ||
| ``--upload`` / ``--no-upload`` | Enables/disable the upload to | | ||
| | Wikidata of the results | | ||
+-----------------------------------+-----------------------------------+ | ||
|
||
Important | ||
~~~~~~~~~ | ||
|
||
The command does not only run soweego, but it takes care of some side | ||
tasks. Initially, backups the folder you give as the parameter. The | ||
backups will be three at most. When creating the 4th backup, the oldest | ||
is deleted. After the archiving step, the given folder is emptied. | ||
Subsequently, it checks out the master branch and pulls the latest | ||
changes (deleting all the pending edits in the local repository). | ||
Finally, our soweego setup is launched. | ||
|
||
Under the hood | ||
-------------- | ||
|
||
The pipeline structure is actually defined in ``pipeline.py`` and is | ||
launched by ``python -m soweego run``. Our setup script launches | ||
``python -m soweego run`` as latest command and appends all the | ||
arguments from the target (eg. musicbrainz, discogs, imdb) onwards. | ||
|
||
Cron jobs | ||
--------- | ||
|
||
Our setup is launched periodically for each target. We obviously rely on | ||
cron jobs to achieve this. You can find in ``scripts/cron`` the helpers | ||
we call though the cron tab. They can be reused easily just by changing | ||
all the paths to adhere to your environment setup. |