Skip to content

Commit

Permalink
feat: fast XML upload (DEV-1626) (#352)
Browse files Browse the repository at this point in the history
  • Loading branch information
irinaschubert committed May 15, 2023
1 parent 69d65e7 commit c2f46ea
Show file tree
Hide file tree
Showing 21 changed files with 1,928 additions and 33 deletions.
9 changes: 8 additions & 1 deletion .github/workflows/tests-on-push.yml
Expand Up @@ -22,12 +22,19 @@ jobs:
with:
python-version: 3.9

- name: Install dependencies
- name: Install Python dependencies
run: |
curl -sSL https://install.python-poetry.org | python3 -
poetry self add poetry-exec-plugin
poetry install
- name: Install ffmpeg for local processing (fast xmlupload)
uses: FedericoCarboni/setup-ffmpeg@v2
id: setup-ffmpeg

- name: Install ImageMagick for local processing (fast xmlupload)
uses: mfinelli/setup-imagemagick@v2

- name: build docs
run: poetry run mkdocs build --strict

Expand Down
1 change: 1 addition & 0 deletions .gitignore
Expand Up @@ -74,3 +74,4 @@ metrics/

# for testing in development
tmp/
/out/
4 changes: 2 additions & 2 deletions docs/cli-commands.md
Expand Up @@ -76,7 +76,7 @@ dsp-tools get [options] project_definition.json
The following options are available:

- `-s` | `--server` (optional, default: `0.0.0.0:3333`): URL of the DSP server
- `-u` | `--user` (optional, default: `root@example.com`): username used for authentication with the DSP-API
- `-u` | `--user` (optional, default: `root@example.com`): username (e-mail) used for authentication with the DSP-API
- `-p` | `--password` (optional, default: `test`): password used for authentication with the DSP-API
- `-P` | `--project` (mandatory): shortcode, shortname or IRI of the project
- `-v` | `--verbose` (optional): print more information about the progress to the console
Expand All @@ -103,7 +103,7 @@ dsp-tools xmlupload [options] xml_data_file.xml
The following options are available:

- `-s` | `--server` (optional, default: `0.0.0.0:3333`): URL of the DSP server where DSP-TOOLS sends the data to
- `-u` | `--user` (optional, default: `root@example.com`): username used for authentication with the DSP-API
- `-u` | `--user` (optional, default: `root@example.com`): username (e-mail) used for authentication with the DSP-API
- `-p` | `--password` (optional, default: `test`): password used for authentication with the DSP-API
- `-S` | `--sipi` (optional, default: `http://0.0.0.0:1024`): URL of the SIPI server where DSP-TOOLS sends the multimedia files to
- `-i` | `--imgdir` (optional, default: `.`): folder from where the paths in the `<bitstream>` tags are evaluated
Expand Down
1 change: 1 addition & 0 deletions docs/developers/user-data.md
Expand Up @@ -12,6 +12,7 @@ Here is an overview of its structure:
| docker | `start-stack` | files necessary to startup Docker containers |
| rosetta | `rosetta` | a clone of [the rosetta test project](https://github.com/dasch-swiss/082e-rosetta-scripts) |
| logging.log, logging.log.1 | several ones | These two grow up to 3MB, then the oldest entries are deleted |
| fast-xmlupload | fast xmlupload | shell script for local processing |


Remark: Docker is normally not able to access files
Expand Down
111 changes: 111 additions & 0 deletions docs/internal/fast-xmlupload.md
@@ -0,0 +1,111 @@
[![PyPI version](https://badge.fury.io/py/dsp-tools.svg)](https://badge.fury.io/py/dsp-tools)

# Fast XML upload

For projects with a lot of files,
the [`xmlupload`](../cli-commands.md#xmlupload) command is too slow.
That's why we developed, for internal usage, a specific workflow for fast mass uploads.
The fast mass upload workflow processes the files locally before uploading them to the DSP server.
Then, it creates the resources of the XML file on the DSP server.

In order for the fast mass upload to work, you need the following dependencies:

- Your machine must be able to run the DSP software stack.
The (internal) document "Installation of your Mac" explains what software needs to be installed.
- Install ffmpeg, e.g. with `brew install ffmpeg`
- Install ImageMagick, e.g. with `brew install imagemagick`

The fast mass upload consists of the following steps:

1. Prepare your data as explained below
2. Process the files locally with `dsp-tools process-files`
3. Upload the files to DSP with `dsp-tools upload-files`
4. Create the resources on DSP with `dsp-tools fast-xmlupload`


## 1. Prepare your data

The following data structure is expected:

```text
my_project
├── data_model.json
├── data.xml (<bitstream>multimedia/dog.jpg</bitstream>)
└── multimedia
├── dog.jpg
├── cat.mp3
└── subfolder
├── snake.pdf
└── bird.mp4
```

Note:

- Your project must contain one XML data file, anywhere.
- Your project must contain one sub-folder that contains all multimedia files (here: `multimedia`).
- The multimedia files in `multimedia` may be arbitrarily nested.
- Every path referenced in a `<bitstream>` in the XML file must point to a file in `multimedia`.
- The paths in the `<bitstream>` are relative to the project root.


## 2. `dsp-tools process-files`

Process the files locally, using a SIPI container.

```bash
dsp-tools process-files --input-dir=multimedia --output-dir=tmp data.xml
```

The following options are available:

- `--input-dir` (mandatory): path to the input directory where the files should be read from
- `--output-dir` (mandatory): path to the output directory where the processed/transformed files should be written to
- `--nthreads` (optional, default computed by the concurrent library, dependent on the machine): number of threads to use for processing

All files referenced in the `<bitstream>` tags of the XML
are expected to be in the input directory
which is provided with the `--input-dir` option.
The processed files
(derivative, .orig file, sidecar file, as well as the preview file for movies)
will be stored in the given `--output-dir` directory.
If the output directory doesn't exist, it will be created automatically.
Additionally to the output directory,
a pickle file is written with the name `processing_result_[timestamp].pkl`.
It contains a mapping from the original files to the processed files,
e.g. "multimedia/dog.jpg" -> "tmp/0b/22/0b22570d-515f-4c3d-a6af-e42b458e7b2b.jp2".


## 3. `dsp-tools upload-files`

After all files are processed, the upload step can be started.


```bash
dsp-tools upload-files --pkl-file=processing_result_20230414_152810.pkl --processed-dir=tmp
```

The following options are available:

- `-f` | `--pkl-file` (mandatory): path to the pickle file that was written by the processing step
- `-d` | `--processed-dir` (mandatory): path to the directory where the processed files are located
(same as `--output-dir` in the processing step)
- `-n` | `--nthreads` (optional, default 4): number of threads to use for uploading (optimum depends on the number of CPUs on the server)
- `-s` | `--server` (optional, default: `0.0.0.0:3333`): URL of the DSP server
- `-S` | `--sipi-url` (optional, default: `0.0.0.0:1024`): URL of the SIPI server
- `-u` | `--user` (optional, default: `root@example.com`): username (e-mail) used for authentication with the DSP-API
- `-p` | `--password` (optional, default: `test`): password used for authentication with the DSP-API


## 4. `dsp-tools fast-xmlupload`

```bash
dsp-tools fast-xmlupload --pkl-file=processing_result_20230414_152810.pkl data.xml
```

The following options are available:

- `-f` | `--pkl-file` (mandatory): path to the pickle file that was written by the processing step
- `-s` | `--server` (optional, default: `0.0.0.0:3333`): URL of the DSP server
- `-S` | `--sipi-url` (optional, default: `0.0.0.0:1024`): URL of the SIPI server
- `-u` | `--user` (optional, default: `root@example.com`): username (e-mail) used for authentication with the DSP-API
- `-p` | `--password` (optional, default: `test`): password used for authentication with the DSP-API
2 changes: 2 additions & 0 deletions mkdocs.yml
Expand Up @@ -20,6 +20,8 @@ nav:
- Incremental xmlupload: incremental-xmlupload.md
- excel2xml module: excel2xml-module.md
- Running DSP locally: start-stack.md
- DaSCH-internal commands:
- Fast XML Upload: internal/fast-xmlupload.md
- Information for developers:
- Developers documentation: developers/index.md
- Dependencies, packaging & distribution: developers/packaging.md
Expand Down
1 change: 0 additions & 1 deletion pyproject.toml
Expand Up @@ -140,6 +140,5 @@ disable = [
"too-many-nested-blocks", # TODO: activate this
"too-many-return-statements", # TODO: activate this
"too-many-statements", # TODO: activate this
"bare-except", # TODO: activate this
"consider-using-f-string", # in excel2xml, strings with {} in it must be formatted
]
92 changes: 81 additions & 11 deletions src/dsp_tools/dsp_tools.py
Expand Up @@ -10,6 +10,9 @@
from pathlib import Path

from dsp_tools.excel2xml import excel2xml
from dsp_tools.fast_xmlupload.process_files import process_files
from dsp_tools.fast_xmlupload.upload_files import upload_files
from dsp_tools.fast_xmlupload.upload_xml import fast_xmlupload
from dsp_tools.models.exceptions import UserError
from dsp_tools.utils.excel_to_json_lists import (
excel2lists,
Expand Down Expand Up @@ -40,13 +43,15 @@ def make_parser() -> argparse.ArgumentParser:
# help texts
username_text = "username (e-mail) used for authentication with the DSP-API "
password_text = "password used for authentication with the DSP-API "
url_text = "URL of the DSP server"
dsp_server_text = "URL of the DSP server"
verbose_text = "print more information about the progress to the console"
sipi_text = "URL of the Sipi server"

# default values
default_localhost = "http://0.0.0.0:3333"
default_dsp_api_url = "http://0.0.0.0:3333"
default_user = "root@example.com"
default_pw = "test"
default_sipi = "http://0.0.0.0:1024"

# make a parser
parser = argparse.ArgumentParser(description=f"DSP-TOOLS (version {version('dsp-tools')}, © {datetime.datetime.now().year} by DaSCH)")
Expand All @@ -59,7 +64,7 @@ def make_parser() -> argparse.ArgumentParser:
"A project can consist of lists, groups, users, and ontologies (data models)."
)
parser_create.set_defaults(action="create")
parser_create.add_argument("-s", "--server", default=default_localhost, help=url_text)
parser_create.add_argument("-s", "--server", default=default_dsp_api_url, help=dsp_server_text)
parser_create.add_argument("-u", "--user", default=default_user, help=username_text)
parser_create.add_argument("-p", "--password", default=default_pw, help=password_text)
parser_create.add_argument("-V", "--validate-only", action="store_true",
Expand All @@ -73,7 +78,7 @@ def make_parser() -> argparse.ArgumentParser:
# get
parser_get = subparsers.add_parser(name="get", help="Retrieve a project with its data model(s) from a DSP server and write it into a JSON file")
parser_get.set_defaults(action="get")
parser_get.add_argument("-s", "--server", default=default_localhost, help=url_text)
parser_get.add_argument("-s", "--server", default=default_dsp_api_url, help=dsp_server_text)
parser_get.add_argument("-u", "--user", default=default_user, help=username_text)
parser_get.add_argument("-p", "--password", default=default_pw, help=password_text)
parser_get.add_argument("-P", "--project", help="shortcode, shortname or IRI of the project", required=True)
Expand All @@ -83,10 +88,10 @@ def make_parser() -> argparse.ArgumentParser:
# xmlupload
parser_upload = subparsers.add_parser(name="xmlupload", help="Upload data defined in an XML file to a DSP server")
parser_upload.set_defaults(action="xmlupload")
parser_upload.add_argument("-s", "--server", default=default_localhost, help="URL of the DSP server where DSP-TOOLS sends the data to")
parser_upload.add_argument("-s", "--server", default=default_dsp_api_url, help="URL of the DSP server where DSP-TOOLS sends the data to")
parser_upload.add_argument("-u", "--user", default=default_user, help=username_text)
parser_upload.add_argument("-p", "--password", default=default_pw, help=password_text)
parser_upload.add_argument("-S", "--sipi", default="http://0.0.0.0:1024",
parser_upload.add_argument("-S", "--sipi", default=default_sipi,
help="URL of the SIPI server where DSP-TOOLS sends the multimedia files to")
parser_upload.add_argument("-i", "--imgdir", default=".", help="folder from where the paths in the <bitstream> tags are evaluated")
parser_upload.add_argument("-I", "--incremental", action="store_true",
Expand All @@ -96,6 +101,44 @@ def make_parser() -> argparse.ArgumentParser:
parser_upload.add_argument("-m", "--metrics", action="store_true", help="write metrics into a 'metrics' folder")
parser_upload.add_argument("xmlfile", help="path to the XML file containing the data")

# process-files
parser_process_files = subparsers.add_parser(
name="process-files",
help="For internal use only: process all files referenced in an XML file"
)
parser_process_files.set_defaults(action="process-files")
parser_process_files.add_argument("--input-dir", help="path to the input directory where the files should be read from")
parser_process_files.add_argument("--output-dir", help="path to the output directory where the processed/transformed files should be written to")
parser_process_files.add_argument("--nthreads", type=int, default=None, help="number of threads to use")
parser_process_files.add_argument("xml_file", help="path to XML file containing the data")

# upload-files
parser_upload_files = subparsers.add_parser(
name="upload-files",
help="For internal use only: upload already processed files"
)
parser_upload_files.set_defaults(action="upload-files")
parser_upload_files.add_argument("-f", "--pkl-file", help="path to pickle file written by 'process-files'")
parser_upload_files.add_argument("-d", "--processed-dir", help="path to the directory with the processed files")
parser_upload_files.add_argument("-n", "--nthreads", type=int, default=4, help="number of threads to use")
parser_upload_files.add_argument("-s", "--server", default=default_dsp_api_url, help=dsp_server_text)
parser_upload_files.add_argument("-S", "--sipi-url", default=default_sipi, help=sipi_text)
parser_upload_files.add_argument("-u", "--user", default=default_user, help=username_text)
parser_upload_files.add_argument("-p", "--password", default=default_pw, help=password_text)

# fast-xmlupload
parser_fast_xmlupload_files = subparsers.add_parser(
name="fast-xmlupload",
help="For internal use only: create resources with already uploaded files"
)
parser_fast_xmlupload_files.set_defaults(action="fast-xmlupload")
parser_fast_xmlupload_files.add_argument("-f", "--pkl-file", help="path to pickle file written by 'process-files'")
parser_fast_xmlupload_files.add_argument("-s", "--server", default=default_dsp_api_url, help=dsp_server_text)
parser_fast_xmlupload_files.add_argument("-S", "--sipi-url", default=default_sipi, help=sipi_text)
parser_fast_xmlupload_files.add_argument("-u", "--user", default=default_user, help=username_text)
parser_fast_xmlupload_files.add_argument("-p", "--password", default=default_pw, help=password_text)
parser_fast_xmlupload_files.add_argument("xml_file", help="path to XML file containing the data")

# excel2json
parser_excel2json = subparsers.add_parser(
name="excel2json",
Expand Down Expand Up @@ -157,8 +200,7 @@ def make_parser() -> argparse.ArgumentParser:
# startup DSP stack
parser_stackup = subparsers.add_parser(name="start-stack", help="Run a local instance of DSP-API and DSP-APP")
parser_stackup.set_defaults(action="start-stack")
parser_stackup.add_argument("--max_file_size", type=int,
help="max. multimedia file size allowed by SIPI, in MB (default: 250, max: 100'000)")
parser_stackup.add_argument("--max_file_size", type=int, help="max. multimedia file size allowed by SIPI, in MB (default: 250, max: 100'000)")
parser_stackup.add_argument("--prune", action="store_true", help="execute 'docker system prune' without asking")
parser_stackup.add_argument("--no-prune", action="store_true", help="don't execute 'docker system prune' (and don't ask)")

Expand All @@ -175,7 +217,7 @@ def make_parser() -> argparse.ArgumentParser:
help="Create a template repository with a minimal JSON and XML file"
)
parser_template.set_defaults(action="template")

# clone rosetta
parser_rosetta = subparsers.add_parser(
name="rosetta",
Expand Down Expand Up @@ -264,8 +306,36 @@ def call_requested_action(
sipi=args.sipi,
verbose=args.verbose,
incremental=args.incremental,
save_metrics=args.metrics
save_metrics=args.metrics,
preprocessing_done=False
)

elif args.action == "process-files":
success = process_files(
input_dir=args.input_dir,
output_dir=args.output_dir,
xml_file=args.xml_file,
nthreads=args.nthreads
)
elif args.action == "upload-files":
success = upload_files(
pkl_file=args.pkl_file,
dir_with_processed_files=args.processed_dir,
nthreads=args.nthreads,
user=args.user,
password=args.password,
dsp_url=args.server,
sipi_url=args.sipi_url
)
elif args.action == "fast-xmlupload":
success = fast_xmlupload(
xml_file=args.xml_file,
pkl_file=args.pkl_file,
user=args.user,
password=args.password,
dsp_url=args.server,
sipi_url=args.sipi_url
)
elif args.action == "excel2json":
success = excel2json(
data_model_files=args.excelfolder,
Expand Down Expand Up @@ -318,7 +388,6 @@ def call_requested_action(
logger.error(f"Unknown action '{args.action}'")

return success



def main() -> None:
Expand Down Expand Up @@ -352,5 +421,6 @@ def main() -> None:
if not success:
sys.exit(1)


if __name__ == "__main__":
main()
Empty file.

0 comments on commit c2f46ea

Please sign in to comment.