Skip to content

Commit

Permalink
fix(process-files): quit Python interpreter after every batch to prev…
Browse files Browse the repository at this point in the history
…ent "too many open files" error (DEV-2268) (#402)
  • Loading branch information
jnussbaum committed Jul 14, 2023
1 parent 679654d commit 1cbf927
Show file tree
Hide file tree
Showing 7 changed files with 566 additions and 331 deletions.
23 changes: 23 additions & 0 deletions docs/cli-commands.md
Expand Up @@ -312,3 +312,26 @@ dsp-tools rosetta
```

A DSP stack must be running before executing this command.



## `process-files`

DaSCH-internal command to process multimedia files locally,
before uploading them to a DSP server.
See [here](./internal/fast-xmlupload.md) for more information.



## `upload-files`

DaSCH-internal command to upload processed multimedia files to a DSP server.
See [here](./internal/fast-xmlupload.md) for more information.



## `fast-xmlupload`

DaSCH-internal command to create the resources of an XML file
after the processed multimedia files have been uploaded already.
See [here](./internal/fast-xmlupload.md) for more information.
28 changes: 27 additions & 1 deletion docs/internal/fast-xmlupload.md
Expand Up @@ -62,19 +62,45 @@ The following options are available:
- `--output-dir` (mandatory): path to the output directory where the processed/transformed files should be written to
- `--nthreads` (optional, default computed by the concurrent library, dependent on the machine):
number of threads to use for processing
- `--batchsize` (optional, default 5000): number of files to process in one batch

All files referenced in the `<bitstream>` tags of the XML
are expected to be in the input directory
which is provided with the `--input-dir` option.

The processed files
(derivative, .orig file, sidecar file, as well as the preview file for movies)
will be stored in the given `--output-dir` directory.
If the output directory doesn't exist, it will be created automatically.

Additionally to the output directory,
a pickle file is written with the name `processing_result_[timestamp].pkl`.
It contains a mapping from the original files to the processed files,
e.g. "multimedia/dog.jpg" -> "tmp/0b/22/0b22570d-515f-4c3d-a6af-e42b458e7b2b.jp2".
e.g. `multimedia/dog.jpg` -> `tmp/0b/22/0b22570d-515f-4c3d-a6af-e42b458e7b2b.jp2`.


### Important note

**Due to a resource leak, Python must be quitted after a certain time.**
**For big datasets, only a batch of files is processed, then Python exits with exit code 2.**
**In this case, you need to restart the command several times, until the exit code is 0.**
**Only then, all files are processed.**
**Unexpected errors result in exit code 1.**

You can orchestrate this with a shell script, e.g.:

```bash
exit_code=2
while [ $exit_code -eq 2 ]; do
dsp-tools process-files --input-dir=multimedia --output-dir=tmp data.xml
exit_code=$?
done

if [ $exit_code -ne 0 ]; then
echo "Error: exit code $exit_code"
exit $exit_code
fi
```

## 3. `dsp-tools upload-files`

Expand Down
5 changes: 4 additions & 1 deletion src/dsp_tools/dsp_tools.py
Expand Up @@ -140,6 +140,9 @@ def make_parser() -> argparse.ArgumentParser:
"--output-dir", help="path to the output directory where the processed/transformed files should be written to"
)
parser_process_files.add_argument("--nthreads", type=int, default=None, help="number of threads to use")
parser_process_files.add_argument(
"--batchsize", type=int, default=5000, help="number of files to process before Python exits"
)
parser_process_files.add_argument("xml_file", help="path to XML file containing the data")

# upload-files
Expand Down Expand Up @@ -376,13 +379,13 @@ def call_requested_action(
save_metrics=args.metrics,
preprocessing_done=False,
)

elif args.action == "process-files":
success = process_files(
input_dir=args.input_dir,
output_dir=args.output_dir,
xml_file=args.xml_file,
nthreads=args.nthreads,
batch_size=args.batchsize,
)
elif args.action == "upload-files":
success = upload_files(
Expand Down

0 comments on commit 1cbf927

Please sign in to comment.