Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(process-files): quit Python interpreter after every batch to prevent "too many open files" error (DEV-2268) #402

Merged
merged 35 commits into from
Jul 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
66a3097
write pkl file in case of non-success
jnussbaum Jun 9, 2023
7554a4b
reformat docs
jnussbaum Jun 12, 2023
f16fa26
Merge branch 'main' into wip/dev-2268-too-many-open-files
jnussbaum Jun 12, 2023
6427a73
Merge branch 'main' into wip/dev-2268-too-many-open-files
jnussbaum Jun 16, 2023
37bc086
Merge branch 'main' into wip/dev-2268-too-many-open-files
jnussbaum Jun 22, 2023
2637f33
edit
jnussbaum Jun 26, 2023
2978ee9
refactor double_check_unprocessed_files()
jnussbaum Jun 26, 2023
952480a
refactor
jnussbaum Jun 26, 2023
83f596f
debug
jnussbaum Jun 26, 2023
b384f33
blacken
jnussbaum Jun 26, 2023
9e6eff2
Merge branch 'main' into wip/dev-2268-too-many-open-files
jnussbaum Jun 26, 2023
3ec5ffe
fix
jnussbaum Jun 26, 2023
81bf234
docs
jnussbaum Jun 26, 2023
e9ce80f
Merge branch 'main' into wip/dev-2268-too-many-open-files
jnussbaum Jun 26, 2023
77d2abb
fix test
jnussbaum Jun 26, 2023
98f314d
edit
jnussbaum Jun 26, 2023
84520ee
finish
jnussbaum Jun 27, 2023
8ff4ad1
write pickle file to working directory instead of input directory
jnussbaum Jun 27, 2023
9519081
add _write_processed_and_unprocessed_files_to_txt_files()
jnussbaum Jun 27, 2023
17a8f8a
edit
jnussbaum Jun 27, 2023
519bdcb
working version
jnussbaum Jun 27, 2023
8f571ef
edit
jnussbaum Jun 27, 2023
70d1b02
fix test
jnussbaum Jun 27, 2023
9dbd30b
fix test: locally it works
jnussbaum Jun 27, 2023
67a69b5
blacken
jnussbaum Jun 27, 2023
63a6974
Merge branch 'main' into wip/dev-2268-too-many-open-files
jnussbaum Jun 27, 2023
1e5212e
keep more log records
jnussbaum Jun 27, 2023
cce17a4
Merge branch 'main' into wip/dev-2268-too-many-open-files
jnussbaum Jun 28, 2023
58919c3
Merge branch 'main' into wip/dev-2268-too-many-open-files
jnussbaum Jun 28, 2023
681ebc3
Merge branch 'main' into wip/dev-2268-too-many-open-files
jnussbaum Jun 30, 2023
f8dfbb2
Merge branch 'main' into wip/dev-2268-too-many-open-files
jnussbaum Jul 10, 2023
deab8e2
make batch size configurable
jnussbaum Jul 14, 2023
9f1f846
add internal commands to CLI commands overview
jnussbaum Jul 14, 2023
95fd8b5
Merge branch 'main' into wip/dev-2268-too-many-open-files
jnussbaum Jul 14, 2023
7a7d172
fix
jnussbaum Jul 14, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
23 changes: 23 additions & 0 deletions docs/cli-commands.md
Original file line number Diff line number Diff line change
Expand Up @@ -312,3 +312,26 @@ dsp-tools rosetta
```

A DSP stack must be running before executing this command.



## `process-files`

DaSCH-internal command to process multimedia files locally,
before uploading them to a DSP server.
See [here](./internal/fast-xmlupload.md) for more information.



## `upload-files`

DaSCH-internal command to upload processed multimedia files to a DSP server.
See [here](./internal/fast-xmlupload.md) for more information.



## `fast-xmlupload`

DaSCH-internal command to create the resources of an XML file
after the processed multimedia files have been uploaded already.
See [here](./internal/fast-xmlupload.md) for more information.
28 changes: 27 additions & 1 deletion docs/internal/fast-xmlupload.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,19 +62,45 @@ The following options are available:
- `--output-dir` (mandatory): path to the output directory where the processed/transformed files should be written to
- `--nthreads` (optional, default computed by the concurrent library, dependent on the machine):
number of threads to use for processing
- `--batchsize` (optional, default 5000): number of files to process in one batch

All files referenced in the `<bitstream>` tags of the XML
are expected to be in the input directory
which is provided with the `--input-dir` option.

The processed files
(derivative, .orig file, sidecar file, as well as the preview file for movies)
will be stored in the given `--output-dir` directory.
If the output directory doesn't exist, it will be created automatically.

Additionally to the output directory,
a pickle file is written with the name `processing_result_[timestamp].pkl`.
It contains a mapping from the original files to the processed files,
e.g. "multimedia/dog.jpg" -> "tmp/0b/22/0b22570d-515f-4c3d-a6af-e42b458e7b2b.jp2".
e.g. `multimedia/dog.jpg` -> `tmp/0b/22/0b22570d-515f-4c3d-a6af-e42b458e7b2b.jp2`.


### Important note

**Due to a resource leak, Python must be quitted after a certain time.**
**For big datasets, only a batch of files is processed, then Python exits with exit code 2.**
**In this case, you need to restart the command several times, until the exit code is 0.**
**Only then, all files are processed.**
**Unexpected errors result in exit code 1.**

You can orchestrate this with a shell script, e.g.:

```bash
exit_code=2
while [ $exit_code -eq 2 ]; do
dsp-tools process-files --input-dir=multimedia --output-dir=tmp data.xml
exit_code=$?
done

if [ $exit_code -ne 0 ]; then
echo "Error: exit code $exit_code"
exit $exit_code
fi
```

## 3. `dsp-tools upload-files`

Expand Down
5 changes: 4 additions & 1 deletion src/dsp_tools/dsp_tools.py
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,9 @@ def make_parser() -> argparse.ArgumentParser:
"--output-dir", help="path to the output directory where the processed/transformed files should be written to"
)
parser_process_files.add_argument("--nthreads", type=int, default=None, help="number of threads to use")
parser_process_files.add_argument(
"--batchsize", type=int, default=5000, help="number of files to process before Python exits"
)
parser_process_files.add_argument("xml_file", help="path to XML file containing the data")

# upload-files
Expand Down Expand Up @@ -376,13 +379,13 @@ def call_requested_action(
save_metrics=args.metrics,
preprocessing_done=False,
)

elif args.action == "process-files":
success = process_files(
input_dir=args.input_dir,
output_dir=args.output_dir,
xml_file=args.xml_file,
nthreads=args.nthreads,
batch_size=args.batchsize,
)
elif args.action == "upload-files":
success = upload_files(
Expand Down