fix(upload-files, fast-xmlupload): handle multiple pickle files (DEV-2500) #451

jnussbaum · 2023-07-28T11:31:13Z

No description provided.

linear · 2023-07-28T11:31:15Z

DEV-2500 fast xmlupload: make steps 2-3 ready to process several pickle files

The first step (processing) can produce several pickle files, if the data set is very big.

The uploading step and the fast-xmlupload step must be adapted so that they can handle more than one pickle file.

BalduinLandolt

Looks good, I just remarked some details, feel free to ignore them

BalduinLandolt · 2023-07-31T07:28:07Z

docs/internal/fast-xmlupload.md

 **In this case, you need to restart the command several times, until the exit code is 0.**
 **Only then, all files are processed.**
 **Unexpected errors result in exit code 1.**
+**If this batch splitting happens, every run produces a new pickle file.**


Do we know what kind of a resource leak this is? Is it on the python side or on our side? Could it be fixed?

Unfortunately, we have no idea. Christian, Vij, and me, we spent a lot of time figuring out what could be wrong, but no one knows. It's really annoying.

BalduinLandolt · 2023-07-31T07:30:09Z

src/dsp_tools/fast_xmlupload/process_files.py

+    filename = Path(f"processing_result_{datetime.now().strftime('%Y%m%d_%H%M%S')}.pkl")
+    while filename.is_file():
+        logger.warning(f"The file {filename} already exists. Trying again in 1 second...")
+        sleep(1)
+        filename = Path(f"processing_result_{datetime.now().strftime('%Y%m%d_%H%M%S')}.pkl")


is there a particular reason to not adding more date precision, or using a unique identifier (like a UUID) in the first place?

BalduinLandolt · 2023-07-31T07:33:06Z

src/dsp_tools/fast_xmlupload/upload_files.py

+        pkl_files: pickle file(s) returned by the processing step
+
+    Returns:
+        list of uuid file paths


what is a "uuid file path"?

BalduinLandolt · 2023-07-31T07:47:39Z

src/dsp_tools/fast_xmlupload/upload_files.py

+    orig_paths_2_processed_paths: list[tuple[Path, Optional[Path]]] = []
+    for pkl_file in pkl_files:
+        with open(pkl_file, "rb") as file:
+            orig_paths_2_processed_paths.extend(pickle.load(file))
+
+    processed_paths: list[Path] = []
+    for orig_path, processed_path in orig_paths_2_processed_paths:
+        if processed_path:
+            processed_paths.append(processed_path)
+        else:
+            print(f"{datetime.now()}: WARNING: There is no processed file for {orig_path}")
+            logger.warning(f"There is no processed file for {orig_path}")
+
+    return processed_paths


this method is overly complicated if you ask me: The first loop only creates an iterable to be iterated in the second loop, so they could be combined into one loop. Also, reading pickles from Path is a one liner: pickle.loads(path.read_bytes())

BalduinLandolt · 2023-07-31T07:49:51Z

src/dsp_tools/fast_xmlupload/upload_xml.py

-    with open(pkl_file, "rb") as file:
-        orig_path_2_processed_path: list[tuple[Path, Optional[Path]]] = pickle.load(file)
+    orig_path_2_processed_path: list[tuple[Path, Optional[Path]]] = []
+    for pkl_file in pkl_files:
+        with open(pkl_file, "rb") as file:
+            orig_path_2_processed_path.extend(pickle.load(file))


as above. Also, there seems to be a lot of logic duplicated here

BalduinLandolt · 2023-07-31T07:51:04Z

src/dsp_tools/fast_xmlupload/upload_xml.py

-            raise BaseError(
+            raise UserError(


Isn't the idea of the UserError that it's thrown when something is the user's fault? Is that really the case here?

BalduinLandolt · 2023-07-31T07:53:19Z

test/e2e/test_fast_xmlupload.py

        """

        def action() -> bool:
+            print("test_fast_xmlupload_batching: call process_files() with batch size 15")


printing in tests is never a good idea. Tests should only produce the test report.

jnussbaum added 2 commits July 28, 2023 13:17

adapt code, docs, and tests

04b28bc

make tests faster: reduce number of bitstreams

2c23db9

jnussbaum self-assigned this Jul 28, 2023

jnussbaum added 2 commits July 28, 2023 13:33

fix TypeError: 'type' object is not subscriptable

ecfe744

prevent overwriting pkl file if batch is very small

e548589

jnussbaum requested a review from BalduinLandolt July 28, 2023 12:32

BalduinLandolt approved these changes Jul 31, 2023

View reviewed changes

jnussbaum and others added 4 commits August 7, 2023 13:54

make pickle file naming easier (add microsecond)

e8dbbc2

implement reviewer's feedback

4eeb177

Merge branch 'main' into wip/dev-2500-handle-multiple-pkl-files

a43debf

remove unused import

b64c06e

jnussbaum merged commit 98f0b97 into main Aug 7, 2023
4 checks passed

jnussbaum deleted the wip/dev-2500-handle-multiple-pkl-files branch August 7, 2023 12:53

daschbot mentioned this pull request Aug 7, 2023

chore: release 3.0.0 #455

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(upload-files, fast-xmlupload): handle multiple pickle files (DEV-2500) #451

fix(upload-files, fast-xmlupload): handle multiple pickle files (DEV-2500) #451

jnussbaum commented Jul 28, 2023

linear bot commented Jul 28, 2023

BalduinLandolt left a comment

BalduinLandolt Jul 31, 2023

jnussbaum Aug 7, 2023

BalduinLandolt Aug 7, 2023

BalduinLandolt Jul 31, 2023

BalduinLandolt Jul 31, 2023

BalduinLandolt Jul 31, 2023

BalduinLandolt Jul 31, 2023

BalduinLandolt Jul 31, 2023

BalduinLandolt Jul 31, 2023

fix(upload-files, fast-xmlupload): handle multiple pickle files (DEV-2500) #451

fix(upload-files, fast-xmlupload): handle multiple pickle files (DEV-2500) #451

Conversation

jnussbaum commented Jul 28, 2023

linear bot commented Jul 28, 2023

BalduinLandolt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment