Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(upload-files, fast-xmlupload): handle multiple pickle files (DEV-2500) #451

Merged
merged 8 commits into from Aug 7, 2023

Conversation

jnussbaum
Copy link
Collaborator

No description provided.

@jnussbaum jnussbaum self-assigned this Jul 28, 2023
@linear
Copy link

linear bot commented Jul 28, 2023

DEV-2500 fast xmlupload: make steps 2-3 ready to process several pickle files

The first step (processing) can produce several pickle files, if the data set is very big.

The uploading step and the fast-xmlupload step must be adapted so that they can handle more than one pickle file.

Copy link
Collaborator

@BalduinLandolt BalduinLandolt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I just remarked some details, feel free to ignore them

Comment on lines 86 to +89
**In this case, you need to restart the command several times, until the exit code is 0.**
**Only then, all files are processed.**
**Unexpected errors result in exit code 1.**
**If this batch splitting happens, every run produces a new pickle file.**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know what kind of a resource leak this is? Is it on the python side or on our side? Could it be fixed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we have no idea. Christian, Vij, and me, we spent a lot of time figuring out what could be wrong, but no one knows. It's really annoying.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too bad

Comment on lines 148 to 152
filename = Path(f"processing_result_{datetime.now().strftime('%Y%m%d_%H%M%S')}.pkl")
while filename.is_file():
logger.warning(f"The file {filename} already exists. Trying again in 1 second...")
sleep(1)
filename = Path(f"processing_result_{datetime.now().strftime('%Y%m%d_%H%M%S')}.pkl")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a particular reason to not adding more date precision, or using a unique identifier (like a UUID) in the first place?

pkl_files: pickle file(s) returned by the processing step

Returns:
list of uuid file paths
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is a "uuid file path"?

Comment on lines 64 to 77
orig_paths_2_processed_paths: list[tuple[Path, Optional[Path]]] = []
for pkl_file in pkl_files:
with open(pkl_file, "rb") as file:
orig_paths_2_processed_paths.extend(pickle.load(file))

processed_paths: list[Path] = []
for orig_path, processed_path in orig_paths_2_processed_paths:
if processed_path:
processed_paths.append(processed_path)
else:
print(f"{datetime.now()}: WARNING: There is no processed file for {orig_path}")
logger.warning(f"There is no processed file for {orig_path}")

return processed_paths
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method is overly complicated if you ask me: The first loop only creates an iterable to be iterated in the second loop, so they could be combined into one loop. Also, reading pickles from Path is a one liner: pickle.loads(path.read_bytes())

Comment on lines 25 to 33
with open(pkl_file, "rb") as file:
orig_path_2_processed_path: list[tuple[Path, Optional[Path]]] = pickle.load(file)
orig_path_2_processed_path: list[tuple[Path, Optional[Path]]] = []
for pkl_file in pkl_files:
with open(pkl_file, "rb") as file:
orig_path_2_processed_path.extend(pickle.load(file))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above. Also, there seems to be a lot of logic duplicated here

Comment on lines -33 to +40
raise BaseError(
raise UserError(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the idea of the UserError that it's thrown when something is the user's fault? Is that really the case here?

"""

def action() -> bool:
print("test_fast_xmlupload_batching: call process_files() with batch size 15")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

printing in tests is never a good idea. Tests should only produce the test report.

@jnussbaum jnussbaum merged commit 98f0b97 into main Aug 7, 2023
4 checks passed
@jnussbaum jnussbaum deleted the wip/dev-2500-handle-multiple-pkl-files branch August 7, 2023 12:53
@daschbot daschbot mentioned this pull request Aug 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants