New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(upload-files, fast-xmlupload): handle multiple pickle files (DEV-2500) #451
Conversation
DEV-2500 fast xmlupload: make steps 2-3 ready to process several pickle files
The first step (processing) can produce several pickle files, if the data set is very big. The uploading step and the fast-xmlupload step must be adapted so that they can handle more than one pickle file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I just remarked some details, feel free to ignore them
**In this case, you need to restart the command several times, until the exit code is 0.** | ||
**Only then, all files are processed.** | ||
**Unexpected errors result in exit code 1.** | ||
**If this batch splitting happens, every run produces a new pickle file.** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know what kind of a resource leak this is? Is it on the python side or on our side? Could it be fixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, we have no idea. Christian, Vij, and me, we spent a lot of time figuring out what could be wrong, but no one knows. It's really annoying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too bad
filename = Path(f"processing_result_{datetime.now().strftime('%Y%m%d_%H%M%S')}.pkl") | ||
while filename.is_file(): | ||
logger.warning(f"The file {filename} already exists. Trying again in 1 second...") | ||
sleep(1) | ||
filename = Path(f"processing_result_{datetime.now().strftime('%Y%m%d_%H%M%S')}.pkl") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a particular reason to not adding more date precision, or using a unique identifier (like a UUID) in the first place?
pkl_files: pickle file(s) returned by the processing step | ||
|
||
Returns: | ||
list of uuid file paths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is a "uuid file path"?
orig_paths_2_processed_paths: list[tuple[Path, Optional[Path]]] = [] | ||
for pkl_file in pkl_files: | ||
with open(pkl_file, "rb") as file: | ||
orig_paths_2_processed_paths.extend(pickle.load(file)) | ||
|
||
processed_paths: list[Path] = [] | ||
for orig_path, processed_path in orig_paths_2_processed_paths: | ||
if processed_path: | ||
processed_paths.append(processed_path) | ||
else: | ||
print(f"{datetime.now()}: WARNING: There is no processed file for {orig_path}") | ||
logger.warning(f"There is no processed file for {orig_path}") | ||
|
||
return processed_paths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method is overly complicated if you ask me: The first loop only creates an iterable to be iterated in the second loop, so they could be combined into one loop. Also, reading pickles from Path is a one liner: pickle.loads(path.read_bytes())
with open(pkl_file, "rb") as file: | ||
orig_path_2_processed_path: list[tuple[Path, Optional[Path]]] = pickle.load(file) | ||
orig_path_2_processed_path: list[tuple[Path, Optional[Path]]] = [] | ||
for pkl_file in pkl_files: | ||
with open(pkl_file, "rb") as file: | ||
orig_path_2_processed_path.extend(pickle.load(file)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above. Also, there seems to be a lot of logic duplicated here
raise BaseError( | ||
raise UserError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the idea of the UserError
that it's thrown when something is the user's fault? Is that really the case here?
""" | ||
|
||
def action() -> bool: | ||
print("test_fast_xmlupload_batching: call process_files() with batch size 15") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
printing in tests is never a good idea. Tests should only produce the test report.
No description provided.