Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR config parameteres #3014

Merged
merged 9 commits into from
May 17, 2024

Conversation

amadeusz-ds
Copy link
Contributor

@amadeusz-ds amadeusz-ds commented May 14, 2024

This PR introduces GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR controlling where temporary files are stored during partition flow, via tempfile.tempdir.

Edit:

Renamed prefixes from STORAGE_ to UNSTRUCTURED_CACHE_

Edit 2:

Renamed prefixes from UNSTRUCTURED_CACHE to GLOBAL_WORKING_DIR_


@property
def STORAGE_DIR(self) -> str:
"""Path to Unstructured storage directory."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

STORAGE_DIR is a misleading name, which has permanent or at least caching connotations. could this instead be TMP_STORAGE_DIR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are parameters STORAGE_DIR and STORAGE_TMPDIR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe UNSTRUCTURED_DIR and UNSTRUCTURED_TMPDIR as those by default point to ~/.cache/unstructured and ~/.cache/unstructured/tmp/{gid} respectively

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or UNSTRUCTURED_CACHE_DIR and UNSTRUCTURED_TMP_DIR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made the changes, waiting for greenlight @cragwolfe

Copy link
Contributor

@pawel-kmiecik pawel-kmiecik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just env names to be corrected.

@amadeusz-ds amadeusz-ds changed the title feat: add STORAGE_DIR and STORAGE_TMPDIR config parameteres feat: add UNSTRUCTURED_CACHE_DIR and UNSTRUCTURED_CACHE_TMPDIR config parameteres May 15, 2024
@@ -160,7 +160,6 @@ def _try_process_document(self, doc: Path) -> Optional[list]:
@abstractmethod
def _process_document(self, doc: Path) -> list:
"""Should return all metadata and metrics for a single document."""
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was removed by the linter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a docstring is added, the pass keyword is optional for functions.

@amadeusz-ds amadeusz-ds changed the title feat: add UNSTRUCTURED_CACHE_DIR and UNSTRUCTURED_CACHE_TMPDIR config parameteres feat: add GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR config parameteres May 17, 2024
@@ -161,7 +163,12 @@ def test_save_elements_with_output_dir_path_none():
)

# Verify that the images are saved in the expected directory
expected_output_dir = os.path.join(tmpdir, "figures")
if storage_enabled:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personally I'd prefer to see more usages of pathlib, but I believe it's not the mail goal of this PR, just a side note

@amadeusz-ds amadeusz-ds added this pull request to the merge queue May 17, 2024
Merged via the queue into main with commit 1c8b2b2 May 17, 2024
42 checks passed
@amadeusz-ds amadeusz-ds deleted the feat/properly-support-usage-of-temporary-storage branch May 17, 2024 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants