Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid hidden file in read only data stores for .zip #1628

Open
YapengLang opened this issue Nov 17, 2023 · 7 comments
Open

Avoid hidden file in read only data stores for .zip #1628

YapengLang opened this issue Nov 17, 2023 · 7 comments
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed

Comments

@YapengLang
Copy link
Collaborator

Describe the bug
When opening a data store on a .zip file, the hidden files starting with "." will jump out in the members list.

To Reproduce
demo: loci.zip

from cogent3 import open_data_store
ds = open_data_store("loci.zip", suffix="phy", mode="r")
ds.members[:5]

you will get:

[DataMember(data_store=/Users/yapenglang/Documents/lab/phylomania2023/c3workshop/data/loci.zip, unique_id=locus213.phy),
 DataMember(data_store=/Users/yapenglang/Documents/lab/phylomania2023/c3workshop/data/loci.zip, unique_id=._locus213.phy),
 DataMember(data_store=/Users/yapenglang/Documents/lab/phylomania2023/c3workshop/data/loci.zip, unique_id=locus207.phy),
 DataMember(data_store=/Users/yapenglang/Documents/lab/phylomania2023/c3workshop/data/loci.zip, unique_id=._locus207.phy),
 DataMember(data_store=/Users/yapenglang/Documents/lab/phylomania2023/c3workshop/data/loci.zip, unique_id=locus43.phy)]
@GavinHuttley GavinHuttley added bug Something isn't working help wanted Extra attention is needed good first issue Good for newcomers labels Nov 17, 2023
@GaoPeizhong
Copy link

I would like to do this issue.

@GaoPeizhong
Copy link

The easiest and most straightforward way to do this is to use a for loop to iterate over all the files and use a condition to separate the files that should be hidden from the ones we want.
You just need to replace the original line 3 with this:
ds.filtered_members = [member for member in ds.members if not member.unique_id.startswith('._')]
ds.filtered_members[:5]
Let me know if you need to implement anything further, I'm not sure what you want to implement here.
5dcbe7ff4f39730e2223599c57c4b37

@GavinHuttley
Copy link
Collaborator

GavinHuttley commented Mar 12, 2024

That "patches" the system for a single use case but is not a general solution to the problem.

A DataStore is a collection of data records which will be operated on by different processes. If you look at how the members attribute is defined, it's actually a property which itself just returns the sum of the completed and not_completed properties. So you need to look at the implementation of both those for the ReadOnlyDataStoreZipped class.

Bear in mind that the file filter should be based on whether the name starts with "." rather than "._".

Finally, to really fix this issue you need to add a new test in tests/test_app/test_data_store.py. That test should add the zip attached to this issue into tests/data. The test should fail before you add your proposed fix, and pass after you add your fix.

@GaoPeizhong
Copy link

I understand what you're asking for now but ran into a problem while trying to solve this problem: you said to add logi.zip to tests/data, but the link given above seems to be broken. (Error 404 Not Found)

@YapengLang
Copy link
Collaborator Author

@GaoPeizhong
Copy link

I've done part of the problem for now. I've added a new feature to ReadOnlyDataStoreZipped to make the original code get what it wants.
In/workspaces/Python/cogent3 / SRC/cogent3 / app/data_store. Py path find class ReadOnlyDataStoreZipped (DataStoreABC) : , Add a bit of code below this:
def _iter_matches(self, subdir: str, pattern: str) -> Iterator[PathLike]:
with zipfile.ZipFile(self._source) as archive:
names = archive.namelist()
for name in names:
name = pathlib.Path(name)
if name.name.startswith('.'):
continue
if subdir and name.parent.name != subdir:
continue
if name.match(pattern):
yield name
Uploading issue.png…

@GaoPeizhong
Copy link

issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants