Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An include pattern should infer the inclusion of parent directories #176

Closed
tbain98 opened this issue Sep 10, 2017 · 14 comments
Closed

An include pattern should infer the inclusion of parent directories #176

tbain98 opened this issue Sep 10, 2017 · 14 comments

Comments

@tbain98
Copy link

tbain98 commented Sep 10, 2017

The guide has an example of how a user can go wrong with the Duplicacy inclusion pattern paradigm if they don't explicitly include all parent directories.

This seems backwards/broken. The better behavior would be to infer that all parent directories along the path of any included content are deemed to be included, but that those parent directories' other content is not deemed to be included nor excluded.

Changing the behavior to what I've proposed here would ensure that the expected behavior occurs in the first example: foo, foo/bar, and foo/bar/* are all included, but no other content is.

EDIT:
I've realized that what I wrote originally was ambiguous, and it's possible that people read what I wrote and inferred the other interpretation. I've left the original request untouched, but the content below clarifies what I meant.

When I said that I want the parent folder(s) to be "included", I do not mean that I want them included in the content that is stored as part of the backup unless a regex matches at least one file or directory somewhere below them. My sole intent was to say that parent folders should be visited (the word "included" was a poor word choice); that is, they shouldn't be deemed to be excluded from consideration because they were not explicitly specified.

If I include a regex of foo/bar.*, I want my backup set to have a root-level folder named bar, along with all content contained within the source foo/bar folder (including all recursive sub-folders), but not to contain a folder named foo as a parent to bar. In this example, I'm specifying that I want bar included, but I have not said that I want foo included. (If I wanted foo included, I'd need to specify an additional include for foo.) But according to the documentation I quoted, visiting bar at all requires that foo be included in the backup (which forces the full path to bar to be included in the backed-up content, limiting flexibility on restores); making that no longer be the case is the only change I'm requesting here.

@cristihcd
Copy link

In completely agree with that. It would also make the filters file cleaner.

@leftytennis
Copy link
Contributor

I just submitted a patch for supporting include/exclude regular expression filters (PR #187). If/when this is included in duplicacy, it fundamentally solves the issue that the current filter matching requires that all parent directories are included, etc. For example, using the new regex filters, the following would behave as expected as well:

i:foo/bar/.*
e:.*

This would match anything beneath foo/bar/ and exclude everything else. Coversely, reversing the include/exclude would also behave as expected:

e:foo/bar/.*
i:*

Everything under foo/bar/ would be excluded, and everything else included.

@leftytennis
Copy link
Contributor

leftytennis commented Sep 14, 2017

I believe this can be closed after inclusion of PR #187. It doesn't change the behavior of the original wildcard filters, but offers a new regular expression filter that can be used to achieve the desired behavior.

@tbain98
Copy link
Author

tbain98 commented Sep 15, 2017

@jt70471 Can you please explain why the behavior you implemented (thanks for doing that, BTW) would behave differently in terms of including parent directories? I would expect your regex sample from your first example to exclude foo, resulting in foo/bar not being considered at all, same as for the wildcard-based implementation. What am I not seeing?

@leftytennis
Copy link
Contributor

leftytennis commented Sep 15, 2017

Edit: actual testing shows I was wrong! :) You do indeed need to include a parent directory regex that matches so that duplicacy will descend into sub-directories. Updated example below.

I didn't dig into how/why the wildcard patterns require that parent directories need to be specified before any sub-directories to be considered. In the example, I quoted, assume the following directory listing:

foo/file1
foo/file2
foo/bar/
foo/bar/file3
foo/bar/file4
foo/baz/
foo/baz/file5

The regular expressions are evaluated in the order they are encountered from filters, so assuming the filters file contains:

# the $ tells regex that it matches the end of the string, so that's the directory itself
# it needs to match or else foo/bar/ is never seen. You can anchor the start of the string match to the first position by starting the regex with the "^" character as seen below.
i:^foo/$
i:^foo/bar/.*
e:.*

The resulting regex evaluation would be:

foo/file1 - match e:.* (excluded)
foo/file2 - match e:* (excluded)
foo/bar/ - match i:^foo/$ (included)
foo/bar/file3 - match i:^foo/bar/.* (included)
foo/bar/file4 - match i:^foo/bar/.* (included)
foo/baz/ - match e:.* (excluded)
foo/baz/file5 - match e:.* (excluded)

Sorry to have misled you with incorrect information. I'll review the guide update I made to see what changes are needed to clarify how parent/sub-directory matching needs to occur.

@tbain98
Copy link
Author

tbain98 commented Sep 15, 2017 via email

@tbain98
Copy link
Author

tbain98 commented Sep 16, 2017

@gilbertchen Please re-open this issue; it was not resolved by PR #187. @jt70471's comment two above this one indicates that although he originally thought that it would, further testing indicates that explicit inclusion of the parent directories is still required with regexes as currently implemented.

@leftytennis
Copy link
Contributor

leftytennis commented Sep 16, 2017

It is completely dependant on how you have you include/exclude filters coded. The same applies to the pattern matching filters or the new regex filters.

You have to ensure that a parent folder is not excluded during filter matching, otherwise the sub-directory folders will never be traversed by duplicacy and therefore will never be matched.

For example, consider the following directory/file structure:

imac02:~/testme$ ls -lR foo
total 0
drwxr-xr-x+ 4 jefft  staff  136 Sep 16 16:46 bar
drwxr-xr-x+ 4 jefft  staff  136 Sep 16 16:46 baz
-rw-r--r--+ 1 jefft  staff    0 Sep 16 16:46 file1
-rw-r--r--+ 1 jefft  staff    0 Sep 16 16:46 file2

foo/bar:
total 0
-rw-r--r--+ 1 jefft  staff  0 Sep 15 03:42 file3
-rw-r--r--+ 1 jefft  staff  0 Sep 15 03:42 file4

foo/baz:
total 0
-rw-r--r--+ 1 jefft  staff  0 Sep 15 03:42 file5
-rw-r--r--+ 1 jefft  staff  0 Sep 15 03:42 file6

and the following filters shown in GUIDE.md:

+foo/bar/*
-*

As explained in GUIDE.md, foo/bar/* will not be included, because its parent directory foo/ is excluded and no files/sub-directories beneath it are traversed and subsequently, no attempt to match those files/sub-directories against any filter. In order to have foo/bar/* included, you need to ensure its parent directory(s) are included, like so:

+foo/bar/*
+foo/
-*

If your filter logic were reversed, i.e. include everything except specific exclusions, then you need not specify parent directories. In fact, this is why I originally commented that the regex filters worked different. It's because my filter logic had an include everything as the last filter, which was preceded by a number of exclusion filters. The example to exclude only foo/bar/* and include everything else shows that you need not include a filter that matches the parent, because the last filter matches the parent, etc:

-foo/bar/*
+*

@tbain98
Copy link
Author

tbain98 commented Sep 17, 2017

@jt70471 I agree that that's how the code currently works, and that's why I'm requesting that it be changed.

When I first wrote the bug, I was thinking that it was enough to just include the parent directories when you have a child that matches, but as I thought more about it I realized that there's something more fundamental wrong, that I'm requesting be changed in addition. I fundamentally disagree with the premise that regexes are applied in a hierarchical fashion, where matching an exclusion filter at the bottom of the list would result in failing to evaluate any files or sub-directories within the directory even though they might match inclusion filters that are higher in the list. I understand that @gilbertchen did that for speed reasons, but I think it results in fundamentally unintuitive behavior (why is why he's spent time fielding questions about it from new users who expect different behavior).

For example, if you have a repo with only:

/foo
/foo/test.txt

And you have the following regex-based filters:

i:.*\.txt
e:.*

In that scenario, text.txt should be included (and therefore so should /foo, since it's necessary to store test.txt properly). Choosing to skip foo entirely means that you don't test /foo/test.txt for inclusion, which means that you don't include it even though you should (because it passes an inclusion filter before it passes an exclusion filter).

So what I'm requesting here is that the algorithm be changed to enumerate all content under the repository's root directory and apply the regex list to every single file and directory. I'm aware that this will take some time due to CPU and disk I/O (though I'm very curious to see how much), so I'd be OK with making the choice of which algorithm to use be configurable via the preferences file, so that anyone who's willing to put extra stuff into their includes and excludes in order to get better speed has that option.

@leftytennis
Copy link
Contributor

Agreed there would be overhead associated with doing what you suggest. It would be more intuitive I believe and certainly more flexible.

@tbain98
Copy link
Author

tbain98 commented Sep 19, 2017

@gilbertchen Please re-open this issue; it was not resolved by PR #187. @jt70471's comment two above this one indicates that although he originally thought that it would, further testing indicates that explicit inclusion of the parent directories is still required with regexes as currently implemented.

@tbain98
Copy link
Author

tbain98 commented Sep 28, 2017

Since this issue is closed, I've submitted #216 to replace it, with a much clearer explanation of what I'm requesting/proposing. So please submit any further comments about the proposal on #216, not this issue.

@gilbertchen
Copy link
Owner

Sorry I don't know how I could miss your request to re-open this issue -- it shouldn't have been closed in the first place and I must be thinking something else when I closed it.

@tbain98
Copy link
Author

tbain98 commented Sep 28, 2017

At the time, @jt70471 was convinced that his fix addressed this though I wasn't seeing how. We later figured out that his test case was skewing the analysis, but never were able to get it re-opened.

The new issue captures the idea better anyway, so it's a net win in the end even though it was a little frustrating along the way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants