New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
An include pattern should infer the inclusion of parent directories #176
Comments
In completely agree with that. It would also make the filters file cleaner. |
I just submitted a patch for supporting include/exclude regular expression filters (PR #187). If/when this is included in duplicacy, it fundamentally solves the issue that the current filter matching requires that all parent directories are included, etc. For example, using the new regex filters, the following would behave as expected as well:
This would match anything beneath foo/bar/ and exclude everything else. Coversely, reversing the include/exclude would also behave as expected:
Everything under foo/bar/ would be excluded, and everything else included. |
I believe this can be closed after inclusion of PR #187. It doesn't change the behavior of the original wildcard filters, but offers a new regular expression filter that can be used to achieve the desired behavior. |
@jt70471 Can you please explain why the behavior you implemented (thanks for doing that, BTW) would behave differently in terms of including parent directories? I would expect your regex sample from your first example to exclude foo, resulting in foo/bar not being considered at all, same as for the wildcard-based implementation. What am I not seeing? |
Edit: actual testing shows I was wrong! :) You do indeed need to include a parent directory regex that matches so that duplicacy will descend into sub-directories. Updated example below. I didn't dig into how/why the wildcard patterns require that parent directories need to be specified before any sub-directories to be considered. In the example, I quoted, assume the following directory listing:
The regular expressions are evaluated in the order they are encountered from filters, so assuming the filters file contains:
The resulting regex evaluation would be:
Sorry to have misled you with incorrect information. I'll review the guide update I made to see what changes are needed to clarify how parent/sub-directory matching needs to occur. |
I haven't run your change yet; I don't need regexes badly enough at the moment to be worth the effort of building from source, so I'm just waiting for 2.0.10 to be released. And I haven't yet constructed a wildcard-based scenario that will hit the problem described by the guide, so I'm relying rather heavily on the documentation plus what I saw in your code that I reviewed.
With that large caveat, I'm not seeing an obvious reason why your regex-based solution would decide that foo (which you left out of your last comment) was included but the wildcard-based solution would decide that it was excluded (and therefore so was everything below it).
I hope to get some time this weekend to actually set up that scenario and see what happens.
|
@gilbertchen Please re-open this issue; it was not resolved by PR #187. @jt70471's comment two above this one indicates that although he originally thought that it would, further testing indicates that explicit inclusion of the parent directories is still required with regexes as currently implemented. |
It is completely dependant on how you have you include/exclude filters coded. The same applies to the pattern matching filters or the new regex filters. You have to ensure that a parent folder is not excluded during filter matching, otherwise the sub-directory folders will never be traversed by duplicacy and therefore will never be matched. For example, consider the following directory/file structure:
and the following filters shown in GUIDE.md:
As explained in GUIDE.md, foo/bar/* will not be included, because its parent directory foo/ is excluded and no files/sub-directories beneath it are traversed and subsequently, no attempt to match those files/sub-directories against any filter. In order to have foo/bar/* included, you need to ensure its parent directory(s) are included, like so:
If your filter logic were reversed, i.e. include everything except specific exclusions, then you need not specify parent directories. In fact, this is why I originally commented that the regex filters worked different. It's because my filter logic had an include everything as the last filter, which was preceded by a number of exclusion filters. The example to exclude only foo/bar/* and include everything else shows that you need not include a filter that matches the parent, because the last filter matches the parent, etc:
|
@jt70471 I agree that that's how the code currently works, and that's why I'm requesting that it be changed. When I first wrote the bug, I was thinking that it was enough to just include the parent directories when you have a child that matches, but as I thought more about it I realized that there's something more fundamental wrong, that I'm requesting be changed in addition. I fundamentally disagree with the premise that regexes are applied in a hierarchical fashion, where matching an exclusion filter at the bottom of the list would result in failing to evaluate any files or sub-directories within the directory even though they might match inclusion filters that are higher in the list. I understand that @gilbertchen did that for speed reasons, but I think it results in fundamentally unintuitive behavior (why is why he's spent time fielding questions about it from new users who expect different behavior). For example, if you have a repo with only:
And you have the following regex-based filters:
In that scenario, text.txt should be included (and therefore so should /foo, since it's necessary to store test.txt properly). Choosing to skip foo entirely means that you don't test /foo/test.txt for inclusion, which means that you don't include it even though you should (because it passes an inclusion filter before it passes an exclusion filter). So what I'm requesting here is that the algorithm be changed to enumerate all content under the repository's root directory and apply the regex list to every single file and directory. I'm aware that this will take some time due to CPU and disk I/O (though I'm very curious to see how much), so I'd be OK with making the choice of which algorithm to use be configurable via the preferences file, so that anyone who's willing to put extra stuff into their includes and excludes in order to get better speed has that option. |
Agreed there would be overhead associated with doing what you suggest. It would be more intuitive I believe and certainly more flexible. |
@gilbertchen Please re-open this issue; it was not resolved by PR #187. @jt70471's comment two above this one indicates that although he originally thought that it would, further testing indicates that explicit inclusion of the parent directories is still required with regexes as currently implemented. |
Sorry I don't know how I could miss your request to re-open this issue -- it shouldn't have been closed in the first place and I must be thinking something else when I closed it. |
At the time, @jt70471 was convinced that his fix addressed this though I wasn't seeing how. We later figured out that his test case was skewing the analysis, but never were able to get it re-opened. The new issue captures the idea better anyway, so it's a net win in the end even though it was a little frustrating along the way. |
The guide has an example of how a user can go wrong with the Duplicacy inclusion pattern paradigm if they don't explicitly include all parent directories.
This seems backwards/broken. The better behavior would be to infer that all parent directories along the path of any included content are deemed to be included, but that those parent directories' other content is not deemed to be included nor excluded.
Changing the behavior to what I've proposed here would ensure that the expected behavior occurs in the first example: foo, foo/bar, and foo/bar/* are all included, but no other content is.
EDIT:
I've realized that what I wrote originally was ambiguous, and it's possible that people read what I wrote and inferred the other interpretation. I've left the original request untouched, but the content below clarifies what I meant.
When I said that I want the parent folder(s) to be "included", I do not mean that I want them included in the content that is stored as part of the backup unless a regex matches at least one file or directory somewhere below them. My sole intent was to say that parent folders should be visited (the word "included" was a poor word choice); that is, they shouldn't be deemed to be excluded from consideration because they were not explicitly specified.
If I include a regex of foo/bar.*, I want my backup set to have a root-level folder named bar, along with all content contained within the source foo/bar folder (including all recursive sub-folders), but not to contain a folder named foo as a parent to bar. In this example, I'm specifying that I want bar included, but I have not said that I want foo included. (If I wanted foo included, I'd need to specify an additional include for foo.) But according to the documentation I quoted, visiting bar at all requires that foo be included in the backup (which forces the full path to bar to be included in the backed-up content, limiting flexibility on restores); making that no longer be the case is the only change I'm requesting here.
The text was updated successfully, but these errors were encountered: