Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Apache] Traverse the include path to the root where relevant #7642

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

joohoi
Copy link
Member

@joohoi joohoi commented Dec 13, 2019

This PR allows Certbot to utilize full include tree for searches that need to know about the ancestors. Search for mod_macro block is included in this PR. The functionality also takes the possibility of multiple includes into account.

Because of the limitations of the current parsing engine, we have to search includes from the root up. This also introduces quite big performance penalty, especially if done dynamically. This is why a dictionary of Include, IncludeOptions and their values are saved to the parser object, and refreshed when new includes are added.

@ohemorange
Copy link
Contributor

Before getting into the code, I'd love some high-level clarifications about this PR, probably mostly since it's been a while since before the holidays.

Basically, I'm wondering how much this is worth it. As I understand it, this code is a) pretty specific to the augeas parser implementation, which makes it relatively temporary and b) it involves caching results, which is a prime breeding ground for bugs. This combination might be worth it, but it seems like we're doing this to get a more correct implementation than we previously had (searching down include paths, because the augeas path won't have the full logical path in it). And on top of that, are there searches other than the macro search that's changed in this PR?

Were there other factors I'm missing here? Because based on just these, I'm surprised to see this PR.

@joohoi
Copy link
Member Author

joohoi commented Jan 6, 2020

The main need for this PR is to allow runtime assertions to be utilized without the need of re-implementing a minor bug to the apacheparser implementation that has existed in the code for a long time.

The bug exists because we're simply reading the Augeas path of VirtualHost block to determine if it's wrapped in a <Macro> block. While this catches the majority of use cases, there may be configurations that it's not able to parse correctly because of how Augeas paths work.

This would not trigger the bug:

<Macro VHost $domain>
  <VirtualHost *:80>
    ServerName $domain
    ...
  </VirtualHost>
</Macro>
Use VHost example.com
Use VHost example.org

But this would:

# vhost.conf
<VirtualHost *:80>
  ServerName $domain
  ...
</VirtualHost>
# end of vhost.conf

# main.conf
<Macro VHost $domain>
  IncludeOptional vhost.conf
</Macro>
Use VHost example.com
Use VHost example.org
# end of main.conf

In the bugged example, we would be checking the /path/to/vhost.conf/VirtualHost to determine if it's in a mod_macro block.

Copy link
Contributor

@ohemorange ohemorange left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the overall logic seems to be implemented correctly, where possible we should attempt to simplify this, because as it is now it's a bit tricky to follow, which makes it more error-prone than I'm comfortable with.

# Removes ../ etc.
path = os.path.normpath(path)
if not os.path.isabs(path):
path = os.path.join(root, path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's probably not a problem, but I'd feel more comfortable if the normpath call came after the join, just in case the root isn't already norm.

@@ -560,6 +570,124 @@ def find_comments(self, arg, start=None):
results.append(comment)
return results

def find_blocks_from_include_tree(self, block, path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I see where the name is coming from, it's a bit opaque when reading the calling code. Especially if it's only being called in one place, leaving it as returning a boolean could make it more clear. Something like block_type_present_in_ancestors or something to that effect?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context here is that we will end up using this for reverse searching ancestor blocks in the Augeas implementation of ParseNode intefraces later, so the current functionality is needed for that purpose.


block_paths = [] # type: List[str]
# Queue tracks the paths that we need to search for
queue = list()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
queue = list()
queue = []

to match block_paths?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, we seem to use both types of initialization in the code. I'd prefer that new code picks one form though.

found = re.search(block, path[startidx:], flags=re.IGNORECASE)
# We want the list to be from the leaf to the root
block_paths.reverse()
return block_paths
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a use case for actually returning the path itself, rather than just an inclusion test? If not, this could could be much smaller.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we do. Not within this PR but upcoming ones that will depend on this functionality. See previous comment

if path.startswith(get_aug_path(self.loc["root"])):
return []

if not self.includes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're going to have this here anyway, we can probably take it out of the augeas init, and simply none-out instead of calling _find_all_includes in add_dir_to_ifmodssl, saving some compute time with a lazy pattern.

filepath = apache_util.get_file_path(path)
include_paths = []
for inc in self.includes:
incpath = apache_util.normalize_filepath(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like the sort of preprocessing step that most correctly belongs in _find_all_includes, unless I'm missing something that relies on the specific check here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency and reusability, I'd prefer keeping the exact include directive contents intact (which is handled by _find_all_includes) and do the path normalization when comparing against that data. This also lets us avoid the re-searching through the whole tree from root to leaves when subsequently looking if a file might be included by the existing Include and IncludeOptional directives. One of such cases would be when adding a new HTTPS virtualhost file, where we need to figure out if the created files filepath is already included in a wildcard include.

block_paths.reverse()
return block_paths

def find_includes_for_path(self, path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the code is correct here, the semantics of this are a bit obfuscated -- paths aren't included, files are, and while in practice we're technically taking the file part of the path and checking for that, it still seems weird to have a function that finds includes for a path rather than a file. Basically having a too-smart helper function feels weird. How would you feel about taking the filename as an argument to this function? And maybe separating out the processing of path to file into a separate function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of filepaths, and while the actual files are included in the configuration tree, they are included by their paths. These paths might also be wildcard ones, so a subsequently created file that wasn't included in the configuration tree might be automatically included by its path. This function is trying to look up Include and IncludeOptional directives that would include the filepath that we provide as an argument.

This was just explaining the naming, and I'm open to discuss the semantics further, but would prefer keeping it as it is based on the above.

@joohoi
Copy link
Member Author

joohoi commented Jan 29, 2020

We agreed that it'd be good to write an additional test, testing and demonstrating the use of paths returned from the function find_blocks_from_include_tree() and use apache-parser-v2 tests to do so. Unfortunately the apache-parser-v2 tree didn't have specific tests doing reverse lookups for specific conditional directives so I wrote a new one.

The new test here demonstrates the functionality needed for doing reverse searches for ancestors with specific parameters, IfModule used as an example.

@ohemorange
Copy link
Contributor

To summarize plans discussed elsewhere:

Step 1. Merge apache-parser-v2 into master (done)
Step 2. Rewrite find_ancestors() to call find_blocks_from_include_tree()

for ancestor in self.parser.find_blocks_from_include_tree(name)
    anc = self._create_blocknode(ancestor)
    ancestors.append(anc)
return ancestors

Step 3. Review code all working together properly to be more confident that find_blocks_from_include_tree() is implemented correctly

@bmw bmw added the priority: unplanned Work that we believe should be done, but does not have a higher priority. label Mar 25, 2020
@bmw
Copy link
Member

bmw commented Apr 1, 2020

What's the status of this PR? Should we try to get it merged in the short term or should we save the work for later?

If the latter, I'd prefer this PR was closed to clean out the PR queue a bit. If we do this, the way I usually do it is to link the branch containing this work from the issue it addresses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: unplanned Work that we believe should be done, but does not have a higher priority.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants