-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(ingest/s3): Path spec aware folder traversal #8095
fix(ingest/s3): Path spec aware folder traversal #8095
Conversation
…s not a partition column
|
||
for i in range(slash_to_remove_from_glob): | ||
glob_include = glob_include.rsplit("/", 1)[0] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we removing one additional slash than slash_to_remove_from_glob
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment and restructured the code to make it more clear.
bucket_name=bucket_name, folder=sorted_dirs[0] + "/" | ||
) | ||
for dir in sorted_dirs: | ||
if path_spec.dir_allowed(f"{protocol}" + bucket_name + "/" + dir + "/"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if path_spec.dir_allowed(f"{protocol}" + bucket_name + "/" + dir + "/"): | |
if path_spec.dir_allowed(f"{protocol}{bucket_name}/{dir}/"): |
dir_to_process = self.get_dir_to_process( | ||
bucket_name=bucket_name, folder=f + "/" | ||
bucket_name=bucket_name, | ||
folder=f + "/", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this PR. but this suffix with "/" could be moved inside get_dir_to_process
before the call to list_folders
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this file added intentionally ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, without the new logic, this file would have been picked up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple minor requests but don't think this needs a rereview.
return protocol | ||
else: | ||
raise ValueError( | ||
f"Unable to get protocol or invalid protocol form path: {path}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo form -> from
for i in range(slash_to_remove_from_glob): | ||
glob_include = glob_include.rsplit("/", 1)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we get this via glob_include.rsplit("/", i)[0]
, so I guess glob_include.rsplit("/", slash_to_remove_from_glob + 1)[0]
in total? If that's the first time it's defined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I made it more clear in the code.
Folder traversal needs to be aware of the path_spec and not go into folders which should not match with the path_spec.
The logic heavily relies on that nature of path spec which mandates to specify each folder level.
This pr depends on #8089
Checklist