Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ingest/s3): Path spec aware folder traversal #8095

Merged

Conversation

treff7es
Copy link
Contributor

Folder traversal needs to be aware of the path_spec and not go into folders which should not match with the path_spec.
The logic heavily relies on that nature of path spec which mandates to specify each folder level.

This pr depends on #8089

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label May 22, 2023

for i in range(slash_to_remove_from_glob):
glob_include = glob_include.rsplit("/", 1)[0]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we removing one additional slash than slash_to_remove_from_glob ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment and restructured the code to make it more clear.

bucket_name=bucket_name, folder=sorted_dirs[0] + "/"
)
for dir in sorted_dirs:
if path_spec.dir_allowed(f"{protocol}" + bucket_name + "/" + dir + "/"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if path_spec.dir_allowed(f"{protocol}" + bucket_name + "/" + dir + "/"):
if path_spec.dir_allowed(f"{protocol}{bucket_name}/{dir}/"):

dir_to_process = self.get_dir_to_process(
bucket_name=bucket_name, folder=f + "/"
bucket_name=bucket_name,
folder=f + "/",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR. but this suffix with "/" could be moved inside get_dir_to_process before the call to list_folders.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file added intentionally ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, without the new logic, this file would have been picked up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, nice.

Copy link
Collaborator

@asikowitz asikowitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple minor requests but don't think this needs a rereview.

return protocol
else:
raise ValueError(
f"Unable to get protocol or invalid protocol form path: {path}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo form -> from

Comment on lines +93 to +94
for i in range(slash_to_remove_from_glob):
glob_include = glob_include.rsplit("/", 1)[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get this via glob_include.rsplit("/", i)[0], so I guess glob_include.rsplit("/", slash_to_remove_from_glob + 1)[0] in total? If that's the first time it's defined

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I made it more clear in the code.

@treff7es treff7es merged commit d50a999 into datahub-project:master May 30, 2023
44 checks passed
@treff7es treff7es deleted the path_spec_aware_folder_traversal branch May 30, 2023 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants