Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GoogleDriveHook: Fixing log message + adding more verbose documentation #29694

Merged
merged 28 commits into from
Mar 4, 2023

Conversation

aru-trackunit
Copy link
Contributor

@aru-trackunit aru-trackunit commented Feb 22, 2023

related: #29552

Previously the log message didn't exactly tell the user where the has been uploaded to so I decided to fix that behavior.

If reviewers decide that solution is too complex to keep it then I am happy to remove the log as well or set a message to "File has been uploaded."

Unfortunately Google API doesn't return full path of the file in one API call and there is a need to go up the folder tree. In the documentation they also say that drive can have up to 20 nested folders which I included in the solution to break a loop if more than 20 iterations have happened.
https://support.google.com/a/users/answer/7338880?hl=en

@boring-cyborg boring-cyborg bot added area:providers provider:google Google (including GCP) related issues labels Feb 22, 2023
@aru-trackunit
Copy link
Contributor Author

I am not sure who should review it and no one has been attached, @eladkal -> If I made a mistake tagging you please assign to a review someone that could do that

@potiuk
Copy link
Member

potiuk commented Feb 24, 2023

I am not sure who should review it and no one has been attached, @eladkal -> If I made a mistake tagging you please assign to a review someone that could do that

FYI. Just to dispel some misconception and misunderstanding and set expectations.

We have no "assignments" for review, and asking others to assign others is not how things work here. This is not a corporate environment where people have "jobs". This is community driven organisation where people spend their time on tasks they find they might help, not because somoene "assigns" them to do stuff. As an author you should patiently wait until someone will take a look (and possibly ping in general if it does not happen in reasonable time (which is not well defined BTW - you need to use your own judgment, but generally 72 HRS in ASF is considered as a time that people who are in different time zone and potentially have families, day jobs, sometimes weekends and holidays for any reaction time.

You need to exercise patience. And tagging people to assign others is more of "corporate delegation expectation" than "community trust" so it's quite a bit out-of-place.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but I have two nits.

airflow/providers/google/suite/hooks/drive.py Outdated Show resolved Hide resolved
airflow/providers/google/suite/hooks/drive.py Outdated Show resolved Hide resolved
airflow/providers/google/suite/hooks/drive.py Outdated Show resolved Hide resolved
@eladkal eladkal changed the title Fixing log message + adding more verbose documetation GoogleDriveHook: Fixing log message + adding more verbose documentation Feb 24, 2023
@dimberman
Copy link
Contributor

LGTM! Thanks for the explanations

if folder_id != "root":
try:
upload_location = self._resolve_file_path(folder_id)
except (GoogleApiClientError, AirflowException) as e:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eladkal Here is the catch for all exceptions - we don't want to fail the job due to path resolution

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think we shouldn't do that. The change you made to _get_file_info is enough.
AirflowException is not relevant here and as for GoogleApiClientError if this is raised after the 2 retries set then I guess we want the task to fail.. not to catch the exception and to just log it.

So lets revert this part... and I think we good to go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, because of the order of tasks. First file is uploaded and then we are trying to resolve its path, so if we fail on path resolution then file got uploaded, we failed the task due to unexpected behavior and we can't revert this change. So in my head there are 2 options:

  1. We keep the solution as is
  2. I move the path resolution code before uploading a file so the transaction doesn't need to be reverted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @aru-trackunit. I don't think the task should fail simply because the file path can't be resolved for logging purposes. It's unfortunate the Google API doesn't surface this information easily. As long as the task does what it's supposed to do, Airflow shouldn't care what is in its user-facing task logs. Having this logic configurable also helps too.

I do agree with @eladkal though that AirflowException is irrelevant. The operator is not failing for some orchestration or execution-related issue (IMO I think we use AirflowException with too broad of a brush sometimes). Maybe a custom exception is raised (e.g. NestedFolderLimitReachedException or something similar) just in case some real AirflowException is swallowed unknowingly?

:param file_id: The id of a file in Google Drive
:return: Google Drive full path for a file
"""
MAX_NESTED_FOLDERS_LEVEL = 20 # Link to docs https://support.google.com/a/users/answer/7338880?hl=en
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still feels magical. But, out of curiosity, if the documentation states "A folder in a shared drive can have up to 20 levels of nested folders", does that mean users can't physically create more than that number of nested folders or is it really a shouldn't create situation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested it and it means that users can't create a folder:
That's the error body - status code 403

{
 "error": {
  "errors": [
   {
    "domain": "global",
    "reason": "teamDriveHierarchyTooDeep",
    "message": "The shared drive hierarchy depth will exceed the limit."
   }
  ],
  "code": 403,
  "message": "The shared drive hierarchy depth will exceed the limit."
 }
}

if folder_id != "root":
try:
upload_location = self._resolve_file_path(folder_id)
except (GoogleApiClientError, AirflowException) as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @aru-trackunit. I don't think the task should fail simply because the file path can't be resolved for logging purposes. It's unfortunate the Google API doesn't surface this information easily. As long as the task does what it's supposed to do, Airflow shouldn't care what is in its user-facing task logs. Having this logic configurable also helps too.

I do agree with @eladkal though that AirflowException is irrelevant. The operator is not failing for some orchestration or execution-related issue (IMO I think we use AirflowException with too broad of a brush sometimes). Maybe a custom exception is raised (e.g. NestedFolderLimitReachedException or something similar) just in case some real AirflowException is swallowed unknowingly?

@aru-trackunit
Copy link
Contributor Author

@josh-fell I decided to remove this nested limit, since we can rely on Google API instead, also I made a wrong assumption.
If google doesn't let user upload a file nested deeper that 20 levels then it will never generate longer path, but it was also protecting that in worst case scenario it makes 20 iterations instead of infinite loop if google introduces a bug.

@potiuk potiuk merged commit f55b957 into apache:main Mar 4, 2023
@aru-trackunit
Copy link
Contributor Author

Wohooo!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants