-
Notifications
You must be signed in to change notification settings - Fork 13.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GoogleDriveHook
: Fixing log message + adding more verbose documentation
#29694
Conversation
I am not sure who should review it and no one has been attached, @eladkal -> If I made a mistake tagging you please assign to a review someone that could do that |
FYI. Just to dispel some misconception and misunderstanding and set expectations. We have no "assignments" for review, and asking others to assign others is not how things work here. This is not a corporate environment where people have "jobs". This is community driven organisation where people spend their time on tasks they find they might help, not because somoene "assigns" them to do stuff. As an author you should patiently wait until someone will take a look (and possibly ping in general if it does not happen in reasonable time (which is not well defined BTW - you need to use your own judgment, but generally 72 HRS in ASF is considered as a time that people who are in different time zone and potentially have families, day jobs, sometimes weekends and holidays for any reaction time. You need to exercise patience. And tagging people to assign others is more of "corporate delegation expectation" than "community trust" so it's quite a bit out-of-place. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but I have two nits.
GoogleDriveHook
: Fixing log message + adding more verbose documentation
LGTM! Thanks for the explanations |
if folder_id != "root": | ||
try: | ||
upload_location = self._resolve_file_path(folder_id) | ||
except (GoogleApiClientError, AirflowException) as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eladkal Here is the catch for all exceptions - we don't want to fail the job due to path resolution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think we shouldn't do that. The change you made to _get_file_info
is enough.
AirflowException is not relevant here and as for GoogleApiClientError if this is raised after the 2 retries set then I guess we want the task to fail.. not to catch the exception and to just log it.
So lets revert this part... and I think we good to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree, because of the order of tasks. First file is uploaded and then we are trying to resolve its path, so if we fail on path resolution then file got uploaded, we failed the task due to unexpected behavior and we can't revert this change. So in my head there are 2 options:
- We keep the solution as is
- I move the path resolution code before uploading a file so the transaction doesn't need to be reverted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @aru-trackunit. I don't think the task should fail simply because the file path can't be resolved for logging purposes. It's unfortunate the Google API doesn't surface this information easily. As long as the task does what it's supposed to do, Airflow shouldn't care what is in its user-facing task logs. Having this logic configurable also helps too.
I do agree with @eladkal though that AirflowException
is irrelevant. The operator is not failing for some orchestration or execution-related issue (IMO I think we use AirflowException
with too broad of a brush sometimes). Maybe a custom exception is raised (e.g. NestedFolderLimitReachedException
or something similar) just in case some real AirflowException is swallowed unknowingly?
:param file_id: The id of a file in Google Drive | ||
:return: Google Drive full path for a file | ||
""" | ||
MAX_NESTED_FOLDERS_LEVEL = 20 # Link to docs https://support.google.com/a/users/answer/7338880?hl=en |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still feels magical. But, out of curiosity, if the documentation states "A folder in a shared drive can have up to 20 levels of nested folders", does that mean users can't physically create more than that number of nested folders or is it really a shouldn't create situation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested it and it means that users can't create a folder:
That's the error body - status code 403
{
"error": {
"errors": [
{
"domain": "global",
"reason": "teamDriveHierarchyTooDeep",
"message": "The shared drive hierarchy depth will exceed the limit."
}
],
"code": 403,
"message": "The shared drive hierarchy depth will exceed the limit."
}
}
if folder_id != "root": | ||
try: | ||
upload_location = self._resolve_file_path(folder_id) | ||
except (GoogleApiClientError, AirflowException) as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @aru-trackunit. I don't think the task should fail simply because the file path can't be resolved for logging purposes. It's unfortunate the Google API doesn't surface this information easily. As long as the task does what it's supposed to do, Airflow shouldn't care what is in its user-facing task logs. Having this logic configurable also helps too.
I do agree with @eladkal though that AirflowException
is irrelevant. The operator is not failing for some orchestration or execution-related issue (IMO I think we use AirflowException
with too broad of a brush sometimes). Maybe a custom exception is raised (e.g. NestedFolderLimitReachedException
or something similar) just in case some real AirflowException is swallowed unknowingly?
@josh-fell I decided to remove this nested limit, since we can rely on Google API instead, also I made a wrong assumption. |
Wohooo! |
related: #29552
Previously the log message didn't exactly tell the user where the has been uploaded to so I decided to fix that behavior.
If reviewers decide that solution is too complex to keep it then I am happy to remove the log as well or set a message to "File has been uploaded."
Unfortunately Google API doesn't return full path of the file in one API call and there is a need to go up the folder tree. In the documentation they also say that drive can have up to 20 nested folders which I included in the solution to break a loop if more than 20 iterations have happened.
https://support.google.com/a/users/answer/7338880?hl=en