Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hook for managing directories and files in Azure Data Lake Storage Gen2 #28262

Merged
merged 7 commits into from
Jan 5, 2023

Conversation

bharanidharan14
Copy link
Contributor

@bharanidharan14 bharanidharan14 commented Dec 9, 2022

Created hook for supporting ADLS gen2, which uses the WASB connection and connects to ADLS gen2 storage

Relates to #28223

@tatiana
Copy link
Contributor

tatiana commented Dec 12, 2022

Relates to #28223

@bharanidharan14 bharanidharan14 changed the title [WIP]: Hook for managing directories and files in Azure Data Lake Storage Gen2 Hook for managing directories and files in Azure Data Lake Storage Gen2 Dec 13, 2022
@bharanidharan14 bharanidharan14 marked this pull request as ready for review December 13, 2022 15:21
@bharanidharan14
Copy link
Contributor Author

@tatiana @luanmorenomaciel Need your review on this PR.

from airflow.hooks.base import BaseHook


class AzureDataLakeStorageV2(BaseHook):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class AzureDataLakeStorageV2(BaseHook):
class AzureDataLakeStorageV2Hook(BaseHook):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I changed the Hook name now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this.
eventually v2 will be the standard and on that day it would be redundant to mention v2.

I see that all current v1 services are listed as AzueDataLakeStorage*
WDYT about having the new ones as Adls*?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the current v1 services are listed as AzueDataLakeStorage. Now it is AzureDataLakeStorageClient can it be AzureDataLakeStorageClientHook. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we don't have hook for adls it can be anything we want.
i suggested the pattern: Adls* so:

Current V1 Suggested V2
- AdlsClientHook
ADLSDeleteOperator AdlsDeleteOperator
ADLSListOperator AdlsListOperator

Copy link
Contributor

@eladkal eladkal Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we can also do:

microsoft/azure/hooks/adsl_v2.py
microsoft/azure/operators/adsl_v2.py

then class names can have the same names as v1 because they are not in the same file
and once v1 is removed we can change the files names which is easier change then changing classes names.

return DataLakeServiceClient(
account_url=f"https://{conn.login}.dfs.core.windows.net", credential=token_credential, **extra
)
credential = conn.password
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to do something similar to:
https://github.com/apache/airflow/blob/main/airflow/providers/microsoft/azure/hooks/wasb.py#L202?

I wonder how much in common the Wasb & the AzureDataLakeStorageV2 implementation have - and if it would make sense to have a BaseAzureHook

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tatiana I am using WASB connection details to connect to ADLS gen2 storage account, apart the connection details nothing is similar I think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a shame, @bharanidharan14 - I didn't realise things were so disconnected in Azure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tatiana ADLS gen2 is using different protocols and APIs and different client-type principles to connect. Currently, I have created separate connection types and connection details.

from airflow.providers.microsoft.azure.hooks.wasb import WasbHook


class AzureDataLakeStorageClient(WasbHook):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#28223 description states that wasb is legacy if so what is the motivation to inhert from wsab for adsl gen2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only for connection details, I thought of using these WASB connection details. should I create a new connection ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If wasb is legacy and we expect it to be removed some day then I don't think we utlize it for other services
but I'm not Azure expert - I leave this desicion to people who knows the service.

What i do want to have here is docs and specifically connection docs that explains how to make this work.
(I want to avoid questions raised down the road by users who will ask to refactor because we utlize legacy service)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok let me add new connection, with docs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bharanidharan14 @eladkal did we consider having an AzureBaseHook similar to GoogleBaseHook
https://github.com/apache/airflow/blob/main/airflow/providers/google/common/hooks/base_google.py
This way, both AzureDataLakeStorageHook and WasbHook could inherit from it

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tatiana and @mateusoliveiraowshq as per our conversation, implementing the AzureBaseHook would be challenging since Azure unfortunately does not share the same connection client principles (BlobServiceClient & DataLakeServiceClient) between the WASB and ABFS protocol.

Azure Blob Storage - WASB
Azure Data Lake Storage Gen2 - ABFS

What we could do instead of is to have the following hook structure:

airflow.providers.microsoft.azure.hooks.wasb (Already Exists)

airflow.providers.microsoft.azure.hooks.azure_data_lake (Suggested)

Meaning that as per @bharanidharan14 @eladkal conversation we would recommend to inherid the hook from the airflow.providers.microsoft.azure.hooks.azure_data_lake instead of airflow.providers.microsoft.azure.hooks.wasb, that would make more sense since WASB protocol has been marked as legacy hence the, implementation process are different from each other.

wasb-abfs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luanmorenomaciel I don't think inheriting from the airflow.providers.microsoft.azure.hooks.azure_data_lake is also not recommended because this ADLS existing hook is gen1 and it's being getting retired from feb. They may stop supporting for that too as well. so its better to inherit from BaseHook. WDYT ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be fine to inherit from BaseHook

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luanmorenomaciel Currently I am inheriting from BaseHook. Can you also review this PR

from hooks.base import BaseHook


class AdlsClientHook(BaseHook):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense for this to be defined inside airflow/providers/microsoft/azure/hooks/azure_data_lake?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed it

@bharanidharan14
Copy link
Contributor Author

@kaxil Need your review on this PR

@kaxil
Copy link
Member

kaxil commented Dec 22, 2022

Tests are failing

Copy link
Contributor

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bharanidharan14 we're very close to merging this work.

There is one comment pending: https://github.com/apache/airflow/pull/28262/files#r1056265305
Rebasing and passing tests also seem to be pending.

Fix mypy issues

Fix mypy issues

Fix mypy issues

Fix mypy issues

Add adls v2 connection

Fix doc

Fix doc

Moved hook to existing data lake hook file

Moved hook test case

Moved hook test case

Fix doc
ADLS gen2 hook
Fix mypy issues

Fix mypy issues

Fix mypy issues

Fix mypy issues

Add adls v2 connection

Fix doc

Fix doc

Moved hook to existing data lake hook file

Moved hook test case

Moved hook test case

Fix doc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants