Skip to content
/ adfpl Public

How to eliminiate Azure Data Factory's public internet exposure using private link

License

Notifications You must be signed in to change notification settings

caryeun/adfpl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

How to eliminate Azure Data Factory's public internet exposure using private link

Azure Private Link was made generally available on Feb 2020. Since then, it made numerous Azure PaaS services more secured. By eliminating data transfer via the public internet, Azure Private Link help reduce the exposure to cyber security attacks significantly.

Of most PaaS Services, Azure Data Factory is considered to be 1 of the most important services to be secured. An ADF instance can connect to numerous data sources that might contain sensitive customer information - and the impact of data exposure can be far more serious than any other PaaS service. Making ADF more secured, therefore, is critical in building a secure data solution on Azure.

In the following sections, we are going to walk through

  1. How private link work to make Azure Data Factory more secured; and
  2. Provide sample ARM template for you to provision an ADF environment that makes use of private link; and
  3. Provide reference links to Azure certification, in case you want to learn more about Azure Data Factory / Azure Private Link

What is Azure Private Link?

Azure Private Link enables you to access Azure PaaS Services (for example, Azure Storage and SQL Database) and Azure hosted customer-owned/partner services over a private endpoint in your virtual network. Traffic between your virtual network and the service travels the Microsoft backbone network. Exposing your service to the public internet is no longer necessary.

Refer to below link for more details about Private Link:

What is Azure Data Factory?

Azure Data Factory is the cloud-based ETL and data integration service that allows you to create data-driven workflows for orchestrating data movement and transforming data at scale. Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores. You can build complex ETL processes that transform data visually with data flows or by using compute services such as Azure HDInsight Hadoop, Azure Databricks, and Azure SQL Database.

Azure Data Factory consists of 2 planes: the "control plane" and "data plane".

  • The "control plane" stores metadata such as pipeline definition and schedule, and provides Data Factory pipelines authoring and monitoring capabilities.
  • The "data plane" is a compute infrastructure called Integration Runtime (IR) to provide data integration capabilities. It connects to "linked service", which are data stores or compute services, to perform "activities", which can be copying data between data stores, running Data Flows, or dispatching transform activities to other Azure services such as HDInsight, Databricks and Azure Machine Learning. There are 3 types of integration runtimes: Azure, Self-hosted and Azure-SSIS. As of the time of the writing, only private link for self-hosted integration runtime is generally available. This blog will use self-hosted integration runtime for illustration.

The below logical diagram illustrates the various components for an Azure Data Factory pipeline.
Azure Data Factory - logical diagram

Refer to below link for more details Azure Data Factory:

How does Private Link makes Data Factory more secure?

To illustrate the idea, we will look at a simplified data factory infrastructure setup below, where there is:

  1. 1 instance of data factory, that stores metadata of the pipeline; and
  2. 1 storage account that represents the "source" linked service; and
  3. 1 storage account that represents the "destination linked service", and
  4. 1 integration runtime that performs the actual data movement from "source" to "destination".

Network diagram - before Private Link for ADF is implemented

Azure Data Factory - network diagram - without private link

Before private link is available,

  • the communication between ADF IR and ADF control plane will have to traverse the public internet; and

  • the following communication channels between the Azure Data Factory and the virtual network will have to be opened up:

    • adf.azure.com, port 443; and
    • *.{region}.datafactory.azure.net, port 443; and
    • *.servicebus.windows.net, port 443; and
    • download.microsoft.com

All of these together added unnecessary risk exposure to Azure Data Factory.

I have exported the ARM templates for the above setup to the GitHub link here for your reference. Templates

Network diagram - after Private Link for ADF is implemented

After private link is introduced, you can secure communication between ADF IR and ADF control plane using private link. The below diagram illustrates how it works.

Azure Data Factory - network diagram - with private link

What's more - you don’t need to configure the preceding domain and port in a virtual network - which further reduced your risk exposure.

I have exported the ARM templates for the above setup to the GitHub link here. Templates

If you compare the list of resources before and after implementing private link - you should notice that there are 3 additional resources created for supporting the use of private link. Resource list - after adding private link

How may I set up Private Link for Azure Data Factory, and ensure no public internet is allowed?

You can configure Private Link for Azure Data Factory via the portal UI. Below screen capture for quick reference. Azure Data Factory - Private Endpoint

You can disable public network access to the data factory. Below screen capture for quick reference. Azure Data Factory - Disable public network access

How to tell if my self-hosted integration runtime is connecting to the private endpoint?

You can verify by checking the DNS resolution of the service endpoint hostname on the integration runtime. You can get the hostname from the authentication key string in the Azure Data Factory portal, as shown below. Azure Data Factory - Integration runtime Authentication Key

If private endpoint is setup, then the DNS resolution on your self hosted integration runtime should show an internal IP address. In the example below, it resolves to 192.168.168.5 which is the intranet address assigned by internal DNS. Azure Data Factory - DNS resolution

I am interested to know more. Where may I get more info?

Azure Data Factory is just 1 of the data services offered on Azure. To learn more about the management, monitoring, security and prviacy of data on Azure, the below learning path for "Azure Data Engineer Associate" shall help.

Private link is among 1 of the security features offered by Azure. To learn more about the other security features, the below learning path for "Azure Security Engineer Associate" shall give you a more comprehensive overview on Azure Security.

If you are looking for information for security consideration of implementing just Data Factory, refer to the link below:

And finally, the examples given in this blog can be found in the GitHub repo here. https://github.com/caryeun/adfpl

Hope you find this blog post useful!

About

How to eliminiate Azure Data Factory's public internet exposure using private link

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published