Azure Private Link was made generally available on Feb 2020. Since then, it made numerous Azure PaaS services more secured. By eliminating data transfer via the public internet, Azure Private Link help reduce the exposure to cyber security attacks significantly.
Of most PaaS Services, Azure Data Factory is considered to be 1 of the most important services to be secured. An ADF instance can connect to numerous data sources that might contain sensitive customer information - and the impact of data exposure can be far more serious than any other PaaS service. Making ADF more secured, therefore, is critical in building a secure data solution on Azure.
In the following sections, we are going to walk through
- How private link work to make Azure Data Factory more secured; and
- Provide sample ARM template for you to provision an ADF environment that makes use of private link; and
- Provide reference links to Azure certification, in case you want to learn more about Azure Data Factory / Azure Private Link
Azure Private Link enables you to access Azure PaaS Services (for example, Azure Storage and SQL Database) and Azure hosted customer-owned/partner services over a private endpoint in your virtual network. Traffic between your virtual network and the service travels the Microsoft backbone network. Exposing your service to the public internet is no longer necessary.
Refer to below link for more details about Private Link:
- What is Azure Private Link? https://docs.microsoft.com/en-us/azure/private-link/private-link-overview
Azure Data Factory is the cloud-based ETL and data integration service that allows you to create data-driven workflows for orchestrating data movement and transforming data at scale. Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores. You can build complex ETL processes that transform data visually with data flows or by using compute services such as Azure HDInsight Hadoop, Azure Databricks, and Azure SQL Database.
Azure Data Factory consists of 2 planes: the "control plane" and "data plane".
- The "control plane" stores metadata such as pipeline definition and schedule, and provides Data Factory pipelines authoring and monitoring capabilities.
- The "data plane" is a compute infrastructure called Integration Runtime (IR) to provide data integration capabilities. It connects to "linked service", which are data stores or compute services, to perform "activities", which can be copying data between data stores, running Data Flows, or dispatching transform activities to other Azure services such as HDInsight, Databricks and Azure Machine Learning. There are 3 types of integration runtimes: Azure, Self-hosted and Azure-SSIS. As of the time of the writing, only private link for self-hosted integration runtime is generally available. This blog will use self-hosted integration runtime for illustration.
The below logical diagram illustrates the various components for an Azure Data Factory pipeline.
Refer to below link for more details Azure Data Factory:
- What is Azure Data Factory https://docs.microsoft.com/en-us/azure/data-factory/introduction
To illustrate the idea, we will look at a simplified data factory infrastructure setup below, where there is:
- 1 instance of data factory, that stores metadata of the pipeline; and
- 1 storage account that represents the "source" linked service; and
- 1 storage account that represents the "destination linked service", and
- 1 integration runtime that performs the actual data movement from "source" to "destination".
Before private link is available,
-
the communication between ADF IR and ADF control plane will have to traverse the public internet; and
-
the following communication channels between the Azure Data Factory and the virtual network will have to be opened up:
- adf.azure.com, port 443; and
- *.{region}.datafactory.azure.net, port 443; and
- *.servicebus.windows.net, port 443; and
- download.microsoft.com
All of these together added unnecessary risk exposure to Azure Data Factory.
I have exported the ARM templates for the above setup to the GitHub link here for your reference. Templates
After private link is introduced, you can secure communication between ADF IR and ADF control plane using private link. The below diagram illustrates how it works.
What's more - you don’t need to configure the preceding domain and port in a virtual network - which further reduced your risk exposure.
I have exported the ARM templates for the above setup to the GitHub link here. Templates
If you compare the list of resources before and after implementing private link - you should notice that there are 3 additional resources created for supporting the use of private link.
You can configure Private Link for Azure Data Factory via the portal UI. Below screen capture for quick reference.
You can disable public network access to the data factory. Below screen capture for quick reference.
You can verify by checking the DNS resolution of the service endpoint hostname on the integration runtime. You can get the hostname from the authentication key string in the Azure Data Factory portal, as shown below.
If private endpoint is setup, then the DNS resolution on your self hosted integration runtime should show an internal IP address. In the example below, it resolves to 192.168.168.5 which is the intranet address assigned by internal DNS.
Azure Data Factory is just 1 of the data services offered on Azure. To learn more about the management, monitoring, security and prviacy of data on Azure, the below learning path for "Azure Data Engineer Associate" shall help.
- Microsoft Certified: Azure Data Engineer Associate https://docs.microsoft.com/en-us/learn/certifications/azure-data-engineer
Private link is among 1 of the security features offered by Azure. To learn more about the other security features, the below learning path for "Azure Security Engineer Associate" shall give you a more comprehensive overview on Azure Security.
- Microsoft Certified: Azure Security Engineer Associate https://docs.microsoft.com/en-us/learn/certifications/azure-security-engineer
If you are looking for information for security consideration of implementing just Data Factory, refer to the link below:
- Security considerations for data movement in Azure Data Factory https://docs.microsoft.com/en-us/azure/data-factory/data-movement-security-considerations
And finally, the examples given in this blog can be found in the GitHub repo here. https://github.com/caryeun/adfpl
Hope you find this blog post useful!