# Deploy and access Azure HDInsight in a private subnet

In order to better isolate access to Azure HDInsight clusters from the public Internet and enhance the security at the networking layer, enterprises can deploy the clusters in a private subnet in a virtual network, and access them through a private endpoint.

Each HDInsight cluster deployed in a virtual network has a private endpoint in the form of **https://CLUSTERNAME-int.azurehdinsight.net** as well as a public endpoint. Note the **“-int”** in this URL, this endpoint will resolve to a private IP in that virtual network and is not accessible from the public Internet.


This lab walks you through the steps to:
* Deploy an HDInsight cluster in a private subnet;
* Configure the netwok security groups (NSG) that are necessary to deny access from the public Internet through the public endpoint;
* Use an SSH tunnel with dynamic port forwarding to access the administrative web interfaces provided by the HDInsight cluster (eg. Ambari or the Spark UI).

The following diagram illustrates the architecture that we will be setting up.

![Architecture Diagram](images/architecture.jpg)

## Create a resource group and a virtual network

For convenience, we will be using the Azure CLI to setup our virtual networlk.

[Install and configure the Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) on your machine, then log in to your Azure subscription entering the following command:

In [None]:
%%sh
az login

Before you can create a virtual network, you have to create a resource group to host the virtual network (of course  you can reuse an existing one if you already have it - in this case you can skip the step below):

In [None]:
%%sh
az group create --name labs-rg --location southeastasia

Create a virtual network with a private subnet where to deploy the HDInsight cluster:

In [None]:
%%sh
az network vnet create \
    --name dev-test-vnet \
    --resource-group labs-rg \
    --location southeastasia \
    --address-prefix 10.0.0.0/16 \
    --subnet-name hdinsight-subnet \
    --subnet-prefix 10.0.0.0/24

Create a public subnet within the virtual network:

In [None]:
%%sh
az network vnet subnet create \
    --name public-subnet \
    --vnet-name dev-test-vnet \
    --resource-group labs-rg \
    --address-prefix 10.0.128.0/20

## Create an SSH key pair 

We will be using Public-key cryptography to authenticate SSH sessions, in order to avoid passwords and to secure access to both the HDInsight cluster and the bastion host (or jumpbox) that we deploy in the public subnet.

Use the `ssh-keygen` command to create public and private key files. The following command generates a 2048-bit RSA key pair:

In [None]:
%%sh
ssh-keygen -t rsa -b 2048 -f keys/azure-ssh-keypair

Use the following command to set the permissions of your private key file so that only you can read it:

In [None]:
%%sh
chmod 400 keys/azure-ssh-keypair*

## Create the jumpbox server

Follow [these instructions](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/quick-create-portal) to create a Linux VM that will serve as a jumpbox.

Just ensure that you:
* Enable SSH authentication by providing the public key you have previously created;
* Enable only SSH inbound connections from the Internet, selecting the SSH port (22) in the **Inbound port rules** section;
* Deploy the jumpbox in the public subnet.

In alternative, you can execute the script below to create the server:

In [None]:
%%sh
az vm create \
    --resource-group labs-rg \
    --name jumpbox \
    --image CentOS \
    --location southeastasia \
    --public-ip-address-dns-name <JUMPBOX_DNS_NAME> \
    --vnet-name dev-test-vnet \
    --subnet public-subnet \
    --admin-username <ADMIN_USERNAME> \
    --ssh-key-value "$(cat keys/azure-ssh-keypair.pub)"

## Create the HDInsight cluster

Follow [these instructions](https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-jupyter-spark-sql-use-portal) to create an Apache Spark cluster in HDInsight using the Azure portal.

While configuring the cluster, make sure that you enable SSH authentication by providing the public key you have create at the previous step:

![Enable SSH authentication with public key](images/screenshot-01.png)

Moreover, ensure that you deploy the cluster in the private subnet you have created, as exemplified in the following screenshot:

![Deployu the cluster in the private subnet](images/screenshot-02.png)

## Create and configure Network Security Groups (NSG)

Network security groups (NSG) allow you to filter inbound and outbound traffic to the network.

We need to create a NSG to secure access to the HDInsight cluster, and configure it with a set of rules that:
* Allow traffic from the Azure health and management services to reach HDInsight clusters on port 443;
* Allow SSH traffic from the public subnet;
* Allow traffic between VMs inside the subnet;
* Deny everything else, including access from the Azure Load Balancer with public IP address.

The IP addresses of the management services to allow in tbhe Southeast Asia region are the following (for other regions, you can consult this [documentation link](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-extend-hadoop-virtual-network#hdinsight-ip-1)):

| Source IP address | Destination port | Direction |
|--------------|---|-------|
|168.61.49.99  |443|Inbound|
|23.99.5.239   |443|Inbound|
|168.61.48.131 |443|Inbound|
|138.91.141.162|443|Inbound|
|13.76.245.160 |443|Inbound|
|13.76.136.249 |443|Inbound|
|168.63.129.16 |443|Inbound|

The following commands in Azure CLI helps you to setup the NSG (alternatively you could configure the same NSG through the [Azure Portal](https://portal.azure.com):

In [None]:
%%sh
az network nsg create --resource-group labs-rg --name hdinsight-nsg --location southeastasia 

In [None]:
%%sh
az network nsg rule create \
    --name AllowAzureMgmtServicesInBound \
    --priority 100 \
    --nsg-name hdinsight-nsg \
    --resource-group labs-rg \
    --access Allow \
    --source-address-prefixes 168.61.49.99 23.99.5.239 168.61.48.131 138.91.141.162 13.76.245.160 13.76.136.249 168.63.129.16 \
    --source-port-ranges 443 \
    --destination-address-prefixes '*' \
    --destination-port-ranges 443 \
    --protocol Tcp

In [None]:
%%sh
az network nsg rule create \
    --name DenyAzureLoadBalancerInBound \
    --priority 200 \
    --nsg-name hdinsight-nsg \
    --resource-group labs-rg \
    --access Deny \
    --source-address-prefixes AzureLoadBalancer \
    --source-port-ranges '*' \
    --destination-address-prefixes '*' \
    --destination-port-ranges '*'

## Setup an SSH tunnel with dynamic port forwarding

You can setup and use an SSH tunnel with dynamic port forwarding to access the administrative web interfaces provided by the HDInsight cluster (eg. Ambari or the Spark UI)

If you use an SSH tunnel with dynamic port forwarding, you must use a SOCKS proxy management add-on to control the proxy settings in your browser. Using a SOCKS proxy management tool allows you to automatically filter URLs based on text patterns and to limit the proxy settings to domains that match the form of the master node's public DNS name. The browser add-on automatically handles turning the proxy on and off when you switch between viewing websites hosted on the master node and those on the Internet. To manage your proxy settings, configure your browser to use an add-on such as FoxyProxy (with Google Chrome) or SwitchyOmega (with Firefox).

The following example demonstrates a [SwitchyOmega](https://addons.mozilla.org/en-US/firefox/addon/switchyomega/) configuration using Firefox.

Set the proxy server to localhost with the port set to 8157 (you should set this value to the local port number that you will use to establish the SSH tunnel).

![Setup the SOCKS proxy](images/screenshot-03.png)

Whitelist the following URL patterns (as illustrated by the screenshot below):

* The **\*-int.azurehdinsight.net\*** matches the DNS name of the private endpoint of the HDInsight cluster;

* The **10.0.\*** pattern provides access to the private IP addresses of the compute resources deployed within the virtual network you have setup, including the HDInsight cluster. Alter this filter if it conflicts with your network access plan.

![Whitelist URL patterns](images/screenshot-04.png)

Add the private SSH key you have created to the authentication agent, with the following command:

In [None]:
%%sh
ssh-add -k <PRIVATE_SSH_KEY>

Connect to the jumpbox server, creating the SSH tunnel:

In [None]:
%%sh
ssh -A -D 8157 <ADMIN_USERNAME>@<IP_ADDRESS_OR_DNS_NAME_OF_JUMPBOX_SERVER>

Open your broweser and connect to the HDInsight private endpoint (**https://CLUSTERNAME-int.azurehdinsight.net**).

If you have configured everything properly, the browser should be route the requests through the proxy server and you should be able to access the HDInsight cluster administrative itnerfaces as illustrated by the following screenshot.

![HDInsight cluster web UI](images/screenshot-05.png)