Microsoft R with Spark on HDInsight
- Azure subscription or free trial account
- bash terminal emulator (cygwin) or putty
- cloud explorer
Deploying to Azure
Using the ARM Template
Using the Portal
Documentation Page: Getting started with R Server on HDInsight
Sign in to the Azure portal.
Select NEW, Data + Analytics, and then HDInsight.
Enter a name for the cluster in the Cluster Name field. If you have multiple Azure subscriptions, use the Subscription entry to select the one you want to use.
Select Select Cluster Type. On the Cluster Type blade, select the following options:
Cluster Type: R Server on Spark
Cluster Tier: Premium
Leave the other options at the default values, then use the Select button to save the cluster type.
[AZURE.NOTE] You can also add R Server to other HDInsight cluster types (such as Hadoop or HBase,) by selecting the cluster type, and then selecting Premium.
Select Resource Group to see a list of existing resource groups and then select the one to create the cluster in. Or, you can select Create New and then enter the name of the new resource group. A green check will appear to indicate that the new group name is available.
[AZURE.NOTE] This entry will default to one of your existing resource groups, if any are available.
Use the Select button to save the resource group.
Select Credentials, then enter a Cluster Login Username and Cluster Login Password.
Enter an SSH Username and select Password, then enter the SSH Password to configure the SSH account. SSH is used to remotely connect to the cluster using a Secure Shell (SSH) client.
Use the Select button to save the credentials.
Select Data Source to select a data source for the cluster. Either select an existing storage account by selecting Select storage account and then selecting the account, or create a new account using the New link in the Select storage account section.
If you select New, you must enter a name for the new storage account. A green check will appear if the name is accepted.
The Default Container will default to the name of the cluster. Leave this as the value.
Select Location to select the region to create the storage account in.
[AZURE.IMPORTANT] Selecting the location for the default data source will also set the location of the HDInsight cluster. The cluster and default data source must be located in the same region.
Use the Select button to save the data source configuration.
Select Node Pricing Tiers to display information about the nodes that will be created for this cluster. Unless you know that you'll need a larger cluster, leave the number of worker nodes at the default of
4. The estimated cost of the cluster will be shown within the blade.
Use the Select button to save the node pricing configuration.
On the New HDInsight Cluster blade, make sure that Pin to Startboard is selected, and then select Create. This will create the cluster and add a tile for it to the Startboard of your Azure Portal. The icon will indicate that the cluster is creating, and will change to display the HDInsight icon once creation has completed.
While creating Creation complete
[AZURE.NOTE] It will take some time for the cluster to be created, usually around 15~40 minutes. Use the tile on the Startboard, or the Notifications entry on the left of the page to check on the creation process.
SSH Into Edge Node
Please be aware that you won't access R Server through the head/master/name node, but on the edge node
Find the edge node SSH address by selecting your cluster then, All Settings, Apps, and RServer. Copy the SSH endpoint.
Connect to the edge node using an SSH client. You can ignore SSH keys for the purposes of this lab. In production it is highly recommended that you use SSH keys rather than username/password authentication.
Enter your SSH username and password.
Once you are connected, become a root user on the cluster. In the SSH session, use the following command.
sudo su -
Download the custom script to install RStudio. Use the following command.
Change the permissions on the custom script file and run the script. Use the following commands.
chmod 755 InstallRStudio.sh ./InstallRStudio.sh
If you used an SSH password while creating an HDInsight cluster with R Server, you can skip this step and proceed to the next. If you used an SSH key instead to create the cluster, you must set a password for your SSH user. You will need this password when connecting to RStudio. Run the following commands. When prompted for Current Kerberos password, just press ENTER.
passwd remoteuser Current Kerberos password: New password: Retype new password: Current Kerberos password:
If your password is successfully set, you should see a message like this.
passwd: password updated successfully
Exit the SSH session.
Create an SSH tunnel to the cluster by mapping
localhost:8787on the HDInsight cluster to the client machine. You must create an SSH tunnel before opening a new browser session.
On a Linux client or a Windows client (using Cygwin), open a terminal session and use the following command.
ssh -L 8787:localhost:8787 USERNAME@r-server.CLUSTERNAME-ssh.azurehdinsight.net
Replace USERNAME with an SSH user for your HDInsight cluster, and replace CLUSTERNAME with the name of your HDInsight cluster
On a Windows client create an SSH tunnel PuTTY.
- Open PuTTY, and enter your connection information. If you are not familiar with PuTTY, see Use SSH with Linux-based Hadoop on HDInsight from Windows for information on how to use it with HDInsight.
- In the Category section to the left of the dialog, expand Connection, expand SSH, and then select Tunnels.
Provide the following information on the Options controlling SSH port forwarding form:
- Source port - The port on the client that you wish to forward. For example, 8787.
- Destination - The destination that must be mapped to the local client machine. For example, localhost:8787.
Click Add to add the settings, and then click Open to open an SSH connection.
- When prompted, log in to the server. This will establish an SSH session and enable the tunnel.
Open a web browser and enter the following URL based on the port you entered for the tunnel.
You will be prompted to enter the SSH username and password to connect to the cluster. If you used an SSH key while creating the cluster, you must enter the password you created in step 5 above.
To test whether the RStudio installation was successful, you can run a test script that executes R based MapReduce and Spark jobs on the cluster. Go back to the SSH console and enter the following commands to download the test script to run in RStudio.
If you created a Hadoop cluster with R, use this command.
If you created a Spark cluster with R, use this command.
In RStudio, you will see the test script you downloaded. Double click the file to open it, select the contents of the file, and then click Run. You should see the output in the Console pane.