# Lab 7 - Extending the cluster with HDInsight Applications

AdventureWorks is interested in using HDInsight applications for extending the capabilities their cluster. They are interested in two applications, H2O Sparkling Water and Apache Solr. H2O will provide machine learning and predictive analytics, while Solr will provide enterprise search capabilities.

They have provided you with the tables for users, products, and weblogs that contain all the data you need. You will build and train a deep learning model using H2O Sparkling Water, combining the capabilities of Spark with H2O. Then, you will use Solr to add search capabilities to the AdventureWorks cluster.

In this lab you will learn how to extend an existing HDInsight cluster by installing both third-party and custom applications.

## Pre-requistes

Before attempting this lab, make sure you:
* Have provisioned an HDInsight 3.6 cluster with Spark 2.1.
* Have copied the retaildata to the default storage for your Spark cluster.
* Are running these notebooks from your HDInsight cluster.

These steps are described in the lab-preqs guide, included with these notebooks.

### Install H2O Sparkling Water
![H2O](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/images/h2o.png)

Through H2O’s AI platform and its Sparkling Water solution, users can combine the fast, scalable machine learning algorithms of H2O with the capabilities of Spark, as well as drive computation from Scala/R/Python and utilize the H2O Flow UI, providing an ideal machine learning platform for application developers.

H2O is an open source, in-memory, distributed, fast, and scalable machine learning and predictive analytics platform that allows you to build machine learning models on big data and provides easy productionalization of those models in an enterprise environment.

Sparkling Water allows users to combine the fast, scalable machine learning algorithms of H2O with the capabilities of Spark. With Sparkling Water, users can drive computation from Scala/R/Python and utilize the H2O Flow UI, providing an ideal machine learning platform for application developers.

H2O can be installed on an existing HDInsight cluster, or can be included as part of a new cluster creation. For our purposes, we are going to install on our existing cluster.

For your cluster blade in the Azure portal:
1. Select the `Applications` option under Configuration, or on the overview blade.
    ![Applications](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/applications-access.png)
2. On the Applications blade, select **H2O Artificial Intelligence for HDInsight**, under Available applications. (See below if the H2O application in listed under Unavailable applications)
    ![Applications - H2O](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/applications-blade.png)
3. Select **Review Legal Terms** on the H2O Artifical Intelligence blade.
    ![H2O Blade](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/h2o-blade.png)
4. Review the Terms of use, and select **Purchase**.
    ![H2O Terms of Use](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/images/h2o-legal-terms.png)
    ![Purchase](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/purchsae.png)
5. Click **OK** on the H2O blade, with the Legal terms accepted.
    ![H2O Legal Terms Accepted](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/h2o-legal-terms-accepted.png)
6. Click **Next** on the Applications blade to install H2O.
    a. The installation will take approximately 10 minutes to complete.
    b. Once the installation is complete, you can see the application by clicking again on **Applications** on the cluster blade.
        ![Installed Apps](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/images/h2o-installed.png)

> If during the installation process above you encountered an issue where the **H2O** application is listed under Unavailable applications, you will need to delete your cluster, and create a new one, including H2O as part of the cluster creation process.
![H2O Unavailable](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/h2o-unavailable-apps.png)

TODO: List steps perform "custom" cluster install, so H2O app can be included during the install process.
To delete and create a new cluster, follow the steps below:

TODO: Add "Deploy to Azure" button to provision a new cluster, with H2O as part of the process...

From your cluster blade in the Azure portal:
1. Select **Delete**, and select **Yes** to confirm you want to delete the cluster.
    ![Delete Cluster](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/images/delete-cluster.png)
2. You will receive a message that the cluster is being deleted.
    ![Deleting Cluster Message](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/images/deleting_cluster.png)
3. Now, you will create a new cluster in the same resource group.
4. Navigate to the resource group, and click **+ Add**.
    ![Resource Group App](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/images/resource-group-add.png)
5. Enter HDI in the search box, and select HDInsight.
    ![HDInsight Search](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/images/hdinsight-search.png)
6. Select HDInsight from the list.
    ![HDInsight](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/images/hdinsight-select.png)
7. On the HDInsight blade, select **Create**.
    ![HDInsight create blade](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/images/hdinsight-create-blade.png)
8. On the HDInsight blade, select **Custom (size, settings, apps)** at the top of the blade, so the H2O application can be installed as part of the creation process.
    ![HDInsight Custom Install](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/images/hdinsight-custom-install.png)

After installing H2O on HDInsight, you can use the built-in Jupyter notebooks to write your first H2O on HDInsights applications...

## Starting the H2O Cluster
Now that H2O is installed, the next thing that needs to be done is to configure the environment. Most of the configurations are already taken care by the system, such as the FLOW UI address, Spark jar location, the Sparkling water egg file, etc., however there are still a few settings that need to configure.

### Find Edge node hostname
To ensure we are able to connect to the H2O Flow UI, we need to assign the proper value to the  `spark.ext.h2o.announce.rest.url` in the configuration below. To find the correct host_name value, we need to look up the value in the H2O Sparkling Water configuration file.

Let's start by importing the Python types we'll need to query the H2O config file.

In [1]:
# Import necessary types for querying the H2O config file.
from pyspark.sql.types import StructType, StructField, StringType

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
1,application_1507613647085_0005,pyspark,idle,Link,Link,✔


SparkSession available as 'spark'.


Now, read the H2O config file from its Azure storage location. Be sure to replace the clustername with the appropriate value.

In [3]:
clustername = 'hdilabskyle9' # Replace with your HDInsight cluster name
config_schema = StructType([
        StructField('Name',StringType()), 
        StructField('Value', StringType())])

swconfig = spark.read.csv("/HdiApplications/ScriptActionCfgs/%s-h2o-sparklingwater.cfg" % clustername,
                    schema=config_schema,
                    sep="=",
                    header=False)

swconfig.filter(swconfig['Name'] == "EDGENODE_HOSTS").select(swconfig['Value']).show(1, False)

+-------------------------------------------------------------------+
|Value                                                              |
+-------------------------------------------------------------------+
|("ed11-hdilab.yzcjmk5uxsoepi32miarvyzmib.cx.internal.cloudapp.net")|
+-------------------------------------------------------------------+

Copy the host name value (e.g., ed11-hdilab.fu31bippkliejecocwb1m5yjga.bx.internal.cloudapp.net) from the configuration above, and paste it into the code below, replacing the `<EdgeHostName>` component of the `spark.ext.h2o.announce.rest.url` value. This will ensure you are able to properly connect to H2O Flow UI once H2O Sparkling Water properly is started.

### Set the H2O configuration
There are four important parameters which must still be configured: 
1. H2O Flow UI URL (retrieved in the previous step)
2. Driver memory
3. Executor memory
4. The number of executors

The driver and executor memory, and number of executors is driven by the number of worker nodes in the cluster. For our case, we have 2 worker nodes, so we will assign 1 as an executor, to ensure we have enough resources for H2O to work properly. You want to ensure the resource utilization for the H2O cluster remains below 75%.

> Note that all spark applications deployed using a Jupyter Notebook will have "yarn-cluster" deploy-mode. This means that the spark driver node will be allocated on any worker node of the cluster, not on the head nodes.

In [4]:
%%configure -f 
{
    "conf":{
        "spark.ext.h2o.announce.rest.url": "http://ed11-hdilab.yzcjmk5uxsoepi32miarvyzmib.cx.internal.cloudapp.net:5000/flows",
        "spark.jars":"/H2O-Sparkling-Water-files/sparkling-water-assembly-all.jar",
        "spark.submit.pyFiles":"/H2O-Sparkling-Water-files/pySparkling.egg",
        "spark.locality.wait":"3000",
        "spark.scheduler.minRegisteredResourcesRatio":"1",
        "spark.task.maxFailures":"1",
        "spark.yarn.am.extraJavaOption":"-XX:MaxPermSize=384m",
        "spark.yarn.max.executor.failures":"1",
        "maximizeResourceAllocation": "true"
    },
    "driverMemory":"10G",
    "executorMemory":"20G",
    "numExecutors":1
}

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
2,application_1507613647085_0006,pyspark,idle,Link,Link,✔


SparkSession available as 'spark'.


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
2,application_1507613647085_0006,pyspark,idle,Link,Link,✔


### Initiate the H2O Context

To start the H2O cluster, we call `h2o_context = pysparkling.H2OContext.getOrCreate(sc)`. This initiates an H2O context on top of Spark, so the default spark context can recognized.

In [5]:
import pyspark
import pysparkling, h2o
import os
os.environ["PYTHON_EGG_CACHE"] = "~/"

h2o_context = pysparkling.H2OContext.getOrCreate(sc)

Connecting to H2O server at http://10.0.0.11:54321... successful.
--------------------------  -------------------------------
H2O cluster uptime:         11 secs
H2O cluster version:        3.10.4.3
H2O cluster version age:    6 months and 9 days !!!
H2O cluster name:           sparkling-water-yarn_1937546603
H2O cluster total nodes:    1
H2O cluster free memory:    17.78 Gb
H2O cluster total cores:    4
H2O cluster allowed cores:  4
H2O cluster status:         accepting new members, healthy
H2O connection url:         http://10.0.0.11:54321
H2O connection proxy:
H2O internal security:      False
Python version:             2.7.12 final
--------------------------  -------------------------------

### H2O Flow
Once the H2O Cluster is up and running, you can open H2O Flow by going to `https://<ClusterName>-h2o.apps.azurehdinsight.net:443`. This can also be accessed by selecting `Applications` on your HDInsight blade in the Azure portal, then selecting the `Portal` link next to the `h2o-sparklingwater` in the applications list.

![H2O Flow](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/images/h2o-flow-link.png)

> Note: If the H2O Flow link redirects you to a help page, try clearning your browser cache. If you are still uable to reach it, you likely don't have enough resources on your cluster. Try decreasing the amount of memory assigned in your configuration above, or increasing the number of **Worker nodes** under the **Scale cluster** option on your cluster blade in the Azure Portal.

## Prepare the data for modeling
Our goal is to create a simple deep learning model for providing predictions about recommended products for users, based on actions, and the user's age and gender. For this, we need to combine data from the weblogs table with data from our users table.

Steps:
1. Define UDF to transform Action to an integer score (30, 70, 100)
2. Define UDF to transform gender to an integer (Male = 1, Female = 2)
3. Create a final, scored DataFrame that can be passed to H2O as our model

### Import Python modules
We are going to use Spark SQL to retrieve our data from Hive tables, so first, let's import the python modules needed to run this notebook.

In [6]:
from pyspark.sql.types import *
from pyspark.sql.functions import UserDefinedFunction

Now, let's quickly look at the two datasets we will be using to build our model, weblogs and users. Both of these datasets have already been stored in Hive tables in our storage account. The tables are named `weblogs` and `users`.

In [7]:
# Weblogs data
weblogs_df = spark.sql("SELECT * FROM weblogs")
weblogs_df.show(5)

+------+--------------------+---------+--------+-----+----------+--------------------+----------------+-------+--------------------+----------------------+
|UserId|           SessionId|ProductId|Quantity|Price|TotalPrice|         ReferralURL|PageStopDuration| Action|     TransactionDate|CleanedTransactionDate|
+------+--------------------+---------+--------+-----+----------+--------------------+----------------+-------+--------------------+----------------------+
|  7516|9576e72c-356f-402...|      509|       0| 55.0|       0.0|         contoso.com|             159|Browsed|2/20/2016 10:26:0...|  2016-02-20 22:26:...|
|  7516|9576e72c-356f-402...|      482|       0|  3.8|       0.0|http://contoso.co...|              44|Browsed|2/20/2016 10:28:3...|  2016-02-20 22:28:...|
|  7516|9576e72c-356f-402...|      494|       0| 27.5|       0.0|http://contoso.co...|              93|Browsed|2/20/2016 10:29:2...|  2016-02-20 22:29:...|
|  7516|9576e72c-356f-402...|      513|       0|  8.0|       0.0

In [8]:
# Users data
users_df = spark.sql("SELECT * FROM users")
users_df.show(5)

+----+--------------------+-----+---------+--------------------+--------+--------------------+---------------+-----+------+---------+--------------+--------+--------------------+--------------------+---+--------------+--------------------+--------------------+--------------------+
|  id|            LoginMd5|Email|FirstName|        PictureLarge|LastName|           LoginSha1|       Username|Title|Gender|LoginSalt|         Phone|Password|         LoginSha256|    PictureThumbnail|Age|          Cell|           BirthDate|          Registered|       PictureMedium|
+----+--------------------+-----+---------+--------------------+--------+--------------------+---------------+-----+------+---------+--------------+--------+--------------------+--------------------+---+--------------+--------------------+--------------------+--------------------+
|9858|6952c3958a740dc51...|  NaN|    tiago|https://randomuse...|ekelmans|73600c23029c824bb...|    blueswan862|   mr|  Male| gjzhB97J|(107)-025-5278|   zho

As can be seen in the output from the queries above, there are a lot of fields in both datasets that are not needed for our model. Now, we want to create a combined DataFrame containing only the fields we want for the model.

In [9]:
# Combine only needed columns for the two tables for our model
weblogsWithUsers_df = spark.sql("SELECT w.UserId, u.Gender, u.Age, w.ProductId, w.Action FROM weblogs w JOIN users u ON w.UserId = u.Id")
weblogsWithUsers_df.show(5)

+------+------+---+---------+-------+
|UserId|Gender|Age|ProductId| Action|
+------+------+---+---------+-------+
|  7516|  Male| 51|      509|Browsed|
|  7516|  Male| 51|      482|Browsed|
|  7516|  Male| 51|      494|Browsed|
|  7516|  Male| 51|      513|Browsed|
|  7516|  Male| 51|      474|Browsed|
+------+------+---+---------+-------+
only showing top 5 rows

### Data Munging with the Spark API
Now that we have the data we need, there are a few modifications we need to make in order to get the table ready to be used for our model.

#### Create UDFs to transform data
Let's create a couple of User Defined Functions to handle transforming our Action and Gender fields into the proper structure for our model.

> Note: We are also converting the Acore and Gender columns to IntegerType() in the process.

##### Assign numeric Score to Action
Next, we are going to assign scores to the Action field in the weblogs frame. We begin by defining how we want to weigh the implicit rating described by the action field in the weblogs table. An implicit rating occurs here because a user is not explictly providing a rating (e.g., they never say "I rate this product 4 out of 5 stars". Instead we will infer their rating by virtue of their action. 

A product that is browsed gets 30 points, a product that is added to the cart gets 70 points and a product that is purchased gets 100 points.

In [10]:
# UDF to map action values to a score
def action_to_score(col):
    if col == "Browsed":
        return 30
    elif col == "Add To Cart":
        return 70
    elif col == "Purchased":
        return 100
    else:
        return 0

map_action_to_score = UserDefinedFunction(action_to_score, StringType())

Now, apply the UDF to our table, to get scores for each action.

In [11]:
result_scored = weblogsWithUsers_df.withColumn("Score", map_action_to_score("Action").cast(IntegerType()))
result_scored.show(5)

+------+------+---+---------+-------+-----+
|UserId|Gender|Age|ProductId| Action|Score|
+------+------+---+---------+-------+-----+
|  7516|  Male| 51|      509|Browsed|   30|
|  7516|  Male| 51|      482|Browsed|   30|
|  7516|  Male| 51|      494|Browsed|   30|
|  7516|  Male| 51|      513|Browsed|   30|
|  7516|  Male| 51|      474|Browsed|   30|
+------+------+---+---------+-------+-----+
only showing top 5 rows

#### Convert Gender to numeric value
We also need to convert our gender column to a numeric value for our model.

In [12]:
# UDF to map gender to a numeric value
def gender_to_int(col):
    if col == "Male":
        return 1
    else:
        return 2
    
map_gender_to_int = UserDefinedFunction(gender_to_int, StringType())

By sending Gender data into the UDF, we can convert our Male and Female values into numeric representations.

In [13]:
result_final = result_scored.withColumn("Gender", map_gender_to_int("Gender").cast(IntegerType()))
result_final.show(5)

+------+------+---+---------+-------+-----+
|UserId|Gender|Age|ProductId| Action|Score|
+------+------+---+---------+-------+-----+
|  7516|     1| 51|      509|Browsed|   30|
|  7516|     1| 51|      482|Browsed|   30|
|  7516|     1| 51|      494|Browsed|   30|
|  7516|     1| 51|      513|Browsed|   30|
|  7516|     1| 51|      474|Browsed|   30|
+------+------+---+---------+-------+-----+
only showing top 5 rows

We can verify that our `Gender` and `Score` columns are the proper type (`IntegerType`) by looking at the schema of the `result_final` frame.

In [15]:
result_final.printSchema()

root
 |-- UserId: long (nullable = true)
 |-- Gender: integer (nullable = true)
 |-- Age: integer (nullable = true)
 |-- ProductId: integer (nullable = true)
 |-- Action: string (nullable = true)
 |-- Score: integer (nullable = true)

## Publish result as H2OFrame
With the table now in the required shape, we can publish the Spark DataFrame to an H2OFrame, assigning it a friendly name. We will define the columns to use for our H2OFrame as part of the process, so our H2OFrame will include `UserId`, `Age`, `Gender`, `ProductId`, and `Score`. Note, we've also dropped the Action column.

Note: This will take several minutes to complete.

In [16]:
# Publish Spark DataFrame as H2OFrame with given name
final_columns = ["UserId", "Age", "Gender", "ProductId", "Score"]
weblogsWithUsers_hf = h2o_context.as_h2o_frame(result_final.select(final_columns), "weblogsWithUsers")
weblogsWithUsers_hf.show()

  UserId    Age    Gender    ProductId    Score
--------  -----  --------  -----------  -------
    7516     51         1          509       30
    7516     51         1          482       30
    7516     51         1          494       30
    7516     51         1          513       30
    7516     51         1          474       30
    7516     51         1          494       30
    7516     51         1          505       30
    7516     51         1          514       30
    7516     51         1          480       30
    7516     51         1          497       30

[84916780 rows x 5 columns]

## View H2OFrame in H2O Flow UI
Once the create of the H2OFrame above is complete, open the H2O Flow UI page (`https://<ClusterName>-h2o.apps.azurehdinsight.net:443`).
![H2O Flow Dashboard](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/images/h2o-flow-dashboard.png)

From the H2O Flow dashboard, select `getFrames` to view the frames you loaded above. 
![H2O Flow getFrames](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/images/h2o-flow-get-frames.png)

You should see a frame named `weblogsWithUsers` listed.
![H2O Flow Frames](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/images/h2o-flow-frames.png)

Select the `weblogsWithUsers` frame, and take a closer look at the data and options available in H2O Flow.
![H2O Flow Frame Summary](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/images/h2o-flow-frame-summary.png)

H2O Flow UI provides the capability to build models and predictions directly from the web UI. However, for this lab we are going to return to our Jupyter notebook, and create the model there, using Python script.

## Prepare the training and test datasets
Our next task is to create datasets to user for training and testing our model, as well as assign the categoricals needed for perform our modeling.

#### Assign factors
Score, Age, and Gender columns need to be factors.

In [None]:
# Transform select columns to categoricals
weblogsWithUsers_hf["Score"] = weblogsWithUsers_hf["Score"].asfactor()
weblogsWithUsers_hf["Age"] = weblogsWithUsers_hf["Age"].asfactor()
weblogsWithUsers_hf["Gender"] = weblogsWithUsers_hf["Gender"].asfactor()

#### Set the predictor names and response column name

In [None]:
# Set the predictor names and the response column name
predictors = ["Age", "Gender"]
response = "Score"

### Prepare training and validation datasets for modeling

In [None]:
# Split frame into two - we use one as the training frame and the second as the validation frame
splits = weblogsWithUsers_hf.split_frame(ratios=[0.75], destination_frames=["train", "valid"], seed=42)
train_hf = splits[0]
valid_hf = splits[1]

## Train a deep neural network to predict recommended products for a given user
We are going to create an H2O deep learning model to perform our predictions with the following settings

* model_id: "ratingsModel"
* epochs: 0.25
* activation: Tanh
* hidden: [10, 10, 10]

In [None]:
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

dl_model = H2ODeepLearningEstimator(model_id = "ratingsModel", activation = "Tanh", hidden = [10, 10, 10], epochs = 0.25)
dl_model.train(x = predictors,
               y = response,
               training_frame = train_hf,
               validation_frame = valid_hf)

dl_model.show()

### Test the model
Now, let's put our model to use, using our validation frame as test data.

In [None]:
predictions = dl_model.predict(valid_hf)
predictions.show(5)

In [None]:
performance = dl_model.model_performance(valid_hf)
performance.show()

TODO: Look at model and predictions in H2O Flow.
We have completed creating a predictive model using H2O's Deep Learning model.

## Install Solr
![Apache Solr](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/images/apache-solr.png)

Apache Solr is a third-party custom application, which has not been published to the Azure portal. Apache Solr is an enterprise search platform that enables powerful full-text search on data. While HDInsight enables storing and managing vast amounts of data, Apache Solr provides the capabilities to quickly retrieve the data.

Solr will be installed using a Script Action on our HDInsight cluster.

1. From your cluster blade in the Azure Portal, select **Script Actions**.
    ![Configuration Script Action Link](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/images/config-script-action.png)
2. Select **+ Submit New**.
    ![Submit New Script Action](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/images/submit-new-script-action.png)
3. From the Script type drop down, select **Install Solr**.
    ![Submit New Script Action](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/images/solr-install-script-action.png)
4. Leave the default settings, and select **Create**.
   
After a few seconds, Solr will be installed. If you select Notifications in the Azure portal, you should a message like the following:
![Submit New Script Action](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/images/solr-install-success.png)

That completes our installation of Solr! Now let's add some data to it.

### Adding data to Solr
Now that Solr is installed, let's add some sample data into it, and create an index for the data. To start this exercise, we will use sample data provided in the Solr installation, as Solr is configured by default with a specific schema that it expects data to conform to.

To get at the sample data, we need to create an SSH tunnel to the head node of our cluster, where Solr is installed.

1. Create SSH connection to the primary node of our cluster.
    a. Use SSH to create an SSH tunnel to the cluster head node, using the following command: 
        > `ssh USERNAME@CLUSTERNAME-ssh.azurehdinsight.net`
    b. Once connected, we are going to perform the following commands to copy the sample datainto Solr.
        * `cd /usr/hdp/current/solr/example/exampledocs`
        * java -jar post.jar solr.xml monitor.xml
            ![Java jar output](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/images/java-jar-output.png)

### Using the Solr dashboard
The Solr dashboard is a web UI that allows you to work with Solr through your web browser. The Solr dashboard is not exposed directly on the Internet, so you will need to use an SSH tunnel to access it.

#### Determine the host name for the primary node.
Using your SSH shell created above, type the following command to get the host name:
```
hostname -f
```
![hostname command](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/images/ssh-hostname-command.png)

Copy the output from that command, as we will be using it below.

#### Configure system to connect to Solr dashboard

1. Open a new bash terminal window.
2. Use SSH to create an SSH tunnel to the cluster head node, using the following command, replacing PORT with a random port number, and CLUSTERNAME with your cluster name.
    ```
    ssh -C2qTnNf -D PORT sshuser@CLUSTERNAME-ssh.azurehdinsight.net
    ```

##### Configure proxy on local machine
1. On your local machine, open your network settings, and create a SOCKS proxy for localhost and the PORT you specified in the connection above.
    a. For Windows machines, go to Control Panel, Network and Internet, and then Internet Options.
        * On the Internet Options dialog, select the Connections tab, then select LAN Settings.
        * On the LAN Settings dialog, check Use a proxy server..., and select Advanced.
        * On the Advanced dialog:
            * Enter "localhost" in the Socks textbox, and your PORT value in the Port textbox.
            * Select OK.
        * Select Apply.
    b. For Mac OS, open System Preferences, and select Network.
        * Select Advanced on the Network dialog, then select the Proxies tab.
        * Check SOCKS Proxy, enter localhost for the SOCKS Proxy Server.
        * Enter your PORT in the port textbox, next to the SOCKS Proxy Server box.
        * Select OK.
        * Select Apply on the Network dialog.

### Open the Solr dashboard
In your browser, connect to `http://HOSTNAME:8983/solr/#/`, where HOSTNAME is the name you copied after executing the `hostname -f` command in a previous step.

![Solr Dashboard](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/images/solr-dashboard.png)

### View indexed data
From the left menu, click the **Core Selector** drown-down.
![Solr Core Selector](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/images/solr-core-selector.png)

Then, select collection 1.
![Solr collection1](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/images/solr-collection1.png)

Under collection 1, select Query.
![Solr query](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/images/solr-query.png)

Leave the default query settings, and select **Execute Query**
![Solr Query Results](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/images/solr-query-results.png)


### Add documents
Now, let's add some of our own data. The below JSON is output from our products table, formatted in the shaped needed by Solr.

```javascript
[{"id":628,"name":"High Heels","price":31.0,"cat":"31","category":"Womens Casual Shoes","sku":"Clothing"},
{"id":627,"name":"Wedge Heel Shoes","price":27.0,"cat":"31","category":"Womens Casual Shoes","sku":"Clothing"},
{"id":626,"name":"Ankle Boots","price":26.0,"cat":"31","category":"Womens Casual Shoes","sku":"Clothing"},
{"id":625,"name":"Summer Sandal","price":25.0,"cat":"31","category":"Womens Casual Shoes","sku":"Clothing"},
{"id":624,"name":"Canvas Boat Shoe","price":16.0,"cat":"31","category":"Womens Casual Shoes","sku":"Clothing"},
{"id":623,"name":"Indoor Slipper","price":12.0,"cat":"31","category":"Womens Casual Shoes","sku":"Clothing"},
{"id":622,"name":"High Heel Zip Snow Boots","price":35.0,"cat":"31","category":"Womens Casual Shoes","sku":"Clothing"},
{"id":621,"name":"Knee High Boots","price":22.0,"cat":"31","category":"Womens Casual Shoes","sku":"Clothing"},
{"id":620,"name":"Welly Boots","price":32.0,"cat":"31","category":"Womens Casual Shoes","sku":"Clothing"},
{"id":619,"name":"Fully Fleece Lining Snow Boots","price":25.0,"cat":"31","category":"Womens Casual Shoes","sku":"Clothing"},
{"id":617,"name":"Faux Fur Slipper","price":19.0,"cat":"31","category":"Womens Casual Shoes","sku":"Clothing"},
{"id":616,"name":"Faux Fur Snow Boot","price":25.0,"cat":"31","category":"Womens Casual Shoes","sku":"Clothing"}]
```

Select **Documents** in the collection 1 menu, and select Solr Command (raw XML or JSON) for the Document Type:
![Solr Document Add](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/images/solr-document-add.png)

Then, paste the JSON text above into the Document(s) box, replacing any existing text.

![Solr JSON Document Upload](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/images/solr-json-document-insert.png)

Click **Submit Document** to add the JSON documents to the search index.
![Solr JSON Response](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab07/images/solr-json-insert-response.png)

### Query the data
Now, you can return to the query screen, and view the newly added documents in the search results.

## Cleanup

### Proxy settings
* Remove proxy settings for SSH tunnel set up to allow connection to Solr dashboard.

### Stop Solr
* Execute the following command to stop solr on the cluster
    ```
    sudo stop solr
    ```

## Conclusion
In this lab you have learned how to extend an HDInsight cluster using third-party applications, adding valuable functionality to the AdventureWorks cluster. There are multiple applications available to extend HDInsight clusters in the Azure portal, and more can be developed by Microsoft, independent software vendors (ISV) or by yourself.

During this lab you:
* Installed a custom application, Apache Solr, using a Script Action, since it has not been published to the Azure portal.
* Used Solr to index and search a small sample of product data.
* Installed H2O Sparkling Water.
* Created a deep learning model using Spark and H2O.
* Used H2O Flow UI to view your model and predictions.