# Lab 5 - Monitoring & Configuration

AdventureWorks would like to monitor the health and performance of their HDInsight cluster, which is essential for maintaining maximum performance and resource utilization. They would like to be able to address possible coding or cluster configuration errors, through monitoring. They are also interested in being proactive with addressing potential issues by being alerted when certain events occur, such as a failing app.

To meet their requirements, you will enable Azure Log Analytics in Operations Manager Suite (OMS) to monitor their cluster's operations. You will demonstrate how to query the cluster's logs for errors or other events, and set up alerts based on those queries. Also, you will use the YARN and Spark UIs to track applications, and use the Spark History Server to view the history of the applications. Finally, you will use Ambari to configure Spark settings. 

### Submit Spark jobs

Let's submit a few Spark jobs to show activity in our charts as well as logs.

First, let's import the libraries required by our queries.

In [None]:
import pprint, datetime
from pyspark import SparkContext
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import date_format,unix_timestamp

In [None]:
%%sql
select Action, Count(weblogs.*) as Ct, Max(CleanedTransactionDate), Min(CleanedTransactionDate)
from weblogs
group by Action
order by Ct desc

This next one is a bit more complex, which should show a spike in resource usage and generate a nice DAG graph.

In [None]:
%%sql
select weblogs.UserId, users.Username, products.ProductName, weblogs.Quantity,
weblogs.Price, weblogs.TotalPrice, weblogs.ReferralURL, weblogs.PageStopDuration,
weblogs.Action from weblogs
inner join products
on weblogs.productid = products.productid
inner join users on weblogs.userid = users.id
where CleanedTransactionDate > '1/15/2016'
and PageStopDuration > 10 and ReferralURL like '%contoso%'
and weblogs.ProductId between 87 and 709 and weblogs.Price >= 15
and Gender = 'Female'

## Enable Azure Log Analytics

Log Analytics is a service in OMS that monitors your cloud and on-premises environments to maintain their availability and performance. It collects data generated by resources in your cloud and on-premises environments and from other monitoring tools to provide analysis across multiple sources.

Before we begin, we must first create a Log Analytics workspace. You can think of this workspace as a unique Log Analytics environment with its own data repository, data sources, and solutions. You must have one such workspace already created that you can associate with the HDInsight cluster. Refer to [this resource](https://docs.microsoft.com/azure/log-analytics/log-analytics-quick-collect-azurevm#create-a-workspace) for instructions on creating a workspace.

### Create the Log Analytics workspace

Create a new Log Analytics workspace called 'HDI-Lab05', if available. If unavailable, add your name to the end (e.g. HDI-Lab05Joel).

### Enable Monitoring on the cluster

Go to the cluster settings on the Azure Portal and enable Monitoring, selecting the HDI-Lab05 (or the name you provided) OMS workspace you created.

### Add Spark cluster management solution to Log Analytics

Add the HDInsight Spark Monitoring solution from the OMS Solution Gallery on the workspace you created. Instructions can be found [here](https://docs.microsoft.com/azure/hdinsight/hdinsight-hadoop-oms-log-analytics-management-solutions).

Click on the HDInsightSpark solution tile on the OMS dashboard. You will see charts displaying resource utilization, links to display log queries, and information on cluster health. When you add OMS cluster management solutions for several clusters, you can view all of the information using OMS, instead of needing to log in to each cluster dashboard in Ambari to view this information.

![HDInsightSpark charts](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab05/images/hdinsight-spark-charts.png)

**Leave the OMS Dashboard open** for now.

## Track applications in YARN UI

We'll now look at the traditional way to view cluster activity and application/job status, using the Ambari Web UI. More information about Ambari can be [found here](https://docs.microsoft.com/azure/hdinsight/hdinsight-hadoop-manage-ambari). In short, Ambari simplifies the management and monitoring of Hadoop clusters by providing an easy to use web UI and REST API. It is used to monitor the cluster and make configuration changes.

Hadoop has various services running across its distributed platform. YARN (Yet Another Resource Negotiator) coordinates these services, allocates cluster resources, and manages access to a common data set. We can use the YARN UI to track applications, or jobs, and to monitor their progress.

### Open Ambari dashboard to view cluster status and charts

When you open the Ambari Web UI, you will view charts that are somewhat similar to the ones we saw in the OMS dashboard. You will see a list of all of the cluster services that are running on the left-hand side of the page. Clicking on services allows you to view detailed information about the service, including on which nodes they are running, and configuration options.

Your Ambari dashboard should look similar to the following:

![Ambari Web UI dashboard](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab05/images/ambari-dashboard.png)

### Open the YARN UI

The YARN UI will allow us to view our applications that are currently running, have finished running, have been submitted or killed, etc. You can follow [these instructions](https://docs.microsoft.com/azure/hdinsight/hdinsight-apache-spark-resource-manager#how-do-i-launch-the-yarn-ui) to open YARN UI. It should look like the following:

![Select the Cluster Dashboard link on the overview blade of your HDInsight cluster](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab05/images/yarn-ui.png)

### View running applications

Within the YARN UI, look at the various applications using the menu on the left. When you look at the RUNNING applications, you should see one for the Jupyter notebook. View the details of this application, then view it on the Spark History Server, using the link within. Read instructions for doing this [here](https://docs.microsoft.com/azure/hdinsight/hdinsight-apache-spark-job-debugging#track-an-application-in-the-yarn-ui)

## Submit a long-running job to cause CPU spike

We'll create a Spark DataFrame from the weblog files, even though we have created a Hive table from the data already. This will allow us to execute a long-running query.

In [None]:
df = spark.read.csv("/retaildata/rawdata/weblognew/*/*/weblog.txt",sep="|",header=True)

Now let's submit a long-running query (around 4 minutes) so we can view CPU usage details for this job, setting a benchmark for our CPU spike alert later on.

In [None]:
pprint.pprint(
    df.select("TransactionDate",
          date_format(
            unix_timestamp("TransactionDate","M/d/yyyy h:mm:ss a").cast("timestamp"),
            "yyyy-MM-dd HH:mm:ss").alias("date")
         ).where("date IS NULL").take(5)
)

## View Spark Job details in the Spark History Server

The Spark History Server, otherwise known as Spark UI, provides a view into completed Spark jobs. Every Spark application launches a web application UI that displays useful information about the application, such as:

* Information about Spark SQL jobs
* A list of stages and tasks
* An event timeline that displays the relative ordering and interleaving of application events. The timeline view is available on three levels: across all jobs, within one job, and within one stage. The timeline also shows executor allocation and deallocation.
* The execution directed acyclic graph (DAG) for each job.
* Environment - runtime information, property settings, library paths
* A summary of RDD sizes and memory usage

We're going to use the Spark UI to view details about one of our Spark SQL jobs we ran earlier in this notebook. The key is to look for a description that includes `PythonRDD.scala`. We'll find one with a lot of tasks for all stages to look at. When we open it, we want to look at the directed acyclic graph (DAG), which will show us a nice diagram of the execution of those stages.

## View CPU usage details in Log Analytics

Switch back to the OMS Dashboard. Click on the magnifying glass in the left-hand menu, titled "Log Search" when you hover over top of it.

Paste the following query into the filter box:

`Perf | where CounterName == "% Processor Time" and InstanceName == "_Total" and (Computer startswith_cs "wn" and Computer contains_cs "-") and TimeGenerated >= ago(30minute) | summarize AggregatedValue = avg(CounterValue) by TimeGenerated, bin(TimeGenerated, 5min) | sort by TimeGenerated1 desc`

This query is searching the cluster logs for % Processor Time in all of the cluster's worker nodes, which are completing the bulk of the work, within the last 30 minutes. It then displays the average % Processor Time in 5 minute increments. Finally, it sorts by the time bin in descending order.

What you are looking for is a CPU spike from when you ran the long-running job above. If you don't see a spike, it's possible that it's been longer than 30 minutes since you ran your job. If this is the case, either modify the `TimeGenerated >= ago(30minute)` portion of the query to the appropriate number of minutes, or re-run the long-running job and wait for a few minutes before re-executing the log query.

In the example below, it appears as though the long-running job cause around a 20% CPU spike:

![Log Analytics query on % Processor Time](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab05/images/log-analytics-query.png)

If I switch over to the Ambari dashboard and click on the CPU Usage chart, I can see similar values reported there:

![Ambari CPU Usage chart](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab05/images/ambari-cpu-usage.png)

Let's set up an alert if the Processor Time percentage exceeds around 75% of this threshold. In our case, that's around 15% Processor Time.

## Set up an alert in Log Analytics

With the % Processor Time query results open in Log Analytics, create a new alert that sends you an email if the % Processor Time exceeds 75% of the highest average value you saw in the above query. It should be checked every 5 minutes, within a 5 minute time window, if the number of results is greater than 0.

### Execute long-running job once again to spike the CPU usage

Now let's submit the same long-running query (around 4 minutes) once again so we can cause a spike in the CPU usage, firing our new alert.

In [None]:
pprint.pprint(
    df.select("TransactionDate",
          date_format(
            unix_timestamp("TransactionDate","M/d/yyyy h:mm:ss a").cast("timestamp"),
            "yyyy-MM-dd HH:mm:ss").alias("date")
         ).where("date IS NULL").take(5)
)

After a few minutes, you should receive an email alert similar to the following:

![CPU Spike Alert email](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab05/images/cpu-spike-alert-email.png)

If you don't receive an email alert in a few minutes, re-run the following Log Analytics query to see what your highest % Processor Time usage levels are. You can then go to Settings > Alerts in OMS and edit the alert accordingly:

`Perf | where CounterName == "% Processor Time" and InstanceName == "_Total" and (Computer startswith_cs "wn" and Computer contains_cs "-") and TimeGenerated >= ago(30minute) | summarize AggregatedValue = avg(CounterValue) by TimeGenerated, bin(TimeGenerated, 5min) | sort by TimeGenerated1 desc`

## Use Ambari to configure Spark settings

Spark can be configured via configuration files located on its head nodes, or through the Ambari UI. Making configuration changes through Ambari allows you to change settings for Spark and other services in one place, at the cluster level or individual application level. Application-level settings can be done in a Jupyter notebook. In this case, we are going to make optimal Spark configuration changes at the cluster level.

The three key parameters that can be used for Spark configuration, depending on application requirements are `spark.executor.instances`, `spark.executor.cores`, and `spark.executor.memory`. An Executor is a process launched for a Spark application. It runs on the worker node and is responsible to carry out the tasks for the application. The default number of executors and the executor sizes for each cluster is calculated based on the number of worker nodes and the worker node size. This information is stored in spark-defaults.conf on the cluster head nodes.

Use Ambari to change Spark settings to the following values:

* **spark.executor.instances**: 5
* **spark.executor.cores**: 3
* **spark.executor.memory**: 4608m

## Conclusion

In the lab, you have learned how to configure and use Azure Log Analytics in OMS with your HDInsight Spark cluster and monitor the charts, execute a log search and add an alert, track applications in YARN UI and Spark UI, use the Spark History Server, and use Ambari to configure Spark settings.