# Use Case - Manage YARN Applications

Let us go through the design and implementation of simple Python based application to manage YARN applications.

## Overview of the session

Let us get an overview of what we are going to cover as part of this session.

* Define Problem Statement
* Explore Options
* YARN REST APIs
* Overview of JSON
* Setup Project
* Externalize Properties
* Extracting Application Details
* Manipulating Dates
* Planning for Deployment


## Define Problem Statement

Let us define problem statement around manaing YARN applications.
* In our labs we are consistenly running out of resources.
  * People run Spark Jobs using `spark-shell` or `pyspark`. They are supposed to run the code interactively.
  * However people just launch `spark-shell` or `pyspark` and forget to close the session.
  * Subsequent sessions are not getting launched and we would like to get notified about long running sessions so that our admin team can take corrective action.
* Once the framework is developed, we can additional features as well.
  * Number of sessions launched by user.
  * Resources used by users.
  * Resources used by long running sessions.
  * We can think of so many other use cases.

## Explore Options

There are several ways using which we can get the details of long running sessions.

* We can use command line utility called as `yarn` to get the details of the jobs.
* It have options such as `yarn application -list` and `yarn application -status` etc to get details about YARN applications.
* By default, `yarn application -list` will give us running jobs.
* We can pass criteria using `-appStates` to get applications belonging to different states.
* We can get additional information of an application using `yarn application -status` with application id.
* YARN also provides REST APIs to get the information about YARN Applications. You can get the information from their [official page](https://hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html).
* We will choose REST API approach as we can more features implemented as part of the application

## YARN REST APIs

Let us understand how to get the details about YARN Cluster, Applications etc.

* Getting Cluster Details - `curl http://rm01.itversity.com:19088/ws/v1/cluster`
* Getting Cluster Metrics - `curl http://rm01.itversity.com:19088/ws/v1/cluster/metrics`
* Getting App Statistics - `curl http://rm01.itversity.com:19088/ws/v1/cluster/appstatistics`
* Getting App Details - `curl http://rm01.itversity.com:19088/ws/v1/cluster/apps\?states\=running`

## Overview of JSON

Let us get an overview of JSON.

* JSON stands for Java Script Object Notation.
* Keys are unique.
* Values can be simple objects or array or nested JSON object etc.
* JSON is similar to Python Dict.
* JSON is extensively used for the communication between front end and back end of a web or mobile application.

```
{
   "appStatInfo":{
      "statItem":[
         {
            "state":"FINISHED",
            "type":"*",
            "count":8
         },
         {
            "state":"FAILED",
            "type":"*",
            "count":0
         },
         {
            "state":"NEW_SAVING",
            "type":"*",
            "count":0
         },
         {
            "state":"NEW",
            "type":"*",
            "count":0
         },
         {
            "state":"KILLED",
            "type":"*",
            "count":0
         },
         {
            "state":"ACCEPTED",
            "type":"*",
            "count":0
         },
         {
            "state":"RUNNING",
            "type":"*",
            "count":12
         },
         {
            "state":"SUBMITTED",
            "type":"*",
            "count":0
         }
      ]
   }
}
```

## Setup Project

Let us setup the project to develop the application.

* Typically we use PyCharm for the development of Python based applications.
* It is better to have virtual environment for a Python based applications.
* We need to install required libraries using pip - `pip install requests`.
* We can also install using PyCharm.

## Externalize Properties

Let us see how we can externalize the properties.

* Create a file by name `config.py`.
* Add this to the file.

In [None]:
BASE_URI = 'http://rm01.itversity.com:19088'

## Extracting Application Details

Let us develop a function to get extracting application details.

In [None]:
def get_running_jobs(uri):
    import requests
    running_apps = requests.get(uri).json()
    return running_apps

In [None]:
uri = '{BASE_URI}/ws/v1/cluster/apps?states={STATE}'
running_jobs = get_running_jobs(uri.format(BASE_URI=BASE_URI, STATE='running'))
running_jobs

In [None]:
running_jobs['apps']['app'][0]['state']

In [None]:
running_jobs['apps']['app'][0]['startedTime']

## Manipulating Dates

Let us understand how to manipulate date and time using Python.

* There are libraries such as `datetime` and `time`.
* We can get current time using `datetime.datetime.now()`

In [None]:
import datetime
datetime.datetime.now()

* We can format date using `strftime`

In [None]:
import datetime
datetime.datetime.now().strftime('%Y-%m-%d')

* Converting a string to a date by passing specified format.

In [None]:
datetime.datetime.strptime('2020-05-09', '%Y-%m-%d')

* Converting Unix Time to a date

In [None]:
datetime.datetime.fromtimestamp(1589065394)

* We can use `time` to get Unix timestamp of current time.

In [None]:
import time
time.time()

In [None]:
int(time.time())

## Computing YARN Application age

Let us understand how to get the age of YARN Applications.

* Let us first read the details of YARN Applications which are in running state.

In [50]:
BASE_URI = 'http://rm01.itversity.com:19088'

In [51]:
uri = '{BASE_URI}/ws/v1/cluster/apps?state={STATE}'

In [71]:
def get_yarn_app_list(uri):
    import requests
    apps = requests.get(uri).json()
    return apps['apps']['app']

In [72]:
app_list = get_yarn_app_list(uri.format(BASE_URI=BASE_URI, STATE='running'))

In [73]:
def get_age_in_seconds(started_time):
    import time
    return int(time.time()) - started_time

In [82]:
def get_app_details(apps):
    app_details = map(lambda app:
            (app['id'], get_age_in_seconds(int(app['startedTime']/1000))), apps)
    return app_details

In [88]:
app_details = get_app_details(app_list)

In [89]:
beyond_age = 1800

In [90]:
long_running_apps = filter(lambda app: app[1] > beyond_age, app_details)
list(long_running_apps)

[('application_1589064448439_0083', 2161),
 ('application_1589064448439_0084', 2022),
 ('application_1589064448439_0085', 1908),
 ('application_1589064448439_0068', 4765),
 ('application_1589064448439_0071', 4685),
 ('application_1589064448439_0072', 4594),
 ('application_1589064448439_0074', 3820),
 ('application_1589064448439_0075', 3694),
 ('application_1589064448439_0076', 3654),
 ('application_1589064448439_0080', 3042)]

In [41]:
def notify_long_running_apps(apps):
    if len(apps) > 0:
        for app in apps:
            print(app)

In [49]:
notify_long_running_apps(long_running_apps)

('application_1589064448439_0031', 3356)
('application_1589064448439_0034', 2999)
('application_1589064448439_0036', 2684)
('application_1589064448439_0042', 1986)
('application_1589064448439_0046', 1847)


## Planning for Deployment