---

## Implementation of Livy API

In [1]:
import sys

In [2]:
sys.path.insert(0, '../')

In [3]:
from livy_submit import livy_api

---
## Manual QA of Livy API

Configure the API object

In [4]:
server_url = 'ip-172-31-20-241.ec2.internal'

In [5]:
api = livy_api.LivyAPI(server_url=server_url, port=8998)

Upload files to HDFS

In [29]:
import os
from hdfs.ext.kerberos import KerberosClient

In [30]:
namenode_connection_url = f'http://{server_url}:50070'
run_file = os.path.abspath('pi.py')
hdfs_dir = '/user/edill/pi.py'
job_name = 'edill-pi'
spark_config = {}

In [31]:
run_file

'/home/ec2-user/notebooks/livy-submit-old/QA/pi.py'

In [34]:
# 1. upload run_file to hdfs_dir
client = KerberosClient(namenode_connection_url)

Make sure that `pi.py` exists in the home directory of your user

In [35]:
client.list('/user/edill')

['.sparkStaging', 'banking-leads.parquet', 'banking_leads.csv', 'pi.py']

In [11]:
file_on_hdfs = client.upload(hdfs_path='', local_path=run_file, overwrite=True)
file_on_hdfs

'/user/edill/pi.py'

Start using the Livy Python API

See what info we have available. This will likely be an output that looks like this:
```
(0, 0, {})
```

In [12]:
api.all_info()

(0, 0, {})

This one should throw a stack trace with:
```
HTTPError: 404 Client Error: Not Found for url: http://ip-172-31-20-241.ec2.internal:8998/batches/26
```

In [15]:
api.info(26)

HTTPError: 404 Client Error: Not Found for url: http://ip-172-31-20-241.ec2.internal:8998/batches/26

---
## Submit

Submit your livy job

In [21]:
job1 = api.submit(file=file_on_hdfs, name=job_name)
job1

{'name': 'edill-pi', 'file': '/user/edill/pi.py'}


Batch(id=30, appId=None, appInfo={'driverLogUrl': None, 'sparkUiUrl': None}, log=[see self.log], state=starting)

Check on its info

In [22]:
api.info(job1.id)

Batch(id=30, appId=None, appInfo={'driverLogUrl': None, 'sparkUiUrl': None}, log=[see self.log], state=starting)

Submit another job so we can have additional info in the all_info function

In [23]:
job2 = api.submit(file=file_on_hdfs, name=job_name)
job2

{'name': 'edill-pi', 'file': '/user/edill/pi.py'}


Batch(id=31, appId=None, appInfo={'driverLogUrl': None, 'sparkUiUrl': None}, log=[see self.log], state=starting)

Check all available batch jobs. You should have two, if you've run the cells in order and only once each.

In [24]:
api.all_info()

(0,
 4,
 {28: Batch(id=28, appId=application_1544723249474_0018, appInfo={'driverLogUrl': None, 'sparkUiUrl': 'http://ip-172-31-20-241.ec2.internal:20888/proxy/application_1544723249474_0018/'}, log=[see self.log], state=success),
  29: Batch(id=29, appId=None, appInfo={'driverLogUrl': None, 'sparkUiUrl': None}, log=[see self.log], state=starting),
  30: Batch(id=30, appId=None, appInfo={'driverLogUrl': None, 'sparkUiUrl': None}, log=[see self.log], state=starting),
  31: Batch(id=31, appId=None, appInfo={'driverLogUrl': None, 'sparkUiUrl': None}, log=[see self.log], state=starting)})

---
Look at the logs for the first job. Make sure you see this log line:
```
  'Pi is roughly 3.139560',
```

In [25]:
api.logs(job1.id)

(30,
 42,
 142,
 ['18/12/17 14:17:41 INFO Client: Submitting application application_1544723249474_0020 to ResourceManager',
  '18/12/17 14:17:41 INFO YarnClientImpl: Submitted application application_1544723249474_0020',
  '18/12/17 14:17:41 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1544723249474_0020 and attemptId None',
  '18/12/17 14:17:42 INFO Client: Application report for application_1544723249474_0020 (state: ACCEPTED)',
  '18/12/17 14:17:42 INFO Client: ',
  '\t client token: Token { kind: YARN_CLIENT_TOKEN, service:  }',
  '\t diagnostics: AM container is launched, waiting for AM container to Register with RM',
  '\t ApplicationMaster host: N/A',
  '\t ApplicationMaster RPC port: -1',
  '\t queue: default',
  '\t start time: 1545056261402',
  '\t final status: UNDEFINED',
  '\t tracking URL: http://ip-172-31-20-241.ec2.internal:20888/proxy/application_1544723249474_0020/',
  '\t user: edill',
  '18/12/17 14:17:43 INFO Client: Appli

---
Look at the logs for the second job. Make sure you see this log line:
```
  'Pi is roughly 3.139560',
```

In [26]:
api.logs(job2.id)

(31,
 42,
 142,
 ['18/12/17 14:17:48 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(livy, edill); groups with view permissions: Set(); users  with modify permissions: Set(livy, edill); groups with modify permissions: Set()',
  '18/12/17 14:17:48 INFO Client: Submitting application application_1544723249474_0021 to ResourceManager',
  '18/12/17 14:17:48 INFO YarnClientImpl: Submitted application application_1544723249474_0021',
  '18/12/17 14:17:48 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1544723249474_0021 and attemptId None',
  '18/12/17 14:17:49 INFO Client: Application report for application_1544723249474_0021 (state: ACCEPTED)',
  '18/12/17 14:17:49 INFO Client: ',
  '\t client token: Token { kind: YARN_CLIENT_TOKEN, service:  }',
  '\t diagnostics: AM container is launched, waiting for AM container to Register with RM',
  '\t ApplicationMaster host: N/A',
  '\t Applic

---
Now, let's start a third job so we can kill it

In [27]:
job3 = api.submit(file=file_on_hdfs, name=job_name)
job3

{'name': 'edill-pi', 'file': '/user/edill/pi.py'}


Batch(id=32, appId=None, appInfo={'driverLogUrl': None, 'sparkUiUrl': None}, log=[see self.log], state=starting)

And let's kill it

In [28]:
api.kill(job3.id)

{'msg': 'deleted'}

If you see this above

```
{'msg': 'deleted'}
```

Then things are working as expected