In [4]:
%run "./Includes/Classroom-Setup"

### Automating ETL Workloads

Since recurring production jobs are the goal of ETL workloads, Spark needs a way to integrate with other automation and scheduling tools.  We also need to be able to run Python files and Scala/Java jars.

There are a number of different automation and scheduling tools including the following:<br><br>

* Command line tools integrated with the UNIX scheduler Cron
* The workflow scheduler Apache Airflow
* Amazon's trigger-based scheduler AWS Lambda

The gateway into job scheduling is programmatic access to Databricks, which can be achieved either through the REST API or the Databricks Command Line Interface (CLI).

### Access Tokens

Access tokens provide programmatic access to the Databricks CLI and REST API.  

To get started, first generate an access token.

In order to generate a token:<br><br>

1. Click on the person icon in the upper-right corner of the screen.
2. Click **User Settings**
3. Click on **Access Tokens**
4. Click on **Generate New Token**
5. Name your token
6. Designate a lifespan (a shorter lifespan is generally better to minimize risk exposure)
7. Click **Generate**
8. Copy your token.  You'll only be able to see it once.

Be sure to keep this key secure.  This grants the holder full programmatic access to Databricks, including both resources and data that's available to your Databricks environment.

Paste your token into the following cell along with the domain of your Databricks deployment (you can see this in the notebook's URL).  The deployment should look something like `https://my-deployment.cloud.databricks.com`

In [10]:
# TODO

token = "FILL IN"
domain = "https://<DOMAIN NAME>.cloud.databricks.com" + "/api/2.0/"

header = {'Authorization': "Bearer "+ token}

Test that the connection works by listing all files in the root directory of DBFS.

In [12]:
try:
  import json
  import requests

  endPoint = domain+"dbfs/list?path=/"
  r = requests.get(endPoint, headers=header)

  [i.get("path") for i in json.loads(r.text).get("files")]  

except Exception as e:
  print(e)
  print("\n** Double check your previous settings **\n")

### Scheduling with the REST API and CLI

Jobs can either be scheduled for running on a consistent basis or they can be run every time the API call is made.  Since there are many parameters in scheduling jobs, it's often best to schedule a job through the user interface, parse the configuration settings, and then run later jobs using the API.

Run the following cell to get the sense of what a basic job accomplishes.

In [15]:
path = dbutils.notebook.run("./Runnable/Runnable-4", 120, {"username": getUsername(), "ranBy": "NOTEBOOK"})
display(spark.read.parquet(path))

The notebook `Runnable-4` logs a timestamp and how the notebook is run.  This will log our jobs.

Schedule this job notebook as a job using parameters by first navigating to the jobs panel on the left-hand side of the screen and creating a new job.  Customize the job as follows:<br><br>

1. Give the job a name
2. Choose the notebook `Runnable-4` in the `Runnable` directory of this course
3. Add parameters for `username`, which is your Databricks login email (this gives you a unique path to save your data), and set `ranBy` as `JOB`
4. Choose a cluster of 2 workers and 1 driver (the default is too large for our needs).  **You can also choose to run a job against an already active cluster, reducing the time to spin up new resources.**
5. Click **Run now** to execute the job.


When the job completes, paste the `Run ID` that appears under completed runs below.

In [18]:
try:
  runId = "FILL_IN"
  endPoint = domain + "jobs/runs/get?run_id={}".format(runId)

  json.loads(requests.get(endPoint, headers=header).text)
  
except Exception as e:
  print(e)
  print("\n** Double check your runId and domain **\n")

Now take a look at the table to see the update

In [20]:
display(spark.read.parquet(path))

## Exercise 1: Create and Submit a Job using the REST API

Now that a job has been submitted through the UI, we can easily capture and re-run that job.  Re-run the job using the REST API and different parameters.

### Step 1: Create the `POST` Request Payload

To create a new job, communicate the specifications about the job using a `POST` request.  First, define the following variables:<br><br>

* `name`: The name of your job
* `notebook_path`: The path to the notebook `Runnable-4`.  This will be the `noteboook_path` variable listed in the API call above.

In [25]:
# TODO
import json

name = "Lesson-04-Lab"
notebook_path = "/Shared/ETL-Part-3/Python/Runnable/Runnable-4"

data = {
  "name": name,
  "new_cluster": {
    "spark_version": "4.2.x-scala2.11",
    "node_type_id": "i3.xlarge",
    "num_workers": 2,
    "spark_conf": {"spark.databricks.delta.preview.enabled": "true"}
  },
  "notebook_task": {
    "notebook_path": notebook_path,
    "base_parameters": {
      "username": username, "ranBy": "REST-API"
    }
  }
}

data_str = json.dumps(data)
print(data_str)

### Step 2: Create the Job

Use the base `domain` defined above to create a URL for the REST endpoint `jobs/create`.  Then, submit a `POST` request using `data_str` as the payload.

In [27]:
# TODO
createEndPoint = domain + "jobs/create"
r = requests.post(createEndPoint, headers=header, data=data_str)

job_id = json.loads(r.text).get("job_id")
print(job_id)

### Step 3: Run the Job

Run the job using the `job_id` from above.  You'll need to submit the post request to the `RunEndPoint` URL of `jobs/run-now`

In [29]:
# TODO
RunEndPoint = domain + "jobs/run-now"

data2 = {"job_id": job_id}
data2_str = json.dumps(data2)

r = requests.post(RunEndPoint, headers=header, data=data2_str)

r.text

### Step 4: Confirm that the Job Ran

Confirm that the job ran by checking the parquet file.  It can take a few minutes for the job to run and update this file.

In [31]:
display(spark.read.parquet(path))

In [32]:
# TEST - Run this cell to test your solution
from pyspark.sql.functions import col

APICounts = (spark.read.parquet(path)
  .filter(col("ranBy") == "REST-API")
  .count()
)

if APICounts > 0:
  print("Tests passed!")
else:
  print("Test failed, no records found")

## Review
**Question:** What ways can you schedule jobs on Databricks?  
**Answer:** Jobs can be scheduled using the UI, REST API, or Databricks CLI.

**Question:** How can you gain programmatic access to Databricks?  
**Answer:** Generating a token will give programmatic access to most Databricks services.

In [35]:
%run "./Includes/Classroom-Cleanup"