**Warning:** Remember that for interacting with EDI Big Data Stack you must be authenticated at the system using kinit command. For more information, read the documentation at [Authenticating with Kerberos](https://docs.edincubator.eu/big-data-stack/basic-concepts.html#authenticating-with-kerberos).

In [None]:
kinit -kt ~/work/$JUPYTERHUB_USER.service.keytab $JUPYTERHUB_USER@EDINCUBATOR.EU

# Oozie

[Apache Oozie](http://oozie.apache.org/) is a workflow scheduler for Hadoop. Oozie allow defining worflows, coordinators and bundles:

* **Workflow:** It is a sequence of actions. It is written in xml and the actions can be map reduce, hive, pig etc.

* **Coordinator:** It is a program that triggers actions (commonly workflow jobs) when a set of conditions are met. Conditions can be a time frequency, other external events etc.

* **Bundle:** It is defined as a higher level oozie abstraction that batches a set of coordinator jobs.We can specify the time for bundle job to start as well.

In this tutorial we explain how to create an execute an Oozie workflow. This workflow will launch the Pig job presented at [Pig](https://docs.edincubator.eu/big-data-stack/tools/pig.html#pig) and read generated results by a Spark job.

**Note**: You can design and run Oozie jobs easily using [Workflow Manager View](https://docs.edincubator.eu/big-data-stack/tools/views.html#workflow).

## Oozie Workflow

Oozie workflows, coordinators and bundles are defined in XML files. You can find the following example at `stack-examples/oozieexmaple/workflow.xml` file at stack-examples repository:

In [None]:
cd ~/work/examples/oozieexample/

This workflow defines three actions:

* The first action defines a file system action (fs) for clearing the output paths and avoid future errors.

* The second one defines the Pig script for aggregating data (*stack-examples/pigexample/yelp_business.pig*).

* The third one defines the Spark job for filtering data (*stack-examples/oozieexample/spark.py*).

### File System action

This action clears paths used by other tasks as output to avoid errors.

```xml
<action name="fs_1">
  <fs>
    <name-node>${nameNode}</name-node>
    <delete path="/user/${user}/${examplesRoot}/pig-output"></delete>
    <delete path="/user/${user}/${examplesRoot}/spark-oozie-output"></delete>
  </fs>
  <ok to="pig_1"/>
  <error to="pig_1"/>
</action>
```

As can be seen, every action has certain XML nodes and attributes:

* **action:** represents the action to be defined. It has to be named using name attribute.
* **type:** the type of action, in this case fs.
* **ok and error:** they represent the flow in case of a successful or a failed result.

In addition to those properties and the ones owned by the specific action, if the action needs to interact with other components like HDFS Namenode or YARN Jobtracker, they must be defined too.

### Pig action

This action executes a Pig script.

```xml
<action name="pig_1">
  <pig>
    <job-tracker>${jobTracker}</job-tracker>
    <name-node>${nameNode}</name-node>
    <script>/user/${user}/${examplesRoot}/yelp_business.pig</script>
    <argument>-param</argument>
    <argument>output_dir=/user/${user}/${examplesRoot}/pig-output</argument>
  </pig>
  <ok to="spark_1"/>
  <error to="kill"/>
</action>
```

### Spark action

This action executes a Spark script.

```xml
<action name="spark_1">
  <spark
    xmlns="uri:oozie:spark-action:0.2">
    <job-tracker>${jobTracker}</job-tracker>
    <name-node>${nameNode}</name-node>
    <master>yarn-cluster</master>
    <name>${user}SparkOozieTest</name>
    <jar>${nameNode}/user/${user}/${examplesRoot}/spark.py</jar>
    <arg>--app_name=${user}SparkOozieExample</arg>
    <arg>--username=${user}</arg>
    <arg>--example_dir=${examplesRoot}</arg>
  </spark>
  <ok to="end"/>
  <error to="kill"/>
</action>
```

In addition to the action, you must declare the following global configuration atributes.

```xml
<global>
  <configuration>
    <property>
      <name>oozie.use.system.libpath</name>
      <value>true</value>
    </property>
    <property>
      <name>oozie.action.sharelib.for.spark</name>
      <value>spark2</value>
    </property>
  </configuration>
</global>
```

## Oozie Job Properties

In addition to the *workflow.xml* file, the job.properties file declares the parameters and variables used by the Oozie job:

```data
nameNode=hdfs://master.edincubator.eu:8020
jobTracker=master.edincubator.eu:8050
master=yarn-cluster
examplesRoot=oozie-example
user=<username>
oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/
```

## Executing the workflow

For executing the workflow, you must follow those steps:

In [None]:
cd ~/work/examples/oozieexample
hdfs dfs -mkdir /user/$USERNAME/oozie-example
hdfs dfs -put workflow.xml /user/$USERNAME/oozie-example
hdfs dfs -put ../pigexample/yelp_business.pig /user/$USERNAME/oozie-example
hdfs dfs -put ../pigexample/yelp_business.pig /user/$USERNAME/oozie-example
oozie job -oozie http://master.edincubator.eu:11000/oozie -config job.properties -run

You can check the status of the job using *oozie jobs* command:

In [None]:
oozie jobs -oozie http://master.edincubator.eu:11000/oozie

You can check logs from a job using *oozie job -log* command

In [None]:
oozie job -oozie http://master.edincubator.eu:11000/oozie -log <oozie_job_id>

When Oozie job finishes, you can check its results at `/user/<username>/oozie-example/spark-oozie-output`.