# Spark Development and Execution life cycle

Do you want to understand Spark execution life cycle and understand different terms such as executor, executor tasks, driver program and more?

* Develop Spark Application – Get Monthly Product Revenue
* Build and Deploy
* Local mode vs. YARN mode
* Quick walk through of Spark UI
* YARN deployment modes
* Spark Execution Cycle

### Develop Spark Application – Get Monthly Product Revenue

Let us start with details with respect to problem statement, design and then implementation.

**Problem Statement**

Using retail db dataset, we need to compute Monthly Product Revenue for given month.

* We need to consider only completed and closed orders to compute revenue.
* Also we need to consider only those transactions for a given month passed as argument.
* We need to sort the data in descending order by revenue while saving the output to HDFS Path.

**Design** 

Let us see the design for the given Problem Statement.

* Filter for orders which fall in the month passed as the argument
* Join filtered orders and order_items to get order_item details for a given month
* Get revenue for each product_id
* We need to read products from the local file system
* Convert into RDD and extract product_id and product_name
* Join it with aggregated order_items (product_id, revenue)
* Get product_name and revenue for each product

**Development**

Let us create a new project and develop the logic.

* Setup Project
    * Scala Version: 2.11.8 (on windows, latest of 2.11 in other environments)
    * sbt Version: 0.13.x
    * JDK: 1.8.x
    * Project Name: SparkDemo
* Update build.sbt
    * typesafe config
    * Spark Core
* Update application.properties
* Develop Logic to compute revenue per product for given month.
* Once the project is setup we can launch Scala REPL with Spark as well as typesafe config dependencies using **sbt console**
* Once we get the logic, we can update as part of Program called **GetMonthlyProductRevenue**

**USING SBT CONSOLE**

As part of this topic, we will see how to access sbt console and use it for exploring Spark based APIs.

* Go to the working directory of the project.
* Run sbt console
* We should be able to use typesafe config APIs as well as Spark APIs.
* Create Spark Conf and Spark Context objects

import org.apache.hadoop.fs.{FileSystem, Path}

import org.apache.spark.{SparkConf, SparkContext}

val props = ConfigFactory.load()

val envProps = props.getConfig("devu")

// As part of the video, we have passed dev.
// But to make the code compatible with windows, 
// there are some changes to the code.
// Make sure to pass devu in Ubuntu

val inputPath = envProps.getString("input.base.dir")

val outputPath = envProps.getString("output.base.dir") + "monthly_product_revenue"

val month = "2014-01"

val conf = new SparkConf().

  setAppName("Revenue Per Product for " + month).
  
  setMaster(envProps.getString("execution.mode"))
  
val sc = new SparkContext(conf)

**HADOOP CONFIGURATION**

Let us see how we can access Hadoop Configuration.

* Spark uses HDFS APIs to read files from supported file systems.
* As part of Spark dependencies, we get HDFS APIs as well.
* We can get Hadoop Configuration using sc.hadoopConfiguration
* Using it, we will be able to create FileSystem Object. It will explose APIs such as exists, delete etc.
* We can use those to validate as well as manage input and/or output directories.

/ Make sure to run earlier code to create Spark Context

import org.apache.hadoop.fs.{FileSystem, Path}

val fs = FileSystem.get(sc.hadoopConfiguration)

if (!fs.exists(new Path(inputPath))) {

  println("Input path does not exist")
  
} else {

  if (fs.exists(new Path(outputPath)))
  
  
fs.delete(new Path(outputPath), true)

In [None]:
**READ AND FILTER ORDERS**

As we are able to create Spark Context, now let us actually read and manipulate data from orders.

* Read data from orders
* Use filter and validate for COMPLETE or CLOSED as well as passed month
* Use map to extract order_id and hard coded value 1 so that we can use it to join later.

// Filter for orders which fall in the month passed as argument

val orders = inputPath + "orders"

val ordersFiltered = sc.textFile(orders).

  filter(order => {
      
    order.split(",")(1).contains(month) &&
      
      List("COMPLETE", "CLOSED").contains(order.split(",")(3))
  }).

  map(order => (order.split(",")(0).toInt, 1))

JOIN ORDERS AND ORDER ITEMS

Now let us join order_items with orders and get product_id and order_item_subtotal.

* Read data from order_items
* Extract order_id, product_id and order_item_subtotal as a tuple.
* First element is order_id and second element is nested tuple which contain product_id and order_item_subtotal.
* Join the data set with orders filtered using order_id as key.
* It will generate RDD of tuples – **(order_id, ((product_id, order_item_subtotal), 1))**

// Join filtered orders and order_items to get order_item details for a given month

// Get revenue for each product_id

val orderItems = inputPath + "order_items"

val revenueByProductId = sc.textFile(orderItems).

  map(orderItem => {
      
    val oi = orderItem.split(",")
      
    (oi(1).toInt, (oi(2).toInt, oi(4).toFloat)
     
  }).
      
  join(ordersFiltered)

### COMPUTE REVENUE PER PRODUCT ID

Now we can extract product_id and order_item_subtotal and compute revenue for each product_id.

* We can discard order_id and 1 from the join ouput.
* We can use map and get the required information – product_id and order_item_subtotal.
* Using reduceByKey, we should be able to compute revenue for each product_id.

// Join filtered orders and order_items to get order_item details for a given month

// Get revenue for each product_id

val orderItems = inputPath + "order_items"

val revenueByProductId = sc.textFile(orderItems).

  map(orderItem => {
      
    val oi = orderItem.split(",")
      
    (oi(1).toInt, (oi(2).toInt, oi(4).toFloat))
      
  }).

  join(ordersFiltered).

  map(rec => rec._2._1).

  reduceByKey(_ + _)