### Section 10: Apache Spark 2.x - Processing Data using Data Frames - Basic Transformations

#### Define Problem Statement - Get Daily Product Revenue

<p>Here is the problem statement for which we will be exploring Data Frame APIs to come up with final solution.</p>
<ul>
<li>Get daily product revenue</li>
<li>orders – order_id, order_date, order_customer_id, order_status</li>
<li>order_items – order_item_id, order_item_order_id, order_item_product_id, order_item_quantity, order_item_subtotal, order_item_product_price</li>
<li>Data is comma separated</li>
<li>We will fetch data using spark.read.csv</li>
<li>Apply type cast functions to convert fields into their original type where ever is applicable.</li>
</ul>
<img src="Retail_DB_ER_Diagram.png" />

#### Read Data

##### Method1 - spark.read.format()

In [1]:
orders=spark.read.format('csv').schema('order_id int,order_date string,customer_id int,order_status string'). \
                               load('/user/pi/retail_db/orders')

In [2]:
orders.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- order_date: string (nullable = true)
 |-- customer_id: integer (nullable = true)
 |-- order_status: string (nullable = true)



In [3]:
orders.show(10,False)

+--------+---------------------+-----------+---------------+
|order_id|order_date           |customer_id|order_status   |
+--------+---------------------+-----------+---------------+
|1       |2013-07-25 00:00:00.0|11599      |CLOSED         |
|2       |2013-07-25 00:00:00.0|256        |PENDING_PAYMENT|
|3       |2013-07-25 00:00:00.0|12111      |COMPLETE       |
|4       |2013-07-25 00:00:00.0|8827       |CLOSED         |
|5       |2013-07-25 00:00:00.0|11318      |COMPLETE       |
|6       |2013-07-25 00:00:00.0|7130       |COMPLETE       |
|7       |2013-07-25 00:00:00.0|4530       |COMPLETE       |
|8       |2013-07-25 00:00:00.0|2911       |PROCESSING     |
|9       |2013-07-25 00:00:00.0|5657       |PENDING_PAYMENT|
|10      |2013-07-25 00:00:00.0|5648       |PENDING_PAYMENT|
+--------+---------------------+-----------+---------------+
only showing top 10 rows



In [4]:
orderItems = spark.read.format('csv'). \
schema('order_item_id int,order_item_order_id int,order_item_product_id int,order_item_quantity_id int,order_item_subtotal float,order_item_product_price float'). \
load('/user/pi/retail_db/order_items')

In [5]:
orderItems.printSchema()

root
 |-- order_item_id: integer (nullable = true)
 |-- order_item_order_id: integer (nullable = true)
 |-- order_item_product_id: integer (nullable = true)
 |-- order_item_quantity_id: integer (nullable = true)
 |-- order_item_subtotal: float (nullable = true)
 |-- order_item_product_price: float (nullable = true)



In [6]:
orderItems.show(10,False)

+-------------+-------------------+---------------------+----------------------+-------------------+------------------------+
|order_item_id|order_item_order_id|order_item_product_id|order_item_quantity_id|order_item_subtotal|order_item_product_price|
+-------------+-------------------+---------------------+----------------------+-------------------+------------------------+
|1            |1                  |957                  |1                     |299.98             |299.98                  |
|2            |2                  |1073                 |1                     |199.99             |199.99                  |
|3            |2                  |502                  |5                     |250.0              |50.0                    |
|4            |2                  |403                  |1                     |129.99             |129.99                  |
|5            |4                  |897                  |2                     |49.98              |24.99             

#### Method2 - spark.read.csv()

In [7]:
orders=spark.read.csv('/user/pi/retail_db/orders',schema='order_id int,order_date string,customer_id int,order_status string')

In [8]:
orders.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- order_date: string (nullable = true)
 |-- customer_id: integer (nullable = true)
 |-- order_status: string (nullable = true)



In [9]:
orders.show(10,False) 

+--------+---------------------+-----------+---------------+
|order_id|order_date           |customer_id|order_status   |
+--------+---------------------+-----------+---------------+
|1       |2013-07-25 00:00:00.0|11599      |CLOSED         |
|2       |2013-07-25 00:00:00.0|256        |PENDING_PAYMENT|
|3       |2013-07-25 00:00:00.0|12111      |COMPLETE       |
|4       |2013-07-25 00:00:00.0|8827       |CLOSED         |
|5       |2013-07-25 00:00:00.0|11318      |COMPLETE       |
|6       |2013-07-25 00:00:00.0|7130       |COMPLETE       |
|7       |2013-07-25 00:00:00.0|4530       |COMPLETE       |
|8       |2013-07-25 00:00:00.0|2911       |PROCESSING     |
|9       |2013-07-25 00:00:00.0|5657       |PENDING_PAYMENT|
|10      |2013-07-25 00:00:00.0|5648       |PENDING_PAYMENT|
+--------+---------------------+-----------+---------------+
only showing top 10 rows



In [10]:
orderItems = spark.read.csv('/user/pi/retail_db/order_items','order_item_id int,order_item_order_id int,order_item_product_id int,order_item_quantity_id int,order_item_subtotal float,order_item_product_price float')

In [11]:
orderItems.printSchema()

root
 |-- order_item_id: integer (nullable = true)
 |-- order_item_order_id: integer (nullable = true)
 |-- order_item_product_id: integer (nullable = true)
 |-- order_item_quantity_id: integer (nullable = true)
 |-- order_item_subtotal: float (nullable = true)
 |-- order_item_product_price: float (nullable = true)



In [12]:
orderItems.show(10,False)

+-------------+-------------------+---------------------+----------------------+-------------------+------------------------+
|order_item_id|order_item_order_id|order_item_product_id|order_item_quantity_id|order_item_subtotal|order_item_product_price|
+-------------+-------------------+---------------------+----------------------+-------------------+------------------------+
|1            |1                  |957                  |1                     |299.98             |299.98                  |
|2            |2                  |1073                 |1                     |199.99             |199.99                  |
|3            |2                  |502                  |5                     |250.0              |50.0                    |
|4            |2                  |403                  |1                     |129.99             |129.99                  |
|5            |4                  |897                  |2                     |49.98              |24.99             

#### Method3 - spark.read.csv() - toDF/withColumn

In [13]:
ordersCsv=spark.read.csv('/user/pi/retail_db/orders').toDF('order_id','order_date','customer_id','order_status')

In [14]:
ordersCsv.printSchema()

root
 |-- order_id: string (nullable = true)
 |-- order_date: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- order_status: string (nullable = true)



In [15]:
from pyspark.sql.types import IntegerType
orders=ordersCsv.withColumn('order_id',ordersCsv.order_id.cast(IntegerType())) \
                .withColumn('customer_id',ordersCsv.customer_id.cast(IntegerType()))

In [16]:
orders.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- order_date: string (nullable = true)
 |-- customer_id: integer (nullable = true)
 |-- order_status: string (nullable = true)



In [17]:
orders.show(10,False) 

+--------+---------------------+-----------+---------------+
|order_id|order_date           |customer_id|order_status   |
+--------+---------------------+-----------+---------------+
|1       |2013-07-25 00:00:00.0|11599      |CLOSED         |
|2       |2013-07-25 00:00:00.0|256        |PENDING_PAYMENT|
|3       |2013-07-25 00:00:00.0|12111      |COMPLETE       |
|4       |2013-07-25 00:00:00.0|8827       |CLOSED         |
|5       |2013-07-25 00:00:00.0|11318      |COMPLETE       |
|6       |2013-07-25 00:00:00.0|7130       |COMPLETE       |
|7       |2013-07-25 00:00:00.0|4530       |COMPLETE       |
|8       |2013-07-25 00:00:00.0|2911       |PROCESSING     |
|9       |2013-07-25 00:00:00.0|5657       |PENDING_PAYMENT|
|10      |2013-07-25 00:00:00.0|5648       |PENDING_PAYMENT|
+--------+---------------------+-----------+---------------+
only showing top 10 rows



In [18]:
orderItemsCsv = spark.read.csv('/user/pi/retail_db/order_items').toDF('order_item_id','order_item_order_id','order_item_product_id','order_item_quantity_id','order_item_subtotal','order_item_product_price')

In [19]:
orderItemsCsv.printSchema()

root
 |-- order_item_id: string (nullable = true)
 |-- order_item_order_id: string (nullable = true)
 |-- order_item_product_id: string (nullable = true)
 |-- order_item_quantity_id: string (nullable = true)
 |-- order_item_subtotal: string (nullable = true)
 |-- order_item_product_price: string (nullable = true)



In [20]:
from pyspark.sql.types import IntegerType,FloatType
orderItems=orderItemsCsv.withColumn('order_item_id',orderItemsCsv.order_item_id.cast(IntegerType())) \
                        .withColumn('order_item_order_id',orderItemsCsv.order_item_order_id.cast(IntegerType())) \
                        .withColumn('order_item_product_id',orderItemsCsv.order_item_product_id.cast(IntegerType())) \
                        .withColumn('order_item_quantity_id',orderItemsCsv.order_item_quantity_id.cast(IntegerType())) \
                        .withColumn('order_item_subtotal',orderItemsCsv.order_item_subtotal.cast(FloatType())) \
                        .withColumn('order_item_product_price',orderItemsCsv.order_item_product_price.cast(FloatType()))

In [21]:
orderItems.printSchema()

root
 |-- order_item_id: integer (nullable = true)
 |-- order_item_order_id: integer (nullable = true)
 |-- order_item_product_id: integer (nullable = true)
 |-- order_item_quantity_id: integer (nullable = true)
 |-- order_item_subtotal: float (nullable = true)
 |-- order_item_product_price: float (nullable = true)



In [22]:
orderItems.show(10,False)

+-------------+-------------------+---------------------+----------------------+-------------------+------------------------+
|order_item_id|order_item_order_id|order_item_product_id|order_item_quantity_id|order_item_subtotal|order_item_product_price|
+-------------+-------------------+---------------------+----------------------+-------------------+------------------------+
|1            |1                  |957                  |1                     |299.98             |299.98                  |
|2            |2                  |1073                 |1                     |199.99             |199.99                  |
|3            |2                  |502                  |5                     |250.0              |50.0                    |
|4            |2                  |403                  |1                     |129.99             |129.99                  |
|5            |4                  |897                  |2                     |49.98              |24.99             

### Selection or Projection of Data in Data Frames

*We can Either use select,withColumn or selectExpr to project the data*

 * Wecan use select and fetch data from the fields we are looking for
 * Both order and orderItems are of type DAtaFrame.We will able to access attributes by prefexing data frame name eg.orders.order_id and orderItems.ordeR_item_id).Also we can pass attributes name as strings

***

orders.select(orders.order_id,orders.order_date) or
orders.select('order_id','order_date')

***

 * We can apply necessary functions to manipulate data while it is being projected.

***

orders.select(substring('order_date',1,7)).show()

***

 * We can give alias to derived fieldsusing alias function

***

orders.select(substring('order_date',1,7).alias('order_month')).show()

In [23]:
orders

DataFrame[order_id: int, order_date: string, customer_id: int, order_status: string]

In [24]:
orderItems

DataFrame[order_item_id: int, order_item_order_id: int, order_item_product_id: int, order_item_quantity_id: int, order_item_subtotal: float, order_item_product_price: float]

#### select()

In [25]:
#select coumn using dataframe dot column name
orders.select(orders.order_id,orders.order_date).show(10,False)

+--------+---------------------+
|order_id|order_date           |
+--------+---------------------+
|1       |2013-07-25 00:00:00.0|
|2       |2013-07-25 00:00:00.0|
|3       |2013-07-25 00:00:00.0|
|4       |2013-07-25 00:00:00.0|
|5       |2013-07-25 00:00:00.0|
|6       |2013-07-25 00:00:00.0|
|7       |2013-07-25 00:00:00.0|
|8       |2013-07-25 00:00:00.0|
|9       |2013-07-25 00:00:00.0|
|10      |2013-07-25 00:00:00.0|
+--------+---------------------+
only showing top 10 rows



In [26]:
#select coumn using single quotes
orders.select('order_id','order_date').show(10,False)

+--------+---------------------+
|order_id|order_date           |
+--------+---------------------+
|1       |2013-07-25 00:00:00.0|
|2       |2013-07-25 00:00:00.0|
|3       |2013-07-25 00:00:00.0|
|4       |2013-07-25 00:00:00.0|
|5       |2013-07-25 00:00:00.0|
|6       |2013-07-25 00:00:00.0|
|7       |2013-07-25 00:00:00.0|
|8       |2013-07-25 00:00:00.0|
|9       |2013-07-25 00:00:00.0|
|10      |2013-07-25 00:00:00.0|
+--------+---------------------+
only showing top 10 rows



In [27]:
#apply functions
from pyspark.sql.functions import substring
orders.select(substring('order_date',1,7)).show(10,False)

+---------------------------+
|substring(order_date, 1, 7)|
+---------------------------+
|2013-07                    |
|2013-07                    |
|2013-07                    |
|2013-07                    |
|2013-07                    |
|2013-07                    |
|2013-07                    |
|2013-07                    |
|2013-07                    |
|2013-07                    |
+---------------------------+
only showing top 10 rows



In [28]:
#with alias
orders.select(substring('order_date',1,7).alias('order_month')).show(10,False)

+-----------+
|order_month|
+-----------+
|2013-07    |
|2013-07    |
|2013-07    |
|2013-07    |
|2013-07    |
|2013-07    |
|2013-07    |
|2013-07    |
|2013-07    |
|2013-07    |
+-----------+
only showing top 10 rows



In [29]:
#if we want to select all coumns ,using select we need to mentioned all with derived columns
#with alias
orders.select('order_id','order_date','customer_id','order_status',substring('order_date',1,7).alias('order_month')).show(10,False)

+--------+---------------------+-----------+---------------+-----------+
|order_id|order_date           |customer_id|order_status   |order_month|
+--------+---------------------+-----------+---------------+-----------+
|1       |2013-07-25 00:00:00.0|11599      |CLOSED         |2013-07    |
|2       |2013-07-25 00:00:00.0|256        |PENDING_PAYMENT|2013-07    |
|3       |2013-07-25 00:00:00.0|12111      |COMPLETE       |2013-07    |
|4       |2013-07-25 00:00:00.0|8827       |CLOSED         |2013-07    |
|5       |2013-07-25 00:00:00.0|11318      |COMPLETE       |2013-07    |
|6       |2013-07-25 00:00:00.0|7130       |COMPLETE       |2013-07    |
|7       |2013-07-25 00:00:00.0|4530       |COMPLETE       |2013-07    |
|8       |2013-07-25 00:00:00.0|2911       |PROCESSING     |2013-07    |
|9       |2013-07-25 00:00:00.0|5657       |PENDING_PAYMENT|2013-07    |
|10      |2013-07-25 00:00:00.0|5648       |PENDING_PAYMENT|2013-07    |
+--------+---------------------+-----------+-------

#### withColumn()

- using with column we dont need to specifically mention all column in Dataframe

In [30]:

orders.withColumn('order_month',substring('order_date',1,7)).show(10,False)

+--------+---------------------+-----------+---------------+-----------+
|order_id|order_date           |customer_id|order_status   |order_month|
+--------+---------------------+-----------+---------------+-----------+
|1       |2013-07-25 00:00:00.0|11599      |CLOSED         |2013-07    |
|2       |2013-07-25 00:00:00.0|256        |PENDING_PAYMENT|2013-07    |
|3       |2013-07-25 00:00:00.0|12111      |COMPLETE       |2013-07    |
|4       |2013-07-25 00:00:00.0|8827       |CLOSED         |2013-07    |
|5       |2013-07-25 00:00:00.0|11318      |COMPLETE       |2013-07    |
|6       |2013-07-25 00:00:00.0|7130       |COMPLETE       |2013-07    |
|7       |2013-07-25 00:00:00.0|4530       |COMPLETE       |2013-07    |
|8       |2013-07-25 00:00:00.0|2911       |PROCESSING     |2013-07    |
|9       |2013-07-25 00:00:00.0|5657       |PENDING_PAYMENT|2013-07    |
|10      |2013-07-25 00:00:00.0|5648       |PENDING_PAYMENT|2013-07    |
+--------+---------------------+-----------+-------

 - so all column along with addition column will be part of new dataframe.
 - if column name mentioned in alias is same as exisitng column,then exisitng column will be replaced.

In [31]:
orders.withColumn('order_date',substring('order_date',1,7)).show(10,False)

+--------+----------+-----------+---------------+
|order_id|order_date|customer_id|order_status   |
+--------+----------+-----------+---------------+
|1       |2013-07   |11599      |CLOSED         |
|2       |2013-07   |256        |PENDING_PAYMENT|
|3       |2013-07   |12111      |COMPLETE       |
|4       |2013-07   |8827       |CLOSED         |
|5       |2013-07   |11318      |COMPLETE       |
|6       |2013-07   |7130       |COMPLETE       |
|7       |2013-07   |4530       |COMPLETE       |
|8       |2013-07   |2911       |PROCESSING     |
|9       |2013-07   |5657       |PENDING_PAYMENT|
|10      |2013-07   |5648       |PENDING_PAYMENT|
+--------+----------+-----------+---------------+
only showing top 10 rows



#### drop()

 - using drop new datframe is created without column mentioned in drop
 - there will not be any change in exisitng dataframe

In [32]:
orders_new = orders.drop('customer_id')

In [33]:
orders_new.show(10,False)

+--------+---------------------+---------------+
|order_id|order_date           |order_status   |
+--------+---------------------+---------------+
|1       |2013-07-25 00:00:00.0|CLOSED         |
|2       |2013-07-25 00:00:00.0|PENDING_PAYMENT|
|3       |2013-07-25 00:00:00.0|COMPLETE       |
|4       |2013-07-25 00:00:00.0|CLOSED         |
|5       |2013-07-25 00:00:00.0|COMPLETE       |
|6       |2013-07-25 00:00:00.0|COMPLETE       |
|7       |2013-07-25 00:00:00.0|COMPLETE       |
|8       |2013-07-25 00:00:00.0|PROCESSING     |
|9       |2013-07-25 00:00:00.0|PENDING_PAYMENT|
|10      |2013-07-25 00:00:00.0|PENDING_PAYMENT|
+--------+---------------------+---------------+
only showing top 10 rows



In [34]:
orders.show(10,False)

+--------+---------------------+-----------+---------------+
|order_id|order_date           |customer_id|order_status   |
+--------+---------------------+-----------+---------------+
|1       |2013-07-25 00:00:00.0|11599      |CLOSED         |
|2       |2013-07-25 00:00:00.0|256        |PENDING_PAYMENT|
|3       |2013-07-25 00:00:00.0|12111      |COMPLETE       |
|4       |2013-07-25 00:00:00.0|8827       |CLOSED         |
|5       |2013-07-25 00:00:00.0|11318      |COMPLETE       |
|6       |2013-07-25 00:00:00.0|7130       |COMPLETE       |
|7       |2013-07-25 00:00:00.0|4530       |COMPLETE       |
|8       |2013-07-25 00:00:00.0|2911       |PROCESSING     |
|9       |2013-07-25 00:00:00.0|5657       |PENDING_PAYMENT|
|10      |2013-07-25 00:00:00.0|5648       |PENDING_PAYMENT|
+--------+---------------------+-----------+---------------+
only showing top 10 rows



#### selectExpr()
 - using this we can mentioned hive functions instead of dataframe functions

In [35]:
orders.selectExpr('order_id','substring(order_date,1,7) as order_month').show()

+--------+-----------+
|order_id|order_month|
+--------+-----------+
|       1|    2013-07|
|       2|    2013-07|
|       3|    2013-07|
|       4|    2013-07|
|       5|    2013-07|
|       6|    2013-07|
|       7|    2013-07|
|       8|    2013-07|
|       9|    2013-07|
|      10|    2013-07|
|      11|    2013-07|
|      12|    2013-07|
|      13|    2013-07|
|      14|    2013-07|
|      15|    2013-07|
|      16|    2013-07|
|      17|    2013-07|
|      18|    2013-07|
|      19|    2013-07|
|      20|    2013-07|
+--------+-----------+
only showing top 20 rows



In [36]:
orders.select('order_id',substring('order_date',1,7).alias('order_month')).show()

+--------+-----------+
|order_id|order_month|
+--------+-----------+
|       1|    2013-07|
|       2|    2013-07|
|       3|    2013-07|
|       4|    2013-07|
|       5|    2013-07|
|       6|    2013-07|
|       7|    2013-07|
|       8|    2013-07|
|       9|    2013-07|
|      10|    2013-07|
|      11|    2013-07|
|      12|    2013-07|
|      13|    2013-07|
|      14|    2013-07|
|      15|    2013-07|
|      16|    2013-07|
|      17|    2013-07|
|      18|    2013-07|
|      19|    2013-07|
|      20|    2013-07|
+--------+-----------+
only showing top 20 rows



### Filtering Data from Data Frames

*Data Frame has 2 APIs to filter the data,where and filter. They are just synonyms and you can use either of them for filtering.*
 - You can use a filter or where using 2 overloaded functions. 
 - One takes SQL style syntax and other takes Data Frame Native style syntax. 
 - One by using class.attributeName and comparing with values 
***

 e.g.: *orders.where(orders.order status == 'COMPLETE').show()*

***
 - Other by passing conditions as literals
***

 e. g.: *orders.where('order status = "COMPLETE"').show()*

***
 - Make sure both orders and orderltems data frames are created 
 - Let us see a few more examples 
	* Get orders which are either COMPLETE or CLOSED 
	* Get orders which are either COMPLETE or CLOSED and placed in the month of 2013 August 
	* Get order items where order item subtotal is not equal to the product of order_item_quantity and order_item_product_price 
	* Get all the orders which are placed on first of every month 


#### Get orders which are either COMPLETE or CLOSED

In [37]:
#Get orders which are either COMPLETE or CLOSED
orders.filter(orders.order_status == 'CLOSED').show(10,False)

+--------+---------------------+-----------+------------+
|order_id|order_date           |customer_id|order_status|
+--------+---------------------+-----------+------------+
|1       |2013-07-25 00:00:00.0|11599      |CLOSED      |
|4       |2013-07-25 00:00:00.0|8827       |CLOSED      |
|12      |2013-07-25 00:00:00.0|1837       |CLOSED      |
|18      |2013-07-25 00:00:00.0|1205       |CLOSED      |
|24      |2013-07-25 00:00:00.0|11441      |CLOSED      |
|25      |2013-07-25 00:00:00.0|9503       |CLOSED      |
|37      |2013-07-25 00:00:00.0|5863       |CLOSED      |
|51      |2013-07-25 00:00:00.0|12271      |CLOSED      |
|57      |2013-07-25 00:00:00.0|7073       |CLOSED      |
|61      |2013-07-25 00:00:00.0|4791       |CLOSED      |
+--------+---------------------+-----------+------------+
only showing top 10 rows



In [38]:
orders.filter((orders.order_status == 'CLOSED') | (orders.order_status == 'COMPLETE')).show(10,False)

+--------+---------------------+-----------+------------+
|order_id|order_date           |customer_id|order_status|
+--------+---------------------+-----------+------------+
|1       |2013-07-25 00:00:00.0|11599      |CLOSED      |
|3       |2013-07-25 00:00:00.0|12111      |COMPLETE    |
|4       |2013-07-25 00:00:00.0|8827       |CLOSED      |
|5       |2013-07-25 00:00:00.0|11318      |COMPLETE    |
|6       |2013-07-25 00:00:00.0|7130       |COMPLETE    |
|7       |2013-07-25 00:00:00.0|4530       |COMPLETE    |
|12      |2013-07-25 00:00:00.0|1837       |CLOSED      |
|15      |2013-07-25 00:00:00.0|2568       |COMPLETE    |
|17      |2013-07-25 00:00:00.0|2667       |COMPLETE    |
|18      |2013-07-25 00:00:00.0|1205       |CLOSED      |
+--------+---------------------+-----------+------------+
only showing top 10 rows



In [39]:
orders.filter(orders.order_status.isin('CLOSED','COMPLETE')).show(10,False)

+--------+---------------------+-----------+------------+
|order_id|order_date           |customer_id|order_status|
+--------+---------------------+-----------+------------+
|1       |2013-07-25 00:00:00.0|11599      |CLOSED      |
|3       |2013-07-25 00:00:00.0|12111      |COMPLETE    |
|4       |2013-07-25 00:00:00.0|8827       |CLOSED      |
|5       |2013-07-25 00:00:00.0|11318      |COMPLETE    |
|6       |2013-07-25 00:00:00.0|7130       |COMPLETE    |
|7       |2013-07-25 00:00:00.0|4530       |COMPLETE    |
|12      |2013-07-25 00:00:00.0|1837       |CLOSED      |
|15      |2013-07-25 00:00:00.0|2568       |COMPLETE    |
|17      |2013-07-25 00:00:00.0|2667       |COMPLETE    |
|18      |2013-07-25 00:00:00.0|1205       |CLOSED      |
+--------+---------------------+-----------+------------+
only showing top 10 rows



In [40]:
#sql based syntax

In [41]:
orders.filter("order_status in ('CLOSED','COMPLETE')").show(10,False)

+--------+---------------------+-----------+------------+
|order_id|order_date           |customer_id|order_status|
+--------+---------------------+-----------+------------+
|1       |2013-07-25 00:00:00.0|11599      |CLOSED      |
|3       |2013-07-25 00:00:00.0|12111      |COMPLETE    |
|4       |2013-07-25 00:00:00.0|8827       |CLOSED      |
|5       |2013-07-25 00:00:00.0|11318      |COMPLETE    |
|6       |2013-07-25 00:00:00.0|7130       |COMPLETE    |
|7       |2013-07-25 00:00:00.0|4530       |COMPLETE    |
|12      |2013-07-25 00:00:00.0|1837       |CLOSED      |
|15      |2013-07-25 00:00:00.0|2568       |COMPLETE    |
|17      |2013-07-25 00:00:00.0|2667       |COMPLETE    |
|18      |2013-07-25 00:00:00.0|1205       |CLOSED      |
+--------+---------------------+-----------+------------+
only showing top 10 rows



#### Get orders which are either COMPLETE or CLOSED and placed in the month of 2013 August

In [42]:
#Get orders which are either COMPLETE or CLOSED and placed in the month of 2013 August
orders.filter((orders.order_status.isin('CLOSED','COMPLETE')) & (orders.order_date.like('2013-08%'))).show(10,False)

+--------+---------------------+-----------+------------+
|order_id|order_date           |customer_id|order_status|
+--------+---------------------+-----------+------------+
|1297    |2013-08-01 00:00:00.0|11607      |COMPLETE    |
|1298    |2013-08-01 00:00:00.0|5105       |CLOSED      |
|1299    |2013-08-01 00:00:00.0|7802       |COMPLETE    |
|1302    |2013-08-01 00:00:00.0|1695       |COMPLETE    |
|1304    |2013-08-01 00:00:00.0|2059       |COMPLETE    |
|1305    |2013-08-01 00:00:00.0|3844       |COMPLETE    |
|1307    |2013-08-01 00:00:00.0|4474       |COMPLETE    |
|1309    |2013-08-01 00:00:00.0|2367       |CLOSED      |
|1312    |2013-08-01 00:00:00.0|12291      |COMPLETE    |
|1314    |2013-08-01 00:00:00.0|10993      |COMPLETE    |
+--------+---------------------+-----------+------------+
only showing top 10 rows



In [43]:
#sql format
orders.filter("order_status in ('CLOSED','COMPLETE') and order_date like '2013-08%'").show(10,False)

+--------+---------------------+-----------+------------+
|order_id|order_date           |customer_id|order_status|
+--------+---------------------+-----------+------------+
|1297    |2013-08-01 00:00:00.0|11607      |COMPLETE    |
|1298    |2013-08-01 00:00:00.0|5105       |CLOSED      |
|1299    |2013-08-01 00:00:00.0|7802       |COMPLETE    |
|1302    |2013-08-01 00:00:00.0|1695       |COMPLETE    |
|1304    |2013-08-01 00:00:00.0|2059       |COMPLETE    |
|1305    |2013-08-01 00:00:00.0|3844       |COMPLETE    |
|1307    |2013-08-01 00:00:00.0|4474       |COMPLETE    |
|1309    |2013-08-01 00:00:00.0|2367       |CLOSED      |
|1312    |2013-08-01 00:00:00.0|12291      |COMPLETE    |
|1314    |2013-08-01 00:00:00.0|10993      |COMPLETE    |
+--------+---------------------+-----------+------------+
only showing top 10 rows



#### Get order items where order item subtotal is not equal to the product of order_item_quantity and order_item_product_price

In [44]:
orderItems.show(10,False)

+-------------+-------------------+---------------------+----------------------+-------------------+------------------------+
|order_item_id|order_item_order_id|order_item_product_id|order_item_quantity_id|order_item_subtotal|order_item_product_price|
+-------------+-------------------+---------------------+----------------------+-------------------+------------------------+
|1            |1                  |957                  |1                     |299.98             |299.98                  |
|2            |2                  |1073                 |1                     |199.99             |199.99                  |
|3            |2                  |502                  |5                     |250.0              |50.0                    |
|4            |2                  |403                  |1                     |129.99             |129.99                  |
|5            |4                  |897                  |2                     |49.98              |24.99             

In [45]:
orderItems.select('order_item_subtotal','order_item_quantity_id','order_item_product_price').show(10,False)

+-------------------+----------------------+------------------------+
|order_item_subtotal|order_item_quantity_id|order_item_product_price|
+-------------------+----------------------+------------------------+
|299.98             |1                     |299.98                  |
|199.99             |1                     |199.99                  |
|250.0              |5                     |50.0                    |
|129.99             |1                     |129.99                  |
|49.98              |2                     |24.99                   |
|299.95             |5                     |59.99                   |
|150.0              |3                     |50.0                    |
|199.92             |4                     |49.98                   |
|299.98             |1                     |299.98                  |
|299.95             |5                     |59.99                   |
+-------------------+----------------------+------------------------+
only showing top 10 

In [46]:
from pyspark.sql.functions import round
orderItems.select('order_item_subtotal','order_item_quantity_id','order_item_product_price'). \
            filter(orderItems.order_item_subtotal != round(orderItems.order_item_quantity_id * orderItems.order_item_product_price,2)). \
            show(10,False)

+-------------------+----------------------+------------------------+
|order_item_subtotal|order_item_quantity_id|order_item_product_price|
+-------------------+----------------------+------------------------+
+-------------------+----------------------+------------------------+



#### Get all the orders which are placed on first of every month

In [47]:
from pyspark.sql.functions import date_format
orders.filter(date_format(orders.order_date,'dd')=='01').show(10,False)

+--------+---------------------+-----------+---------------+
|order_id|order_date           |customer_id|order_status   |
+--------+---------------------+-----------+---------------+
|1297    |2013-08-01 00:00:00.0|11607      |COMPLETE       |
|1298    |2013-08-01 00:00:00.0|5105       |CLOSED         |
|1299    |2013-08-01 00:00:00.0|7802       |COMPLETE       |
|1300    |2013-08-01 00:00:00.0|553        |PENDING_PAYMENT|
|1301    |2013-08-01 00:00:00.0|1604       |PENDING_PAYMENT|
|1302    |2013-08-01 00:00:00.0|1695       |COMPLETE       |
|1303    |2013-08-01 00:00:00.0|7018       |PROCESSING     |
|1304    |2013-08-01 00:00:00.0|2059       |COMPLETE       |
|1305    |2013-08-01 00:00:00.0|3844       |COMPLETE       |
|1306    |2013-08-01 00:00:00.0|11672      |PENDING_PAYMENT|
+--------+---------------------+-----------+---------------+
only showing top 10 rows



In [48]:
orders.select('order_date').filter(date_format(orders.order_date,'dd')=='01').distinct().count()

12

In [49]:
from pyspark.sql.functions import max,min
orders.select(min(orders.order_date).alias('min_order_date'),max(orders.order_date).alias('max_order_date')). \
        show(truncate=False)

+---------------------+---------------------+
|min_order_date       |max_order_date       |
+---------------------+---------------------+
|2013-07-25 00:00:00.0|2014-07-24 00:00:00.0|
+---------------------+---------------------+



In [50]:
orders.selectExpr("min(order_date) as min_order_date","max(order_date) as max_order_date").show(truncate=False)

+---------------------+---------------------+
|min_order_date       |max_order_date       |
+---------------------+---------------------+
|2013-07-25 00:00:00.0|2014-07-24 00:00:00.0|
+---------------------+---------------------+



###  Joining multiple Data Frames

*Dataframe has API called join to perform join operation*

In [51]:
help(orders.join)

Help on method join in module pyspark.sql.dataframe:

join(other, on=None, how=None) method of pyspark.sql.dataframe.DataFrame instance
    Joins with another :class:`DataFrame`, using the given join expression.
    
    :param other: Right side of the join
    :param on: a string for the join column name, a list of column names,
        a join expression (Column), or a list of Columns.
        If `on` is a string or a list of strings indicating the name of the join column(s),
        the column(s) must exist on both sides, and this performs an equi-join.
    :param how: str, default ``inner``. Must be one of: ``inner``, ``cross``, ``outer``,
        ``full``, ``full_outer``, ``left``, ``left_outer``, ``right``, ``right_outer``,
        ``left_semi``, and ``left_anti``.
    
    The following performs a full outer join between ``df1`` and ``df2``.
    
    >>> df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height).collect()
    [Row(name=None, height=80), Row(name='Bob'

#### Get all the order items corresponding to COMPLETE or CLOSED orders

In [52]:
orders

DataFrame[order_id: int, order_date: string, customer_id: int, order_status: string]

In [53]:
orderItems

DataFrame[order_item_id: int, order_item_order_id: int, order_item_product_id: int, order_item_quantity_id: int, order_item_subtotal: float, order_item_product_price: float]

In [54]:
orders.where(orders.order_status.isin('COMPLETE','CLOSED')). \
    join(orderItems,orders.order_id==orderItems.order_item_order_id,'inner'). \
    show()

+--------+--------------------+-----------+------------+-------------+-------------------+---------------------+----------------------+-------------------+------------------------+
|order_id|          order_date|customer_id|order_status|order_item_id|order_item_order_id|order_item_product_id|order_item_quantity_id|order_item_subtotal|order_item_product_price|
+--------+--------------------+-----------+------------+-------------+-------------------+---------------------+----------------------+-------------------+------------------------+
|       1|2013-07-25 00:00:...|      11599|      CLOSED|            1|                  1|                  957|                     1|             299.98|                  299.98|
|       4|2013-07-25 00:00:...|       8827|      CLOSED|            8|                  4|                 1014|                     4|             199.92|                   49.98|
|       4|2013-07-25 00:00:...|       8827|      CLOSED|            7|                  4|     

In [55]:
print('orders:',orders.count())
print('orderItems:',orderItems.count())
print('InnerJoin',orders.where(orders.order_status.isin('COMPLETE','CLOSED')). \
    join(orderItems,orders.order_id==orderItems.order_item_order_id,'inner'). \
    count())

orders: 68883
orderItems: 172198
InnerJoin 75408


In [56]:
#### Get all the orders where there are no corresponsing order_items
orders.join(orderItems,orders.order_id==orderItems.order_item_order_id,'left'). \
       filter('order_item_order_id is null'). \
    count()

11452

In [57]:
#### Check if there are any order_items where there is no corresponsing order in the orders data set
orders.join(orderItems,orders.order_id==orderItems.order_item_order_id,'right'). \
       filter('order_item_order_id is null'). \
    count()

0

### Perform Aggregations using Data Frames

In [58]:
orders

DataFrame[order_id: int, order_date: string, customer_id: int, order_status: string]

In [59]:
orders.select(orders.order_status).distinct().count()

9

In [60]:
#better way is to do as below
from pyspark.sql.functions import countDistinct
orders.select(countDistinct('order_status')).show()

+----------------------------+
|count(DISTINCT order_status)|
+----------------------------+
|                           9|
+----------------------------+



In [61]:
#get revenue for order id 2
from pyspark.sql.functions import sum
orderItems.filter(orderItems.order_item_order_id==2).select(round(sum(orderItems.order_item_subtotal),2)).show()

+----------------------------------+
|round(sum(order_item_subtotal), 2)|
+----------------------------------+
|                            579.98|
+----------------------------------+



In [62]:
#Get count by status from order
orders.groupby('order_status').count().show()

+---------------+-----+
|   order_status|count|
+---------------+-----+
|PENDING_PAYMENT|15030|
|       COMPLETE|22899|
|        ON_HOLD| 3798|
| PAYMENT_REVIEW|  729|
|     PROCESSING| 8275|
|         CLOSED| 7556|
|SUSPECTED_FRAUD| 1558|
|        PENDING| 7610|
|       CANCELED| 1428|
+---------------+-----+



In [63]:
orderItems

DataFrame[order_item_id: int, order_item_order_id: int, order_item_product_id: int, order_item_quantity_id: int, order_item_subtotal: float, order_item_product_price: float]

In [64]:
#get revenue for each order id from order items
orderItems.groupby('order_item_order_id').agg(round(sum('order_item_subtotal'),2).alias('order_revenue')).show()

+-------------------+-------------+
|order_item_order_id|order_revenue|
+-------------------+-------------+
|                148|       479.99|
|                463|       829.92|
|                471|       169.98|
|                496|       441.95|
|               1088|       249.97|
|               1580|       299.95|
|               1591|       439.86|
|               1645|      1509.79|
|               2366|       299.97|
|               2659|       724.91|
|               2866|       569.96|
|               3175|       209.97|
|               3749|       143.97|
|               3794|       299.95|
|               3918|       829.93|
|               3997|       579.95|
|               4101|       129.99|
|               4519|        79.98|
|               4818|       399.98|
|               4900|       179.97|
+-------------------+-------------+
only showing top 20 rows



In [65]:
#get daily product revenue (order_date,order_item_product_id)
orders.join(orderItems,orders.order_id==orderItems.order_item_order_id). \
    groupby(substring('order_date',1,10).alias('order_date'),'order_item_product_id'). \
    agg(round(sum('order_item_subtotal'),2). \
    alias('DailyProductRev')). \
    show()

+----------+---------------------+---------------+
|order_date|order_item_product_id|DailyProductRev|
+----------+---------------------+---------------+
|2013-07-26|                   93|         124.95|
|2013-07-30|                  810|          59.97|
|2013-08-04|                  804|          59.97|
|2013-08-06|                  823|         155.97|
|2013-09-07|                  627|        4158.96|
|2013-09-17|                  893|          24.99|
|2013-09-30|                  565|           70.0|
|2013-09-30|                  134|          200.0|
|2013-10-04|                  565|          140.0|
|2013-10-16|                  642|           60.0|
|2013-10-20|                  793|          44.97|
|2013-10-24|                  564|          120.0|
|2013-10-28|                  235|         139.96|
|2013-10-31|                  116|          89.98|
|2013-11-01|                 1014|        8446.62|
|2013-11-13|                  835|          95.97|
|2013-11-14|                  7

### Sorting Data in Data Frames

* we can either use sort() or orderBy()

In [66]:
#sort order by status
orders.sort('order_status').show()

+--------+--------------------+-----------+------------+
|order_id|          order_date|customer_id|order_status|
+--------+--------------------+-----------+------------+
|     527|2013-07-28 00:00:...|       5426|    CANCELED|
|    1435|2013-08-01 00:00:...|       1879|    CANCELED|
|     552|2013-07-28 00:00:...|       1445|    CANCELED|
|     112|2013-07-26 00:00:...|       5375|    CANCELED|
|     564|2013-07-28 00:00:...|       2216|    CANCELED|
|     955|2013-07-30 00:00:...|       8117|    CANCELED|
|    1383|2013-08-01 00:00:...|       1753|    CANCELED|
|     962|2013-07-30 00:00:...|       9492|    CANCELED|
|     607|2013-07-28 00:00:...|       6376|    CANCELED|
|    1013|2013-07-30 00:00:...|       1903|    CANCELED|
|     667|2013-07-28 00:00:...|       4726|    CANCELED|
|    1169|2013-07-31 00:00:...|       3971|    CANCELED|
|     717|2013-07-29 00:00:...|       8208|    CANCELED|
|    1186|2013-07-31 00:00:...|      11947|    CANCELED|
|     753|2013-07-29 00:00:...|

In [67]:
#sort order by order date and then status
orders.sort('order_date','order_status').show()

+--------+--------------------+-----------+------------+
|order_id|          order_date|customer_id|order_status|
+--------+--------------------+-----------+------------+
|      50|2013-07-25 00:00:...|       5225|    CANCELED|
|       1|2013-07-25 00:00:...|      11599|      CLOSED|
|      12|2013-07-25 00:00:...|       1837|      CLOSED|
|       4|2013-07-25 00:00:...|       8827|      CLOSED|
|      37|2013-07-25 00:00:...|       5863|      CLOSED|
|      18|2013-07-25 00:00:...|       1205|      CLOSED|
|      24|2013-07-25 00:00:...|      11441|      CLOSED|
|      25|2013-07-25 00:00:...|       9503|      CLOSED|
|   57754|2013-07-25 00:00:...|       4648|      CLOSED|
|      90|2013-07-25 00:00:...|       9131|      CLOSED|
|      51|2013-07-25 00:00:...|      12271|      CLOSED|
|      57|2013-07-25 00:00:...|       7073|      CLOSED|
|      61|2013-07-25 00:00:...|       4791|      CLOSED|
|      62|2013-07-25 00:00:...|       9111|      CLOSED|
|      87|2013-07-25 00:00:...|

In [68]:
#Sort order item by order item order id and order item sub total descending
orderItems.sort(['order_item_order_id','order_item_subtotal'],ascending=[0,1]).show()

+-------------+-------------------+---------------------+----------------------+-------------------+------------------------+
|order_item_id|order_item_order_id|order_item_product_id|order_item_quantity_id|order_item_subtotal|order_item_product_price|
+-------------+-------------------+---------------------+----------------------+-------------------+------------------------+
|       172198|              68883|                  502|                     3|              150.0|                    50.0|
|       172197|              68883|                  208|                     1|            1999.99|                 1999.99|
|       172196|              68882|                  502|                     1|               50.0|                    50.0|
|       172195|              68882|                  365|                     1|              59.99|                   59.99|
|       172194|              68881|                  403|                     1|             129.99|                  

In [69]:
orderItems.sort('order_item_order_id',orderItems.order_item_subtotal.desc()).show()

+-------------+-------------------+---------------------+----------------------+-------------------+------------------------+
|order_item_id|order_item_order_id|order_item_product_id|order_item_quantity_id|order_item_subtotal|order_item_product_price|
+-------------+-------------------+---------------------+----------------------+-------------------+------------------------+
|            1|                  1|                  957|                     1|             299.98|                  299.98|
|            3|                  2|                  502|                     5|              250.0|                    50.0|
|            2|                  2|                 1073|                     1|             199.99|                  199.99|
|            4|                  2|                  403|                     1|             129.99|                  129.99|
|            6|                  4|                  365|                     5|             299.95|                  

In [71]:
#Daily product order revenue ,order by order_date and revenue desc
orders.join(orderItems,orders.order_id==orderItems.order_item_order_id).groupBy('order_date','order_item_product_id'). \
        agg(round(sum('order_item_subtotal'),2)).sort(['order_date','order_item_product_id'],ascending=[0,1]).show()

+--------------------+---------------------+----------------------------------+
|          order_date|order_item_product_id|round(sum(order_item_subtotal), 2)|
+--------------------+---------------------+----------------------------------+
|2014-07-24 00:00:...|                   19|                            124.99|
|2014-07-24 00:00:...|                   35|                            159.99|
|2014-07-24 00:00:...|                   37|                            209.94|
|2014-07-24 00:00:...|                   44|                            179.97|
|2014-07-24 00:00:...|                  116|                            179.96|
|2014-07-24 00:00:...|                  134|                              50.0|
|2014-07-24 00:00:...|                  172|                             180.0|
|2014-07-24 00:00:...|                  191|                          10898.91|
|2014-07-24 00:00:...|                  203|                            399.99|
|2014-07-24 00:00:...|                  

In [78]:
#to sort withon partition
orders.sortWithinPartitions('order_date',orders.customer_id.desc()).show()

+--------+--------------------+-----------+---------------+
|order_id|          order_date|customer_id|   order_status|
+--------+--------------------+-----------+---------------+
|   57785|2013-07-25 00:00:...|      12347|     PROCESSING|
|   57787|2013-07-25 00:00:...|      12294|PENDING_PAYMENT|
|      51|2013-07-25 00:00:...|      12271|         CLOSED|
|     103|2013-07-25 00:00:...|      12256|     PROCESSING|
|      48|2013-07-25 00:00:...|      12186|     PROCESSING|
|     100|2013-07-25 00:00:...|      12131|     PROCESSING|
|       3|2013-07-25 00:00:...|      12111|       COMPLETE|
|      40|2013-07-25 00:00:...|      12092|PENDING_PAYMENT|
|   57779|2013-07-25 00:00:...|      11941|       COMPLETE|
|      70|2013-07-25 00:00:...|      11809|PENDING_PAYMENT|
|      59|2013-07-25 00:00:...|      11644|PENDING_PAYMENT|
|       1|2013-07-25 00:00:...|      11599|         CLOSED|
|      94|2013-07-25 00:00:...|      11589|     PROCESSING|
|      38|2013-07-25 00:00:...|      115

### Development Life Cycle using Data Frames

* Open Pycharm
* New Project - set Name [spark2Demo] - set interpretor python3
* Add pyspark - references
* setting - project structure - add contect root - add python in spark and py4j* file
* create directory src - main - python
* code for DailyProductRevenue.py

#### Parameterize script
-- In pycharm create directory called resource and create file called application.properties at same level of python folder created

#to execute
#move to base directory : 
***

cd /home/pi/shared/PySparkProjects/Spark2Demo

export SPARK_MAJOR_VERSION=2

spark-submit src/main/python/DailyProductRevenue.py prod

***