### Section 11: Apache Spark 2.x - Processing Data using Data Frames - Window Functions

- Window Function - APIs
- Problem Statement
- Creating Window Spec
- Performing Aggregation
- Using Windowing Functions
- Ranking Functions
- Development Life Cycle

### Data Frames - Window Functions APIs - Overview

* Main package pyspark.sql.window 
* It has classes such as Window and WindowSpec 
* Window have APIs such as partitionBy, orderBy etc 
* These APIs (such as partitionBy) return WindowSpec object.
* We can pass WindowSpec object to over on functions such as rank(), dense_rank(), sum() etc 
* Syntax: rank().over(spec) where spec = Window.partitionBy(toLumnNamet) 
    * Aggregations - sum, avg, min, max etc 
    * Ranking - rank, dense_rank, row_number etc 
    * Windowing - Lead, Lag etc 


In [1]:
#load order_items
orderItems = spark.read.format("csv"). \
            schema("order_item_id int,order_item_order_id int,order_item_product_id int,order_item_quantity_id int,order_item_subtotal float,order_item_product_price float"). \
            load("/user/pi/retail_db/order_items")
orderItems.show(10,False)

+-------------+-------------------+---------------------+----------------------+-------------------+------------------------+
|order_item_id|order_item_order_id|order_item_product_id|order_item_quantity_id|order_item_subtotal|order_item_product_price|
+-------------+-------------------+---------------------+----------------------+-------------------+------------------------+
|1            |1                  |957                  |1                     |299.98             |299.98                  |
|2            |2                  |1073                 |1                     |199.99             |199.99                  |
|3            |2                  |502                  |5                     |250.0              |50.0                    |
|4            |2                  |403                  |1                     |129.99             |129.99                  |
|5            |4                  |897                  |2                     |49.98              |24.99             

In [28]:
#get total oder revenue correspondes to each order item record
from pyspark.sql.window import *
#step 1 - create window spec object
spec = Window.partitionBy("order_item_order_id")
type(spec)

pyspark.sql.window.WindowSpec

In [31]:
from pyspark.sql.functions import sum,round
#step2:over function
orderItems.withColumn('order_revenue',round(sum('order_item_subtotal').over(spec),2)). \
            select('order_item_id','order_item_order_id','order_item_subtotal','order_revenue'). \
            show(10,False)

+-------------+-------------------+-------------------+-------------+
|order_item_id|order_item_order_id|order_item_subtotal|order_revenue|
+-------------+-------------------+-------------------+-------------+
|348          |148                |100.0              |479.99       |
|349          |148                |250.0              |479.99       |
|350          |148                |129.99             |479.99       |
|1129         |463                |239.96             |829.92       |
|1130         |463                |250.0              |829.92       |
|1131         |463                |39.99              |829.92       |
|1132         |463                |299.97             |829.92       |
|1153         |471                |39.99              |169.98       |
|1154         |471                |129.99             |169.98       |
|1223         |496                |59.99              |441.95       |
+-------------+-------------------+-------------------+-------------+
only showing top 10 

###  Define Problem Statement - Get Top N Daily Products
* Problem Statement - Get top N products Per day
* Get daily product revenue code from the previous topic
* Using ranking functions and get the ran associated based on revenue for each day
* Once we get rank,let us filter for top n products

* use daily product revenue code

In [44]:
orders=spark.read.format('csv').schema('order_id int,order_date string,customer_id int,order_status string'). \
                               load('/user/pi/retail_db/orders')

In [45]:
orderItems = spark.read.format('csv'). \
schema('order_item_id int,order_item_order_id int,order_item_product_id int,order_item_quantity_id int,order_item_subtotal float,order_item_product_price float'). \
load('/user/pi/retail_db/order_items')

In [47]:
dailyProductRevenue = orders.select('order_id','order_date'). \
    join(orderItems,orders.order_id==orderItems.order_item_order_id,'inner'). \
    groupby('order_date','order_item_product_id'). \
    agg(round(sum('order_item_subtotal'),2).alias('revenue')). \
    sort(['order_date','revenue'],ascending=[1,0])
dailyProductRevenue.show(10,False)

+---------------------+---------------------+--------+
|order_date           |order_item_product_id|revenue |
+---------------------+---------------------+--------+
|2013-07-25 00:00:00.0|1004                 |10799.46|
|2013-07-25 00:00:00.0|957                  |9599.36 |
|2013-07-25 00:00:00.0|191                  |8499.15 |
|2013-07-25 00:00:00.0|365                  |7558.74 |
|2013-07-25 00:00:00.0|1073                 |6999.65 |
|2013-07-25 00:00:00.0|1014                 |6397.44 |
|2013-07-25 00:00:00.0|403                  |5589.57 |
|2013-07-25 00:00:00.0|502                  |5100.0  |
|2013-07-25 00:00:00.0|627                  |2879.28 |
|2013-07-25 00:00:00.0|226                  |599.99  |
+---------------------+---------------------+--------+
only showing top 10 rows



### Data Frame Operations - Creating Window Spec
       * get daily order count part of each order record
       
***

from pyspark.sql.window import Window

spec = Window.partitionBy('ColumnName1').orderBy('ColumnName2')

***
*For aggregate functions no need to mention orderBy*

*Spec created can only be used over dataframe which has columns mentioned in Window Spec*

In [38]:
#get daily order count part of each order record
orders = spark.read.\
         csv("/user/pi/retail_db/orders", \
         schema="order_id int,order_date string,customer_id int,order_status string")
orders.show(10,False)         

+--------+---------------------+-----------+---------------+
|order_id|order_date           |customer_id|order_status   |
+--------+---------------------+-----------+---------------+
|1       |2013-07-25 00:00:00.0|11599      |CLOSED         |
|2       |2013-07-25 00:00:00.0|256        |PENDING_PAYMENT|
|3       |2013-07-25 00:00:00.0|12111      |COMPLETE       |
|4       |2013-07-25 00:00:00.0|8827       |CLOSED         |
|5       |2013-07-25 00:00:00.0|11318      |COMPLETE       |
|6       |2013-07-25 00:00:00.0|7130       |COMPLETE       |
|7       |2013-07-25 00:00:00.0|4530       |COMPLETE       |
|8       |2013-07-25 00:00:00.0|2911       |PROCESSING     |
|9       |2013-07-25 00:00:00.0|5657       |PENDING_PAYMENT|
|10      |2013-07-25 00:00:00.0|5648       |PENDING_PAYMENT|
+--------+---------------------+-----------+---------------+
only showing top 10 rows



In [43]:
from pyspark.sql.window import Window
from pyspark.sql.functions import count
spec = Window.partitionBy('order_date')
orders. \
    withColumn('daily_orders',count('order_id').over(spec)). \
    show(100,False)

+--------+---------------------+-----------+---------------+------------+
|order_id|order_date           |customer_id|order_status   |daily_orders|
+--------+---------------------+-----------+---------------+------------+
|3378    |2013-08-13 00:00:00.0|3155       |PROCESSING     |73          |
|3379    |2013-08-13 00:00:00.0|5437       |COMPLETE       |73          |
|3380    |2013-08-13 00:00:00.0|3519       |CLOSED         |73          |
|3381    |2013-08-13 00:00:00.0|10023      |ON_HOLD        |73          |
|3382    |2013-08-13 00:00:00.0|7856       |PENDING        |73          |
|3383    |2013-08-13 00:00:00.0|1523       |PENDING_PAYMENT|73          |
|3384    |2013-08-13 00:00:00.0|12398      |PENDING_PAYMENT|73          |
|3385    |2013-08-13 00:00:00.0|132        |COMPLETE       |73          |
|3386    |2013-08-13 00:00:00.0|2128       |PENDING        |73          |
|3387    |2013-08-13 00:00:00.0|2735       |COMPLETE       |73          |
|3388    |2013-08-13 00:00:00.0|12319 

### Data Frame Operations - Performing Aggregations using sum, avg etc
*For aggregate functions no need to mention orderBy*

In [32]:
#get daily product revenue - get average revenue,minimum and maximum revenue
#col function

In [48]:
dailyProductRevenue.show(10,False)

+---------------------+---------------------+--------+
|order_date           |order_item_product_id|revenue |
+---------------------+---------------------+--------+
|2013-07-25 00:00:00.0|1004                 |10799.46|
|2013-07-25 00:00:00.0|957                  |9599.36 |
|2013-07-25 00:00:00.0|191                  |8499.15 |
|2013-07-25 00:00:00.0|365                  |7558.74 |
|2013-07-25 00:00:00.0|1073                 |6999.65 |
|2013-07-25 00:00:00.0|1014                 |6397.44 |
|2013-07-25 00:00:00.0|403                  |5589.57 |
|2013-07-25 00:00:00.0|502                  |5100.0  |
|2013-07-25 00:00:00.0|627                  |2879.28 |
|2013-07-25 00:00:00.0|226                  |599.99  |
+---------------------+---------------------+--------+
only showing top 10 rows



In [63]:
from pyspark.sql.functions import count,avg,min,max,sum,round
from pyspark.sql.functions import col
from pyspark.sql.window import Window
spec = Window.partitionBy('order_date')
dailyProductRevenue. \
    select('order_date','revenue'). \
    withColumn('DailyRevenue',round(sum('revenue').over(spec),2)). \
    withColumn('PercentageRevenue',round((dailyProductRevenue.revenue/col('DailyRevenue'))*100,2)). \
    withColumn('DailyAvgRevenue',round(avg('revenue').over(spec),2)). \
    withColumn('DailyMinRevenue',min('revenue').over(spec)). \
    withColumn('DailyMaxRevenue',max('revenue').over(spec)). \
    sort('order_date','DailyRevenue'). \
    show(10,False)

+---------------------+--------+------------+-----------------+---------------+---------------+---------------+
|order_date           |revenue |DailyRevenue|PercentageRevenue|DailyAvgRevenue|DailyMinRevenue|DailyMaxRevenue|
+---------------------+--------+------------+-----------------+---------------+---------------+---------------+
|2013-07-25 00:00:00.0|599.99  |68153.83    |0.88             |1842.0         |19.98          |10799.46       |
|2013-07-25 00:00:00.0|6397.44 |68153.83    |9.39             |1842.0         |19.98          |10799.46       |
|2013-07-25 00:00:00.0|2879.28 |68153.83    |4.22             |1842.0         |19.98          |10799.46       |
|2013-07-25 00:00:00.0|8499.15 |68153.83    |12.47            |1842.0         |19.98          |10799.46       |
|2013-07-25 00:00:00.0|6999.65 |68153.83    |10.27            |1842.0         |19.98          |10799.46       |
|2013-07-25 00:00:00.0|5589.57 |68153.83    |8.2              |1842.0         |19.98          |10799.46 

In [None]:
#from pyspark.sql.functions import col
#col function is used when we need to refer analytic column part of same withColumn

### Data Frame Operations - Time Series Functions such as Lead, Lag etc

In [81]:
#read employee data
from pyspark.sql import Row
employeeRaw = open('/home/pi/shared/datasets/employee.csv').read().splitlines()
employeeRdd = sc.parallelize(employeeRaw)
employee = employeeRdd.filter(lambda rw:rw.split(',')[0]!='EMPID'). \
        map(lambda r:Row(EMPID=r.split(',')[0],NAME=r.split(',')[1],JOBTITLE=r.split(',')[2],DEPTID=r.split(',')[3], \
                         DESCR=r.split(',')[4],HIRE_DT=r.split(',')[5],ANNUAL_RT=r.split(',')[6])).toDF()
employee.show()

+---------+------+--------------------+-----+---------------+--------------------+--------------------+
|ANNUAL_RT|DEPTID|               DESCR|EMPID|        HIRE_DT|            JOBTITLE|                NAME|
+---------+------+--------------------+-----+---------------+--------------------+--------------------+
|    32470|A50550|DPW-Water & Waste...| 1001| 8/27/2018 0:00|Utilities Inst Re...|      Aaron Kareem D|
|    60200|A03031|OED-Employment De...| 1002|10/24/1979 0:00|Facilities/Office...|    Aaron Patricia G|
|    64823|A02002|  City Council (002)| 1003|12/12/2016 0:00|  Council Technician|       Abadir Adam O|
|    53640|A99094|Police Department...| 1004| 4/17/2018 0:00|      Police Officer|Abaku Aigbolosimu...|
|    68562|A29011|States Attorneys ...| 1005| 5/22/2017 0:00|Assistant State's...|       Abbeduto Mack|
|    33280|A68002|     R&P-Parks (002)| 1006| 4/11/2018 0:00|Recreation Arts I...|      Abbott Ethan N|
|    75110|A90005| TRANS-Traffic (005)| 1007|11/28/2014 0:00|Ope

In [82]:
employee.write.csv('/user/pi/employee_data/employee')

In [84]:
! hdfs dfs -ls /user/pi/employee_data/employee

2020-06-05 20:03:25,556 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 17 items
-rw-r--r--   2 pi supergroup          0 2020-06-05 20:03 /user/pi/employee_data/employee/_SUCCESS
-rw-r--r--   2 pi supergroup      83031 2020-06-05 20:03 /user/pi/employee_data/employee/part-00000-d4884824-842e-4853-af5d-08abe7308656-c000.csv
-rw-r--r--   2 pi supergroup      82511 2020-06-05 20:03 /user/pi/employee_data/employee/part-00001-d4884824-842e-4853-af5d-08abe7308656-c000.csv
-rw-r--r--   2 pi supergroup      83589 2020-06-05 20:03 /user/pi/employee_data/employee/part-00002-d4884824-842e-4853-af5d-08abe7308656-c000.csv
-rw-r--r--   2 pi supergroup      82781 2020-06-05 20:03 /user/pi/employee_data/employee/part-00003-d4884824-842e-4853-af5d-08abe7308656-c000.csv
-rw-r--r--   2 pi supergroup      83280 2020-06-05 20:03 /user/pi/employee_data/employee/part-00004-d4884824-842e-4853-af5d-08abe7308656-c000.csv
-rw

In [88]:
employee = spark.read.csv("/user/pi/employee_data/employee", \
                          schema='SALARY float,DEPTID string,DESCR string,EMPID int,HIRE_DT string,JOBTITLE string,NAME string')
employee.show()

+--------+------+--------------------+-----+---------------+--------------------+-----------------+
|  SALARY|DEPTID|               DESCR|EMPID|        HIRE_DT|            JOBTITLE|             NAME|
+--------+------+--------------------+-----+---------------+--------------------+-----------------+
| 38926.0|A29009|States Attorneys ...|13945| 5/20/2019 0:00|       Law Clerk SAO|    Webb Edward J|
| 39102.0|A02002|  City Council (002)|13946|  1/4/2017 0:00|   Council Assistant|   Webb Frances M|
|131438.0|A99190|Police Department...|13947|  5/8/1992 0:00|        Police Major|      Webb John O|
| 33680.0|A68009|     R&P-Parks (009)|13948|  4/8/2004 0:00|        Utility Aide|      Webb Kern A|
| 36202.0|A49507|TRANS-Highways (507)|13949|  7/1/1991 0:00|      Laborer Hourly|   Webb Michael E|
| 38002.0|A65116|HLTH-Health Dept....|13950| 6/25/2008 0:00|Medical Office As...|  Webb Rochelle M|
| 33192.0|A65016|HLTH-Health Depar...|13951| 11/7/1994 0:00|Medical Office As...|  Webb Rochelle M|


In [94]:
from pyspark.sql.window import Window
spec = Window.partitionBy('DEPTID'). \
            orderBy(employee.SALARY.desc())

In [101]:
from pyspark.sql.functions import lead,lag,first,last
employee.select('EMPID','DEPTID','SALARY'). \
    withColumn('NextLowSal',lead('SALARY',1).over(spec)). \
    withColumn('NextHighSal',lag('SALARY',1).over(spec)). \
    withColumn('HighestSal',first('SALARY').over(spec)). \
    withColumn('HighestSal',last('SALARY').over(spec)). \
    show(10,False)

+-----+------+-------+----------+-----------+----------+
|EMPID|DEPTID|SALARY |NextLowSal|NextHighSal|HighestSal|
+-----+------+-------+----------+-----------+----------+
|1067 |A30003|83856.0|67830.0   |null       |83856.0   |
|3976 |A30003|67830.0|65000.0   |83856.0    |67830.0   |
|5098 |A30003|65000.0|50927.0   |67830.0    |65000.0   |
|3030 |A30003|50927.0|44061.0   |65000.0    |50927.0   |
|3720 |A30003|44061.0|null      |50927.0    |44061.0   |
|2505 |A64466|94303.0|84784.0   |null       |94303.0   |
|6661 |A64466|84784.0|84177.0   |94303.0    |84784.0   |
|14785|A64466|84177.0|83338.0   |84784.0    |84177.0   |
|7280 |A64466|83338.0|78721.0   |84177.0    |83338.0   |
|8125 |A64466|78721.0|78721.0   |83338.0    |78721.0   |
+-----+------+-------+----------+-----------+----------+
only showing top 10 rows



 - As per above result last() is not giving expected result
 - So we need to modify Window spec to add rangeBetween()

In [104]:
help(Window)

Help on class Window in module pyspark.sql.window:

class Window(builtins.object)
 |  Utility functions for defining window in DataFrames.
 |  
 |  For example:
 |  
 |  >>> # ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
 |  >>> window = Window.orderBy("date").rowsBetween(Window.unboundedPreceding, Window.currentRow)
 |  
 |  >>> # PARTITION BY country ORDER BY date RANGE BETWEEN 3 PRECEDING AND 3 FOLLOWING
 |  >>> window = Window.orderBy("date").partitionBy("country").rangeBetween(-3, 3)
 |  
 |  .. note:: When ordering is not defined, an unbounded window frame (rowFrame,
 |       unboundedPreceding, unboundedFollowing) is used by default. When ordering is defined,
 |       a growing window frame (rangeFrame, unboundedPreceding, currentRow) is used by default.
 |  
 |  .. note:: Experimental
 |  
 |  .. versionadded:: 1.4
 |  
 |  Static methods defined here:
 |  
 |  orderBy(*cols)
 |      Creates a :class:`WindowSpec` with the ordering defined.
 |      
 |      .. 

In [111]:
from pyspark.sql.window import Window
spec = Window.partitionBy('DEPTID'). \
            orderBy(employee.SALARY.desc()). \
            rangeBetween(Window.unboundedPreceding,Window.unboundedFollowing)

In [116]:
from pyspark.sql.functions import lead,lag,first,last
employee.select('EMPID','DEPTID','SALARY'). \
    withColumn('LowestSal',last('SALARY',False).over(spec)). \
    show(10,False)

+-----+------+-------+---------+
|EMPID|DEPTID|SALARY |LowestSal|
+-----+------+-------+---------+
|1067 |A30003|83856.0|44061.0  |
|3976 |A30003|67830.0|44061.0  |
|5098 |A30003|65000.0|44061.0  |
|3030 |A30003|50927.0|44061.0  |
|3720 |A30003|44061.0|44061.0  |
|2505 |A64466|94303.0|34298.0  |
|6661 |A64466|84784.0|34298.0  |
|14785|A64466|84177.0|34298.0  |
|7280 |A64466|83338.0|34298.0  |
|8125 |A64466|78721.0|34298.0  |
+-----+------+-------+---------+
only showing top 10 rows



### Data Frame Operations - Ranking Functions - rank,dense_rank, row_number etc

In [2]:
#Assign rank to employees based on salary within in each department
employee = spark.read.csv("/user/pi/employee_data/employee", \
                          schema='SALARY float,DEPTID string,DESCR string,EMPID int,HIRE_DT string,JOBTITLE string,NAME string')
employee.show()

+--------+------+--------------------+-----+---------------+--------------------+-----------------+
|  SALARY|DEPTID|               DESCR|EMPID|        HIRE_DT|            JOBTITLE|             NAME|
+--------+------+--------------------+-----+---------------+--------------------+-----------------+
| 38926.0|A29009|States Attorneys ...|13945| 5/20/2019 0:00|       Law Clerk SAO|    Webb Edward J|
| 39102.0|A02002|  City Council (002)|13946|  1/4/2017 0:00|   Council Assistant|   Webb Frances M|
|131438.0|A99190|Police Department...|13947|  5/8/1992 0:00|        Police Major|      Webb John O|
| 33680.0|A68009|     R&P-Parks (009)|13948|  4/8/2004 0:00|        Utility Aide|      Webb Kern A|
| 36202.0|A49507|TRANS-Highways (507)|13949|  7/1/1991 0:00|      Laborer Hourly|   Webb Michael E|
| 38002.0|A65116|HLTH-Health Dept....|13950| 6/25/2008 0:00|Medical Office As...|  Webb Rochelle M|
| 33192.0|A65016|HLTH-Health Depar...|13951| 11/7/1994 0:00|Medical Office As...|  Webb Rochelle M|


In [13]:
from pyspark.sql.window import Window
from pyspark.sql.functions import rank,dense_rank,row_number
spec = Window.partitionBy('DEPTID').orderBy(employee.SALARY.desc())

employee.select('EMPID','DEPTID','SALARY'). \
        withColumn('Rank',rank().over(spec)). \
        withColumn('DenseRank',dense_rank().over(spec)). \
        withColumn('RowNumber',row_number().over(spec)). \
        show()

+-----+------+-------+----+---------+---------+
|EMPID|DEPTID| SALARY|Rank|DenseRank|RowNumber|
+-----+------+-------+----+---------+---------+
| 1067|A30003|83856.0|   1|        1|        1|
| 3976|A30003|67830.0|   2|        2|        2|
| 5098|A30003|65000.0|   3|        3|        3|
| 3030|A30003|50927.0|   4|        4|        4|
| 3720|A30003|44061.0|   5|        5|        5|
| 2505|A64466|94303.0|   1|        1|        1|
| 6661|A64466|84784.0|   2|        2|        2|
|14785|A64466|84177.0|   3|        3|        3|
| 7280|A64466|83338.0|   4|        4|        4|
| 7122|A64466|78721.0|   5|        5|        5|
| 8125|A64466|78721.0|   5|        5|        6|
|14567|A64466|78265.0|   7|        6|        7|
| 3539|A64466|78265.0|   7|        6|        8|
|10842|A64466|75054.0|   9|        7|        9|
| 1573|A64466|75054.0|   9|        7|       10|
| 7685|A64466|75054.0|   9|        7|       11|
| 8542|A64466|75054.0|   9|        7|       12|
|13504|A64466|75054.0|   9|        7|   

In [19]:
#Get top N=5 purchased products Per day
orders=spark.read.format('csv').schema('order_id int,order_date string,customer_id int,order_status string'). \
                               load('/user/pi/retail_db/orders')
#load order_items
orderItems = spark.read.format("csv"). \
            schema("order_item_id int,order_item_order_id int,order_item_product_id int,order_item_quantity_id int,order_item_subtotal float,order_item_product_price float"). \
            load("/user/pi/retail_db/order_items")

from pyspark.sql.functions import count
dailyProducts = orders.select('order_id','order_date').join(orderItems,orders.order_id==orderItems.order_item_order_id). \
                        groupBy('order_date','order_item_product_id'). \
                        agg(count('order_item_product_id').alias("cnt"))

In [43]:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number,col
spec = Window.partitionBy('order_date').orderBy(dailyProducts.cnt.desc())
topNProducts = dailyProducts. \
    withColumn('ProductFreqRank',row_number().over(spec)). \
    where('ProductFreqRank<=5')
topNProducts.show()

+--------------------+---------------------+---+---------------+
|          order_date|order_item_product_id|cnt|ProductFreqRank|
+--------------------+---------------------+---+---------------+
|2013-08-13 00:00:...|                  365| 33|              1|
|2013-08-13 00:00:...|                  403| 32|              2|
|2013-08-13 00:00:...|                  502| 29|              3|
|2013-08-13 00:00:...|                 1073| 20|              4|
|2013-08-13 00:00:...|                  957| 18|              5|
|2013-10-12 00:00:...|                  502| 58|              1|
|2013-10-12 00:00:...|                  403| 54|              2|
|2013-10-12 00:00:...|                  365| 50|              3|
|2013-10-12 00:00:...|                 1014| 43|              4|
|2013-10-12 00:00:...|                 1073| 41|              5|
|2013-11-15 00:00:...|                  403| 51|              1|
|2013-11-15 00:00:...|                  502| 48|              2|
|2013-11-15 00:00:...|   

In [40]:
productsCSV = spark.read.format('csv').load('/user/pi/retail_db/products').toDF('product_id','col1','products_desc','col2','col3','col4')
products = productsCSV. \
            select('product_id','products_desc'). \
            withColumn('product_id',productsCSV.product_id.cast('int'))
products.show()

+----------+--------------------+
|product_id|       products_desc|
+----------+--------------------+
|         1|Quest Q64 10 FT. ...|
|         2|Under Armour Men'...|
|         3|Under Armour Men'...|
|         4|Under Armour Men'...|
|         5|Riddell Youth Rev...|
|         6|Jordan Men's VI R...|
|         7|Schutt Youth Recr...|
|         8|Nike Men's Vapor ...|
|         9|Nike Adult Vapor ...|
|        10|Under Armour Men'...|
|        11|Fitness Gear 300 ...|
|        12|Under Armour Men'...|
|        13|Under Armour Men'...|
|        14|Quik Shade Summit...|
|        15|Under Armour Kids...|
|        16|Riddell Youth 360...|
|        17|Under Armour Men'...|
|        18|Reebok Men's Full...|
|        19|Nike Men's Finger...|
|        20|Under Armour Men'...|
+----------+--------------------+
only showing top 20 rows



In [50]:
topNProducts.join(products,topNProducts.order_item_product_id==products.product_id). \
            select('order_date','products_desc','cnt'). \
            orderBy(['order_date','cnt'],ascending=[0,0]). \
            show(truncate=False)

+---------------------+---------------------------------------------+---+
|order_date           |products_desc                                |cnt|
+---------------------+---------------------------------------------+---+
|2014-07-24 00:00:00.0|Perfect Fitness Perfect Rip Deck             |72 |
|2014-07-24 00:00:00.0|Nike Men's CJ Elite 2 TD Football Cleat      |61 |
|2014-07-24 00:00:00.0|Nike Men's Dri-FIT Victory Golf Polo         |56 |
|2014-07-24 00:00:00.0|O'Brien Men's Neoprene Life Vest             |51 |
|2014-07-24 00:00:00.0|Field & Stream Sportsman 16 Gun Fire Safe    |48 |
|2014-07-23 00:00:00.0|Perfect Fitness Perfect Rip Deck             |62 |
|2014-07-23 00:00:00.0|Nike Men's Dri-FIT Victory Golf Polo         |49 |
|2014-07-23 00:00:00.0|Field & Stream Sportsman 16 Gun Fire Safe    |48 |
|2014-07-23 00:00:00.0|O'Brien Men's Neoprene Life Vest             |47 |
|2014-07-23 00:00:00.0|Nike Men's CJ Elite 2 TD Football Cleat      |47 |
|2014-07-22 00:00:00.0|Nike Men's CJ E