<ul>
<li>Tables should be in hive database - &lt;YOUR_USER_ID&gt;_retail_db_txt
<ul>
<li>orders</li>
<li>order_items</li>
<li>customers</li>
</ul>
</li>
<li>Time to create database and tables need not be counted. Make sure to go back to Spark SQL module and create tables and load data</li>
<li>Get details of top 5 customers by revenue for each month</li>
<li>We need to get all the details of the customer along with month and revenue per month</li>
<li>Data need to be sorted by month in ascending order and revenue per month in descending order</li>
<li>Create table top5_customers_per_month in &lt;YOUR_USER_ID&gt;_retail_db_txt</li>
<li>Insert the output into the newly created table</li>
</ul>

In [52]:
from pyspark.sql.functions import date_format,col
orders = spark.read.csv("/user/pi/retail_db/orders").select('_c0','_c1','_c2').toDF('order_id','order_date','orders_customer_id'). \
    withColumn('order_id',col('order_id').cast('int')). \
    withColumn('order_month',date_format('order_date','YYYY-MM')). \
    withColumn('orders_customer_id',col('orders_customer_id').cast('int')). \
    select('order_id','orders_customer_id','order_month')
orders.show(10)

+--------+------------------+-----------+
|order_id|orders_customer_id|order_month|
+--------+------------------+-----------+
|       1|             11599|    2013-07|
|       2|               256|    2013-07|
|       3|             12111|    2013-07|
|       4|              8827|    2013-07|
|       5|             11318|    2013-07|
|       6|              7130|    2013-07|
|       7|              4530|    2013-07|
|       8|              2911|    2013-07|
|       9|              5657|    2013-07|
|      10|              5648|    2013-07|
+--------+------------------+-----------+
only showing top 10 rows



In [53]:
from pyspark.sql.functions import col,sum,round
order_items = spark.read.csv("/user/pi/retail_db/order_items"). \
              select(col('_c1').cast('int').alias('order_id'),col('_c4').cast('float').alias('revenue')). \
                groupBy('order_id'). \
                agg(round(sum('revenue'),2).alias('revenue'))
order_items.show(10)

+--------+-------+
|order_id|revenue|
+--------+-------+
|     148| 479.99|
|     463| 829.92|
|     471| 169.98|
|     496| 441.95|
|    1088| 249.97|
|    1580| 299.95|
|    1591| 439.86|
|    1645|1509.79|
|    2366| 299.97|
|    2659| 724.91|
+--------+-------+
only showing top 10 rows



In [54]:
customers = spark.read.csv('/user/pi/retail_db/customers',schema='customer_id int,customer_fname string, \
                          customer_lname string,col1 string,col2 string,address string,location string,code string, \
                          postalcode string')
customers.show()

+-----------+--------------+--------------+---------+---------+--------------------+-------------+----+----------+
|customer_id|customer_fname|customer_lname|     col1|     col2|             address|     location|code|postalcode|
+-----------+--------------+--------------+---------+---------+--------------------+-------------+----+----------+
|          1|       Richard|     Hernandez|XXXXXXXXX|XXXXXXXXX|  6303 Heather Plaza|  Brownsville|  TX|     78521|
|          2|          Mary|       Barrett|XXXXXXXXX|XXXXXXXXX|9526 Noble Embers...|    Littleton|  CO|     80126|
|          3|           Ann|         Smith|XXXXXXXXX|XXXXXXXXX|3422 Blue Pioneer...|       Caguas|  PR|     00725|
|          4|          Mary|         Jones|XXXXXXXXX|XXXXXXXXX|  8324 Little Common|   San Marcos|  CA|     92069|
|          5|        Robert|        Hudson|XXXXXXXXX|XXXXXXXXX|10 Crystal River ...|       Caguas|  PR|     00725|
|          6|          Mary|         Smith|XXXXXXXXX|XXXXXXXXX|3151 Sleepy Quail

In [55]:
monthlyCustomerRev = orders.join(order_items,orders.order_id==order_items.order_id). \
    select('order_month','revenue','orders_customer_id')
monthlyCustomerRev.show()

+-----------+-------+------------------+
|order_month|revenue|orders_customer_id|
+-----------+-------+------------------+
|    2013-07| 299.98|             11599|
|    2013-07| 579.98|               256|
|    2013-07| 699.85|              8827|
|    2013-07|1129.86|             11318|
|    2013-07| 579.92|              4530|
|    2013-07| 729.84|              2911|
|    2013-07| 599.96|              5657|
|    2013-07| 651.92|              5648|
|    2013-07| 919.79|               918|
|    2013-07|1299.87|              1837|
|    2013-07| 127.96|              9149|
|    2013-07| 549.94|              9842|
|    2013-07| 925.91|              2568|
|    2013-07| 419.93|              7276|
|    2013-07| 694.84|              2667|
|    2013-07| 449.96|              1205|
|    2013-07| 699.96|              9488|
|    2013-07| 879.86|              9198|
|    2013-07| 372.91|              2711|
|    2013-07| 299.98|              4367|
+-----------+-------+------------------+
only showing top

In [56]:
from pyspark.sql.window import Window
spec = Window.partitionBy('order_month'). \
        orderBy(monthlyCustomerRev.revenue.desc())

In [57]:
from pyspark.sql.functions import dense_rank
monthlytop5customer = monthlyCustomerRev. \
                withColumn('rank',dense_rank().over(spec)). \
                where('rank <=5')
monthlytop5customer.show()

+-----------+-------+------------------+----+
|order_month|revenue|orders_customer_id|rank|
+-----------+-------+------------------+----+
|    2013-09|2859.89|              1148|   1|
|    2013-09|2199.99|              4258|   2|
|    2013-09| 1849.9|              1607|   3|
|    2013-09|1829.86|              1420|   4|
|    2013-09|1819.63|              9750|   5|
|    2013-12| 2699.9|               382|   1|
|    2013-12| 2039.8|              1578|   2|
|    2013-12|1829.86|              3366|   3|
|    2013-12|1799.89|              1915|   4|
|    2013-12| 1759.9|              9967|   5|
|    2014-01| 2629.9|               986|   1|
|    2014-01| 1899.9|              2537|   2|
|    2014-01|1819.87|              4436|   3|
|    2014-01| 1799.9|              3051|   4|
|    2014-01|1729.87|              4089|   5|
|    2014-03|2779.86|              5946|   1|
|    2014-03|2629.92|             10351|   2|
|    2014-03|2329.94|              8769|   3|
|    2014-03|1979.83|             

In [48]:
customer_cols = customers.columns
customer_cols

['customer_id',
 'customer_fname',
 'customer_lname',
 'col1',
 'col2',
 'address',
 'location',
 'code',
 'postalcode']

In [99]:
from pyspark.sql.functions import dense_rank
monthlytop5customerDet=customers.join(monthlytop5customer,customers.customer_id==monthlytop5customer.orders_customer_id). \
                          select(customer_cols + ['order_month','revenue']). \
                          sort(['order_month','revenue'],ascending=[1,0])
monthlytop5customerDet.createOrReplaceTempView('monthlytop5customer')

In [100]:
spark.sql('select * from monthlytop5customer limit 10').show()

+-----------+--------------+--------------+---------+---------+--------------------+--------------+----+----------+-----------+-------+
|customer_id|customer_fname|customer_lname|     col1|     col2|             address|      location|code|postalcode|order_month|revenue|
+-----------+--------------+--------------+---------+---------+--------------------+--------------+----+----------+-----------+-------+
|       1175|          Mary|          Gray|XXXXXXXXX|XXXXXXXXX|5079 Velvet Hicko...|        Caguas|  PR|     00725|    2013-07|1699.91|
|       9807|          Mary|         Lopez|XXXXXXXXX|XXXXXXXXX|6229 Clear Oak Lo...|         Vista|  CA|     92084|    2013-07| 1664.9|
|      11941|       Jeffrey|          Pugh|XXXXXXXXX|XXXXXXXXX|3233 Sleepy View ...|         Cayey|  PR|     00736|    2013-07| 1649.8|
|       2255|        Cheryl|         Kline|XXXXXXXXX|XXXXXXXXX| 599 Sleepy Townline|  Philadelphia|  PA|     19124|    2013-07|1629.79|
|      10235|        Joseph|         Singh|XXXXX

In [123]:
spark.sql('use retail_db_txt')

DataFrame[]

In [125]:
spark.sql("drop table if exists top5_customers_per_month");
spark.sql("create table top5_customers_per_month as select * from monthlytop5customer limit 10");
spark.sql("insert into top5_customers_per_month select * from monthlytop5customer limit 10");
spark.sql('select * from top5_customers_per_month limit 10').show()

+-----------+--------------+--------------+---------+---------+--------------------+--------------+----+----------+-----------+-------+
|customer_id|customer_fname|customer_lname|     col1|     col2|             address|      location|code|postalcode|order_month|revenue|
+-----------+--------------+--------------+---------+---------+--------------------+--------------+----+----------+-----------+-------+
|       1175|          Mary|          Gray|XXXXXXXXX|XXXXXXXXX|5079 Velvet Hicko...|        Caguas|  PR|     00725|    2013-07|1699.91|
|       9807|          Mary|         Lopez|XXXXXXXXX|XXXXXXXXX|6229 Clear Oak Lo...|         Vista|  CA|     92084|    2013-07| 1664.9|
|      11941|       Jeffrey|          Pugh|XXXXXXXXX|XXXXXXXXX|3233 Sleepy View ...|         Cayey|  PR|     00736|    2013-07| 1649.8|
|       2255|        Cheryl|         Kline|XXXXXXXXX|XXXXXXXXX| 599 Sleepy Townline|  Philadelphia|  PA|     19124|    2013-07|1629.79|
|      10235|        Joseph|         Singh|XXXXX

In [126]:
spark.sql('select * from ameen_retail_db_txt.top5_customers_per_month limit 10').show()

+-----------+--------------+--------------+----+----+-------+--------+----+----------+-----------+-------+
|customer_id|customer_fname|customer_lname|col1|col2|address|location|code|postalcode|order_month|revenue|
+-----------+--------------+--------------+----+----+-------+--------+----+----------+-----------+-------+
+-----------+--------------+--------------+----+----+-------+--------+----+----------+-----------+-------+



In [121]:
spark.sql('show tables').show(truncate=False)

+-------------------+------------------------+-----------+
|database           |tableName               |isTemporary|
+-------------------+------------------------+-----------+
|ameen_retail_db_txt|top5_customers_per_month|false      |
|                   |monthlytop5customer     |true       |
|                   |monthlytop5customer1    |true       |
+-------------------+------------------------+-----------+

