<H1 align = 'Center'> Spark Higher Level APIs </H1>
<H3> => Assignment 2 : Questions regarding Product Catalog</H3>
<BR>

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, FloatType

In [2]:
spark = SparkSession.builder. \
appName("productCatalog"). \
config("spark.ui.port","0"). \
config("spark.sql.warehouse.dir","/user/itv012857/warehouse"). \
enableHiveSupport(). \
master("yarn"). \
getOrCreate()

In [3]:
spark

### => Setting up the Dataframe

In [4]:
# StructType(List or Tuple of class:`StructField`)
# StructField(name: string, dataType: class `DataType`, nullable: bool, metadata: dict)

In [5]:
productSchema = StructType(
                            (StructField('product_id',IntegerType(),False),
                             StructField('category',IntegerType(),False),
                             StructField('product_name',StringType(),False),
                             StructField('description',StringType(),True),
                             StructField('price',FloatType(),False),
                             StructField('image_url',StringType(),True)
                            )
                          )

In [6]:
products_df = spark.read.csv("/public/trendytech/retail_db/products/part-00000", \
                             schema = productSchema)

In [7]:
products_df.printSchema()

root
 |-- product_id: integer (nullable = true)
 |-- category: integer (nullable = true)
 |-- product_name: string (nullable = true)
 |-- description: string (nullable = true)
 |-- price: float (nullable = true)
 |-- image_url: string (nullable = true)



In [8]:
products_df.show()

+----------+--------+--------------------+-----------+------+--------------------+
|product_id|category|        product_name|description| price|           image_url|
+----------+--------+--------------------+-----------+------+--------------------+
|         1|       2|Quest Q64 10 FT. ...|       null| 59.98|http://images.acm...|
|         2|       2|Under Armour Men'...|       null|129.99|http://images.acm...|
|         3|       2|Under Armour Men'...|       null| 89.99|http://images.acm...|
|         4|       2|Under Armour Men'...|       null| 89.99|http://images.acm...|
|         5|       2|Riddell Youth Rev...|       null|199.99|http://images.acm...|
|         6|       2|Jordan Men's VI R...|       null|134.99|http://images.acm...|
|         7|       2|Schutt Youth Recr...|       null| 99.99|http://images.acm...|
|         8|       2|Nike Men's Vapor ...|       null|129.99|http://images.acm...|
|         9|       2|Nike Adult Vapor ...|       null|  50.0|http://images.acm...|
|   

In [9]:
products_df.createOrReplaceTempView('product_catalog')

In [10]:
spark.sql("describe extended product_catalog").show()

+------------+---------+-------+
|    col_name|data_type|comment|
+------------+---------+-------+
|  product_id|      int|   null|
|    category|      int|   null|
|product_name|   string|   null|
| description|   string|   null|
|       price|    float|   null|
|   image_url|   string|   null|
+------------+---------+-------+



<h3> 1. Find the total no. of products in the dataset

In [24]:
products_df.select("product_name").distinct().count()

750

In [25]:
spark.sql("SELECT COUNT(DISTINCT(product_name)) FROM product_catalog")

count(DISTINCT product_name)
750


In [26]:
spark.sql("SELECT product_name, count(1) \
            FROM product_catalog \
            GROUP BY 1 \
            HAVING COUNT(1) > 1").show(truncate = False)

+---------------------------------------------+--------+
|product_name                                 |count(1)|
+---------------------------------------------+--------+
|Fitness Gear Heavy Bag Stand                 |2       |
|Callaway X Hot Laser Rangefinder             |3       |
|adidas Men's 2014 MLS All-Star Game Rose City|3       |
|Top Flite Kids' 2014 XLj Complete Set (Height|5       |
|YETI Roadie 20 Chest Cooler                  |2       |
|Nike Hyper Elite Crew Basketball Sock        |3       |
|Merrell Women's Azura Hiking Shoe            |2       |
|Merrell Men's Moab Rover Waterproof Wide Hiki|4       |
|Diamondback Adult Response XE Mountain Bike 2|3       |
|Nike Men's Vapor Carbon Elite TD Football Cle|4       |
|Schutt Youth Recruit Hybrid Custom Football H|3       |
|Fitbit Flex Wireless Activity & Sleep Wristba|13      |
|Majestic Youth Replica New York Yankees Derek|2       |
|LIJA Women's Golf Beanie                     |2       |
|Nike Adult Pro Combat Hyperstr

In [27]:
spark.sql("select * from product_catalog \
            where product_name = 'Nike Hyper Elite Crew Basketball Sock'").show()

+----------+--------+--------------------+-----------+-----+--------------------+
|product_id|category|        product_name|description|price|           image_url|
+----------+--------+--------------------+-----------+-----+--------------------+
|       120|       6|Nike Hyper Elite ...|       null| 18.0|http://images.acm...|
|       474|      21|Nike Hyper Elite ...|       null| 18.0|http://images.acm...|
|       617|      41|Nike Hyper Elite ...|       null| 18.0|http://images.acm...|
+----------+--------+--------------------+-----------+-----+--------------------+



<H4> So, there are 750 unique products. Many of them get repeated in different categories, but there are 750 unique products <BR>

<H3> 2. Find the number of unique categories of products in the dataset

In [28]:
products_df.select("category").distinct().count()

55

In [29]:
spark.sql("SELECT COUNT(DISTINCT category) FROM product_catalog")

count(DISTINCT category)
55


<H4> There are 55 unique categories

<H3> Find the top 5 most expensive products based on their price, along with their product name, category, and imageURL.

In [30]:
products_df.select("product_name","category","price","image_url") \
            .sort("Price", ascending = False) \
            .limit(5)

product_name,category,price,image_url
SOLE E35 Elliptical,10,1999.99,http://images.acm...
SOLE F85 Treadmill,4,1799.99,http://images.acm...
SOLE F85 Treadmill,10,1799.99,http://images.acm...
SOLE F85 Treadmill,22,1799.99,http://images.acm...
"""Spalding Beast 6...",47,1099.99,http://images.acm...


In [31]:
spark.sql("SELECT product_name, category, price, image_url \
            FROM product_catalog \
            ORDER BY price DESC \
            LIMIT 5 \
          ")

product_name,category,price,image_url
SOLE E35 Elliptical,10,1999.99,http://images.acm...
SOLE F85 Treadmill,4,1799.99,http://images.acm...
SOLE F85 Treadmill,10,1799.99,http://images.acm...
SOLE F85 Treadmill,22,1799.99,http://images.acm...
"""Spalding Beast 6...",47,1099.99,http://images.acm...


### 4. Find the number of products in each category that have a price greater than $100. Display the results in a tabular format that shows the category and the number of products that satisfy the condition

In [32]:
products_df.filter("price > 100") \
            .groupBy("category") \
            .count() \
            .show()

+--------+-----+
|category|count|
+--------+-----+
|      31|   17|
|      53|   16|
|      34|   15|
|      44|    9|
|      12|    3|
|      22|    4|
|      47|   10|
|      52|    5|
|      13|    1|
|       6|    5|
|      16|   11|
|       3|    5|
|      20|    7|
|      57|    6|
|      54|    6|
|      48|   17|
|       5|   11|
|      19|   13|
|      41|   11|
|      43|   23|
+--------+-----+
only showing top 20 rows



In [33]:
spark.sql("SELECT category, count(1) \
            FROM product_catalog \
            WHERE price > 100 \
            GROUP BY category")

category,count(1)
31,17
53,16
34,15
44,9
12,3
22,4
47,10
52,5
13,1
6,5


### 5. What are the product names and prices of products that have a price greater than $200 and belong to category 5?

In [34]:
products_df.filter("price > 200 and category = 5") \
            .select("product_name","price")

product_name,price
"""Goaliath 54"""" In...",499.99
Fitness Gear 300 ...,209.99
Teeter Hang Ups N...,299.99


In [35]:
spark.sql("SELECT product_name, price \
           FROM product_catalog \
           WHERE price > 200 AND category =5")

product_name,price
"""Goaliath 54"""" In...",499.99
Fitness Gear 300 ...,209.99
Teeter Hang Ups N...,299.99


In [36]:
spark.stop()