# Ex2 - Getting and Knowing your Data

This time we are going to pull data directly from the internet.
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as func
from pyspark.sql.types import FloatType

In [2]:
spark = SparkSession.builder.appName("ex1-chipotle").getOrCreate()
spark

23/01/23 18:56:17 WARN Utils: Your hostname, Ana-Matebook resolves to a loopback address: 127.0.1.1; using 192.168.1.137 instead (on interface wlp2s0)
23/01/23 18:56:17 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/01/23 18:56:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv). 

### Step 3. Assign it to a variable called chipo.

In [3]:
chipo = spark.read.option("header", True).option("inferSchema", True).csv("./../../datasets/chipotle/data.csv")

In [4]:
chipo = spark.read.options(
        header=True,
        inferSchema=True,
        delimiter='\t'
    ).csv("./../../datasets/chipotle/data.csv")

In [5]:
chipo.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- quantity: integer (nullable = true)
 |-- item_name: string (nullable = true)
 |-- choice_description: string (nullable = true)
 |-- item_price: string (nullable = true)



### Step 4. See the first 10 entries

In [6]:
chipo.show(n=10)

+--------+--------+--------------------+--------------------+----------+
|order_id|quantity|           item_name|  choice_description|item_price|
+--------+--------+--------------------+--------------------+----------+
|       1|       1|Chips and Fresh T...|                NULL|    $2.39 |
|       1|       1|                Izze|        [Clementine]|    $3.39 |
|       1|       1|    Nantucket Nectar|             [Apple]|    $3.39 |
|       1|       1|Chips and Tomatil...|                NULL|    $2.39 |
|       2|       2|        Chicken Bowl|[Tomatillo-Red Ch...|   $16.98 |
|       3|       1|        Chicken Bowl|[Fresh Tomato Sal...|   $10.98 |
|       3|       1|       Side of Chips|                NULL|    $1.69 |
|       4|       1|       Steak Burrito|[Tomatillo Red Ch...|   $11.75 |
|       4|       1|    Steak Soft Tacos|[Tomatillo Green ...|    $9.25 |
|       5|       1|       Steak Burrito|[Fresh Tomato Sal...|    $9.25 |
+--------+--------+--------------------+-----------

### Step 5. What is the number of observations in the dataset?

In [7]:
# Solution 1

chipo.count()

4622

### Step 6. What is the number of columns in the dataset?

In [8]:
len(chipo.columns)

5

### Step 7. Print the name of all the columns.

In [9]:
chipo.columns

['order_id', 'quantity', 'item_name', 'choice_description', 'item_price']

### Step 8. How is the dataset indexed?

### Step 9. Which was the most-ordered item? 

In [10]:
# Option 1: This returns a DataFrame
item_quantity = chipo.groupBy("item_name").sum("quantity")


In [11]:
# Option 2: Use `agg` and the function `sum`-
item_quantity_2 = chipo.groupBy("item_name").agg(func.sum("quantity").alias("total"))


In [12]:
item_quantity.show(n=5)

+-------------------+-------------+
|          item_name|sum(quantity)|
+-------------------+-------------+
|Carnitas Soft Tacos|           40|
| Chicken Soft Tacos|          120|
|              Salad|            2|
|        Steak Salad|            4|
|               Bowl|            4|
+-------------------+-------------+
only showing top 5 rows



In [13]:
item_quantity_2.show(n=5)

+-------------------+-----+
|          item_name|total|
+-------------------+-----+
|Carnitas Soft Tacos|   40|
| Chicken Soft Tacos|  120|
|              Salad|    2|
|        Steak Salad|    4|
|               Bowl|    4|
+-------------------+-----+
only showing top 5 rows



In [14]:
sorted_item_quantity_2 = item_quantity_2.sort("total", ascending=False)
sorted_item_quantity_2.show(n=1)

+------------+-----+
|   item_name|total|
+------------+-----+
|Chicken Bowl|  761|
+------------+-----+
only showing top 1 row



### Step 10. For the most-ordered item, how many items were ordered?

In [15]:
sorted_item_quantity_2.show(n=1)

+------------+-----+
|   item_name|total|
+------------+-----+
|Chicken Bowl|  761|
+------------+-----+
only showing top 1 row



### Step 11. What was the most ordered item in the choice_description column?

In [16]:
total_quantity_choice_descr = chipo.groupBy("choice_description").agg(func.sum("quantity").alias("total")).sort("total", ascending=False)

In [17]:
total_quantity_choice_descr.show(n=1)

+------------------+-----+
|choice_description|total|
+------------------+-----+
|              NULL| 1382|
+------------------+-----+
only showing top 1 row



In [18]:
# There are multiple ways for doing this:
total_quantity_choice_descr_notnull = total_quantity_choice_descr.withColumn(
    "clean_total",
    func.when(func.col("choice_description") != "NULL", func.col("total"))  # isNotNull() does not work here
    .otherwise(0)
)

In [19]:
total_quantity_choice_descr_notnull.show(n=5)

+--------------------+-----+-----------+
|  choice_description|total|clean_total|
+--------------------+-----+-----------+
|                NULL| 1382|          0|
|         [Diet Coke]|  159|        159|
|              [Coke]|  143|        143|
|            [Sprite]|   89|         89|
|[Fresh Tomato Sal...|   49|         49|
+--------------------+-----+-----------+
only showing top 5 rows



In [20]:
total_quantity_choice_descr_notnull_sorted = total_quantity_choice_descr_notnull.sort("clean_total", ascending=False).select("choice_description", "clean_total")

In [21]:
total_quantity_choice_descr_final = total_quantity_choice_descr_notnull_sorted.withColumnRenamed("clean_total", "total")

In [22]:
total_quantity_choice_descr_final.show(n=1)

+------------------+-----+
|choice_description|total|
+------------------+-----+
|       [Diet Coke]|  159|
+------------------+-----+
only showing top 1 row



### Step 12. How many items were orderd in total?

In [23]:
total_quantity = chipo.select("quantity").agg(func.sum("quantity").alias("total_quantity"))
total_quantity.show()

+--------------+
|total_quantity|
+--------------+
|          4972|
+--------------+



### Step 13. Turn the item price into a float

#### Step 13.a. Check the item price type

In [24]:
chipo.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- quantity: integer (nullable = true)
 |-- item_name: string (nullable = true)
 |-- choice_description: string (nullable = true)
 |-- item_price: string (nullable = true)



In [25]:
chipo.show(n=5)

+--------+--------+--------------------+--------------------+----------+
|order_id|quantity|           item_name|  choice_description|item_price|
+--------+--------+--------------------+--------------------+----------+
|       1|       1|Chips and Fresh T...|                NULL|    $2.39 |
|       1|       1|                Izze|        [Clementine]|    $3.39 |
|       1|       1|    Nantucket Nectar|             [Apple]|    $3.39 |
|       1|       1|Chips and Tomatil...|                NULL|    $2.39 |
|       2|       2|        Chicken Bowl|[Tomatillo-Red Ch...|   $16.98 |
+--------+--------+--------------------+--------------------+----------+
only showing top 5 rows



#### Step 13.b. Create a lambda function and change the type of item price

In [26]:
change_price_type = lambda price: float(price.lstrip("$"))

In [27]:
change_price_type("$2.99")

2.99

In [28]:
change_price_type_udf = func.udf(change_price_type, returnType=FloatType())
change_price_type_udf

<function __main__.<lambda>(price)>

In [29]:
price_chipo = chipo.withColumn(
    "price",
    change_price_type_udf(func.col("item_price"))
)

In [30]:
price_chipo.show(n=5)

[Stage 33:>                                                         (0 + 1) / 1]

+--------+--------+--------------------+--------------------+----------+-----+
|order_id|quantity|           item_name|  choice_description|item_price|price|
+--------+--------+--------------------+--------------------+----------+-----+
|       1|       1|Chips and Fresh T...|                NULL|    $2.39 | 2.39|
|       1|       1|                Izze|        [Clementine]|    $3.39 | 3.39|
|       1|       1|    Nantucket Nectar|             [Apple]|    $3.39 | 3.39|
|       1|       1|Chips and Tomatil...|                NULL|    $2.39 | 2.39|
|       2|       2|        Chicken Bowl|[Tomatillo-Red Ch...|   $16.98 |16.98|
+--------+--------+--------------------+--------------------+----------+-----+
only showing top 5 rows



                                                                                

#### Step 13.c. Check the item price type

In [31]:
price_chipo.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- quantity: integer (nullable = true)
 |-- item_name: string (nullable = true)
 |-- choice_description: string (nullable = true)
 |-- item_price: string (nullable = true)
 |-- price: float (nullable = true)



### Step 14. How much was the revenue for the period in the dataset?

In [32]:
revenue_chipo = price_chipo.withColumn(
    "revenue",
    func.col("quantity") * func.col("price")
)

In [33]:
revenue_chipo.show(n=5)

+--------+--------+--------------------+--------------------+----------+-----+-------+
|order_id|quantity|           item_name|  choice_description|item_price|price|revenue|
+--------+--------+--------------------+--------------------+----------+-----+-------+
|       1|       1|Chips and Fresh T...|                NULL|    $2.39 | 2.39|   2.39|
|       1|       1|                Izze|        [Clementine]|    $3.39 | 3.39|   3.39|
|       1|       1|    Nantucket Nectar|             [Apple]|    $3.39 | 3.39|   3.39|
|       1|       1|Chips and Tomatil...|                NULL|    $2.39 | 2.39|   2.39|
|       2|       2|        Chicken Bowl|[Tomatillo-Red Ch...|   $16.98 |16.98|  33.96|
+--------+--------+--------------------+--------------------+----------+-----+-------+
only showing top 5 rows



In [34]:
total_revenue = revenue_chipo.agg(
    func.sum("revenue").alias("total_revenue")
)
total_revenue.show()

+----------------+
|   total_revenue|
+----------------+
|39237.0197327137|
+----------------+



### Step 15. How many orders were made in the period?

In [35]:
chipo.agg(func.count_distinct("order_id")).show()

+---------------+
|count(order_id)|
+---------------+
|           1834|
+---------------+



### Step 16. What is the average revenue amount per order?

In [39]:
revenue_chipo.columns

['order_id',
 'quantity',
 'item_name',
 'choice_description',
 'item_price',
 'price',
 'revenue']

In [41]:
# Solution 1

revenue_chipo.groupBy("order_id").agg(func.avg("revenue").alias("avg_revenue")).show()

+--------+------------------+
|order_id|       avg_revenue|
+--------+------------------+
|     148| 7.734999895095825|
|     463|5.3399999141693115|
|     471| 4.829999923706055|
|     496| 3.509999990463257|
|     833|             6.375|
|    1088|11.570000171661377|
|    1238|6.5849997997283936|
|    1342| 4.050000031789144|
|    1580|  7.15666651725769|
|    1591|              37.0|
|    1645|               4.0|
|    1829| 8.083333333333334|
|     243|              45.0|
|     392| 3.756666620572408|
|     540| 6.849999904632568|
|     623| 8.614999771118164|
|     737|12.512499809265137|
|     858| 6.849999904632568|
|     897| 8.739999771118164|
|    1025| 5.353333195050557|
+--------+------------------+
only showing top 20 rows



### Step 17. How many different items are sold?

In [38]:
chipo.agg(func.count_distinct("item_name")).show()

+----------------+
|count(item_name)|
+----------------+
|              50|
+----------------+

