# Ex1 - Filtering and Sorting Data

This time we are going to pull data directly from the internet.
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [1]:
from pyspark.sql import SparkSession
from pyspark import SparkFiles
spark = SparkSession.builder.appName('MyApp').getOrCreate()

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv). 

### Step 3. Assign it to a variable called chipo.

In [2]:
url=r'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv'
spark.sparkContext.addFile(url)
chipo=spark.read.option('delimiter','\t') \
    .option('header','true') \
    .option('inferSchema','true') \
    .csv('file:///'+SparkFiles.get('chipotle.tsv'))

### Step 4. How many products cost more than $10.00?

In [7]:
from pyspark.sql.functions import trim, regexp_replace, cast, col
from pyspark.sql.types import FloatType

In [9]:
chipo=chipo.withColumn('item_price_clean', trim(regexp_replace('item_price','\$','')).cast(FloatType()))
chipo.where(col('item_price_clean')>10.00).distinct().count()

1123

### Step 5. What is the price of each item? 
###### print a data frame with only two columns item_name and item_price

In [11]:
chipo[['item_name','item_price']].distinct().show()

+--------------------+----------+
|           item_name|item_price|
+--------------------+----------+
|    Steak Soft Tacos|    $9.25 |
|    Barbacoa Burrito|    $9.25 |
|Chips and Mild Fr...|    $3.00 |
|       Carnitas Bowl|   $23.50 |
|     Chicken Burrito|   $10.58 |
|        Chicken Bowl|    $8.49 |
|  Chicken Salad Bowl|   $17.50 |
|       Bottled Water|    $3.00 |
|        Chicken Bowl|    $8.75 |
|        Chicken Bowl|   $21.96 |
|    Nantucket Nectar|    $6.78 |
|Chicken Crispy Tacos|   $10.98 |
|Chips and Tomatil...|    $2.39 |
|Chips and Fresh T...|   $44.25 |
|       Bottled Water|    $4.50 |
|      Veggie Burrito|   $33.75 |
|     Chicken Burrito|   $16.38 |
|         Veggie Bowl|    $8.75 |
|Chicken Crispy Tacos|   $11.25 |
|Chips and Tomatil...|    $5.90 |
+--------------------+----------+
only showing top 20 rows



### Step 6. Sort by the name of the item

In [12]:
chipo.orderBy(col('item_name')).show()

+--------+--------+-----------------+------------------+----------+----------------+
|order_id|quantity|        item_name|choice_description|item_price|item_price_clean|
+--------+--------+-----------------+------------------+----------+----------------+
|     511|       1|6 Pack Soft Drink|            [Coke]|    $6.49 |            6.49|
|    1253|       1|6 Pack Soft Drink|        [Lemonade]|    $6.49 |            6.49|
|     520|       1|6 Pack Soft Drink|          [Sprite]|    $6.49 |            6.49|
|     148|       1|6 Pack Soft Drink|       [Diet Coke]|    $6.49 |            6.49|
|     566|       1|6 Pack Soft Drink|       [Diet Coke]|    $6.49 |            6.49|
|     168|       1|6 Pack Soft Drink|       [Diet Coke]|    $6.49 |            6.49|
|     708|       1|6 Pack Soft Drink|            [Coke]|    $6.49 |            6.49|
|     230|       1|6 Pack Soft Drink|       [Diet Coke]|    $6.49 |            6.49|
|     709|       1|6 Pack Soft Drink|       [Diet Coke]|    $6.49

### Step 7. What was the quantity of the most expensive item ordered?

In [21]:
chipo.orderBy(col('item_price_clean').desc()).show(1)

+--------+--------+--------------------+------------------+----------+----------------+
|order_id|quantity|           item_name|choice_description|item_price|item_price_clean|
+--------+--------+--------------------+------------------+----------+----------------+
|    1443|      15|Chips and Fresh T...|              NULL|   $44.25 |           44.25|
+--------+--------+--------------------+------------------+----------+----------------+
only showing top 1 row



### Step 8. How many times was a Veggie Salad Bowl ordered?

In [17]:
chipo.where(col('item_name')=='Veggie Salad Bowl').count()

18

### Step 9. How many times did someone order more than one Canned Soda?

In [20]:
chipo.where(col('item_name')=='Canned Soda').where(col('quantity')>1).count()

20