## SQL Questions

In [1]:
from pyspark.sql import SparkSession

In [2]:
# Create a local spark session
spark = SparkSession.builder \
        .appName('nyc-taxi-sql') \
        .getOrCreate()

In [3]:
# Read parquet file
df = spark.read.load("./output")

In [4]:
df.createOrReplaceTempView("nyc_taxi_data_2015_16")

## Business Questions

## Q1. For each year and month

### i. What was the total number of trips?

In [5]:
spark.sql("""
SELECT year
,month
,COUNT(*) AS number_of_trips
FROM nyc_taxi_data_2015_16
GROUP BY year
,month
ORDER BY year, month
""").show(24)

+----+-----+---------------+
|year|month|number_of_trips|
+----+-----+---------------+
|2015|    1|       14098035|
|2015|    2|       14646205|
|2015|    3|       14887955|
|2015|    4|       14559584|
|2015|    5|       14763113|
|2015|    6|       13792085|
|2015|    7|       12944024|
|2015|    8|       12514354|
|2015|    9|       12543804|
|2015|   10|       13754308|
|2015|   11|       12655840|
|2015|   12|       12874955|
|2016|    1|       12197925|
|2016|    2|       12737514|
|2016|    3|       13618756|
|2016|    4|       13299754|
|2016|    5|       13193482|
|2016|    6|       12363673|
|2016|    7|       11469481|
|2016|    8|       11040565|
|2016|    9|       11114687|
|2016|   10|       11945718|
|2016|   11|       11087608|
|2016|   12|       11508320|
+----+-----+---------------+



### ii. Which day of week had the most trips?

In [6]:
spark.sql("""
WITH WEEKDAY_TRIP_COUNT AS (SELECT year
        ,month
        ,DATE_FORMAT(pickup_datetime, "EEEE") AS pickup_weekday
        ,COUNT(*) AS total_trips
        ,ROW_NUMBER() OVER (PARTITION BY year,month ORDER BY COUNT(*) DESC) AS row_num
    FROM nyc_taxi_data_2015_16
    GROUP BY year
        ,month
        ,pickup_weekday
    )
    SELECT year
        ,month
        ,pickup_weekday
        ,total_trips
    FROM WEEKDAY_TRIP_COUNT
    WHERE row_num = 1
    ORDER BY year, month
""").show(24)

+----+-----+--------------+-----------+
|year|month|pickup_weekday|total_trips|
+----+-----+--------------+-----------+
|2015|    1|      Saturday|    2663157|
|2015|    2|      Saturday|    2338396|
|2015|    3|        Sunday|    2334587|
|2015|    4|      Thursday|    2505889|
|2015|    5|      Saturday|    2627990|
|2015|    6|       Tuesday|    2215136|
|2015|    7|     Wednesday|    2188274|
|2015|    8|      Saturday|    2137535|
|2015|    9|     Wednesday|    2101406|
|2015|   10|      Saturday|    2481416|
|2015|   11|        Sunday|    2115682|
|2015|   12|      Thursday|    2127194|
|2016|    1|        Friday|    2227847|
|2016|    2|      Saturday|    1987101|
|2016|    3|      Thursday|    2258660|
|2016|    4|      Saturday|    2461977|
|2016|    5|        Sunday|    2074789|
|2016|    6|      Thursday|    2127773|
|2016|    7|        Friday|    1992464|
|2016|    8|     Wednesday|    1820406|
|2016|    9|        Friday|    1981416|
|2016|   10|      Saturday|    2149453|


### iii. Which hour of the day had the most trips?

In [7]:
spark.sql("""
WITH HOUR_TRIP_COUNT AS (SELECT year
        ,month
        ,pickup_hour
        ,COUNT(*) AS total_trips
        ,ROW_NUMBER() OVER (PARTITION BY year,month ORDER BY COUNT(*) DESC) AS row_num
    FROM nyc_taxi_data_2015_16
    GROUP BY year
        ,month
        ,pickup_hour
    )
    SELECT year
        ,month
        ,pickup_hour
        ,total_trips
    FROM HOUR_TRIP_COUNT
    WHERE row_num = 1
    ORDER BY year, month
""").show(24)

+----+-----+-----------+-----------+
|year|month|pickup_hour|total_trips|
+----+-----+-----------+-----------+
|2015|    1|         19|     897821|
|2015|    2|         19|     932658|
|2015|    3|         19|     937173|
|2015|    4|         19|     905186|
|2015|    5|         19|     911404|
|2015|    6|         19|     850788|
|2015|    7|         19|     802134|
|2015|    8|         19|     771125|
|2015|    9|         19|     783066|
|2015|   10|         19|     867123|
|2015|   11|         19|     778753|
|2015|   12|         19|     791892|
|2016|    1|         18|     780650|
|2016|    2|         18|     821625|
|2016|    3|         19|     868920|
|2016|    4|         19|     829231|
|2016|    5|         18|     808252|
|2016|    6|         19|     755675|
|2016|    7|         18|     686629|
|2016|    8|         19|     685331|
|2016|    9|         19|     693120|
|2016|   10|         19|     738889|
|2016|   11|         19|     680430|
|2016|   12|         19|     700668|
+

### iv. What was the average number of passengers?

In [8]:
spark.sql("""
SELECT year
    ,month
    ,ROUND(AVG(passenger_count), 6) AS avg_passengers_per_trip
    FROM nyc_taxi_data_2015_16
    GROUP BY year
    ,month
ORDER BY year, month
""").show(24)

+----+-----+-----------------------+
|year|month|avg_passengers_per_trip|
+----+-----+-----------------------+
|2015|    1|               1.653391|
|2015|    2|               1.626309|
|2015|    3|               1.640603|
|2015|    4|                1.64593|
|2015|    5|               1.652701|
|2015|    6|               1.648807|
|2015|    7|               1.660572|
|2015|    8|                1.66225|
|2015|    9|               1.647119|
|2015|   10|               1.640405|
|2015|   11|                1.63751|
|2015|   12|               1.645442|
|2016|    1|               1.636588|
|2016|    2|               1.621866|
|2016|    3|               1.626152|
|2016|    4|               1.628314|
|2016|    5|               1.629545|
|2016|    6|               1.626042|
|2016|    7|               1.636588|
|2016|    8|               1.633314|
|2016|    9|               1.617384|
|2016|   10|               1.618455|
|2016|   11|               1.612241|
|2016|   12|               1.623203|
+

### v. What was the average amount paid per trip (total_amount)?

In [11]:
spark.sql("""
SELECT year
    ,month
    ,ROUND(AVG(total_amount), 6) AS avg_total_amount
    FROM nyc_taxi_data_2015_16
    GROUP BY year
    ,month
ORDER BY year, month
""").show(24)

+----+-----+----------------+
|year|month|avg_total_amount|
+----+-----+----------------+
|2015|    1|       14.664886|
|2015|    2|       15.148309|
|2015|    3|       15.551944|
|2015|    4|       15.777206|
|2015|    5|       16.214256|
|2015|    6|       16.101898|
|2015|    7|        15.89492|
|2015|    8|       15.919458|
|2015|    9|       16.207254|
|2015|   10|       16.259301|
|2015|   11|       16.090388|
|2015|   12|       16.016362|
|2016|    1|       15.410171|
|2016|    2|        15.34471|
|2016|    3|       15.756457|
|2016|    4|       15.969409|
|2016|    5|       16.397597|
|2016|    6|       16.430469|
|2016|    7|       16.171933|
|2016|    8|       16.116113|
|2016|    9|       16.630837|
|2016|   10|        16.30687|
|2016|   11|       16.264509|
|2016|   12|       15.908741|
+----+-----+----------------+



### vi. What was the average amount paid per passenger (total_amount)?

In [13]:
spark.sql("""
SELECT year
    ,month
    ,ROUND(AVG(total_amount/passenger_count), 6) AS avg_total_amount
    FROM nyc_taxi_data_2015_16
    GROUP BY year
    ,month
ORDER BY year, month
""").show(24)

+----+-----+----------------+
|year|month|avg_total_amount|
+----+-----+----------------+
|2015|    1|        11.98855|
|2015|    2|       12.500788|
|2015|    3|       12.757072|
|2015|    4|       12.898395|
|2015|    5|       13.208794|
|2015|    6|       13.140276|
|2015|    7|       12.918228|
|2015|    8|       12.914272|
|2015|    9|       13.218438|
|2015|   10|       13.284638|
|2015|   11|       13.161323|
|2015|   12|       13.034274|
|2016|    1|       12.638365|
|2016|    2|       12.643378|
|2016|    3|       12.942539|
|2016|    4|       13.089892|
|2016|    5|       13.414256|
|2016|    6|       13.463351|
|2016|    7|       13.195319|
|2016|    8|       13.167938|
|2016|    9|       13.658277|
|2016|   10|        13.37838|
|2016|   11|       13.386819|
|2016|   12|       13.013608|
+----+-----+----------------+



##  Q2. For each taxi colour (yellow and green)

### What was the average, median, minimum and maximum trip duration in seconds?

In [15]:
spark.sql('''
select taxi_colour
    ,ROUND(AVG(duration_mins), 6) as avg_duration_mins
    ,ROUND(percentile_approx(duration_mins, 0.5), 6) as median_duration_mins
    ,ROUND(min(duration_mins), 6) as min_duration_mins
    ,ROUND(max(duration_mins), 6) as max_duration_mins
    FROM nyc_taxi_data_2015_16
    group by taxi_colour
''').show()

+-----------+-----------------+--------------------+-----------------+-----------------+
|taxi_colour|avg_duration_mins|median_duration_mins|min_duration_mins|max_duration_mins|
+-----------+-----------------+--------------------+-----------------+-----------------+
|      green|         13.12797|           10.283333|         0.016667|       179.983333|
|     yellow|        14.191113|                11.2|         0.016667|       179.966667|
+-----------+-----------------+--------------------+-----------------+-----------------+



In the above result, the duration minutes was capped at 180 minutes in ETL and that ws the maximum duration, and same goes with minimum duration being >0.

### What was the average, median, minimum and maximum trip distance in km?

In [16]:
spark.sql('''
select taxi_colour
    ,ROUND(AVG(trip_distance * 1.609), 6) as avg_distance_kms
    ,ROUND(percentile_approx(trip_distance * 1.609, 0.5), 6) as median_distance_kms
    ,ROUND(min(trip_distance  * 1.609), 6) as min_distance_kms
    ,ROUND(max(trip_distance * 1.609), 6) as max_distance_kms
    FROM nyc_taxi_data_2015_16
    group by taxi_colour
''').show()

+-----------+----------------+-------------------+----------------+----------------+
|taxi_colour|avg_distance_kms|median_distance_kms|min_distance_kms|max_distance_kms|
+-----------+----------------+-------------------+----------------+----------------+
|      green|        4.640294|            3.07319|         0.01609|           80.45|
|     yellow|        4.814259|            2.75139|         0.01609|           80.45|
+-----------+----------------+-------------------+----------------+----------------+



In the above result, the duration minutes was capped at 70 miles in ETL and that was the maximum distance, and same goes with minimum distance being >0 miles.

### What was the average, median, minimum and maximum speed in km per hour?

In [17]:
spark.sql('''
select taxi_colour
    ,ROUND(AVG(trip_distance * 1.609 /(duration_mins/60)), 6) as avg_speed_kmhr
    ,ROUND(percentile_approx(trip_distance * 1.609 /(duration_mins/60), 0.5), 6) as median_speed_kmhr
    ,ROUND(min(trip_distance * 1.609 /(duration_mins/60)), 6) as min_speed_kmhr
    ,ROUND(max(trip_distance * 1.609 /(duration_mins/60)), 6) as max_speed_kmhr
    FROM nyc_taxi_data_2015_16
    group by taxi_colour
''').show()

+-----------+--------------+-----------------+--------------+--------------+
|taxi_colour|avg_speed_kmhr|median_speed_kmhr|min_speed_kmhr|max_speed_kmhr|
+-----------+--------------+-----------------+--------------+--------------+
|      green|     20.589008|        18.756343|         3.218|        112.63|
|     yellow|     19.026017|        16.824398|         3.218|    112.629998|
+-----------+--------------+-----------------+--------------+--------------+



In the above result, the speed was taken between 2mph and 70mph in ETL and that is the minimum and maximum speed in kmhr.

## Q3. What was the percentage of trips where the driver received tips?

In [18]:
spark.sql('''
    select taxi_colour, sum(case when tip_amount > 0 then 1 else 0 end)/count(1) * 100 as tipped_trips_percentage
    from nyc_taxi_data_2015_16
    group by taxi_colour
''').show()

+-----------+-----------------------+
|taxi_colour|tipped_trips_percentage|
+-----------+-----------------------+
|      green|      41.14766720704491|
|     yellow|      62.00446030413704|
+-----------+-----------------------+



## Q4. For trips where the driver received tips, What was the percentage where the driver received tips of at least $10.

In [24]:
spark.sql('''
    select taxi_colour, sum(case when tip_amount >= 10 then 1 else 0 end)/sum(case when tip_amount > 0 then 1 else 0 end) * 100 as tipped_trips_percentage_gt_10
    from nyc_taxi_data_2015_16
    group by taxi_colour
''').show()

+-----------+-----------------------------+
|taxi_colour|tipped_trips_percentage_gt_10|
+-----------+-----------------------------+
|      green|           1.7097524699694122|
|     yellow|           3.0469699535287367|
+-----------+-----------------------------+



## Q5. For each bins, calculate:

### Average speed (km per hour)

In [20]:
spark.sql('''
    select cat_duration, ROUND(AVG(trip_distance * 1.609 /(duration_mins/60)), 6) as avg_speed_km_hr
    from nyc_taxi_data_2015_16
    group by cat_duration
''').show()

+-------------+---------------+
| cat_duration|avg_speed_km_hr|
+-------------+---------------+
|Above 30 mins|       24.76117|
|   10-20 mins|      18.279589|
|    5-10 mins|      17.473161|
|   20-30 mins|      21.468919|
| Under 5 mins|      20.018777|
+-------------+---------------+



### Average distance per dollar (km per $)

In [21]:
spark.sql('''
    select cat_duration, ROUND(AVG(trip_distance * 1.609 /total_amount), 6) as avg_km_dollar
    from nyc_taxi_data_2015_16
    group by cat_duration
''').show()

+-------------+-------------+
| cat_duration|avg_km_dollar|
+-------------+-------------+
|Above 30 mins|     0.407061|
|   10-20 mins|     0.274964|
|    5-10 mins|     0.226161|
|   20-30 mins|      0.32054|
| Under 5 mins|     0.172607|
+-------------+-------------+



## Q6. Which duration bin will you advise a taxi driver to target to maximise his income?

I would advise a taxi driver to target the long trips which are over 30 minutes for maximum income. Also, I would say to not miss the trips on Saturdays, and 6-7pm on the weekdays as those times are the times where there are most number of trips.