# HW3 - Q3 [35 pts]

## Important Notices

<div class="alert alert-block alert-danger">
    WARNING: <strong>REMOVE</strong> any print statements added to cells with "#export" that are used for debugging purposes befrore submitting because they will crash the autograder in Gradescope. Any additional cells can be used for testing purposes at the bottom. 
</div>

<div class="alert alert-block alert-danger">
    WARNING: Do <strong>NOT</strong> remove any comment that says "#export" because that will crash the autograder in Gradescope. We use this comment to export your code in these cells for grading.
</div>

<div class="alert alert-block alert-danger">
    WARNING: Do <strong>NOT</strong> import any additional libraries into this workbook.
</div>

All instructions, code comments, etc. in this notebook **are part of the assignment instructions**. That is, if there is instructions about completing a task in this notebook, that task is not optional.  

<div class="alert alert-block alert-info">
    You <strong>must</strong> implement the following functions in this notebook to receive credit.
</div>

`user()` - 1 point

`trip_statistics()` - 3 points

`busiest_hour()` - 5 points

`most_freq_pickup_locations()` - 5 points

`avg_trip_distance_and_duration()` - 6 points

`most_freq_peak_hour_fares()` - 10 points

Each function will be auto-graded using different sets of parameters or data, to ensure that values are not hard-coded.  You may assume we will only use your code to work with data from the NYC-TLC dataset during auto-grading.

In addition, you will also submit the resulting output csv from most_freq_peak_hour_fares() as output_large.csv.

`output_large.csv` - 5 points

<div class="alert alert-block alert-danger">
    WARNING: Do <strong>NOT</strong> remove or modify the following utility functions:
</div>

`load_data()`

`main()`

<div class="alert alert-block alert-danger">
    WARNING: Do <strong>NOT</strong> remodify the below cell. It contains the function for loading data and all imports, and the function for running your code.
</div>

In [1]:
#export
from pyspark.sql.functions import *
from pyspark.sql import *

In [2]:
#### DO NOT CHANGE ANYTHING IN THIS CELL ####

def load_data(size='small'):
    # Loads the data for this question. Do not change this function.
    # This function should only be called with the parameter 'small' or 'large'
    
    if size != 'small' and size != 'large':
        print("Invalid size parameter provided. Use only 'small' or 'large'.")
        return
    
    input_bucket = "s3://cse6242-hw3-q3"
    
    # Load Trip Data
    trips_path = '/'+size+'/yellow_tripdata*'
    trips = spark.read.csv(input_bucket + trips_path, header=True, inferSchema=True)
    
    # Load Zone Data
    zones_path = '/'+size+'/taxi*'
    zones = spark.read.csv(input_bucket + zones_path, header=True, inferSchema=True)
    
    return trips, zones
    
def main(size, bucket):
    # Runs your functions
    trips, zones = load_data(size=size)
    
    print("User:", user())
    print()
    
    print("Trip Statistics:")
    ts = trip_statistics(trips)
    ts.show()
    print()
    
    print("Busiest Hour:")
    bh = busiest_hour(trips)
    bh.show(24)
    print()
    
    print("Most Frequent Pickup Locations:")
    mfpl = most_freq_pickup_locations(trips)
    mfpl.show()
    print()
    
    print("Average Trip Distance and Duration:")
    atdd = avg_trip_distance_and_duration(trips)
    atdd.show(n=24)
    print()
    
    print("Most Frequent Peak Hour Fares:")
    mfphf = most_freq_peak_hour_fares(trips, zones)
    mfphf.show()
    mfphf.coalesce(1).write.option("header","true").mode("overwrite").csv('{}/output_{}'.format(bucket, size))

# Implement the below functions for this assignment:
<div class="alert alert-block alert-danger">
    WARNING: Do <strong>NOT</strong> change any function inputs or outputs, and ensure that the dataframes your code returns align with the schema definitions commented in each function. Do <strong>NOT</strong> remove the #export comment from each of the code blocks either. This can prevent your code from being converted to a python file.
</div>

## 3.1 [1 pt] Update the `user()` function
This function should return your GT username, eg: gburdell3

In [3]:
#export
def user():
    return 'jholman6'

## 3.2 [3 pts] Update the `trip_statistics()` function
This function performs exploratory data analysis on the column trip_distance. Compute basic statistics (count, mean, stdev, min, max) for trip_distance. 

Example output formatting:

```
+-------+------------------+
|summary|     trip_distance|
+-------+------------------+
|  count|           xxxxxxx|
|   mean|           xxxxxxx|
| stddev|           xxxxxxx|
|    min|           xxxxxxx|
|    max|           xxxxxxx|
+-------+------------------+
```
Tip: Is there a PySpark Dataframe function you can use to solve this in a single line?

In [4]:
#export
def trip_statistics(trips):
    return trips.select("trip_distance").describe()

## 3.3 [5 pts] Update the `busiest_hour()` function

Determine the hour of the day with the highest number of trips. Display the hour (0-23) and the corresponding trip count.

Returns a PySpark DataFrame with a single row showing the hour with the highest trip count and the corresponding number of trips. Use column names: `hour`, `trip_count`.

Example output formatting:

```
+----+----------+
|hour|trip_count|
+----+----------+
|  xx|    xxxxxx|
+----+----------+
```

In [None]:
#export
def busiest_hour(trips):
    return (
        trips
        .withColumn("pickup_ts", to_timestamp(col("tpep_pickup_datetime")))
        .withColumn("hour", hour(col("pickup_ts")))
        .groupBy("hour")
        .agg(count("*").alias("trip_count"))
        .orderBy(col("trip_count").desc())
        .limit(1)
    )

## 3.4 [5 pts] Update the `most_freq_pickup_locations()` function
Top 10 Most Frequent Pickup Locations

Identify the top 10 pickup locations (by `PULocationID`) with the highest number of trips. Display the location IDs along with their corresponding trip counts.

Return a PySpark DataFrame with the top 10 rows ordered by `trip_count` in descending order. Use column names: `PULocationID`, `trip_count`.

Expected Output:

A table with 10 rows listing the `PULocationID` values and the number of trips observed for each, ordered from most to least frequent.

Example output formatting:
```
+------------+----------+
|PULocationID|trip_count|
+------------+----------+
|         xxx|    xxxxxx|
|         xxx|    xxxxxx|
|         xxx|    xxxxxx|
|         xxx|    xxxxxx|
|         ...|    ......|
+------------+----------+
```

In [None]:
#export
def most_freq_pickup_locations(trips): 
    return (
        trips
        .groupBy("PULocationID")
        .agg(count("*").alias("trip_count"))
        .orderBy(col("trip_count").desc())
        .limit(10)
    )

## 3.5 [6 pts] Update the `avg_trip_distance_and_duration()` function
Average Trip Distance and Duration by Hour

Calculate the average trip distance and average trip duration by pickup hour (0-23), using `tpep_pickup_datetime` for the hour. Display the hour along with the corresponding averages.

Compute trip duration in minutes (difference between drop-off time and pickup time).

Exclude rows where `tpep_pickup_datetime` or `tpep_dropoff_datetime` is null and where `trip_distance` <= 0. No additional outlier filtering or rounding is required.

Note: You can use `unix_timestamp` to help with calculating the duration.

Expected Output:

A table with 24 rows (hours 0-23) showing each hour ordered ascending along with the average trip distance and average trip duration for that hour. Use column names: `hour`, `avg_trip_distance`, `avg_trip_duration`.

Example output formatting:
```
+----+------------------+------------------+
|hour| avg_trip_distance| avg_trip_duration|
+----+------------------+------------------+
|   0|           xxxxxxx|           xxxxxxx|
|   1|           xxxxxxx|           xxxxxxx|
|   2|           xxxxxxx|           xxxxxxx|
|   3|           xxxxxxx|           xxxxxxx|
| ...|               ...|               ...|
|  23|           xxxxxxx|           xxxxxxx|
+----+------------------+------------------+
```

In [7]:
#export
def avg_trip_distance_and_duration(trips):
    valid = (
        trips
        .filter(col("tpep_pickup_datetime").isNotNull() & col("tpep_dropoff_datetime").isNotNull() & (col("trip_distance") > 0))
        .withColumn("pickup_ts", to_timestamp(col("tpep_pickup_datetime")))
        .withColumn("dropoff_ts", to_timestamp(col("tpep_dropoff_datetime")))
        .withColumn("hour", hour(col("pickup_ts")))
        .withColumn("duration_minutes", (unix_timestamp(col("dropoff_ts")) - unix_timestamp(col("pickup_ts"))) / 60.0)
    )

    result = (
        valid
        .groupBy("hour")
        .agg(
            avg(col("trip_distance")).alias("avg_trip_distance"),
            avg(col("duration_minutes")).alias("avg_trip_duration")
        )
        .orderBy(col("hour").asc())
    )

    return result

## 3.6 [10 pts] Update the `most_freq_peak_hour_fares()` function
Top 10 Most Frequent Routes During Peak Hours

Identify the top 10 most frequent routes (combinations of `PULocationID` and `DOLocationID`) during peak hours and return route-level statistics. For this question:

- Peak-hour windows are 7:00 - 8:59 (morning) and 16:00 - 18:59 (evening). Consider trips whose pickup hour falls within these windows.
- Exclude rows where `PULocationID` or `DOLocationID` is null.
- Exclude routes where `PULocationID` equals `DOLocationID` (a valid route must have different pickup and drop-off locations).

For each route, compute:

- `trip_count`: number of trips for that route during peak hours, and
- `avg_total_fare`: the average of the trip total_amount for that route, rounded to two decimal places.

Join the route results with the provided zones dataset to include human-readable zone names for pickup and dropoff. Return a PySpark DataFrame with the following columns (in this order):

`PULocationID`, `PUZone`, `DOLocationID`, `DOZone`, `trip_count`, `avg_total_fare`.

Expected Output:

A table with 10 rows showing the top 10 routes during peak hours ordered by `trip_count` descending. If multiple routes share the same `trip_count`, any stable ordering among them is acceptable as long as results are ordered by `trip_count` descending.

Example output formatting:
```
+------------+------+------------+------+----------+--------------+
|PULocationID|PUZone|DOLocationID|DOZone|trip_count|avg_total_fare|
+------------+------+------------+------+----------+--------------+
|xxx         |xxx   |xxx         |xxx   |xxx       |xx.xx         |
|xxx         |xxx   |xxx         |xxx   |xxx       |xx.xx         |
|xxx         |xxx   |xxx         |xxx   |xxx       |xx.xx         |
|...         |...   |...         |...   |...       |...           |
+------------+------+------------+------+----------+--------------|

```

In [8]:
#export
def most_freq_peak_hour_fares(trips, zones):
    # Peak hours: 7-8 and 16-18 by pickup hour
    peak_trips = (
        trips
        .filter(col("PULocationID").isNotNull() & col("DOLocationID").isNotNull())
        .withColumn("pickup_ts", to_timestamp(col("tpep_pickup_datetime")))
        .withColumn("pickup_hour", hour(col("pickup_ts")))
        .filter((col("pickup_hour").between(7, 8)) | (col("pickup_hour").between(16, 18)))
        .filter(col("PULocationID") != col("DOLocationID"))
    )

    routes = (
        peak_trips
        .groupBy("PULocationID", "DOLocationID")
        .agg(
            count("*").alias("trip_count"),
            round(avg(col("total_amount")), 2).alias("avg_total_fare")
        )
        .orderBy(col("trip_count").desc())
        .limit(10)
    )

    pu = zones.select(col("LocationID").alias("PU_LocationID"), col("Zone").alias("PUZone"))
    do = zones.select(col("LocationID").alias("DO_LocationID"), col("Zone").alias("DOZone"))

    result = (
        routes
        .join(pu, routes.PULocationID == pu.PU_LocationID, "left")
        .join(do, routes.DOLocationID == do.DO_LocationID, "left")
        .select(
            routes.PULocationID,
            col("PUZone"),
            routes.DOLocationID,
            col("DOZone"),
            col("trip_count"),
            col("avg_total_fare")
        )
    )

    return result


## 3.7 [5 pts] q3_output_large.csv
The CSV output from running `most_freq_peak_hour_fares(trips, zones)` on the large dataset.

- Run the `main` function with `size='large'` to generate the file in the S3 bucket you created in the AWS setup.
- Download the generated file from S3 and rename it to `q3_output_large.csv`.

<div class="alert alert-block alert-info">
<h2>Submission</h2>

<p>Once you have finished coding, you can export the notebook from <code>Notebook Explorer</code> by selecting your notebook and clicking <code>Export File</code> from the Actions dropdown.</p>

<p>Submit this notebook (q3.ipynb) along with your CSV file (q3_output_large.csv) to Gradescope.</p>
</div>

<div class="alert alert-block alert-info">
  <h2>Testing</h2>

  <p>You may use the cell below for any additional testing; however, any code written here will not be run or used during grading.</p>

  <ul>
    <li>You can run the <code>main</code> function on different dataset sizes to test your work, or run the functions individually as shown in the examples.</li>
  </ul>
</div>

In [9]:
trips, zones = load_data('small')

NameError: name 'spark' is not defined

In [10]:
ts = trip_statistics(trips)
ts.show()

NameError: name 'trips' is not defined

In [11]:
main('large', 's3://cse6242-gburdell3')

NameError: name 'spark' is not defined