# 01 - Grouping and Aggregating Data

This demonstration will show how to perform grouping and aggregation operations using NYC Taxi trip data. We'll explore basic grouping, multiple aggregations, and window functions.

### Objectives
- Understand basic grouping operations in Spark
- Perform time-based analysis using aggregations
- Implement complex aggregations with multiple metrics
- Use window functions for advanced analytics
- Optimize aggregation performance


%md
## REQUIRED - SELECT CLASSIC COMPUTE

Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.

Follow these steps to select the classic compute cluster:

1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.

1. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:

    - In the drop-down, select **More**.

    - In the **Attach to an existing compute resource** pop-up, select the first drop-down. You will see a unique cluster name in that drop-down. Please select that cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.

1. Find the triangle icon to the right of your compute cluster name and click it.

1. Wait a few minutes for the cluster to start.

1. Once the cluster is running, complete the steps above to select your cluster.


## A. Data Setup and Loading

First, let's load our taxi trip data and examine its structure.
from pyspark.sql.functions import *


In [0]:
from pyspark.sql.functions import *

# Read and displaying the taxi data
trips_df = spark.read.table("samples.nyctaxi.trips")

display(trips_df.limit(10))

%md
## B. Basic Grouping Operations

Let's start with simple grouping operations to understand trip patterns by location.

In [0]:
# Count trips by pickup location, to show top 5 most popular pickup locations
location_counts = trips_df \
    .groupBy("pickup_zip") \
    .count() \
    .orderBy(desc("count"))

display(location_counts.limit(5))

## C. Combining Multiple Aggregations

Let's perform multiple aggregations by location using the `agg()` method

In [0]:
# Perform multiple aggregations by location, order by most popular pickup locations
location_stats = trips_df \
    .groupBy("pickup_zip") \
    .agg(
        count("*").alias("total_trips"),
        round(avg("trip_distance"), 2).alias("avg_distance"),
        round(avg("fare_amount"), 2).alias("avg_fare"),
        round(sum("fare_amount"), 2).alias("total_fare_amt")
    ) \
    .orderBy(desc("total_trips"))

display(location_stats.limit(5))

## D. Window Functions

Now let's use window functions for more advanced analytics.
from pyspark.sql.window import Window

In [0]:
# Create window specs for different ranking methods
window_by_trips = Window.orderBy(desc("total_trips"))
window_by_fare = Window.orderBy(desc("avg_fare"))

# Add different types of rankings
ranked_locations = location_stats \
    .withColumn("trips_rank", rank().over(window_by_trips)) \
    .withColumn("fare_rank", rank().over(window_by_fare)) \
    .withColumn("fare_quintile", ntile(5).over(window_by_fare))  # Divide into 5 groups by fare

In [0]:
ranked_locations.createOrReplaceTempView("ranked_locations")

In [0]:
%sql
select fare_quintile,min(avg_fare),max(avg_fare),count(*) as cnt_per_group from ranked_locations group by fare_quintile

In [0]:
# Displaying the results
display(ranked_locations.select(
    "pickup_zip", 
    "total_trips", 
    "avg_fare", 
    "avg_distance",
    "trips_rank",
    "fare_rank",
    "fare_quintile"
))

## Key Takeaways

1. **Basic Grouping**
   - Use `groupBy()` followed by aggregation method
   - Can group by multiple columns
   - Always check data distribution

2. **Window Functions**
   - Perfect for comparative analytics
   - Consider performance impact
   - Use appropriate window frame

3. **Best Practices**
   - Always alias aggregated columns
   - Handle null values appropriately
   - Consider data skew in grouping keys
