In [1]:
%load_ext google.cloud.bigquery

## Aggregate Analytical function

General example on applying average across the whole table.

In [2]:
%%bigquery
SELECT
  MAX(duration) AS longest_duration,
  COUNT(*) AS num_trips,

FROM
 dataflow-templates-327714.bigquery_examples.cycle_hire

Query complete after 0.03s: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 498.22query/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.26s/rows]


Unnamed: 0,longest_duration,num_trips
0,2674020,24369201


### Aggregate statistic relative to each row (last 100) using WINDOW

This is basically the relative average (last 100 rows before current row)

In [3]:
%%bigquery
SELECT
  AVG(duration) OVER(
    ORDER BY start_date 
    ASC ROWS BETWEEN 100 PRECEDING AND 1 PRECEDING) AS avg_duration
FROM
  dataflow-templates-327714.bigquery_examples.cycle_hire

Query complete after 0.01s: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 1187.60query/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 24369201/24369201 [00:05<00:00, 4832059.17rows/s]


Unnamed: 0,avg_duration
0,380.000000
1,720.000000
2,722.068966
3,739.148936
4,731.785714
...,...
24369196,1142.400000
24369197,1138.800000
24369198,1185.600000
24369199,1152.000000


### Aggregate statistic relative to each row (last 100) using WINDOW with filtering
Last 100 average relative to current row, with filtering on unique start_station_id

In [4]:
%%bigquery
SELECT
  AVG(duration) OVER(
    PARTITION BY start_station_id 
    ORDER BY start_date 
    ASC ROWS BETWEEN 100 PRECEDING AND 1 PRECEDING) AS avg_duration
FROM
  dataflow-templates-327714.bigquery_examples.cycle_hire

Query complete after 0.01s: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 867.31query/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 24369201/24369201 [00:04<00:00, 5327559.66rows/s]


Unnamed: 0,avg_duration
0,2861.4
1,2859.0
2,2895.0
3,2874.0
4,2868.6
...,...
24369196,1108.2
24369197,969.0
24369198,895.2
24369199,885.0


## Navigation functions i.e. data on specific rows relative to current

Fetching a single value denoted by the location of the row 
e.g. Finding the "next" rental of a bike relative to the current row.

In [5]:
%%bigquery
SELECT
  start_date,
  end_date,
  LAST_VALUE(start_date) OVER(
    PARTITION BY bike_id ORDER BY start_date ASC 
    ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING) AS next_rental_start
FROM
  dataflow-templates-327714.bigquery_examples.cycle_hire
LIMIT
  5

Query complete after 0.00s: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 659.69query/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  3.87rows/s]


Unnamed: 0,start_date,end_date,next_rental_start
0,2015-01-05 09:04:00+00:00,2015-01-05 09:18:00+00:00,2015-01-05 12:16:00+00:00
1,2015-01-05 12:16:00+00:00,2015-01-05 12:36:00+00:00,2015-01-05 17:35:00+00:00
2,2015-01-05 17:35:00+00:00,2015-01-05 17:42:00+00:00,2015-01-06 08:55:00+00:00
3,2015-01-06 08:55:00+00:00,2015-01-06 09:07:00+00:00,2015-01-06 14:42:00+00:00
4,2015-01-06 14:42:00+00:00,2015-01-06 14:57:00+00:00,2015-01-09 08:24:00+00:00


## Numbering functions (Ranking)
e.g. Find the 5 longest trips start started at each of the stations

In [6]:
%%bigquery
SELECT start_station_id, duration, RANK()
  OVER(PARTITION BY start_station_id ORDER BY duration DESC) as nth_longest
FROM 
  dataflow-templates-327714.bigquery_examples.cycle_hire
LIMIT
  5

Query complete after 0.00s: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 1662.82query/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  3.78rows/s]


Unnamed: 0,start_station_id,duration,nth_longest
0,26,267120,1
1,26,192600,2
2,26,191880,3
3,26,148500,4
4,26,142020,5


### Finding the top 3 per station

In [7]:
%%bigquery

WITH longest_trips AS (
    SELECT
        start_station_id,
        duration,
        RANK() OVER(PARTITION BY start_station_id ORDER BY duration DESC) as nth_longest
    FROM dataflow-templates-327714.bigquery_examples.cycle_hire)

SELECT 
    start_station_id,
    ARRAY_AGG(duration ORDER BY nth_longest LIMIT 3) as durations
    FROM longest_trips
    GROUP BY start_station_id
    LIMIT 5

Query complete after 0.01s: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 1358.41query/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  4.85rows/s]


Unnamed: 0,start_station_id,durations
0,819,"[88080, 69240, 63420]"
1,56,"[246900, 229500, 201000]"
2,40,"[1129200, 435180, 306600]"
3,754,"[1041960, 262560, 214260]"
4,784,"[250320, 250200, 194880]"
