This notebooks illustrates how to write regex like patterns to match sequences of events. While, I write up more about the API, look at the original inspiration for this library [here](https://observablehq.com/@mikpanko/sequence-pattern-matching).

In [1]:
import awkward as ak
import pyarrow as pa
import pyarrow.parquet

In [2]:
from seqmatcher.codegen import clear_cache
from seqmatcher.matching_jitted import match_pattern, extract_pattern
from seqmatcher.parsing import parse_match_pattern
from seqmatcher.utilities import get_types_from_dataset

In [3]:
# We serialize the generated code to disk for debugging, and because it is easier to compile 
# Numba code that way. 
clear_cache()

Read the shot by shot tennis dataset. We view it as an `awkward array` since that lets us run jitted 
Numba code against it.

In [4]:
dataset = pa.parquet.read_table("../datasets/tennis_shot_by_shot.parquet")
awk_data = ak.from_arrow(dataset)

# collect the python types of the columns in the dataset, since we need those 
# to parse the pattern.
seq_type_map, events_type_map = get_types_from_dataset(dataset)

Here we go. Lets look at how many rallies were serve faults. 

In [5]:
pat_str = "|-(:serve_fault)-(:serve_fault)"
pat = parse_match_pattern(pat_str)
pat.type_properties(seq_type_map, events_type_map)

In [6]:
res = match_pattern(pat, awk_data)
res

<Array [{seq_id: 59, evt_indices: [0, ... 1]}] type='14972 * {"seq_id": int64, "...'>

The output result is a bunch of numpy arrays. Using those, we can sub-select from the input list of sequences to get the sequences that match the pattern. 

The `extract_pattern` routine can be used to do this in one go. 

In [7]:
res = extract_pattern(pat, awk_data)
res

<Array [{matchId: 0, pointNumber: 60, ... ] type='14972 * {"matchId": ?int64, "p...'>

Another ex - if a rally starts with a serve fault, is the player more likely to lose the point? Lets find out. 

In [8]:
# player1 ends up as a winner
pat_str = "|-(:serve_fault)-[..]()-(:winner|:ace {isPlayer1=True})-|"
pat = parse_match_pattern(pat_str)
pat.type_properties(seq_type_map, events_type_map)

In [9]:
res = match_pattern(pat, awk_data)
res

<Array [{seq_id: 4, evt_indices: [0, ... 1]}] type='20136 * {"seq_id": int64, "e...'>

Now, the number of times they lose. 

In [10]:
# player1 ends up losing the point
pat_str = "|-(:serve_fault)-[..]()-(!:winner|:ace {isPlayer1=True})-|"
pat = parse_match_pattern(pat_str)
pat.type_properties(seq_type_map, events_type_map)

In [11]:
res = match_pattern(pat, awk_data)
res

<Array [{seq_id: 2, evt_indices: [0, ... 1]}] type='82896 * {"seq_id": int64, "e...'>

That doesn't quite seem right - a 4x difference in points made when you fault on the first serve!

I wasn't identifying the rallies which player 1 wins correctly, since we leave out rallies which ended with the other player making the last shot! So, one way to correct this would be to also match agains the rallies which end with player 2. And sum up the two results. 

Or we could also write some custom code to do it in one go!

In [12]:
# player 1 winning

pat_str = "|-(:serve_fault)-[..]()-(@end)-|"
code_str = "(@end.isPlayer1==True and @end._eventName in ('winner', 'ace')) or (@end.isPlayer1==False and @end._eventName not in ('winner', 'ace'))"
pat = parse_match_pattern(pat_str, code_str)
pat.type_properties(seq_type_map, events_type_map)

In [13]:
res = match_pattern(pat, awk_data)
res

<Array [{seq_id: 0, evt_indices: [0, ... 1]}] type='82393 * {"seq_id": int64, "e...'>

Now, calculating the number of times they lost the point. 

In [14]:
pat_str = "|-(:serve_fault)-[..]()-(@end)-|"
code_str = "(@end.isPlayer1==True and @end._eventName not in ('winner', 'ace')) or (@end.isPlayer1==False and @end._eventName in ('winner', 'ace'))"
pat = parse_match_pattern(pat_str, code_str)
pat.type_properties(seq_type_map, events_type_map)

In [15]:
res = match_pattern(pat, awk_data)
res

<Array [{seq_id: 2, evt_indices: [0, ... 1]}] type='82209 * {"seq_id": int64, "e...'>

Almost the same amount. So, starting with a serve fault doesn't really matter for the point much. 