Skip to content

Apply regular expressions to lists of arbitrary objects

Notifications You must be signed in to change notification settings

boppreh/listregex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 

Repository files navigation

listregex

listregex implements the same functions as Python's stdlib re module (and a few more), but instead of operating only on strings, it operates on lists of arbitrary objects. If you've found yourself writing awkward code to extract subsequences from a list, and thought to yourself "this would be a tiny regex if my list was a string", then this is the library.

This is not a high-speed regex engine, as it currently uses naive backtracking in pure Python. On the other hand, there's greater flexibility in the patterns allowed, and even a mechanism for arbitrary tests.

Patterns can be:

  • A single literal value. Example: search(pattern=1, items=[1, 2]) matches [1].
  • A list/tuple of patterns, where the sub-patterns are matched sequentially. Example: search([1, 2], [0, 1, 2]) matches [1, 2].
  • A value from a helper function, such as optional(pattern), zero_or_more(pattern), end(), etc. Example: findall(repeat(1), my_list) finds all sequences of 1's.
  • Any combination of the above. Example: search(pattern=[1, repeat(negate(3)), 1], items=[0, 2, 1, 3, 1, 2, 1, 0]) matches [1, 2, 1].
  • A function that takes one parameter, the current match, and returns the number of following items that should be added to be match (note that True == 1 and False == 0). Returning 0 means no match and the engine backtracks. Examples:
    • lambda m: 2 blindly accepts the next two items, such that findall(lambda m: 2, items) returns the items divided in pairs.
    • lambda m: m.next % 2 == 0 checks if the next item is even, and if so, extends the match to include it.
    • lambda m: m.items.count(m.next) > 1 matches all items that occur more than once.
    • lambda m: m[0] > m.next compares the first item of the current match with the next.
from listregex import *

# Matches 1 and 3, optionally with a 2 between them:
fullmatch([1, optional(2), 3], [1, 3])
# Match(1, 3)

# A sequence of 1 or more items between 0 and 3:
search(repeat(lambda m: 0 < m.next <= 3), [0, 1, 2, 3, 4])
# Match(1, 2, 3)

from datetime import date, timedelta
from collections import namedtuple
Login = namedtuple('Login', 'country date')
logins = [
    Login('Germany', date(2020, 1, 1)),
    Login('Belgium', date(2020, 1, 2)),
    Login('Germany', date(2020, 3, 1)),
    Login('Germany', date(2020, 3, 2)),
    Login('Russia', date(2020, 3, 2)),
    Login('Russia', date(2020, 3, 2)),
    Login('Germany', date(2020, 3, 3)),
]
# Find suspicious logins by looking at quick country switches:
pattern = [
    # Start from any login...
    any(),
    
    # Followed by one or more logins at a different country...
    repeat(lambda m: m[0].country != m.next.country), 
    
    # Followed by a login at the original country within 2 days.
    lambda m: m.next.date - m[0].date < timedelta(days=2), 
]
search(pattern, logins)[1].country
# 'Russia'

# Collapses repeated elements.
sub([any(), zero_or_more(lambda m: m.next == m[0])], lambda m: [m[0]], [1, 2, 3, 3, 4, 5, 5])
# [1, 2, 3, 4, 5]

# Parses a binary array where elements are encoded as [length, *values].
findall(lambda m: int(m[0])+1, b'\x00\x01\x55\x02\x66\x66\x00')
# [b'\x00', b'\x01\x55', b'\x02\x66\x66', b'\x00']

# Finds all items that are bigger than the next, or at the `end` of the list.
# Uses `lookahead` to allow the next item to also be matched.
findall([any(), either(end(), lookahead(lambda m: m[0] > m.next))], [1, 2, 1, 3, 2, 4, 3, 1])
# [[2], [3], [4], [3], [1]]

About

Apply regular expressions to lists of arbitrary objects

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages