# Itertools groupby



## groupby

Points to remember about groupby:

- its signature is: `groupby(xs:Iterable[T], key=Callable[[T],U]) -> Iterable[tuple[U,Iterable[T]]]`
- groupby returns the groups as iterables, which must be iterated interleaved with the groupby return object
- `groupby(xs)` returns an iterator over `(key,xs_subgroup_iterator)` tuples, where groupby itself computes the key value!
- groupby creates its own key.



In [1]:
import itertools

In [42]:
#lst = [random.choice(["one","two","three","four","five"]) for _ in range(20)]
lst = ['three', 'two', 'five', 'four', 'four', 'five', 'one', 'one', 'two', 'one', 'five', 'five', 'one', 'three', 'five', 'four', 'one', 'five', 'three','two']
lst

['three',
 'two',
 'five',
 'four',
 'four',
 'five',
 'one',
 'one',
 'two',
 'one',
 'five',
 'five',
 'one',
 'three',
 'five',
 'four',
 'one',
 'five',
 'three',
 'two']

### How NOT to use itertools: consume the returned iterator, then inspect subgroups

In [45]:
[x for x in itertools.groupby(lst,key=lambda v:v[0])]

[('t', <itertools._grouper at 0x7fe87a3bb3a0>),
 ('f', <itertools._grouper at 0x7fe87a3bb1f0>),
 ('o', <itertools._grouper at 0x7fe87a3baad0>),
 ('t', <itertools._grouper at 0x7fe87a3bbcd0>),
 ('o', <itertools._grouper at 0x7fe87a3bbb80>),
 ('f', <itertools._grouper at 0x7fe87a3b9270>),
 ('o', <itertools._grouper at 0x7fe87a3b82e0>),
 ('t', <itertools._grouper at 0x7fe87a3b9540>),
 ('f', <itertools._grouper at 0x7fe87a4c07f0>),
 ('o', <itertools._grouper at 0x7fe87a4c06d0>),
 ('f', <itertools._grouper at 0x7fe87a4c1f00>),
 ('t', <itertools._grouper at 0x7fe87a4c3490>)]

Can't examine the subgroups after the subgroups have been listed!

In [46]:
list([x for x in itertools.groupby(lst,key=lambda v:v[0])][0][1])

[]

### groupby always adds a key value, possibly based on the key func

We can see the key which groupby computed and added, and materialize the subgroups:

In [48]:
[(key,list(subgroup_iter)) for (key,subgroup_iter) in itertools.groupby(lst,key=lambda v:v[0])]

[('t', ['three', 'two']),
 ('f', ['five', 'four', 'four', 'five']),
 ('o', ['one', 'one']),
 ('t', ['two']),
 ('o', ['one']),
 ('f', ['five', 'five']),
 ('o', ['one']),
 ('t', ['three']),
 ('f', ['five', 'four']),
 ('o', ['one']),
 ('f', ['five']),
 ('t', ['three', 'two'])]

You need to ignore the key and materialize the subgroups, to see only the subgroups

In [49]:
[list(subgroup_iter) for (key,subgroup_iter) in itertools.groupby(lst,key=lambda v:v[0])]

[['three', 'two'],
 ['five', 'four', 'four', 'five'],
 ['one', 'one'],
 ['two'],
 ['one'],
 ['five', 'five'],
 ['one'],
 ['three'],
 ['five', 'four'],
 ['one'],
 ['five'],
 ['three', 'two']]

Without passing a key function, group b will treat the values as keys. This will produce subgroups which are equal as defined by their equality relation:

In [50]:
[(key,list(subgroup_iter)) for (key,subgroup_iter) in itertools.groupby(lst)]

[('three', ['three']),
 ('two', ['two']),
 ('five', ['five']),
 ('four', ['four', 'four']),
 ('five', ['five']),
 ('one', ['one', 'one']),
 ('two', ['two']),
 ('one', ['one']),
 ('five', ['five', 'five']),
 ('one', ['one']),
 ('three', ['three']),
 ('five', ['five']),
 ('four', ['four']),
 ('one', ['one']),
 ('five', ['five']),
 ('three', ['three']),
 ('two', ['two'])]

### How to process gropuby results with loops

In [51]:
sublists = []
for (key,subgroup_iter) in itertools.groupby(lst):
    sublist = list(subgroup_iter)
    sublists.append(sublist)
sublists

[['three'],
 ['two'],
 ['five'],
 ['four', 'four'],
 ['five'],
 ['one', 'one'],
 ['two'],
 ['one'],
 ['five', 'five'],
 ['one'],
 ['three'],
 ['five'],
 ['four'],
 ['one'],
 ['five'],
 ['three'],
 ['two']]

## How to avoid groupby and use loops

In [55]:
sublists = []
current_sublist = []
sublists.append(current_sublist)
for item in lst:
    # if the current_sublist is empty (first iteration) or its last item == the current item
    if len(current_sublist) == 0 or current_sublist[-1] == item:
        # add the current item
        current_sublist.append(item)
    else:
        # otherwise, start a new current_sublit
        current_sublist = [item]
        sublists.append(current_sublist)
sublists

[['three'],
 ['two'],
 ['five'],
 ['four', 'four'],
 ['five'],
 ['one', 'one'],
 ['two'],
 ['one'],
 ['five', 'five'],
 ['one'],
 ['three'],
 ['five'],
 ['four'],
 ['one'],
 ['five'],
 ['three'],
 ['two']]

## splitting a list

In [59]:
import string
lst2 = [random.choice(string.ascii_lowercase) for _ in range(20)]
for split_index in [random.randint(0,20-1) for _ in range(5)]:
    lst2[split_index] = ''
lst2

['c',
 'e',
 'p',
 'f',
 'c',
 '',
 '',
 'v',
 '',
 'v',
 'c',
 '',
 'd',
 'y',
 'n',
 'b',
 'o',
 'b',
 'v',
 'n']

In [64]:
sublists = []
curlist = None
for ch in lst2:
    match (curlist is None, )

In [65]:
sublists = []
cur_list = None
sep = ''
for i,ch in enumerate(lst2):
    match (cur_list is None, ch != sep):
        case (True,True):
            cur_list = [ch]
            sublists.append(cur_list)
        case (True,False):
            pass
        case (False, True):
            cur_list.append(ch)
        case (False,False):
            cur_list = None
sublists
        

[['c', 'e', 'p', 'f', 'c'],
 ['v'],
 ['v', 'c'],
 ['d', 'y', 'n', 'b', 'o', 'b', 'v', 'n']]