Skip to content

generic and extensible batch operations for iterables/sliceables, and a few extra iterator utilities

License

Notifications You must be signed in to change notification settings

bourbaki-py/iterutils

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Table of Contents

  1. How to use this Module
  2. Contributing

How to use this Module

Let's say you have some data like this:


>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.arange(10*3).reshape(10,3), columns=["a", "b", "c"])
>>> print(df)
a b c 0 0 1 2 1 3 4 5 2 6 7 8 3 9 10 11 4 12 13 14 5 15 16 17 6 18 19 20 7 21 22 23 8 24 25 26 9 27 28 29

If you've ever written code like this (most likely on several occasions):


>>> chunksize = 3
>>> nchunks = df.shape[0] // chunksize
>>> if df.shape[0] % chunksize != 0:
>>>     nchunks += 1
>>>
>>> for i in range(nchunks):
>>>     chunk = df.iloc[chunksize*i:chunksize*(i+1)]
>>>     # do something with `chunk`
>>>     print(chunk, end='\n')
a b c 0 0 1 2 1 3 4 5 2 6 7 8
a b c 3 9 10 11 4 12 13 14 5 15 16 17
a b c 6 18 19 20 7 21 22 23 8 24 25 26
a b c 9 27 28 29

Then this module is for you! After you install this package, you can accomplish the same task by writing instead:


>>> from bourbaki.iterutils import batched
>>>
>>> for chunk in batched(df, 3):
>>>    # do something with chunk
>>>    print(chunk, end='\n')

To produce the same results.

In addition to batched, there is even_batched, which will do this if you for some reason need chunks with sizes all equal to n or n+1 for some n:


>>> from bourbaki.iterutils import even_batched
>>>
>>> for chunk in even_batched(df, nchunks=3):
>>>    # do something with chunk
>>>    print(chunk, end='\n')
a b c 0 0 1 2 1 3 4 5 2 6 7 8 3 9 10 11
a b c 4 12 13 14 5 15 16 17 6 18 19 20
a b c 7 21 22 23 8 24 25 26 9 27 28 29

You can batch just about any iterable, iterator, or sliceable collection - I demonstrate pandas.DataFrame here because it shows off the extensibility of this module: the naive slice operation, df[a:b] doesn't behave as you would expect; you have to access integer-row slices with the df.iloc accessor to achieve the desired result.

If you have your own collection type that iterutils doesn't as yet know how to slice into chunks, you do this ( with pandas.DataFrame as an example case):


from bourbaki.iterutils import to_sliceable
to_sliceable.register(pd.DataFrame, lambda df: df.iloc)

Or if you have a collection that is simply integer-sliceable already and iterutils just doesn't know about it yet, simply:


from bourbaki.iterutils import Sliceable
Sliceable.register(np.ndarray)

(Numpy arrays are already registered, but this is how you would do it with another collection type)

Finally, if what you have is a generator or iterator such as a map or filter which isn't sliceable but only iterable, and you just want to break it into pieces without boilerplate, you can just use batched without worrying about registering any types. You can also use even_batched, but only if you know the total length of the iterator and pass it to the len_= keyword arg, which is necessary to compute the chunk size.

Note that in the case of types which are not known to be Sliceable or coercible to a sliceable view via to_sliceable, you will always get an iterator of simple lists back from batched or even_batched; the default behavior in that case is just to iterate over the collection and collect the chunks dynamically into lists. So the advantage of registering a type as sliceable is not only computational (less memory overhead and lower runtime due to less copying), but also semantic - you can ensure that the type of your chunks is the same as the type of the collection being chunked by specifying how the slicing will be performed.

Contributing

See something missing here that you think should be added? Put in a pull request or contact the maintainer!

Matthew Hawthorn

About

generic and extensible batch operations for iterables/sliceables, and a few extra iterator utilities

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages