In [None]:
from fastai.vision import *
path = untar_data(URLs.MNIST_TINY)
tfms = get_transforms(do_flip=False)
#path.ls()

# Extending DataBlock API

## The problem

I like current DataBlock API. It allows you to do a lot of things right out of the box. It saves lots of your time on writing data preprocessing pipelines, which can be really similar from task to task. And on top of that it has really nice readable API.

I've used it several times in kaggle competitions and applied to some toy problems and I found few things, which could be improved or extended to give more flex in usage of DataBlock API.

But first - let's review few problematic points:

- Right now the process of transformation of your data files/df/csv into `DataBunch` is monolithic. You can't split this process in several stages with current API. I mean you can if you really want to, but it'll be not so comfortable and not really useful. You can't effectively change few params and get your new DataBunch from previous without recalculating everything from scratch.

In [None]:
data = (ImageList.from_folder(path) #Where to find the data? -> in path and its subfolders
    .split_by_folder()              #How to split in train/valid? -> use the folders
    .label_from_folder()            #How to label? -> depending on the folder of the filenames
    .add_test_folder()              #Optionally add a test set (here default name is test)
    .transform(tfms, size=64)       #Data augmentation? -> use tfms with a size of 64
    .databunch())                   #Finally? -> use the defaults for conversion to ImageDataBunch

- Almost every step of DataBlock API returns intermediate entity or mutate previous state. Those entities have different set of methods you can use to continue transform data into `DataBunch`. So you can't break the order of those steps and you need to keep this order and next steps in your head, as well as public API of specific entities. As well - some of steps could look in wrong place. For example - why we can add test folder only after labeling data - it would make more sense if we've done that on `split_by_folder` step.

In [None]:
def info(obj):
    attrs = len([mn for mn in dir(obj)])
    print('attrs:', attrs, type(obj))

step = ImageList.from_folder(path); info(step)   #Where to find the data? -> in path and its subfolders
step = step.split_by_folder(); info(step)        #How to split in train/valid? -> use the folders
step = step.label_from_folder(); info(step)      #How to label? -> depending on the folder of the filenames
step = step.add_test_folder(); info(step)        #Optionally add a test set (here default name is test)
step = step.transform(tfms, size=64); info(step) #Data augmentation? -> use tfms with a size of 64
step = step.databunch(); info(step)              #Finally? -> use the defaults for conversion to ImageDataBunch

attrs: 92 <class 'fastai.vision.data.ImageList'>
attrs: 44 <class 'fastai.data_block.ItemLists'>
attrs: 53 <class 'fastai.data_block.LabelLists'>
attrs: 53 <class 'fastai.data_block.LabelLists'>
attrs: 53 <class 'fastai.data_block.LabelLists'>
attrs: 73 <class 'fastai.vision.data.ImageDataBunch'>


- Every step evaluated strictly when called. So if you, for example, changed config variable by mistake between steps, or used wrong method - you'll need to start from scratch to get result you want.

In [None]:
step = (ImageList.from_folder(path)
    .split_by_rand_pct(valid_pct=0.2))
# ... some expertiments, you decided you need different pct split
# Uncomment below to see error
# step.split_by_rand_pct(valid_pct=0.4)

- There is no easy way to review current state of `DataBunch` - where it comes from, what `p` was used in split, is split was done by pct or by folder etc. Every setting matters only on its step and just few of them could be extracted from final `DataBunch` object. So to reproduce some needed state of DataBunch we need whole original code and state around that code for `DataBunch` creation.

In [None]:
VAL_PCT = 0.2
data = (ImageList.from_folder(path)
        .split_by_rand_pct(valid_pct=VAL_PCT)
        .label_from_folder()
        .add_test_folder()
        .transform(tfms, size=64)
        .databunch())
# ... some lot of experimenting, changing VAL_PCT value blabla
# And here we can't figure out from `data` what settings were used, what val_pct for example was set

- Extensibility of current API - is not a trivial thing. You need to know exact class you want to add new method to, you need to think how it can affect next class/state and which methods could be not compatible with new step etc. 

## DataBlock API extension prototype as solution

I propose to improve current API by adding more declarativity. We can do several things to make things better - from rewriting whole API in new set of abstractions, to using current API with some kind of wrappings around to give end-users better experience and left all proven logic almost untouched. Below (or if it's an article - in git repo) I trying to create prototype of this Declarative DataBlock API (and to reduce possible conflicts I named main class working with data blocks as `DataChain`).

In [None]:
# Imports
import inspect
from collections import OrderedDict
from fastai.data_block import PreProcessors # I don't know why it doesn't import within fastai.vision.*

# Util functions
# Pretty print for dictionaries, specific for blocks I'm implementing below
def pp(d, indent=4, ljust=12, skip_none=True):
    res = []
    for key, value in d.items():
        if skip_none and value is None: continue
        val = value.__name__ if inspect.isclass(value) else value
        val = f'DataFrame {val.shape}' if isinstance(val, DataFrame) else val
        res.append(f'{" " * indent}{(str(key) + ":").ljust(ljust)}\t{val}')
    return "\n".join(res)

An abstract data block class. `DataChain` will assembly final `DataBunch` from subclasses of this block. Each block will store its state. When needed `assemble` function will be called and block will apply its state to previous block result, or `assembly` to produce current step's `assembly`. It's far from perfect and can be improved later - for example we can add hooks which will check changes in state or in previous block and flush cache. But for now we'll do that by hand as it's not really important for prototype and because it's a part of private API.

In [None]:
class Block():
    """
    An abstract data block class that all block classes should extend from
    
    As I see - all small blocks should be part of private API and users should not change them directly.
    There will be meta object which will create those blocks, set them up, sort in right order
    and when it is needed - call `assemble` on each of them and use result for next blocks
    
    Every block should have prev block it will based on.
    """
    def __init__(self, prev_block=None, assemble_fn=None, **kwargs):
        self.prev_block = prev_block
        self.assemble_fn = assemble_fn
        self.settings = kwargs
        self.assembly = None

    def _short_repr(self):
        assembled = self.assembly is not None
        return f'{self.__class__.__name__}{" (Assembled ✔)" if assembled else " (Assembled ✘)"}'

    def __repr__(self):
        "Standard method which we'll be using for status representation in metaobject"
        res = []
        res.append(f'{self._short_repr()}')
        if isinstance(self.prev_block, Block):
            res.append(f'{pp({"prev_block": self.prev_block._short_repr()})}')
        res.append(f'{pp({"assemble_fn": self.assemble_fn})}')
        res.append(f'{pp(self.settings)}')
        res = list(filter(None, res))
        return "\n".join(res)
    
    def assemble(self):
        "Will be called when we need actual result of block logic. It will cache its results"
        if self.assembly is not None: return self.assembly
        self.validate()
        return self._assemble()
    
    def _assemble(self):
        "Real implementation of assemble method which will be called from `assemble` if it's not already assembled"
        self.assembly = getattr(self.prev_block.assembly, self.assemble_fn)(**self.settings)
        return self.assembly

    def reassemble(self):
        "Will be called when something above in blocks chain was changed and we need to regenerate result"
        self.assembly = None
        return self.assemble()
    
    def validate(self):
        "Checks that every setting needed for assembly is present"
        assert self.prev_block is not None, 'Every block in chain should have prev block. If it is first block - provide specific InputBlock'
        assembly = self.prev_block.assembly
        assert assembly is not None, 'Prev block should be assembled before we assemble this one'
        afn = self.assemble_fn
        assert afn is not None, f'You need to provide `assemble_fn`, that way block will know how to assemble from prev_block'
        assert hasattr(assembly, afn), f'Class {assembly} don\'t have method `{afn}`'

### Base blocks

First step of creation - to understand what kind of task you want to solve and what kind of data you plan to work with. In original DataBlock API you using specific `ListItem` class itself as starting point but here, as we'll do things declarative way we will wrap this step in block.

In [None]:
class InputBlock(Block):
    """
    Special kind of block which actually can not be assmebled - it contains only base item list
    class to start work with.
    """
    def __init__(self, item_list):
        self.prev_block, self.assemble_fn, self.settings = None, None, dict()
        self.assembly = item_list
    def _short_repr(self):
        return f'{self.__class__.__name__}({self.assembly.__name__}) (Assembled ✔)'
    def _assemble(self): return self.assembly
    def reassemble(self): return self.assembly

    def validate(self):
        assert issubclass(self.assembly, ItemList), 'For now item list class for input block can be only subclass of ItemList'

In [None]:
class IdentityBlock(Block):
    """
    Special kind of block which will do nothing. It will propagate prev_block.assembly to self.assembly.
    Not used for now.
    """
    def _assemble(self):
        self.assembly = self.prev_block.assembly
        return self.assembly

    def validate(self):
        "Checks that every setting needed for assembly is present"
        assert self.prev_block is not None, 'Every block in chain should have prev block. If it is first block - provide specific InputBlock'
        assembly = self.prev_block.assembly
        assert assembly is not None, 'Prev block should be assembled before we assemble this one'

In [None]:
input_block = InputBlock(ItemList)
input_block.validate()
input_block

InputBlock(ItemList) (Assembled ✔)

In [None]:
identity_block = IdentityBlock(prev_block=input_block)
identity_block.assemble()
identity_block

IdentityBlock (Assembled ✔)
    prev_block: 	InputBlock(ItemList) (Assembled ✔)

### Source blocks

Second step to create `DataBunch` - set data source. Right now in fastai you have three options: `from_folder`, `from_df` and `from_csv`. Below implementation of `SourceBlock` class which will ensure that we priveded every required argument to get data from specific source.

In [None]:
class SourceBlock(Block):
    """
    Source block
    """
    
    ASSEMBLE_FNS = ['from_folder', 'from_df', 'from_csv']
    def validate(self):
        super(SourceBlock, self).validate()
        assert self.assemble_fn in self.ASSEMBLE_FNS, f'You need to provide `assemble_fn` one of {self.ASSEMBLE_FNS}'
        if self.assemble_fn == 'from_folder':
            assert isinstance(self.settings.get('path', None), PathOrStr.__args__), f'To create item list from folder you should provide `path` arg'
        elif self.assemble_fn == 'from_df':
            assert isinstance(self.settings.get('df', None), DataFrame), f'To create item list from df you should provide `df` arg'
        else:
            assert isinstance(self.settings.get('path', None), PathOrStr.__args__), f'To create item list from csv you should provide `path` arg'
            assert isinstance(self.settings.get('csv_name', None), str), f'To create item list from csv you should provide `csv_name` arg'

In [None]:
block = SourceBlock(prev_block=input_block, assemble_fn='from_df', df=pd.DataFrame([1]), path='/path')
#block.prev_block = ItemList
block.validate()
print(block)
block.assemble()
print(block)
block.assemble()
print(block)
block.settings['df'] = pd.DataFrame([[1,2,3],[4,5,6]])
block.reassemble()
print(block)

SourceBlock (Assembled ✘)
    prev_block: 	InputBlock(ItemList) (Assembled ✔)
    assemble_fn:	from_df
    df:         	DataFrame (1, 1)
    path:       	/path
SourceBlock (Assembled ✔)
    prev_block: 	InputBlock(ItemList) (Assembled ✔)
    assemble_fn:	from_df
    df:         	DataFrame (1, 1)
    path:       	/path
SourceBlock (Assembled ✔)
    prev_block: 	InputBlock(ItemList) (Assembled ✔)
    assemble_fn:	from_df
    df:         	DataFrame (1, 1)
    path:       	/path
SourceBlock (Assembled ✔)
    prev_block: 	InputBlock(ItemList) (Assembled ✔)
    assemble_fn:	from_df
    df:         	DataFrame (2, 3)
    path:       	/path


### Filter blocks

Third optional step - filtration

In [None]:
class FilterBlock(Block):
    """
    Filter block is optional kind of blocks. Will be assembled after source block and before
    splitting block.
    """
    
    ASSEMBLE_FNS = ['filter_by_func', 'filter_by_folder', 'filter_by_rand']
    def validate(self):
        super(FilterBlock, self).validate()
        assert self.assemble_fn in self.ASSEMBLE_FNS, f'You need to provide `assemble_fn` one of {self.ASSEMBLE_FNS}'
        if self.assemble_fn == 'filter_by_func':
            assert isinstance(self.settings.get('func', None), Callable.__args__), f'To filter item list by func you should provide `func` arg'
        elif self.assemble_fn == 'filter_by_rand':
            assert isinstance(self.settings.get('p', None), (int, float)), f'To filter item list randomly you should provide `p` arg'

### Split blocks

Next step - splitting data. There are a bunch of methods in fastai for splitting.

In [None]:
class SplitBlock(Block):
    """
    Split block is required block. Will be assembled after filter block and before
    label block.
    """

    ASSEMBLE_FNS = ["split_none", "split_by_list", "split_by_idxs", "split_by_idx", "split_by_folder",
                    "split_by_rand_pct", "split_subsets", "split_by_valid_func", "split_by_files",
                    "split_by_fname_file", "split_from_df"]
    def validate(self):
        super(SplitBlock, self).validate()
        assert self.assemble_fn in self.ASSEMBLE_FNS, f'You need to provide `assemble_fn` one of `SplitBlock.ASSEMBLE_FNS`'
        if self.assemble_fn == 'split_by_list':
            assert self.settings.get('train', None) is not None, f'To split item list by list you should provide `train` list arg'
            assert self.settings.get('valid', None) is not None, f'To split item list by list you should provide `valid` list arg'
        elif self.assemble_fn == 'split_by_idxs':
            assert self.settings.get('train_idx', None) is not None, f'To split item list by idxs you should provide `train_idx` list arg'
            assert self.settings.get('valid_idx', None) is not None, f'To split item list by idxs you should provide `valid_idx` list arg'
        elif self.assemble_fn == 'split_by_idx':
            assert isinstance(self.settings.get('valid_idx', None), Collection[int].__args__), f'To split item list by idxs you should provide `valid_idx` list of ints arg'
        elif self.assemble_fn == 'split_subsets':
            assert isinstance(self.settings.get('train_size', None), float), f'To split item list by subsets you should provide `train_size`float arg'         
            assert isinstance(self.settings.get('valid_size', None), float), f'To split item list by subsets you should provide `train_size`float arg'         
        elif self.assemble_fn == 'split_by_valid_func':
            assert isinstance(self.settings.get('func', None), Callable.__args__), f'To split item list by valid func you should provide `func` arg'
        elif self.assemble_fn == 'split_by_files':
            assert isinstance(self.settings.get('valid_names', None), ItemList), f'To split item list by files you should provide `valid_names` item list arg'
        elif self.assemble_fn == 'split_by_fname_file':
            assert isinstance(self.settings.get('fname', None), PathOrStr.__args__), f'To split item list by fname file you should provide `fname` arg'

### Label blocks

In [None]:
class LabelBlock(Block):
    """
    Label block is required block. Will be assembled after split block and before
    process/transform block.
    """

    ASSEMBLE_FNS = ["label_from_df", "label_const", "label_empty", "label_from_func", "label_from_folder",
                    "label_from_re", "label_from_lists"]
    def validate(self):
        super(LabelBlock, self).validate()
        assert self.assemble_fn in self.ASSEMBLE_FNS, f'You need to provide `assemble_fn` one of {self.ASSEMBLE_FNS}'
        if self.assemble_fn == 'label_from_func':
            assert isinstance(self.settings.get('func', None), Callable.__args__), f'To label item list by func you should provide `func` arg'
        elif self.assemble_fn == 'label_from_re':
            assert isinstance(self.settings.get('pat', None), str), f'To label item list by re you should provide `pat` arg'
        elif self.assemble_fn == 'label_from_lists':
            assert self.settings.get('train_labels', None) is not None, f'To label item list by lists you should provide `train_labels` arg'
            assert self.settings.get('valid_labels', None) is not None, f'To label item list by lists you should provide `valid_labels` arg'

### Preprocess blocks

I don't really understand now when and on what condition user can call process in pipeline, as I see - it's done automaticaly. So I just pass this step for now.

### Transform blocks

In [None]:
class TransformBlock(Block):
    """
    Transform block is an optional block. Will be assembled after label block and before
    test set/databunch block.
    """

    ASSEMBLE_FNS = ["transform", "transform_y"]
    def validate(self):
        super(TransformBlock, self).validate()
        assert self.assemble_fn in self.ASSEMBLE_FNS, f'You need to provide `assemble_fn` one of {self.ASSEMBLE_FNS}'

### Test setting block (?)

I've singled it out in separate block as now it's a separate step in data pipeline.

In [None]:
class TestSetBlock(Block):
    """
    TestSet block is an optional block. Will be assembled after transform block and before
    databunch block.
    """

    ASSEMBLE_FNS = ["add_test", "add_test_folder"]
    def validate(self):
        super(TestSetBlock, self).validate()
        assert self.assemble_fn in self.ASSEMBLE_FNS, f'You need to provide `assemble_fn` one of {self.ASSEMBLE_FNS}'
        if self.assemble_fn == 'add_test':
            assert isinstance(self.settings.get('items', None), Iterator.__args__), f'To add test set you should provide `items` arg'

### DataBunch blocks

In [None]:
class DataChain():
    """
    Meta object for data blocks.
    
    Usage:
    bunch = DataFactory(ImageList)
    bunch.from_folder(path)
    bunch.split_by_rand_pct(valid_pct: 0.3)
    ...
    data = bunch.databunch()
    # => Will sort all existed blocks, validate presence of required and assemble all of them to produce final
    # => databunch.
    """
    
    DEFAULTS = [
        ('input',      None),
        ('source',     None),
        ('filter',     None),
        ('split',      None),
        ('label',      None),
        #('preprocess', None),
        ('transforms', None),
        ('test_set',   None),
        ('databunch',  None)
    ]
    REQUIRED = ['input', 'source', 'split', 'label']
    PREV_BLOCKS = {
        'input':      None,
        'source':     ['input'],
        'filter':     ['source'],
        'split':      ['source', 'filter'],
        'label':      ['split'],
        'transforms': ['label'],
        'test_set':   ['label', 'transforms'],
        'databunch':  ['label', 'transforms', 'test_set']
    }
    def __init__(self, item_list=None):
        self.blocks = OrderedDict(self.DEFAULTS)
        if item_list is not None: self._input(item_list)
        
    def __repr__(self):
        "Gather all the blocks including default and show their representations"
        return f'{self.__class__.__name__}\n{pp(self.blocks, indent=2, skip_none=False)}\n'

    def __find_prev_block__(self, block_name):
        if self.PREV_BLOCKS[block_name] is None: return
        for pb in reversed(self.PREV_BLOCKS[block_name]):
            if self.blocks[pb] is not None: return self.blocks[pb]
        return
        
    def __validate__(self):
        for name, blk in self.blocks.items():
            if name in self.REQUIRED:
                assert blk is not None, f'You don\'t have required `{name}` block'
    
    def __assemble_all__(self):
        for name, blk in self.blocks.items():
            if blk is None: continue
            pb = self.__find_prev_block__(name)
            if pb is not None: blk.prev_block = pb
            blk.reassemble()

    ## InputBlock methods
    def _input(self, item_list_class:ItemList):
        "Create input block from item list subclass"
        assert issubclass(item_list_class, ItemList), f'DataFactofy#input supports only ItemList subclasses'
        self.blocks['input'] = InputBlock(item_list_class)
        return self
    
    ## SourceBlock methods
    def _source(self, assemble_fn:str, **kwargs):
        "Base method for source block. Useful when you want to use config instead of writing every parameter by hand"
        self.blocks['source'] = SourceBlock(prev_block=self.blocks['input'], assemble_fn=assemble_fn, **kwargs)
        return self

    # Methods below are just wrappers of self.source method. They have identical argument lists
    # as fastai original methods. I've added them to make this api totaly compatible with current fastai api
    def from_folder(self, path:PathOrStr, extensions:Collection[str]=None, recurse:bool=True,
                    include:Optional[Collection[str]]=None, processor:PreProcessors=None, **kwargs)->'ItemList':
        """Create an `ItemList` in `path` from the filenames that have a suffix in `extensions`.
        `recurse` determines if we search subfolders."""
        return self._source(assemble_fn='from_folder', path=path, extensions=extensions, recurse=recurse,
                           include=include, processor=processor, **kwargs)
    def from_df(self, df:DataFrame, path:PathOrStr='.', cols:IntsOrStrs=0, processor:PreProcessors=None, **kwargs)->'ItemList':
        "Create an `ItemList` in `path` from the inputs in the `cols` of `df`."
        return self._source(assemble_fn='from_df', df=df, path=path, cols=cols, processor=processor, **kwargs)
    def from_csv(self, path:PathOrStr, csv_name:str, cols:IntsOrStrs=0, delimiter:str=None, header:str='infer', 
                 processor:PreProcessors=None, **kwargs)->'ItemList':
        "Create an `ItemList` in `path` from the inputs in the `cols` of `path/csv_name`"
        return self._source(assemble_fn='from_csv', path=path, csv_name=csv_name, cols=cols, delimiter=delimiter,
                           header=header, processor=processor, **kwargs)
    
    ## FilterBlock methods
    def _filter(self, assemble_fn:str, **kwargs):
        "Base method for filter block"
        self.blocks['filter'] = FilterBlock(prev_block=self.blocks['source'], assemble_fn=assemble_fn, **kwargs)
        return self
    
    # Methods below are just wrappers of self.source method. They have identical argument lists
    # as fastai original methods. I've added them to make this api totaly compatible with current fastai api   
    def filter_by_func(self, func:Callable)->'ItemList':
        "Only keep elements for which `func` returns `True`."
        return self._filter(assemble_fn='filter_by_func', func=func)
    def filter_by_folder(self, include=None, exclude=None):
        "Only keep filenames in `include` folder or reject the ones in `exclude`."
        return self._filter(assemble_fn='filter_by_folder', include=include, exclude=exclude)
    def filter_by_rand(self, p:float, seed:int=None):
        "Keep random sample of `items` with probability `p` and an optional `seed`."
        return self._filter(assemble_fn='filter_by_rand', p=p, seed=seed)
    
    ## SplitBlock methods
    def _split(self, assemble_fn:str, **kwargs):
        "Base method for split block"
        self.blocks['split'] = SplitBlock(prev_block=self.blocks['filter'], assemble_fn=assemble_fn, **kwargs)
        return self

    # Methods below are just wrappers of self.source method. They have identical argument lists
    # as fastai original methods. I've added them to make this api totaly compatible with current fastai api    
    def split_none(self):
        "Don't split the data and create an empty validation set."
        return self._split(assemble_fn='split_none')
    def split_by_list(self, train, valid):
        "Split the data between `train` and `valid`."
        return self._split(assemble_fn='split_by_list', train=train, valid=valid)
    def split_by_idxs(self, train_idx, valid_idx):
        "Split the data between `train_idx` and `valid_idx`."
        return self._split(assemble_fn='split_by_idxs', train_idx=train_idx, valid_idx=valid_idx)
    def split_by_idx(self, valid_idx:Collection[int])->'ItemLists':
        "Split the data according to the indexes in `valid_idx`."
        return self._split(assemble_fn='split_by_idx', valid_idx=valid_idx)
    def split_by_folder(self, train:str='train', valid:str='valid')->'ItemLists':
        "Split the data depending on the folder (`train` or `valid`) in which the filenames are."
        return self._split(assemble_fn='split_by_folder', train=train, valid=valid)
    def split_by_rand_pct(self, valid_pct:float=0.2, seed:int=None)->'ItemLists':
        "Split the items randomly by putting `valid_pct` in the validation set, optional `seed` can be passed."
        return self._split(assemble_fn='split_by_rand_pct', valid_pct=valid_pct, seed=seed)
    def split_subsets(self, train_size:float, valid_size:float, seed=None) -> 'ItemLists':
        "Split the items into train set with size `train_size * n` and valid set with size `valid_size * n`."
        return self._split(assemble_fn='split_subsets', train_size=train_size, valid_size=valid_size, seed=seed)
    def split_by_valid_func(self, func:Callable)->'ItemLists':
        "Split the data by result of `func` (which returns `True` for validation set)."
        return self._split(assemble_fn='split_by_valid_func', func=func)
    def split_by_files(self, valid_names:'ItemList')->'ItemLists':
        "Split the data by using the names in `valid_names` for validation."
        return self._split(assemble_fn='split_by_files', valid_names=valid_names)
    def split_by_fname_file(self, fname:PathOrStr, path:PathOrStr=None)->'ItemLists':
        "Split the data by using the names in `fname` for the validation set. `path` will override `self.path`."
        return self._split(assemble_fn='split_by_fname_file', fname=fname, path=path)
    def split_from_df(self, col:IntsOrStrs=2):
        "Split the data from the `col` in the dataframe in `self.inner_df`."
        return self._split(assemble_fn='split_from_df', col=col)
    
    ## LabelBlock methods
    def _label(self, assemble_fn:str, **kwargs):
        "Base method for label block"
        self.blocks['label'] = LabelBlock(prev_block=self.blocks['split'], assemble_fn=assemble_fn, **kwargs)
        return self

    def label_from_df(self, cols:IntsOrStrs=1, label_cls:Callable=None, **kwargs):
        "Label `self.items` from the values in `cols` in `self.inner_df`."
        return self._label(assemble_fn='label_from_df', cols=cols, label_cls=label_cls, **kwargs)
    def label_const(self, const:Any=0, label_cls:Callable=None, **kwargs)->'LabelList':
        "Label every item with `const`."
        return self._label(assemble_fn='label_const', const=const, label_cls=label_cls, **kwargs)
    def label_empty(self, **kwargs):
        "Label every item with an `EmptyLabel`."
        return self._label(assemble_fn='label_empty', **kwargs)
    def label_from_func(self, func:Callable, label_cls:Callable=None, **kwargs)->'LabelList':
        "Apply `func` to every input to get its label."
        return self._label(assemble_fn='label_from_func', func=func, label_cls=label_cls, **kwargs)
    def label_from_folder(self, label_cls:Callable=None, **kwargs)->'LabelList':
        "Give a label to each filename depending on its folder."
        return self._label(assemble_fn='label_from_folder', label_cls=label_cls, **kwargs)
    def label_from_re(self, pat:str, full_path:bool=False, label_cls:Callable=None, **kwargs)->'LabelList':
        "Apply the re in `pat` to determine the label of every filename.  If `full_path`, search in the full name."
        return self._label(assemble_fn='label_from_re', pat=pat, full_path=full_path, label_cls=label_cls, **kwargs)
    def label_from_lists(self, train_labels:Iterator, valid_labels:Iterator, label_cls:Callable=None, **kwargs)->'LabelList':
        "Use the labels in `train_labels` and `valid_labels` to label the data. `label_cls` will overwrite the default."
        return self._label(assemble_fn='label_from_lists', train_labels=train_labels,
                          valid_labels=valid_labels, label_cls=label_cls, **kwargs)

    ## TransofrBlock methods
    def _transforms(self, assemble_fn:str, **kwargs):
        "Base method for transform block"
        self.blocks['transforms'] = TransformBlock(prev_block=self.blocks['label'], assemble_fn=assemble_fn, **kwargs)
        return self
    
    def transform(self, tfms:Optional[Tuple[TfmList,TfmList]]=(None,None), **kwargs):
        "Set `tfms` to be applied to the xs of the train and validation set."
        return self._transforms(assemble_fn='transform', tfms=tfms, **kwargs)
    def transform_y(self, tfms:Optional[Tuple[TfmList,TfmList]]=(None,None), **kwargs):
        "Set `tfms` to be applied to the ys of the train and validation set."
        return self._transforms(assemble_fn='transform_y', tfms=tfms, **kwargs)

    ## TestSetBlock methods
    def _test_set(self, assemble_fn:str, **kwargs):
        "Base method for test set block"
        self.blocks['test_set'] = TestSetBlock(prev_block=self.blocks['transforms'], assemble_fn=assemble_fn, **kwargs)
        return self

    def add_test(self, items:Iterator, label:Any=None):
        "Add test set containing `items` with an arbitrary `label`."
        return self._test_set(assemble_fn='add_test', items=items, label=label)
    def add_test_folder(self, test_folder:str='test', label:Any=None):
        "Add test set containing items from `test_folder` and an arbitrary `label`."
        return self._test_set(assemble_fn='add_test_folder', test_folder=test_folder, label=label)
    
    
    ## DataBunchBlock methods - there is no actual DataBunchBlock for now
    ## as I don't see why you need to store result anywhere else your experiment code.
    ## But it's really easy to add one
    def databunch(self, path:PathOrStr=None, bs:int=64, val_bs:int=None, num_workers:int=defaults.cpus,
                  dl_tfms:Optional[Collection[Callable]]=None, device:torch.device=None, collate_fn:Callable=data_collate,
                  no_check:bool=False, **kwargs)->'DataBunch':
        "Create an `DataBunch` from self, `path` will override `self.path`, `kwargs` are passed to `DataBunch.create`."
        self.__validate__()
        self.__assemble_all__()
        label_lists = self.__find_prev_block__('databunch').assembly
        return label_lists.databunch(path=path, bs=bs, val_bs=val_bs, num_workers=num_workers,dl_tfms=dl_tfms,
                                     device=device, collate_fn=collate_fn, no_check=no_check, **kwargs)

In [None]:
# What we start from:
path = untar_data(URLs.MNIST_TINY)
tfms = get_transforms(do_flip=False)
#path.ls()

data_orig = (ImageList.from_folder(path) #Where to find the data? -> in path and its subfolders
    .split_by_folder()                   #How to split in train/valid? -> use the folders
    .label_from_folder()                 #How to label? -> depending on the folder of the filenames
    .add_test_folder()                   #Optionally add a test set (here default name is test)
    .transform(tfms, size=64)            #Data augmentation? -> use tfms with a size of 64
    .databunch())                        #Finally? -> use the defaults for conversion to ImageDataBunch

data_chain = (DataChain(ImageList)
    .from_folder(path)                   #Where to find the data? -> in path and its subfolders
    .split_by_folder()                   #How to split in train/valid? -> use the folders
    .label_from_folder()                 #How to label? -> depending on the folder of the filenames
    .add_test_folder()                   #Optionally add a test set (here default name is test)
    .transform(tfms, size=64))           #Data augmentation? -> use tfms with a size of 64
data_decl = data_chain.databunch()       #Finally? -> use the defaults for conversion to ImageDataBunch 

In [None]:
data_chain.filter_by_rand(0.3).databunch()

ImageDataBunch;

Train: LabelList (206 items)
x: ImageList
Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64)
y: CategoryList
3,3,3,3,3
Path: /home/gazay/.fastai/data/mnist_tiny;

Valid: LabelList (198 items)
x: ImageList
Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64)
y: CategoryList
3,3,3,3,3
Path: /home/gazay/.fastai/data/mnist_tiny;

Test: LabelList (20 items)
x: ImageList
Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64)
y: EmptyLabelList
,,,,
Path: /home/gazay/.fastai/data/mnist_tiny

In [None]:
data_orig

ImageDataBunch;

Train: LabelList (709 items)
x: ImageList
Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64)
y: CategoryList
3,3,3,3,3
Path: /home/gazay/.fastai/data/mnist_tiny;

Valid: LabelList (699 items)
x: ImageList
Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64)
y: CategoryList
3,3,3,3,3
Path: /home/gazay/.fastai/data/mnist_tiny;

Test: LabelList (20 items)
x: ImageList
Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64)
y: EmptyLabelList
,,,,
Path: /home/gazay/.fastai/data/mnist_tiny

In [None]:
data_decl

ImageDataBunch;

Train: LabelList (709 items)
x: ImageList
Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64)
y: CategoryList
3,3,3,3,3
Path: /home/gazay/.fastai/data/mnist_tiny;

Valid: LabelList (699 items)
x: ImageList
Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64)
y: CategoryList
3,3,3,3,3
Path: /home/gazay/.fastai/data/mnist_tiny;

Test: LabelList (20 items)
x: ImageList
Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64)
y: EmptyLabelList
,,,,
Path: /home/gazay/.fastai/data/mnist_tiny

## In conclusion

So it looks the same (because I tryed my best to make it fully compatible with current API). Just one difference - you wrap your `ItemList` class in `DataChain`. And after that whole pipeline becomes _lazy_. You can do anything you want until you call `databunch` method. This is the place where all the magic will happen. All blocks and settings you provided will be validated (because you can forget something - it will tell you what exactly) and assembled. By assembling I mean:

1. Validate that you provided all required arguments
2. Check that previous block output has method you want to call (just internal check which should not affect end user)
3. Call function with provided arguments on previous block
4. Store result in current block's `assembly` attribute

What it will give you right now:

- #### You can check what you already provided to data pipeline (just print your `DataChain` object anytime)

In [None]:
data_chain

DataChain
  input:      	InputBlock(ImageList) (Assembled ✔)
  source:     	SourceBlock (Assembled ✔)
    prev_block: 	InputBlock(ImageList) (Assembled ✔)
    assemble_fn:	from_folder
    path:       	/home/gazay/.fastai/data/mnist_tiny
    recurse:    	True
  filter:     	FilterBlock (Assembled ✔)
    prev_block: 	SourceBlock (Assembled ✔)
    assemble_fn:	filter_by_rand
    p:          	0.3
  split:      	SplitBlock (Assembled ✔)
    prev_block: 	FilterBlock (Assembled ✔)
    assemble_fn:	split_by_folder
    train:      	train
    valid:      	valid
  label:      	LabelBlock (Assembled ✔)
    prev_block: 	SplitBlock (Assembled ✔)
    assemble_fn:	label_from_folder
  transforms: 	TransformBlock (Assembled ✔)
    prev_block: 	LabelBlock (Assembled ✔)
    assemble_fn:	transform
    tfms:       	([RandTransform(tfm=TfmCrop (crop_pad), kwargs={'row_pct': (0, 1), 'col_pct': (0, 1), 'padding_mode': 'reflection'}, p=1.0, resolved={'row_pct': 0.7956460851463952, 'col_pct': 0.6948518107580647,

- #### You can change any block you want anytime

In [None]:
data_chain = data_chain.transform((None,None))
data_chain

DataChain
  input:      	InputBlock(ImageList) (Assembled ✔)
  source:     	SourceBlock (Assembled ✔)
    prev_block: 	InputBlock(ImageList) (Assembled ✔)
    assemble_fn:	from_folder
    path:       	/home/gazay/.fastai/data/mnist_tiny
    recurse:    	True
  filter:     	FilterBlock (Assembled ✔)
    prev_block: 	SourceBlock (Assembled ✔)
    assemble_fn:	filter_by_rand
    p:          	0.3
  split:      	SplitBlock (Assembled ✔)
    prev_block: 	FilterBlock (Assembled ✔)
    assemble_fn:	split_by_folder
    train:      	train
    valid:      	valid
  label:      	LabelBlock (Assembled ✔)
    prev_block: 	SplitBlock (Assembled ✔)
    assemble_fn:	label_from_folder
  transforms: 	TransformBlock (Assembled ✘)
    prev_block: 	LabelBlock (Assembled ✔)
    assemble_fn:	transform
    tfms:       	(None, None)
  test_set:   	TestSetBlock (Assembled ✔)
    prev_block: 	TransformBlock (Assembled ✔)
    assemble_fn:	add_test_folder
    test_folder:	test
  databunch:  	None

- #### or just access its settings and change them

In [None]:
data_chain.blocks['source'].settings['recurse'] = False
print(data_chain)
data_chain.blocks['source'].settings['recurse'] = True

DataChain
  input:      	InputBlock(ImageList) (Assembled ✔)
  source:     	SourceBlock (Assembled ✔)
    prev_block: 	InputBlock(ImageList) (Assembled ✔)
    assemble_fn:	from_folder
    path:       	/home/gazay/.fastai/data/mnist_tiny
    recurse:    	False
  filter:     	FilterBlock (Assembled ✔)
    prev_block: 	SourceBlock (Assembled ✔)
    assemble_fn:	filter_by_rand
    p:          	0.3
  split:      	SplitBlock (Assembled ✔)
    prev_block: 	FilterBlock (Assembled ✔)
    assemble_fn:	split_by_folder
    train:      	train
    valid:      	valid
  label:      	LabelBlock (Assembled ✔)
    prev_block: 	SplitBlock (Assembled ✔)
    assemble_fn:	label_from_folder
  transforms: 	TransformBlock (Assembled ✘)
    prev_block: 	LabelBlock (Assembled ✔)
    assemble_fn:	transform
    tfms:       	(None, None)
  test_set:   	TestSetBlock (Assembled ✔)
    prev_block: 	TransformBlock (Assembled ✔)
    assemble_fn:	add_test_folder
    test_folder:	test
  databunch:  	None



- #### You can create several different `DataChain` objects with different settings very easily

In [None]:
import copy
chains = []
for seed in range(3):
    chains.append(copy.deepcopy(data_chain.filter_by_rand(0.2, seed=seed)))
print(chains[0])
print(chains[2])

DataChain
  input:      	InputBlock(ImageList) (Assembled ✔)
  source:     	SourceBlock (Assembled ✔)
    prev_block: 	InputBlock(ImageList) (Assembled ✔)
    assemble_fn:	from_folder
    path:       	/home/gazay/.fastai/data/mnist_tiny
    recurse:    	True
  filter:     	FilterBlock (Assembled ✘)
    prev_block: 	SourceBlock (Assembled ✔)
    assemble_fn:	filter_by_rand
    p:          	0.2
    seed:       	0
  split:      	SplitBlock (Assembled ✔)
    prev_block: 	FilterBlock (Assembled ✔)
    assemble_fn:	split_by_folder
    train:      	train
    valid:      	valid
  label:      	LabelBlock (Assembled ✔)
    prev_block: 	SplitBlock (Assembled ✔)
    assemble_fn:	label_from_folder
  transforms: 	TransformBlock (Assembled ✘)
    prev_block: 	LabelBlock (Assembled ✔)
    assemble_fn:	transform
    tfms:       	(None, None)
  test_set:   	TestSetBlock (Assembled ✔)
    prev_block: 	TransformBlock (Assembled ✔)
    assemble_fn:	add_test_folder
    test_folder:	test
  databunch:  	None


- #### You can create several different `DataBunch` objects with different settings even without calling `copy.deepcopy` method. Just replace it with call `databunch` method on changed `DataChain`.

In [None]:
data_bunches = []
for seed in range(2):
    data_bunches.append(data_chain.filter_by_rand(p=0.5, seed=seed).databunch())

In [None]:
data_bunches[0].valid_ds == data_bunches[1].valid_ds

False

- #### As before you can see method interface

In [None]:
data_chain.split_by_folder?

## And what cool we can do from here:

1. We can improve caching mechanism to reduce extra calculations. Right now in prototype I haven't done smart caching and it reassembles every block from scratch when `databunch` method called. It can be easily improved by flushing cache only when settings from last assembly was changed or was updated assembly for previous block.


2. I see how not every possible configuration but most of them could be serialized to some static format like `yaml` or `json`. It will allow us to create `save/load` from config file flow, so you don't even have to write code to reproduce DataBunch. And I think it's super useful for reproducibility purposes in general.

3. We can add block for crossvalidation (creating folds, holdout etc.)