Generic types for datasets and preprocessors #20

alexmirrington · 2020-08-19T14:34:57Z

Overview:

At the moment, datasets return objects of type Any, and preprocessing and transform functions take in an argument of type Any and return a value of type Any. Generic types should be used so the type of data the dataset returns can be changed upon dataset creation.

Known challenges:

If a dataset has no preprocessor or no transform function, its return type is potentially different. It might be worth looking into function overloads, class methods as alternative constructors for the sake of typing or definint multiple generic typed classes as seen in the example below, e.g.

_T = TypeVar("_T")

class TypedDataset(torch.utils.data.Dataset, Generic[_T]):

    def __init__(self, root: Path):
        self._root = root

    @abstractmethod
    def __getitem__(self, index: int) -> _T:
        raise NotImplementedError()

_IT = TypeVar("_IT")
_OT = TypeVar("_OT")

class TransformedDataset(TypedDataset[_OT], Generic[_IT, _OT]):

    def __init__(self, root: Path, source: TypedDataset[_IT], transform: Callable[[_IT], _OT]):
        super().__init__(root)
        self._source = source
        self._transform = transform

    def __getitem__(self, index: int) -> _OT:
        result = self._source[index]
        return self._transform(result)

# Do a similar thing for a `PreprocessedDataset`, except the preprocessor function is applied to all samples on load, data is saved to file for later access and references to the original dataset are not kept.

We want to be able to reuse a GQADataset (for example) for both general data analysis (not preprocessed, optionally transformed) and model training (preprocessed and transformed). This indicates that the dataset factory should be resonsible for creating the original GQADataset and wrapping it with a PreprocessedDataset and then a TransformedDataset if needed for training.

The text was updated successfully, but these errors were encountered:

alexmirrington added the code Tasks related to coding and data analysis label Aug 19, 2020

alexmirrington self-assigned this Aug 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generic types for datasets and preprocessors #20

Generic types for datasets and preprocessors #20

alexmirrington commented Aug 19, 2020 •

edited

Loading

Generic types for datasets and preprocessors #20

Generic types for datasets and preprocessors #20

Comments

alexmirrington commented Aug 19, 2020 • edited Loading

alexmirrington commented Aug 19, 2020 •

edited

Loading