Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic types for datasets and preprocessors #20

Open
alexmirrington opened this issue Aug 19, 2020 · 0 comments
Open

Generic types for datasets and preprocessors #20

alexmirrington opened this issue Aug 19, 2020 · 0 comments
Assignees
Labels
code Tasks related to coding and data analysis

Comments

@alexmirrington
Copy link
Owner

alexmirrington commented Aug 19, 2020

Overview:

At the moment, datasets return objects of type Any, and preprocessing and transform functions take in an argument of type Any and return a value of type Any. Generic types should be used so the type of data the dataset returns can be changed upon dataset creation.

Known challenges:

  • If a dataset has no preprocessor or no transform function, its return type is potentially different. It might be worth looking into function overloads, class methods as alternative constructors for the sake of typing or definint multiple generic typed classes as seen in the example below, e.g.
_T = TypeVar("_T")

class TypedDataset(torch.utils.data.Dataset, Generic[_T]):

    def __init__(self, root: Path):
        self._root = root

    @abstractmethod
    def __getitem__(self, index: int) -> _T:
        raise NotImplementedError()

_IT = TypeVar("_IT")
_OT = TypeVar("_OT")

class TransformedDataset(TypedDataset[_OT], Generic[_IT, _OT]):

    def __init__(self, root: Path, source: TypedDataset[_IT], transform: Callable[[_IT], _OT]):
        super().__init__(root)
        self._source = source
        self._transform = transform

    def __getitem__(self, index: int) -> _OT:
        result = self._source[index]
        return self._transform(result)

# Do a similar thing for a `PreprocessedDataset`, except the preprocessor function is applied to all samples on load, data is saved to file for later access and references to the original dataset are not kept.
  • We want to be able to reuse a GQADataset (for example) for both general data analysis (not preprocessed, optionally transformed) and model training (preprocessed and transformed). This indicates that the dataset factory should be resonsible for creating the original GQADataset and wrapping it with a PreprocessedDataset and then a TransformedDataset if needed for training.
@alexmirrington alexmirrington added the code Tasks related to coding and data analysis label Aug 19, 2020
@alexmirrington alexmirrington self-assigned this Aug 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code Tasks related to coding and data analysis
Projects
None yet
Development

No branches or pull requests

1 participant