Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transform to numpy arrays #3

Merged
merged 14 commits into from Jun 30, 2021
Merged

Transform to numpy arrays #3

merged 14 commits into from Jun 30, 2021

Conversation

simeneide
Copy link
Collaborator

@simeneide simeneide commented Jun 28, 2021

Pytorch files are too platform specific. Instead, let the data be available as numpy arrays. This has the same amount of compression. In addition, we make a couple of name changes to more intuitive names.

Todo

  • Convert files from pytorch to numpy
  • Name changes: Call the displayed slates "slates" instead of "action", call the type of interaction (recs or search) "interaction_type" instead of "displayType". Remove ind2user and ind2item as they are scrambled anyways. Remove the popularity count in itemattr object.
  • Update the dataloader
  • Update quickstart (wait until we also build the pip package?)

@simeneide simeneide requested a review from NegatioN June 28, 2021 13:57
@simeneide
Copy link
Collaborator Author

Pushed some of your changes @NegatioN . I now have two files: "datahelper.py" and "dataset_torch.py" which contain relevant code. The idea is that we can also add a "dataset_tensorflow.py" etc to reduce needed dependencies.

In addition to these I think it makes sense to add some evaluation functions, which I will write in pytorch-lightning as a first implementation. So maybe a dataset_lightning.py that can inherit the dataset from dataset_torch maybe. Is this reasonable or too complicated?

@simeneide simeneide marked this pull request as ready for review June 29, 2021 13:09
@NegatioN
Copy link
Collaborator

NegatioN commented Jun 29, 2021

Pushed some of your changes @NegatioN . I now have two files: "datahelper.py" and "dataset_torch.py" which contain relevant code. The idea is that we can also add a "dataset_tensorflow.py" etc to reduce needed dependencies.

(I already wrote this in a comment, but:) I would aim for a numpy generator that people who don't use PyTorch could use for their own dataloaders (ex: TensorFlow?)

In addition to these I think it makes sense to add some evaluation functions, which I will write in pytorch-lightning as a first implementation.

Maybe writing the evaluation functions for the framework you're using first is the right call. But I think it's wise to put some effort into separating the eval functions into a self-contained format, so that they can be used in both Lightning/Pytorch/regular Python? I can provide more feedback on those things, or try to help out, if you sketch out how the function itself is supposed to look like 🙂

So maybe a dataset_lightning.py that can inherit the dataset from dataset_torch maybe. Is this reasonable or too complicated?

The datasets look identical for Lightning and Pytorch, right? It's only the eval functions you're worried about?

@simeneide
Copy link
Collaborator Author

Pushed some of your changes @NegatioN . I now have two files: "datahelper.py" and "dataset_torch.py" which contain relevant code. The idea is that we can also add a "dataset_tensorflow.py" etc to reduce needed dependencies.

(I already wrote this in a comment, but:) I would aim for a numpy generator that people who don't use PyTorch could use for their own dataloaders (ex: TensorFlow?)

In addition to these I think it makes sense to add some evaluation functions, which I will write in pytorch-lightning as a first implementation.

Maybe writing the evaluation functions for the framework you're using first is the right call. But I think it's wise to put some effort into separating the eval functions into a self-contained format, so that they can be used in both Lightning/Pytorch/regular Python? I can provide more feedback on those things, or try to help out, if you sketch out how the function itself is supposed to look like 🙂

So maybe a dataset_lightning.py that can inherit the dataset from dataset_torch maybe. Is this reasonable or too complicated?

The datasets look identical for Lightning and Pytorch, right? It's only the eval functions you're worried about?

I agree on all these points! But its out of scope for this PR I think. It also requires us to be a bit careful so that all functions and dataset splits are equivalent. So need to be worked on slightly separate. But we gone one step with moving data from torch to numpy. so next steps would be to also move code to be only np-dependent. :) numpy generators could a good way to "fix" this.

Copy link
Collaborator

@NegatioN NegatioN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :)

@simeneide simeneide merged commit 13dee27 into main Jun 30, 2021
@simeneide simeneide deleted the transform-to-numpy-arrays branch June 30, 2021 11:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants