Transform to numpy arrays #3

simeneide · 2021-06-28T13:57:20Z

Pytorch files are too platform specific. Instead, let the data be available as numpy arrays. This has the same amount of compression. In addition, we make a couple of name changes to more intuitive names.

Todo

Convert files from pytorch to numpy
Name changes: Call the displayed slates "slates" instead of "action", call the type of interaction (recs or search) "interaction_type" instead of "displayType". Remove ind2user and ind2item as they are scrambled anyways. Remove the popularity count in itemattr object.
Update the dataloader
Update quickstart (wait until we also build the pip package?)

data/transform_data_to_numpy.py

code/dataset.py

simeneide · 2021-06-29T12:35:25Z

Pushed some of your changes @NegatioN . I now have two files: "datahelper.py" and "dataset_torch.py" which contain relevant code. The idea is that we can also add a "dataset_tensorflow.py" etc to reduce needed dependencies.

In addition to these I think it makes sense to add some evaluation functions, which I will write in pytorch-lightning as a first implementation. So maybe a dataset_lightning.py that can inherit the dataset from dataset_torch maybe. Is this reasonable or too complicated?

code/dataset.py

datahelper.py

dataset_torch.py

…h default that are equivalent to old setup

NegatioN · 2021-06-29T14:41:49Z

Pushed some of your changes @NegatioN . I now have two files: "datahelper.py" and "dataset_torch.py" which contain relevant code. The idea is that we can also add a "dataset_tensorflow.py" etc to reduce needed dependencies.

(I already wrote this in a comment, but:) I would aim for a numpy generator that people who don't use PyTorch could use for their own dataloaders (ex: TensorFlow?)

In addition to these I think it makes sense to add some evaluation functions, which I will write in pytorch-lightning as a first implementation.

Maybe writing the evaluation functions for the framework you're using first is the right call. But I think it's wise to put some effort into separating the eval functions into a self-contained format, so that they can be used in both Lightning/Pytorch/regular Python? I can provide more feedback on those things, or try to help out, if you sketch out how the function itself is supposed to look like 🙂

So maybe a dataset_lightning.py that can inherit the dataset from dataset_torch maybe. Is this reasonable or too complicated?

The datasets look identical for Lightning and Pytorch, right? It's only the eval functions you're worried about?

simeneide · 2021-06-30T10:27:19Z

Pushed some of your changes @NegatioN . I now have two files: "datahelper.py" and "dataset_torch.py" which contain relevant code. The idea is that we can also add a "dataset_tensorflow.py" etc to reduce needed dependencies.

(I already wrote this in a comment, but:) I would aim for a numpy generator that people who don't use PyTorch could use for their own dataloaders (ex: TensorFlow?)

In addition to these I think it makes sense to add some evaluation functions, which I will write in pytorch-lightning as a first implementation.

Maybe writing the evaluation functions for the framework you're using first is the right call. But I think it's wise to put some effort into separating the eval functions into a self-contained format, so that they can be used in both Lightning/Pytorch/regular Python? I can provide more feedback on those things, or try to help out, if you sketch out how the function itself is supposed to look like 🙂

So maybe a dataset_lightning.py that can inherit the dataset from dataset_torch maybe. Is this reasonable or too complicated?

The datasets look identical for Lightning and Pytorch, right? It's only the eval functions you're worried about?

I agree on all these points! But its out of scope for this PR I think. It also requires us to be a bit careful so that all functions and dataset splits are equivalent. So need to be worked on slightly separate. But we gone one step with moving data from torch to numpy. so next steps would be to also move code to be only np-dependent. :) numpy generators could a good way to "fix" this.

NegatioN

LGTM :)

simen eide added 5 commits June 25, 2021 10:34

transformation script from numpy to torch

c4b8bc3

update dataset module to use the new numpy files

23502de

rename slate_type (displayType) -> interaction_type

2adea41

convert indices to integers when loading json

cec7343

add numpy version of data

df3cfb6

simeneide requested a review from NegatioN June 28, 2021 13:57

SofieVerrewaere reviewed Jun 28, 2021

View reviewed changes

data/transform_data_to_numpy.py Outdated Show resolved Hide resolved

NegatioN reviewed Jun 29, 2021

View reviewed changes

code/dataset.py Show resolved Hide resolved

add data download function

6e7089c

NegatioN reviewed Jun 29, 2021

View reviewed changes

code/dataset.py Show resolved Hide resolved

simen added 2 commits June 29, 2021 12:27

remove dataset.py and instead use a torch-specific dataset_torch module

55735dc

remove f"" in datahelper

c442698

cleanup and formatting

2f89753

simeneide marked this pull request as ready for review June 29, 2021 13:09

NegatioN reviewed Jun 29, 2021

View reviewed changes

code/dataset.py Show resolved Hide resolved

NegatioN reviewed Jun 29, 2021

View reviewed changes

datahelper.py Show resolved Hide resolved

NegatioN reviewed Jun 29, 2021

View reviewed changes

dataset_torch.py Show resolved Hide resolved

simen added 2 commits June 29, 2021 13:32

add google drive downloader to requirements.txt

f3eaf79

specify validation and test dataset sizes as separate percentages wit…

5be9f5e

…h default that are equivalent to old setup

NegatioN approved these changes Jun 30, 2021

View reviewed changes

Simen Eide and others added 3 commits June 30, 2021 12:40

remove torch data files

5863bbf

reflect the new np dataset

a7b7aab

move organization of repo slightly down in readme

576fd5f

simeneide merged commit 13dee27 into main Jun 30, 2021

simeneide deleted the transform-to-numpy-arrays branch June 30, 2021 11:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transform to numpy arrays #3

Transform to numpy arrays #3

simeneide commented Jun 28, 2021 •

edited

simeneide commented Jun 29, 2021

NegatioN commented Jun 29, 2021 •

edited

simeneide commented Jun 30, 2021

NegatioN left a comment

Transform to numpy arrays #3

Transform to numpy arrays #3

Conversation

simeneide commented Jun 28, 2021 • edited

Todo

simeneide commented Jun 29, 2021

NegatioN commented Jun 29, 2021 • edited

simeneide commented Jun 30, 2021

NegatioN left a comment

Choose a reason for hiding this comment

simeneide commented Jun 28, 2021 •

edited

NegatioN commented Jun 29, 2021 •

edited