Skip to content

Opinions about dlrm_data_pytorch.py #219

@future-xy

Description

@future-xy

Hi, I have several interesting observations while using DLRM.


  1. The following code (line 223 - 225) to randomize data doesn't seem to work:

dlrm/dlrm_data_pytorch.py

Lines 214 to 225 in fbc37eb

# create reordering
indices = np.arange(len(y))
if split == "none":
# randomize all data
if randomize == "total":
indices = np.random.permutation(indices)
print("Randomized indices...")
X_int[indices] = X_int
X_cat[indices] = X_cat
y[indices] = y

Maybe here the code should be like:

self.X_int = X_int[indices]
self.X_cat = X_cat[indices]
self.y = y[indices]

But fortunately, this code seems never triggered in the current version.


  1. I got warnings when running the following code (torch=1.10.1, numpy=1.21.5):

UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:201.)

dlrm/dlrm_data_pytorch.py

Lines 328 to 333 in 9c2fda7

def collate_wrapper_criteo_offset(list_of_tuples):
# where each tuple is (X_int, X_cat, y)
transposed_data = list(zip(*list_of_tuples))
X_int = torch.log(torch.tensor(transposed_data[0], dtype=torch.float) + 1)
X_cat = torch.tensor(transposed_data[1], dtype=torch.long)
T = torch.tensor(transposed_data[2], dtype=torch.float32).view(-1, 1)

dlrm/dlrm_data_pytorch.py

Lines 399 to 404 in 9c2fda7

def collate_wrapper_criteo_length(list_of_tuples):
# where each tuple is (X_int, X_cat, y)
transposed_data = list(zip(*list_of_tuples))
X_int = torch.log(torch.tensor(transposed_data[0], dtype=torch.float) + 1)
X_cat = torch.tensor(transposed_data[1], dtype=torch.long)
T = torch.tensor(transposed_data[2], dtype=torch.float32).view(-1, 1)

This might be a bug of PyTorch, see pytorch/pytorch#13918, and I followed the instruction to modify the code as:

X_int = torch.log(torch.tensor(np.array(transposed_data[0]), dtype=torch.float) + 1)
X_cat = torch.tensor(np.array(transposed_data[1]), dtype=torch.long)
T = torch.tensor(np.array(transposed_data[2]), dtype=torch.float32).view(-1, 1)

This modification speeds up the training process about 100%, from ~30 ms/it to ~15 ms/it on my machine (12x CPUs, 1x GTX 1060). I guess this is because the collect function is frequently called during training. Anyway, I hope this can be useful for others to train DLRM.


  1. This is just a small optimization. It seems X_int, X_cat, T, indices are all numpy.ndarray here:

dlrm/dlrm_data_pytorch.py

Lines 247 to 259 in 9c2fda7

# create training, validation, and test sets
if split == 'train':
self.X_int = [X_int[i] for i in train_indices]
self.X_cat = [X_cat[i] for i in train_indices]
self.y = [y[i] for i in train_indices]
elif split == 'val':
self.X_int = [X_int[i] for i in val_indices]
self.X_cat = [X_cat[i] for i in val_indices]
self.y = [y[i] for i in val_indices]
elif split == 'test':
self.X_int = [X_int[i] for i in test_indices]
self.X_cat = [X_cat[i] for i in test_indices]
self.y = [y[i] for i in test_indices]

So, if I rewrite the above code to:

# create training, validation, and test sets
if split == 'train':
    self.X_int = X_int[train_indices]
    self.X_cat = X_cat[train_indices]
    self.y = y[train_indices]
elif split == 'val':
    self.X_int = X_int[val_indices]
    self.X_cat = X_cat[val_indices]
    self.y = y[val_indices]
elif split == 'test':
    self.X_int = X_int[test_indices]
    self.X_cat = X_cat[test_indices]
    self.y = y[test_indices]

This can reduce ~15 s when creating the Kaggle dataset on my machine.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions