Opinions about dlrm_data_pytorch.py

Hi, I have several interesting observations while using DLRM.

---
1. The following code (line 223 - 225) to randomize data doesn't seem to work: 

https://github.com/facebookresearch/dlrm/blob/fbc37ebe21d4f88f18c6ae01333ada2d025e41cf/dlrm_data_pytorch.py#L214-L225
Maybe here the code should be like:
```python
self.X_int = X_int[indices]
self.X_cat = X_cat[indices]
self.y = y[indices]
```

But fortunately, this code seems never triggered in the current version.

---
2. I got warnings when running the following code (torch=1.10.1, numpy=1.21.5):
> UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at  ../torch/csrc/utils/tensor_new.cpp:201.)

https://github.com/facebookresearch/dlrm/blob/9c2fda79afbc09e277c17e420bffe510125b4f70/dlrm_data_pytorch.py#L328-L333
https://github.com/facebookresearch/dlrm/blob/9c2fda79afbc09e277c17e420bffe510125b4f70/dlrm_data_pytorch.py#L399-L404

This might be a bug of PyTorch, see https://github.com/pytorch/pytorch/issues/13918, and I followed the instruction to modify the code as:
```python
X_int = torch.log(torch.tensor(np.array(transposed_data[0]), dtype=torch.float) + 1)
X_cat = torch.tensor(np.array(transposed_data[1]), dtype=torch.long)
T = torch.tensor(np.array(transposed_data[2]), dtype=torch.float32).view(-1, 1)
```
This modification speeds up the training process about 100%, from ~30 ms/it to ~15 ms/it on my machine (12x CPUs, 1x GTX 1060). I guess this is because the collect function is frequently called during training. Anyway, I hope this can be useful for others to train DLRM.

---
3. This is just a small optimization. It seems `X_int, X_cat, T, indices` are all numpy.ndarray here:

https://github.com/facebookresearch/dlrm/blob/9c2fda79afbc09e277c17e420bffe510125b4f70/dlrm_data_pytorch.py#L247-L259

So, if I rewrite the above code to:
```python
# create training, validation, and test sets
if split == 'train':
    self.X_int = X_int[train_indices]
    self.X_cat = X_cat[train_indices]
    self.y = y[train_indices]
elif split == 'val':
    self.X_int = X_int[val_indices]
    self.X_cat = X_cat[val_indices]
    self.y = y[val_indices]
elif split == 'test':
    self.X_int = X_int[test_indices]
    self.X_cat = X_cat[test_indices]
    self.y = y[test_indices]
```
This can reduce  ~15 s when creating the Kaggle dataset on my machine.

	# create reordering
	indices = np.arange(len(y))

	if split == "none":
	# randomize all data
	if randomize == "total":
	indices = np.random.permutation(indices)
	print("Randomized indices...")

	X_int[indices] = X_int
	X_cat[indices] = X_cat
	y[indices] = y

	def collate_wrapper_criteo_offset(list_of_tuples):
	# where each tuple is (X_int, X_cat, y)
	transposed_data = list(zip(*list_of_tuples))
	X_int = torch.log(torch.tensor(transposed_data[0], dtype=torch.float) + 1)
	X_cat = torch.tensor(transposed_data[1], dtype=torch.long)
	T = torch.tensor(transposed_data[2], dtype=torch.float32).view(-1, 1)

	def collate_wrapper_criteo_length(list_of_tuples):
	# where each tuple is (X_int, X_cat, y)
	transposed_data = list(zip(*list_of_tuples))
	X_int = torch.log(torch.tensor(transposed_data[0], dtype=torch.float) + 1)
	X_cat = torch.tensor(transposed_data[1], dtype=torch.long)
	T = torch.tensor(transposed_data[2], dtype=torch.float32).view(-1, 1)

	# create training, validation, and test sets
	if split == 'train':
	self.X_int = [X_int[i] for i in train_indices]
	self.X_cat = [X_cat[i] for i in train_indices]
	self.y = [y[i] for i in train_indices]
	elif split == 'val':
	self.X_int = [X_int[i] for i in val_indices]
	self.X_cat = [X_cat[i] for i in val_indices]
	self.y = [y[i] for i in val_indices]
	elif split == 'test':
	self.X_int = [X_int[i] for i in test_indices]
	self.X_cat = [X_cat[i] for i in test_indices]
	self.y = [y[i] for i in test_indices]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Opinions about dlrm_data_pytorch.py #219

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Opinions about dlrm_data_pytorch.py #219

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions