# 10-714 Final Project

In this final project I will implement DLRM model (https://arxiv.org/abs/1906.00091) and training it on the Criteo 1TB Click Logs Dataset (https://labs.criteo.com/2013/12/download-terabyte-click-logs/).

As a contribution to the needle framework, I will be implementing (hopefully in a reusable manner) the following building blocks:

- Binary Cross-Entropoy loss
- hashing trick that would optimize memory footprint of the `Embedding` layer
- data parallel distributed training using ray.io

N.B.: The code is most likely won't work in collab right away, as it is currently relying on AWS infrastructure.

In [1]:
# # Code to set up the assignment
# from google.colab import drive
# drive.mount('/content/drive')
# %cd /content/drive/MyDrive/
# !mkdir -p 10714
# %cd /content/drive/MyDrive/10714
# !git clone http://github.com/chaos-ad/dlsyscourse-final.git final
# %cd /content/drive/MyDrive/10714/final/notebooks

# !pip3 install --upgrade --no-deps git+https://github.com/dlsys10714/mugrade.git
# !pip3 install pybind11

In [2]:
%pwd

'/home/ec2-user/SageMaker/code/dlsyscourse-homework/final/notebooks'

In [3]:
!pip install --quiet -r ../requirements.txt

In [4]:
!cd .. && make

-- Found pybind11: /home/ec2-user/SageMaker/.cs/conda/envs/codeserver_py39/lib/python3.9/site-packages/pybind11/include (found version "2.10.1")
-- Found cuda, building cuda backend
Sun Jan 15 22:28:00 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   44C    P8    16W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------

In [5]:
import os
import sys
sys.path.append(os.path.abspath(".."))
sys.path.append(os.path.abspath("../python"))

In [6]:
import dotenv
assert dotenv.load_dotenv(dotenv_path="../conf/dev.env")

In [7]:
import apps.etl
import apps.utils.aws
import apps.utils.common

## Hot code reloading, useful during dev:
%load_ext autoreload
%autoreload 1
%aimport apps.etl
%aimport apps.utils.common
%aimport apps.utils.aws.s3
%aimport apps.utils.aws.athena

In [8]:
import logging
logger = logging.getLogger("notebooks.final")
apps.utils.common.setup_logging(config_file="../conf/logging.yml")

### Download criteo 1TB dataset into S3

Run me once to download data to S3 (bucket & prefix are controlled via conf/dev.env, and postfix is an argument defaulting to "criteo/raw")

In [9]:
# apps.etl.import_criteo_dataset()

Convert data into parquet format, so that column-wise operations will be much faster

In [10]:
# apps.etl.init_parquet_athena_table()

In [11]:
# apps.etl.parse_criteo_dataset()

Some parts of the ETL are omitted here due to lack of time. But essentially for each feature we build a lookup dictionary sorted by frequency using Athena query. Next, we join it back to the sparse features, and use their indices in a dict instead (similarly to what we have done for PTB in HW4).

In [12]:
import tests.debug

  from .autonotebook import tqdm as notebook_tqdm


2023-01-15 22:28:05,055 - torch.distributed.nn.jit.instantiator - INFO - Created a temporary directory at /tmp/tmpzwlef4v1
2023-01-15 22:28:05,056 - torch.distributed.nn.jit.instantiator - INFO - Writing /tmp/tmpzwlef4v1/_remote_module_non_scriptable.py


In [13]:
import torch
device = torch.device("cuda")

In [15]:
tests.debug.run_torch(
    device=device,
    with_pbar = True
)

TRAIN[epoch_id=1]: 100%|█████████▉| 195858865/195871983 [09:29<00:00, 453477.12it/s]

2023-01-15 22:37:52,896 - tests.debug - INFO - TRAIN[epoch_id=1] done: avg_loss=0.1314, auroc_val=0.7224, accuracy_val=0.9679, num_samples=195871983, num_batches=11988


TRAIN[epoch_id=1]: 100%|██████████| 195871983/195871983 [09:30<00:00, 343378.81it/s]
EVAL[epoch_id='test']: 100%|█████████▉| 178224470/178274637 [05:26<00:00, 823569.75it/s]

2023-01-15 22:43:20,882 - tests.debug - INFO - EVAL[epoch_id='test'] done: avg_loss=0.1359, auroc_val=0.7254, accuracy_val=0.9664, num_samples=178274637, num_batches=10920


EVAL[epoch_id='test']: 100%|██████████| 178274637/178274637 [05:27<00:00, 543565.54it/s]


In [None]:
# tests.debug.run_needle()