# **Team 2: CTR Prdiction**
Lei Guo (lg3175@nyu.edu), Xiangjun Kong (xk321@nyu.edu)

- Please run this notebook using Google Colab (https://colab.research.google.com/)
- Sign in to Google with your edu.com account so you can use the unlimited storage on Google Drive
- Select **Runtime** in the tab and then **Change runtime type** and select **GPU** as the hardware accelerator
- The estimated storage to run DLRM on the Criteo Kaggle Dataset is 40 GB.

### **Replicating Results on the Criteo Kaggle Dataset**

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/gdrive')
# then you can view your Google Drive in Files at the sidebar

In [None]:
# Clone DLRM code from Facebook Research's Github and install the dependency package
# Change directory to your Google Drive
%cd /content/gdrive/My\ Drive/
!git clone https://github.com/facebookresearch/dlrm.git
!pip install onnx

In [None]:
# Change directory
%cd /content/gdrive/My\ Drive/dlrm/input/
# Download the Criteo Kaggle Dataset
!wget https://s3-eu-west-1.amazonaws.com/kaggle-display-advertising-challenge-dataset/dac.tar.gz
# Extract the train.txt from the .tar.gz file
!tar -xvf dac.tar.gz

In [None]:
# Change directory
%cd /content/gdrive/My\ Drive/dlrm
# Create a new folder to save DLRM's output
%mkdir -p ./output/

In [None]:
# A sample run of the code, with a tiny model is shown below
!python dlrm_s_pytorch.py --mini-batch-size=2 --data-size=6

In [None]:
# A sample run of the code, with a tiny model in debug mode
!python dlrm_s_pytorch.py --mini-batch-size=2 --data-size=6 --debug-mode

In [None]:
#Replicate the results of Facebook. 
#Save model in model.pt in output folder. Save the results on run_kaggle_pt.log. 
! python dlrm_s_pytorch.py --arch-sparse-feature-size=16 --arch-mlp-bot="13-512-256-64-16" --arch-mlp-top="512-256-1" --data-generation=dataset --data-set=kaggle --raw-data-file=./input/train.txt --processed-data-file=./input/kaggleAdDisplayChallenge_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=128 --print-freq=1024 --print-time --test-mini-batch-size=16384 --test-num-workers=16 --use-gpu --save-model=./output/model.pt --test-freq=1024 --memory-map 2>&1 | tee run_kaggle_pt.log

#Change the path of log file.
!cp -avr ./run_kaggle_pt.log ./output/

In [None]:
# As for the article of Facebook, it suggests to use shell file to run the code. But we think use dlrm_s_pytorch.py is enough.
# If you want to try to run dlrm_s_criteo_kaggle.sh, you may need to debug for "Permission denied". Use the code below.
# !chmod u+x ./dlrm_s_criteo_kaggle.sh
# ! ./dlrm_s_criteo_kaggle.sh --test-freq=1024 --use-gpu

In [None]:
# If you need to download datasets from kaggle, you can use the code below. Take avazu as an example.

# !mkdir -p ~/.kaggle 
# !cp kaggle.json ~/.kaggle/ 
# !chmod 600 ~/.kaggle/kaggle.json 

# !kaggle competitions download -c avazu-ctr-prediction -p /content/gdrive/My\ Drive/



# If you want to upload datasets from your computer, you can use the code below.
# from google.colab import files
# files.upload()

### **DLRM VS AutoML**

**Prepared data for SparkBeyond**  
The following code converts 7 .npz files into .txt files. The you can download all .txt files and upload them to SparkBeyond to run AutoML.

In [None]:
# Transform the pre-processed data(eg.train_day_0_processed.npz) into new txt files.
# We will get 7 txt file each of which represents one day.
import numpy as np

def transform(num):
  data = np.load('/content/gdrive/My Drive/dlrm/input/train_day_{0}_processed.npz'.format(num)) #change path here
  X_cat = data['X_cat']
  X_int = data['X_int']
  y = data['y']
  with open('train_day_{0}_sb.txt'.format(num),'w+') as f:
    for i in range(len(y)):
      line = ';'.join([str(y[i])]+[str(x) for x in X_int[i]]+[str(int(x)) for x in X_cat[i]])
      f.writelines(line+'\n')
    f.close()

for num_ in range(7):
  print('working on: {0}'.format(num_))
  transform(num_)
  print('finished: {0}'.format(num_))

**Prepared data for H2O**  
The following code converts seven .npz files into one .txt files. The you can download the .txt file and upload them to H2O to run AutoML.

In [None]:
#The training set for H2O includes day 0 to day 5. Therefore, we need to combine train_day_0_sb.txt to train_day_5_sb.txt
#Put these 6 files into a new folder.
import os
filedir = os.getcwd()+'/processed'
filenames=os.listdir(filedir)
f=open('processed_h2o/train_all.txt','w')
#Go through all files in folder
for filename in filenames:
    filepath = filedir+'/'+filename
    #Go through a file, write lines
    for line in open(filepath):
        f.writelines(line)
# f.write('\n')

f.close()

### **More on DLRM**

**Imbalanced Data**

- We modify corresponding  code in data_util.py. Then get *data_util_modify.py*.

**Run DLRM on Different Datasets**

- Add 3 arguments: --dense-count, --sparse-count, --day-count. Edit data_util.py / dlrm_data_pytorch.py / dlrm_s_pytorch.py to *data_util_modify.py / dlrm_data_pytorch_modify.py / dlrm_s_pytorch_modify.py*.

- Put them all in the dlrm folder.

In [None]:
#Test new codes.
#Choose smaller sample on the dataset to reduce the running time.
#Because our target is only try to test whether DLRM can be run on other datasets. 
target_file = open('/content/gdrive/My Drive/dlrm/train2.txt', 'w+') # Change path to google drive
source_file = open('/content/gdrive/My Drive/dlrm/input/train.txt') # Change your own path

# 1,000,000 lines ~ 240 MB
threshold = 1000000
i, j = 0, 0

for line in source_file:
  if i>threshold: #Keep the first n lines
    break
  target_file.write(line)
  #print(i,line)
  i += 1 # threshold
  if line[0]=='1':
    j += 1 # The number of label=1 in new dataset

source_file.close()
target_file.close()
# print(j/i) # 0.2549487450

In [None]:
import random

sub_sample_rate = 0.5
i, j, k = 0, 0, 0

target_file = open('/content/gdrive/My Drive/dlrm/train2_random.txt', 'w+')#modify as your path
source_file = open('/content/gdrive/My Drive/dlrm/train2.txt')#modify as your path

for line in source_file:
  i += 1 # The number of the orignal dataset
  if random.random()>sub_sample_rate:
    continue
  k += 1 # The number of sampling 
  target_file.write(line)
  if line[0]=='1':
    j += 1 # The number of label=1 in new dataset

source_file.close()
target_file.close()
# print(j/k) # 0.2557276234
# print(k/i) # 0.5005124994

In [None]:
#Try to reduce columns,then dense features = 12, sparse features = 24
target_file = open('./modify/input/train3.txt', 'w+')#modify as your path
source_file = open('./modify/input/train2_random.txt')#modify as your path

for line in source_file:
  #transform to list, delete some columns
  line_new = line.replace("\n","").split("\t")
  line_new = line_new[:-2]
  del line_new[2]
  #add \n
  line_new.append("\n")
  #transform to string
  str_line = '\t'.join(line_new)
  str_line = str_line.replace("\t\n","\n")
  target_file.write(str_line)

source_file.close()
target_file.close()

In [None]:
#After modified
! python dlrm_s_pytorch_modify.py --data-sub-sample-rate=0.8 --dense-count=12 --sparse-count=24 --day-count=6 --arch-sparse-feature-size=16 --arch-mlp-bot="12-512-256-64-16" --arch-mlp-top="512-256-1" --data-generation=dataset --data-set=kaggle --raw-data-file=./modify/input/train3.txt --processed-data-file=./input/kaggleAdDisplayChallenge_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=128 --print-freq=1024 --print-time --test-mini-batch-size=16384 --test-num-workers=16 --use-gpu --save-model=./modify/output/model_modify.pt --test-freq=1024 --memory-map 2>&1 | tee run_kaggle_pt.log