# Criteo 1 TiB benchmark

In this experiment we will evalutate a number of machine learning tools on a varying size of train data to determine how fast they learn, how much memory they consume etc.
在这个实验中，我们将在不同大小的训练数据上评估许多机器学习工具，以确定它们学习的速度、消耗的内存等。

We will assess Vowpal Wabbit and XGBoost in local mode, and Spark.ML models in cluster mode.

我们将在本地模式下评估Vowpal Wabbit和XGBoost，在集群模式下评估Spark.ML模型。

We will use terabyte click logs released by Criteo and sample needed amount of data from them.

我们将使用Criteo发布的TB点击日志，并从中获取所需的数据量。


This instance of experiment notebook focuses on data preparation and training VW & XGBoost locally.
本实验笔记本集中于数据准备和本地 training VW & XGBoost。

Let's go!



In [1]:
!sh /data/dataset/fengwen/script/criteo-1tb-benchmark/resume.sh

In [2]:
!sh /data/dataset/fengwen/script/make.sh

# Table of contents

* [Configuration](#Configuration)
* [Data preparation](#Data-preparation)
  * [Criteo → LibSVM](#Criteo-→-LibSVM)
  * [LibSVM → Train and test (sampling)](#LibSVM-→-Train-and-test-(sampling%29)
  * [LibSVM train and test → VW train and test](#LibSVM-train-and-test-→-VW-train-and-test)
  * [Local data](#Local-data)
* [Local training](#Local-training)

In [3]:
# encoding：utf-8
%load_ext autotime
%matplotlib inline

from __future__ import print_function

time: 323 ms (started: 2022-11-16 12:54:39 +00:00)


## Configuration
[_(back to toc)_](#Table-of-contents)

Paths:

In [4]:
criteo_data_remote_path = 'criteo/plain'
libsvm_data_remote_path = 'criteo/libsvm'
vw_data_remote_path = 'criteo/vw'

local_data_path = 'criteo/data'
local_results_path = 'criteo/results'
local_runtime_path = 'criteo/runtime'

time: 1.14 ms (started: 2022-11-16 12:54:40 +00:00)


In [5]:
import os


criteo_day_template = os.path.join(criteo_data_remote_path, 'day{}')
libsvm_day_template = os.path.join(libsvm_data_remote_path, 'day{}')
vw_day_template = os.path.join(vw_data_remote_path, 'day{}')

libsvm_train_template = os.path.join(libsvm_data_remote_path, 'train', '{}')
libsvm_test_template = os.path.join(libsvm_data_remote_path, 'test', '{}')
vw_train_template = os.path.join(vw_data_remote_path, 'train', '{}')
vw_test_template = os.path.join(vw_data_remote_path, 'test', '{}')

local_libsvm_test_template = os.path.join(local_data_path, 'data.test.{}.libsvm')
local_libsvm_train_template = os.path.join(local_data_path, 'data.train.{}.libsvm')
local_vw_test_template = os.path.join(local_data_path, 'data.test.{}.vw')
local_vw_train_template = os.path.join(local_data_path, 'data.train.{}.vw')

time: 2.29 ms (started: 2022-11-16 12:54:40 +00:00)


In [6]:
def ensure_directory_exists(path):
    if not os.path.exists(path):
        os.makedirs(path)

time: 295 µs (started: 2022-11-16 12:54:40 +00:00)


In [7]:
file_lists = [libsvm_data_remote_path, criteo_data_remote_path, vw_data_remote_path, local_data_path, local_results_path, local_runtime_path]

for file in file_lists:
    ensure_directory_exists(file)

time: 1.61 ms (started: 2022-11-16 12:54:40 +00:00)


Days to work on:

In [8]:
days = list(range(0, 23 + 1))

time: 752 µs (started: 2022-11-16 12:54:40 +00:00)


Samples to take:

In [9]:
train_samples = [
    10000, 30000,  # tens of thousands
    100000, 300000,  # hundreds of thousands
    1000000, 3000000,  # millions
    10000000, 30000000,  # tens of millions
    100000000, 300000000,  # hundreds of millions
    1000000000, 3000000000,  # billions
]
test_samples = [1000000]

time: 502 µs (started: 2022-11-16 12:54:40 +00:00)


Spark configuration and initialization:

In [10]:
total_cores = 256

time: 586 µs (started: 2022-11-16 12:54:40 +00:00)


In [11]:
executor_cores = 4
executor_instances = total_cores / executor_cores
memory_per_core = 4

time: 775 µs (started: 2022-11-16 12:54:41 +00:00)


In [12]:
app_name = 'Criteo experiment'

master = 'yarn'

settings = {
    'spark.network.timeout': '600',
    
    'spark.driver.cores': '16',
    'spark.driver.maxResultSize': '16G',
    'spark.driver.memory': '32G',
    
    'spark.executor.cores': str(executor_cores),
    'spark.executor.instances': str(executor_instances),
    'spark.executor.memory': str(memory_per_core * executor_cores) + 'G',
    
    'spark.speculation': 'true',
    'spark.yarn.queue': 'root.HungerGames',
}

time: 906 µs (started: 2022-11-16 12:54:41 +00:00)


In [13]:
import findspark
findspark.init('/data/dataset/fengwen/script/data/spark-3.3.1-bin-hadoop3-scala2.13')
from pyspark import SparkContext
sc = SparkContext('local','pyspark')
os.environ['PYSPARK_DRIVER_PYTHON'] = '/home/fengwen/miniconda3/envs/spark/bin/jupyter'
os.environ['PYSPARK_DRIVER_PYTHON_OPTS'] = " --ip=0.0.0.0 --port=7777"
# jupyter: /home/fengwen/miniconda3/envs/spark/bin/jupyter

22/11/16 12:54:42 WARN Utils: Your hostname, oneflow-27 resolves to a loopback address: 127.0.1.1; using 192.168.1.27 instead (on interface ens121f0)
22/11/16 12:54:42 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/11/16 12:54:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/11/16 12:54:43 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
time: 2.84 s (started: 2022-11-16 12:54:41 +00:00)


In [14]:
from pyspark.sql import SparkSession


builder = SparkSession.builder

builder.appName(app_name)
builder.master(master)
for k, v in settings.items():
    builder.config(k, v)

spark = builder.getOrCreate()
sc = spark.sparkContext

sc.setLogLevel('ERROR')


time: 108 ms (started: 2022-11-16 12:54:44 +00:00)


Logging:

In [15]:
import sys
import logging

from importlib import reload # 添加
logging.shutdown()            # 添加
reload(logging)              # 在 reload(logging) 前添加两行代码


handler = logging.StreamHandler(stream=sys.stdout)
formatter = logging.Formatter('[%(asctime)s] %(message)s')
handler.setFormatter(formatter)

ensure_directory_exists(local_runtime_path)
file_handler = logging.FileHandler(filename=os.path.join(local_runtime_path, 'mylog.log'), mode='a')
file_handler.setFormatter(formatter)

logger = logging.getLogger()
logger.addHandler(handler)
logger.addHandler(file_handler)
logger.setLevel(logging.INFO)

time: 6.18 ms (started: 2022-11-16 12:54:44 +00:00)


In [16]:
logger.info('Spark version: %s.', spark.version)

[2022-11-16 12:54:44,474] Spark version: 3.3.1.
time: 1.76 ms (started: 2022-11-16 12:54:44 +00:00)


## Data preparation 数据准备
[_(back to toc)_](#Table-of-contents)

Poor man's HDFS API:

In [17]:
def hdfs_exists(path):
    l = !hadoop fs -ls $path 2>/dev/null
    return len(l) != 0

def hdfs_success(path):
    return hdfs_exists(os.path.join(path, '_SUCCESS'))

def hdfs_delete(path, recurse=False):
    if recurse:
        _ = !hadoop fs -rm -r $path
    else:
        _ = !hadoop fs -rm $path

def hdfs_get(remote_path, local_path):
    remote_path_glob = os.path.join(remote_path, 'part-*')
    _ = !hadoop fs -cat $remote_path_glob >$local_path

time: 555 µs (started: 2022-11-16 12:54:44 +00:00)


Load RDDs from one place and save them to another converted:

In [18]:
def convert_chunked_data(input_path_template, output_path_template, chunks, load_rdd, convert_row, transform_rdd=None):
    for chunk in chunks:
        input_path = input_path_template.format(chunk)
        output_path = output_path_template.format(chunk)

        if hdfs_success(output_path):
            logger.info('Chunk "%s" is already converted and saved to "%s", skipping.', chunk, output_path)
            continue

        logger.info('Reading chunk "%s" data from "%s".', chunk, input_path)
        rdd = load_rdd(input_path)

        if hdfs_exists(output_path):
            logger.info('Cleaning "%s".', output_path)
            hdfs_delete(output_path, recurse=True)

        logger.info('Processing and saving to "%s".', output_path)
        rdd = rdd.map(convert_row)
        
        if transform_rdd is not None:
            rdd = transform_rdd(rdd)
        
        rdd.saveAsTextFile(output_path)

        logger.info('Done with chunk "%s".', chunk)

time: 638 µs (started: 2022-11-16 12:54:44 +00:00)


### Criteo → LibSVM
[_(back to toc)_](#Table-of-contents)

Criteo RDD is actually a DataFrame:

In [19]:
def load_criteo_rdd(path):
    return (
        spark
        .read
        .option('header', 'false')
        .option('inferSchema', 'true')
        .option('delimiter', '\t')
        .csv(path)
        .rdd
    )

time: 305 µs (started: 2022-11-16 12:54:44 +00:00)


Simply add an index to each existing column except the first one which is a target:

In [20]:
def criteo_to_libsvm(row):
    return (
        str(row[0])
        + ' '
        + ' '.join(
            [
                # integer features
                str(i) + ':' + str(row[i])
                for i in range(1, 13 + 1)
                if row[i] is not None
            ] + [
                # string features converted from hex to int
                str(i) + ':' + str(int(row[i], 16))
                for i in range(14, 39 + 1)
                if row[i] is not None
            ]
        )
    )

time: 1.69 ms (started: 2022-11-16 12:54:44 +00:00)


Do it for all days:

In [21]:
convert_chunked_data(criteo_day_template, libsvm_day_template, days, load_criteo_rdd, criteo_to_libsvm)

[2022-11-16 12:54:44,998] Reading chunk "0" data from "criteo/plain/day0".
[2022-11-16 12:54:49,220] Processing and saving to "criteo/libsvm/day0".
[2022-11-16 12:54:49,961] Done with chunk "0".


                                                                                

[2022-11-16 12:54:49,975] Reading chunk "1" data from "criteo/plain/day1".
[2022-11-16 12:54:50,243] Processing and saving to "criteo/libsvm/day1".
[2022-11-16 12:54:50,379] Done with chunk "1".
[2022-11-16 12:54:50,391] Reading chunk "2" data from "criteo/plain/day2".
[2022-11-16 12:54:50,584] Processing and saving to "criteo/libsvm/day2".
[2022-11-16 12:54:50,696] Done with chunk "2".
[2022-11-16 12:54:50,704] Reading chunk "3" data from "criteo/plain/day3".
[2022-11-16 12:54:50,898] Processing and saving to "criteo/libsvm/day3".
[2022-11-16 12:54:51,035] Done with chunk "3".
[2022-11-16 12:54:51,045] Reading chunk "4" data from "criteo/plain/day4".
[2022-11-16 12:54:51,244] Processing and saving to "criteo/libsvm/day4".
[2022-11-16 12:54:51,354] Done with chunk "4".
[2022-11-16 12:54:51,365] Reading chunk "5" data from "criteo/plain/day5".
[2022-11-16 12:54:51,555] Processing and saving to "criteo/libsvm/day5".
[2022-11-16 12:54:51,661] Done with chunk "5".
[2022-11-16 12:54:51,671]

### LibSVM → Train and test (sampling)
[_(back to toc)_](#Table-of-contents)

Let's name samples as their shortened "engineering" notation - e.g. 1e5 is 100k etc.:

In [22]:
def sample_name(sample):
    return str(sample)[::-1].replace('000', 'k')[::-1]

time: 803 µs (started: 2022-11-16 12:54:56 +00:00)


Load data, sample a bit more than needed and cut at exact desired number of lines by zipping with index and filtering upto required index:

In [23]:
from functools import reduce
oversample = 1.03
sampled_partitions = 256


def sample_and_save(input_path_template, output_path_template, days, samples):
    union = None
    union_count = None
    
    for sample in samples:
        name = sample_name(sample)
        output_path = output_path_template.format(name)
        
        if hdfs_success(output_path):
            logger.info('Sample "%s" is already written to "%s", skipping.', sample, output_path)
            continue
            
        logger.info('Preparing to write sample to "%s".', output_path)
        
        if union is None:
            rdds = map(lambda day: sc.textFile(input_path_template.format(day)), days)
            union = reduce(lambda left, right: left.union(right), rdds)

            union_count = union.count()
            logger.info('Total number of lines for days "%s" is "%s".', days, union_count)
            
        ratio = float(sample) / union_count
        
        sampled_union = (
            union
            .sample(False, min(1.0, oversample * ratio))
            .zipWithIndex()
            .filter(lambda z: z[1] < sample)
            .map(lambda z: z[0])
        )
        
        if hdfs_exists(output_path):
            logger.info('Cleaning "%s".', output_path)
            hdfs_delete(output_path, recurse=True)
            
        logger.info('Writing sample "%s" to "%s".', sample, output_path)
        sampled_union.coalesce(sampled_partitions).saveAsTextFile(output_path)
        
        logger.info('Saved "%s" lines to "%s".', sc.textFile(output_path).count(), output_path)

time: 1.46 ms (started: 2022-11-16 12:54:56 +00:00)


Sample all LibSVM data:

In [24]:
sample_and_save(libsvm_day_template, libsvm_test_template, days[-1:], test_samples)

[2022-11-16 12:54:56,792] Preparing to write sample to "criteo/libsvm/test/1kk".
[2022-11-16 12:54:57,009] Total number of lines for days "[23]" is "20".
[2022-11-16 12:54:57,017] Writing sample "1000000" to "criteo/libsvm/test/1kk".
[2022-11-16 12:54:57,179] Saved "20" lines to "criteo/libsvm/test/1kk".
time: 401 ms (started: 2022-11-16 12:54:56 +00:00)


In [25]:
sample_and_save(libsvm_day_template, libsvm_train_template, days[:-1], train_samples)

[2022-11-16 12:54:57,290] Preparing to write sample to "criteo/libsvm/train/10k".
[2022-11-16 12:54:58,451] Total number of lines for days "[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]" is "460".
[2022-11-16 12:54:58,969] Writing sample "10000" to "criteo/libsvm/train/10k".


                                                                                

[2022-11-16 12:55:00,344] Saved "460" lines to "criteo/libsvm/train/10k".
[2022-11-16 12:55:00,356] Preparing to write sample to "criteo/libsvm/train/30k".
[2022-11-16 12:55:00,891] Writing sample "30000" to "criteo/libsvm/train/30k".


                                                                                

[2022-11-16 12:55:02,237] Saved "460" lines to "criteo/libsvm/train/30k".
[2022-11-16 12:55:02,250] Preparing to write sample to "criteo/libsvm/train/100k".
[2022-11-16 12:55:02,693] Writing sample "100000" to "criteo/libsvm/train/100k".


                                                                                

[2022-11-16 12:55:04,061] Saved "460" lines to "criteo/libsvm/train/100k".
[2022-11-16 12:55:04,071] Preparing to write sample to "criteo/libsvm/train/300k".
[2022-11-16 12:55:04,578] Writing sample "300000" to "criteo/libsvm/train/300k".


                                                                                

[2022-11-16 12:55:05,881] Saved "460" lines to "criteo/libsvm/train/300k".
[2022-11-16 12:55:05,894] Preparing to write sample to "criteo/libsvm/train/1kk".
[2022-11-16 12:55:06,386] Writing sample "1000000" to "criteo/libsvm/train/1kk".


                                                                                

[2022-11-16 12:55:07,682] Saved "460" lines to "criteo/libsvm/train/1kk".
[2022-11-16 12:55:07,693] Preparing to write sample to "criteo/libsvm/train/3kk".
[2022-11-16 12:55:08,171] Writing sample "3000000" to "criteo/libsvm/train/3kk".


                                                                                

[2022-11-16 12:55:09,513] Saved "460" lines to "criteo/libsvm/train/3kk".
[2022-11-16 12:55:09,526] Preparing to write sample to "criteo/libsvm/train/10kk".
[2022-11-16 12:55:09,991] Writing sample "10000000" to "criteo/libsvm/train/10kk".


                                                                                

[2022-11-16 12:55:11,305] Saved "460" lines to "criteo/libsvm/train/10kk".
[2022-11-16 12:55:11,318] Preparing to write sample to "criteo/libsvm/train/30kk".
[2022-11-16 12:55:11,805] Writing sample "30000000" to "criteo/libsvm/train/30kk".


                                                                                

[2022-11-16 12:55:13,060] Saved "460" lines to "criteo/libsvm/train/30kk".
[2022-11-16 12:55:13,069] Preparing to write sample to "criteo/libsvm/train/100kk".
[2022-11-16 12:55:13,520] Writing sample "100000000" to "criteo/libsvm/train/100kk".


                                                                                

[2022-11-16 12:55:14,878] Saved "460" lines to "criteo/libsvm/train/100kk".
[2022-11-16 12:55:14,892] Preparing to write sample to "criteo/libsvm/train/300kk".
[2022-11-16 12:55:15,388] Writing sample "300000000" to "criteo/libsvm/train/300kk".


                                                                                

[2022-11-16 12:55:16,771] Saved "460" lines to "criteo/libsvm/train/300kk".
[2022-11-16 12:55:16,783] Preparing to write sample to "criteo/libsvm/train/1kkk".
[2022-11-16 12:55:17,296] Writing sample "1000000000" to "criteo/libsvm/train/1kkk".


                                                                                

[2022-11-16 12:55:18,576] Saved "460" lines to "criteo/libsvm/train/1kkk".
[2022-11-16 12:55:18,589] Preparing to write sample to "criteo/libsvm/train/3kkk".
[2022-11-16 12:55:19,010] Writing sample "3000000000" to "criteo/libsvm/train/3kkk".


                                                                                

[2022-11-16 12:55:20,341] Saved "460" lines to "criteo/libsvm/train/3kkk".
time: 23.1 s (started: 2022-11-16 12:54:57 +00:00)


### LibSVM train and test → VW train and test
[_(back to toc)_](#Table-of-contents)

LibSVM RDD is a text file:

In [26]:
def load_libsvm_rdd(path):
    return sc.textFile(path)

time: 1.16 ms (started: 2022-11-16 12:55:20 +00:00)


Conversion is trivial - we only have to map target to {-1, 1} and convert categorical features to VW feature names as a whole:

In [27]:
def libsvm_to_vw(line):
    parts = line.split(' ')
    parts[0] = '1 |' if parts[0] == '1' else '-1 |'
    for i in range(1, len(parts)):
        index, _, value = parts[i].partition(':')
        if int(index) >= 14:
            parts[i] = index + '_' + value
    return ' '.join(parts)

time: 1.71 ms (started: 2022-11-16 12:55:20 +00:00)


Also, data for VW should be well shuffled:

In [28]:
import hashlib


def calculate_hash(something):
    m = hashlib.md5()
    m.update(str(something).encode("utf-8"))
    return m.hexdigest()

def random_sort(rdd):
    return (
        rdd
        .zipWithIndex()
        .sortBy(lambda z: calculate_hash(z[1]))
        .map(lambda z: z[0])
    )

time: 1.1 ms (started: 2022-11-16 12:55:20 +00:00)


Convert all LibSVM samples:

In [29]:
convert_chunked_data(libsvm_test_template, vw_test_template, [sample_name(sample) for sample in test_samples], load_libsvm_rdd, libsvm_to_vw, transform_rdd=random_sort)

[2022-11-16 12:55:20,818] Reading chunk "1kk" data from "criteo/libsvm/test/1kk".
[2022-11-16 12:55:20,848] Processing and saving to "criteo/vw/test/1kk".
[2022-11-16 12:55:20,953] Done with chunk "1kk".
time: 150 ms (started: 2022-11-16 12:55:20 +00:00)


In [30]:
convert_chunked_data(libsvm_train_template, vw_train_template, [sample_name(sample) for sample in train_samples], load_libsvm_rdd, libsvm_to_vw, transform_rdd=random_sort)

[2022-11-16 12:55:21,119] Reading chunk "10k" data from "criteo/libsvm/train/10k".
[2022-11-16 12:55:21,152] Processing and saving to "criteo/vw/train/10k".




[2022-11-16 12:55:24,461] Done with chunk "10k".


                                                                                

[2022-11-16 12:55:24,473] Reading chunk "30k" data from "criteo/libsvm/train/30k".
[2022-11-16 12:55:24,506] Processing and saving to "criteo/vw/train/30k".




[2022-11-16 12:55:27,514] Done with chunk "30k".


                                                                                

[2022-11-16 12:55:27,526] Reading chunk "100k" data from "criteo/libsvm/train/100k".
[2022-11-16 12:55:27,562] Processing and saving to "criteo/vw/train/100k".




[2022-11-16 12:55:30,574] Done with chunk "100k".


                                                                                

[2022-11-16 12:55:30,584] Reading chunk "300k" data from "criteo/libsvm/train/300k".
[2022-11-16 12:55:30,603] Processing and saving to "criteo/vw/train/300k".




[2022-11-16 12:55:33,421] Done with chunk "300k".


                                                                                

[2022-11-16 12:55:33,435] Reading chunk "1kk" data from "criteo/libsvm/train/1kk".
[2022-11-16 12:55:33,469] Processing and saving to "criteo/vw/train/1kk".




[2022-11-16 12:55:36,347] Done with chunk "1kk".


                                                                                

[2022-11-16 12:55:36,361] Reading chunk "3kk" data from "criteo/libsvm/train/3kk".
[2022-11-16 12:55:36,392] Processing and saving to "criteo/vw/train/3kk".




[2022-11-16 12:55:39,302] Done with chunk "3kk".


                                                                                

[2022-11-16 12:55:39,314] Reading chunk "10kk" data from "criteo/libsvm/train/10kk".
[2022-11-16 12:55:39,347] Processing and saving to "criteo/vw/train/10kk".




[2022-11-16 12:55:42,238] Done with chunk "10kk".


                                                                                

[2022-11-16 12:55:42,250] Reading chunk "30kk" data from "criteo/libsvm/train/30kk".
[2022-11-16 12:55:42,285] Processing and saving to "criteo/vw/train/30kk".




[2022-11-16 12:55:45,170] Done with chunk "30kk".


                                                                                

[2022-11-16 12:55:45,182] Reading chunk "100kk" data from "criteo/libsvm/train/100kk".
[2022-11-16 12:55:45,213] Processing and saving to "criteo/vw/train/100kk".




[2022-11-16 12:55:48,049] Done with chunk "100kk".


                                                                                

[2022-11-16 12:55:48,062] Reading chunk "300kk" data from "criteo/libsvm/train/300kk".
[2022-11-16 12:55:48,099] Processing and saving to "criteo/vw/train/300kk".




[2022-11-16 12:55:50,962] Done with chunk "300kk".


                                                                                

[2022-11-16 12:55:50,976] Reading chunk "1kkk" data from "criteo/libsvm/train/1kkk".
[2022-11-16 12:55:51,008] Processing and saving to "criteo/vw/train/1kkk".




[2022-11-16 12:55:53,871] Done with chunk "1kkk".


                                                                                

[2022-11-16 12:55:53,885] Reading chunk "3kkk" data from "criteo/libsvm/train/3kkk".
[2022-11-16 12:55:53,917] Processing and saving to "criteo/vw/train/3kkk".




[2022-11-16 12:55:56,716] Done with chunk "3kkk".
time: 35.6 s (started: 2022-11-16 12:55:21 +00:00)


                                                                                

Spark is no longer needed:

In [31]:
spark.stop()

time: 876 ms (started: 2022-11-16 12:55:56 +00:00)


### Local data
[_(back to toc)_](#Table-of-contents)

Download all sampled data to local directory:

In [32]:
ensure_directory_exists(local_data_path)

time: 1.16 ms (started: 2022-11-16 12:55:57 +00:00)


In [33]:
def count_lines(path):
    lines = 0
    with open(path) as f:
        for i, _ in enumerate(f):
            lines = i
        return lines + 1

def download_data(remote_template, local_template, samples):
    for sample in samples:
        name = sample_name(sample)
        remote_path = remote_template.format(name)
        local_path = local_template.format(name)
        if os.path.exists(local_path):
            count = count_lines(local_path)
            if count == sample:
                logger.info('File "%s" is already loaded, skipping.', local_path)
                continue
            else:
                logger.info('File "%s" already exists but number of lines "%s" is wrong (must be "%s"), reloading.', local_path, count, sample)
        logger.info('Loading file "%s" as local file "%s".', remote_path, local_path)
        hdfs_get(remote_path, local_path)
        count = count_lines(local_path)
        logger.info('File loaded to "%s", number of lines is "%s".', local_path, count)
        # assert count == sample, 'File "{}" contains wrong number of lines "{}" (must be "{}").'.format(local_path, count, sample)

time: 2.88 ms (started: 2022-11-16 12:55:57 +00:00)


In [34]:
download_data(libsvm_test_template, local_libsvm_test_template, test_samples)

[2022-11-16 12:55:58,038] Loading file "criteo/libsvm/test/1kk" as local file "criteo/data/data.test.1kk.libsvm".
[2022-11-16 12:55:58,052] File loaded to "criteo/data/data.test.1kk.libsvm", number of lines is "1".
time: 15.6 ms (started: 2022-11-16 12:55:58 +00:00)


In [35]:
download_data(libsvm_train_template, local_libsvm_train_template, train_samples)

[2022-11-16 12:55:58,154] Loading file "criteo/libsvm/train/10k" as local file "criteo/data/data.train.10k.libsvm".
[2022-11-16 12:55:58,168] File loaded to "criteo/data/data.train.10k.libsvm", number of lines is "1".
[2022-11-16 12:55:58,170] Loading file "criteo/libsvm/train/30k" as local file "criteo/data/data.train.30k.libsvm".
[2022-11-16 12:55:58,181] File loaded to "criteo/data/data.train.30k.libsvm", number of lines is "1".
[2022-11-16 12:55:58,182] Loading file "criteo/libsvm/train/100k" as local file "criteo/data/data.train.100k.libsvm".
[2022-11-16 12:55:58,191] File loaded to "criteo/data/data.train.100k.libsvm", number of lines is "1".
[2022-11-16 12:55:58,193] Loading file "criteo/libsvm/train/300k" as local file "criteo/data/data.train.300k.libsvm".
[2022-11-16 12:55:58,203] File loaded to "criteo/data/data.train.300k.libsvm", number of lines is "1".
[2022-11-16 12:55:58,204] Loading file "criteo/libsvm/train/1kk" as local file "criteo/data/data.train.1kk.libsvm".
[2022-

In [36]:
download_data(vw_test_template, local_vw_test_template, test_samples)

[2022-11-16 12:55:58,384] Loading file "criteo/vw/test/1kk" as local file "criteo/data/data.test.1kk.vw".
[2022-11-16 12:55:58,398] File loaded to "criteo/data/data.test.1kk.vw", number of lines is "1".
time: 16.7 ms (started: 2022-11-16 12:55:58 +00:00)


In [37]:
download_data(vw_train_template, local_vw_train_template, train_samples)

[2022-11-16 12:55:58,502] Loading file "criteo/vw/train/10k" as local file "criteo/data/data.train.10k.vw".
[2022-11-16 12:55:58,515] File loaded to "criteo/data/data.train.10k.vw", number of lines is "1".
[2022-11-16 12:55:58,517] Loading file "criteo/vw/train/30k" as local file "criteo/data/data.train.30k.vw".
[2022-11-16 12:55:58,525] File loaded to "criteo/data/data.train.30k.vw", number of lines is "1".
[2022-11-16 12:55:58,526] Loading file "criteo/vw/train/100k" as local file "criteo/data/data.train.100k.vw".
[2022-11-16 12:55:58,533] File loaded to "criteo/data/data.train.100k.vw", number of lines is "1".
[2022-11-16 12:55:58,535] Loading file "criteo/vw/train/300k" as local file "criteo/data/data.train.300k.vw".
[2022-11-16 12:55:58,542] File loaded to "criteo/data/data.train.300k.vw", number of lines is "1".
[2022-11-16 12:55:58,543] Loading file "criteo/vw/train/1kk" as local file "criteo/data/data.train.1kk.vw".
[2022-11-16 12:55:58,551] File loaded to "criteo/data/data.tra

## Local training
[_(back to toc)_](#Table-of-contents)

Measuring model quality and ML engine technical metrics:

In [38]:
from functools import reduce
import sys 
from matplotlib import pyplot
from sklearn.metrics import (
    auc,
    log_loss,
    roc_curve,
)


def measure(engine, sample, test_file, time_file, predictions_file):
    
    def get_last_in_line(s):
        return s.rstrip().split( )[-1]

    def parse_elapsed_time(s):
        return reduce(lambda a, b: a * 60 + b, map(float, get_last_in_line(s).split(':')))

    def parse_max_memory(s):
        return int(get_last_in_line(s)) * 1024

    def parse_cpu(s):
        return float(get_last_in_line(s).rstrip('%')) / 100 


    elapsed = -1
    memory = -1
    cpu = -1

    with open(time_file, 'rb') as f:
        for line in f:
            if 'Elapsed (wall clock) time' in line:
                elapsed = parse_elapsed_time(line)
            elif 'Maximum resident set size' in line:
                memory = parse_max_memory(line)
            elif 'Percent of CPU' in line:
                cpu = parse_cpu(line)

    with open(test_file, 'rb') as f:
        labels = [line.rstrip().split(' ')[0] == '1' for line in f]

    with open(predictions_file, 'rb') as f:
        scores = [float(line.rstrip().split(' ')[0]) for line in f]

    fpr, tpr, _ = roc_curve(labels, scores)
    roc_auc = auc(fpr, tpr)
    ll = log_loss(labels, scores)
    
    figure = pyplot.figure(figsize=(6, 6))
    pyplot.plot(fpr, tpr, linewidth=2.0)
    pyplot.plot([0, 1], [0, 1], 'k--')
    pyplot.xlabel('FPR')
    pyplot.ylabel('TPR')
    pyplot.title('{} {} - {:.3f} ROC AUC'.format(engine, sample_name(sample), roc_auc))
    pyplot.show()

    return {
        'Engine': engine,
        'Train size': sample,
        'ROC AUC': roc_auc,
        'Log loss': ll,
        'Train time': elapsed,
        'Maximum memory': memory,
        'CPU load': cpu,
    }

time: 396 ms (started: 2022-11-16 12:55:58 +00:00)


Settings for VW & XGBoost and how to run them; I use (a little bit patched for correctness sake) GNU Time to measure running time, CPU load and memory consumption; configurations for VW & XGBoost are obtained via Hyperopt:

In [39]:
def get_time_command_and_file(train_file):
    time_file = train_file + '.time'
    print("time_file", time_file)
    return [
        # '/usr/local/bin/time',
        '/usr/bin/time',
        '-v',
        '--output=' + time_file,
    ], time_file

def get_vw_commands_and_predictions_file(train_file, test_file):
    model_file = train_file + '.model'
    predictions_file = test_file + '.predictions'
    return [
        'vw83',
        '--link=logistic',
        '--loss_function=logistic',
        '-b', '29',
        '-l', '0.3',
        '--initial_t', '1',
        '--decay_learning_rate', '0.5',
        '--power_t', '0.5',
        '--l1', '1e-15',
        '--l2', '0',
        '-d', train_file,
        '-f', model_file,
    ], [
        'vw83',
        '--loss_function=logistic',
        '-t',
        '-i', model_file,
        '-d', test_file,
        '-p', predictions_file,
    ], predictions_file


xgboost_conf = [
    'booster = gbtree',
    'objective = binary:logistic',
    'nthread = 24',
    'eval_metric = logloss',
    'max_depth = 7',
    'num_round = 200',
    'eta = 0.2',
    'gamma = 0.4',
    'subsample = 0.8',
    'colsample_bytree = 0.8',
    'min_child_weight = 20',
    'alpha = 3',
    'lambda = 100',
]


def get_xgboost_commands_and_predictions_file(train_file, test_file, cache=False):
    config_file = os.path.join(local_runtime_path, 'xgb.conf')
    ensure_directory_exists(local_runtime_path)
    with open(config_file, 'wb') as f:
        for line in xgboost_conf:
            print(line, file=f)
    model_file = train_file + '.model'
    predictions_file = test_file + '.predictions'
    if cache:
        train_file = train_file + '#' + train_file + '.cache'
    return [
        'xgboost',
        config_file,
        'data=' + train_file,
        'model_out=' + model_file,
    ], [
        'xgboost',
        config_file,
        'task=pred',
        'test:data=' + test_file,
        'model_in=' + model_file,
        'name_pred=' + predictions_file,
    ], predictions_file

def get_xgboost_ooc_commands_and_predictions_file(train_file, test_file):
    return get_xgboost_commands_and_predictions_file(train_file, test_file, cache=True)

time: 1.27 ms (started: 2022-11-16 12:55:59 +00:00)


In [40]:
engines = {
    'vw': (get_vw_commands_and_predictions_file, local_vw_train_template, local_vw_test_template),
    'xgb': (get_xgboost_commands_and_predictions_file, local_libsvm_train_template, local_libsvm_test_template),
    'xgb.ooc': (get_xgboost_ooc_commands_and_predictions_file, local_libsvm_train_template, local_libsvm_test_template),
}

time: 334 µs (started: 2022-11-16 12:55:59 +00:00)


Train & test everything:

In [41]:
import vowpalwabbit

time: 5.61 ms (started: 2022-11-16 12:55:59 +00:00)


In [42]:
import subprocess


BREAD_OUT = False

measurements = []

for sample in train_samples:
    for engine in engines:

        logger.info('Training "%s" on "%s" lines of data.', engine, sample)
        
        get_commands_and_predictions_file, train_template, test_template = engines[engine]

        train_file = train_template.format(sample_name(sample))
        test_file = test_template.format(sample_name(test_samples[0]))
        logger.info('Will train on "%s" and test on "%s".', train_file, test_file)

        command_time, time_file = get_time_command_and_file(train_file)
        command_engine_train, command_engine_test, predictions_file = get_commands_and_predictions_file(train_file, test_file)
        
        print("command_engine_train: ",command_engine_train)
        print("command_engine_test: ",command_engine_test)
        print("predictions_file: ",predictions_file)

        logger.info('Performing train.')
        subprocess.call(command_time + command_engine_train)

        logger.info('Performing test.')
        subprocess.call(command_engine_test)

        logger.info('Measuring results.')
        measurement = measure(engine, sample, test_file, time_file, predictions_file)
        logger.info(measurement)
        measurements.append(measurement)

[2022-11-16 12:55:59,556] Training "vw" on "10000" lines of data.
[2022-11-16 12:55:59,557] Will train on "criteo/data/data.train.10k.vw" and test on "criteo/data/data.test.1kk.vw".
time_file criteo/data/data.train.10k.vw.time
command_engine_train:  ['vw83', '--link=logistic', '--loss_function=logistic', '-b', '29', '-l', '0.3', '--initial_t', '1', '--decay_learning_rate', '0.5', '--power_t', '0.5', '--l1', '1e-15', '--l2', '0', '-d', 'criteo/data/data.train.10k.vw', '-f', 'criteo/data/data.train.10k.vw.model']
command_engine_test:  ['vw83', '--loss_function=logistic', '-t', '-i', 'criteo/data/data.train.10k.vw.model', '-d', 'criteo/data/data.test.1kk.vw', '-p', 'criteo/data/data.test.1kk.vw.predictions']
predictions_file:  criteo/data/data.test.1kk.vw.predictions
[2022-11-16 12:55:59,558] Performing train.
[2022-11-16 12:55:59,570] Performing test.


/usr/bin/time: cannot run vw83: No such file or directory


FileNotFoundError: [Errno 2] No such file or directory: 'vw83'

time: 718 ms (started: 2022-11-16 12:55:59 +00:00)


Load measurements:

In [None]:
import pandas


measurements_df = pandas.DataFrame(measurements).sort_values(by=['Engine', 'Train size'])


Plot measurements:

In [None]:
def extract_data_for_plotting(df, what):
    return reduce(
        lambda left, right: pandas.merge(left, right, how='outer', on='Train size'),
        map(
            lambda name: df[df.Engine == name][['Train size', what]].rename(columns={what: name}),
            df.Engine.unique(),
        ),
    )   

def plot_stuff(df, what, ylabel=None, **kwargs):
    data = extract_data_for_plotting(df, what).set_index('Train size')
    ax = data.plot(marker='o', figsize=(6, 6), title=what, grid=True, linewidth=2.0, **kwargs)  # xlim=(1e4, 4e9)
    if ylabel is not None:
        ax.set_ylabel(ylabel)


plot_stuff(measurements_df, 'ROC AUC', logx=True)
plot_stuff(measurements_df, 'Log loss', logx=True)
plot_stuff(measurements_df, 'Train time', loglog=True, ylabel='s')
plot_stuff(measurements_df, 'Maximum memory', loglog=True, ylabel='bytes')
plot_stuff(measurements_df, 'CPU load', logx=True)