# Datum benchmarks

The aim of this notebook is to compare execution time between Datum and Pandas for simple tasks :
* querying Datum dataset elements / Pandas Dataframe rows
* adding/removing elements/rows to Datum dataset / Pandas Dataframe
* modifying elements/rows from Datum dataset / Pandas Dataframe
* modifying attributes/column values from all Datum dataset elements / all Pandas Dataframe rows
* iterating over Datum dataset entries + observables  / Pandas Dataframe rows (+ rows for corresponding observables)

In [2]:
import sys, os
import random
from pathlib import Path

import numpy as np

from datum.datasets import Entry, Observable, Dataset
from datum.readers import VocDetectionReader, CocoDetectionReader, ObsAttributeConstructor
from datum.formatters import VocDetectionFormatter
from datum.transformers import EntryMapper, ObservableMapper, AttributesTransformer

In [3]:
def get_coco_dataset_entries_obs():
    root_dir = Path('/home/clacroix/databases/coco/')
    reader = CocoDetectionReader(root_dir, 'jpg')
    coco_val = Dataset()
    reader.feed(coco_val, sets=['val2017'])
    entries = coco_val.entries_df
    observables = coco_val.observables_df

    return coco_val, entries, observables

dataset, entries, observables = get_coco_dataset_entries_obs()

100%|██████████| 5000/5000 [00:00<00:00, 35543.74it/s]
100%|██████████| 36781/36781 [00:00<00:00, 73550.85it/s]


## Task 1 : individual dataset entry vs Pandas Dataframe row querying

- query 100k random entries + associated observables from dataset : 264ms
- query 100k rows (corresponding to entries attributes only) from Dataframe : 13.8s

In [4]:
dataset, entries, observables = get_coco_dataset_entries_obs()

100%|██████████| 5000/5000 [00:00<00:00, 31057.05it/s]
100%|██████████| 36781/36781 [00:00<00:00, 79898.81it/s]


In [5]:
%%time
def query_entries(dataset, k=100):
    l = len(dataset)
    for k in range(k):
        idx = random.randint(0, l-1)
        entry, observables = dataset[idx]
query_entries(dataset, k=10000)

CPU times: user 69.5 ms, sys: 135 µs, total: 69.7 ms
Wall time: 68.6 ms


In [6]:
%%time
def query_rows(df, k=100):
    l = len(df.index)
    for k in range(k):
        idx = random.randint(0, l-1)
        entry = entries.iloc[idx]
query_rows(entries, k=10000)

CPU times: user 1.4 s, sys: 0 ns, total: 1.4 s
Wall time: 1.39 s


## Task2 : individual dataset entries adding/removal vs. Pandas Dataframe row removal/adding

- add 1000 entries to dataset with 2367 entries : 7.32ms
- add 1000 rows to dataframe with 2367 rows (without memory preallocation) : 3.72s

In [7]:
dataset, entries, observables = get_coco_dataset_entries_obs()

100%|██████████| 5000/5000 [00:00<00:00, 59434.16it/s]
100%|██████████| 36781/36781 [00:00<00:00, 80780.74it/s]


In [8]:
%%time
def add_entries(dataset, k=100000):
    for k in range(k):
        idx = dataset.add_entry({'name': 'foo_' + str(k), 'filename':'foo.jpg', 'dir': 'bar', 'set':'train',
                                 'width': 1280, 'height': 720})
entry = {'filename':'foo', 'dir': 'bar', 'set':'train'}
add_entries(dataset, k=100)

CPU times: user 1.75 ms, sys: 119 µs, total: 1.87 ms
Wall time: 2.57 ms


In [9]:
%%time
def add_rows(k=100):
    l = len(entries.index)
    for k in range(k):
        entries.loc[l+k] = ['foo', 'foo.jpg', 'bar', 1280, 720, 3, 'train', 42, []]
add_rows(k=100)

CPU times: user 744 ms, sys: 0 ns, total: 744 ms
Wall time: 753 ms


## Task 3 : individual dataset entries modifications vs. Pandas Dataframe individual row modification

- Modify 1 attribute from 10k random entries from dataset : 27.8ms
- Modify 1 column value from 10k random rows from dataframe : 4.46s

In [10]:
dataset, entries, observables = get_coco_dataset_entries_obs()

100%|██████████| 5000/5000 [00:00<00:00, 57251.21it/s]
100%|██████████| 36781/36781 [00:00<00:00, 70189.82it/s]


In [11]:
%%time
def modify_entries(dataset, k=1000):
    l = len(dataset)
    for k in range(k):
        idx = random.randint(0, l-1)
        dataset.update_entry_data(idx, {'filename': 'foo.jpg'})
modify_entries(dataset, k=1000)

CPU times: user 12.6 ms, sys: 29 µs, total: 12.6 ms
Wall time: 12.7 ms


In [12]:
%%time
def modify_rows(df, k=100):
    l = len(df.index)
    for k in range(k):
        idx = random.randint(0, l-1)
        entries.loc[idx, 'filename'] = 'foo.jpg'
modify_rows(entries, k=1000)

CPU times: user 388 ms, sys: 31 µs, total: 388 ms
Wall time: 385 ms


In [13]:
%%time
def update_filename(row, idx):
    if row['idx'] == idx:
        return 'foo.jpg'
    else:
        return row['filename']

def modify_rows_apply(df, k=100):
    l = len(df.index)
    for k in range(k):
        idx = random.randint(0, l-1)
        entries.apply(lambda row: update_filename(row, idx), axis=1)
modify_rows_apply(entries, k=1)

CPU times: user 68.6 ms, sys: 3.83 ms, total: 72.5 ms
Wall time: 72 ms


## Task 4  + task 5: modify 1 attribute for all entries and observables of a dataset vs. 1 column value from all rows in a dataset

- Modifying all entries from COCO (118k images) : 840ms
- Modifying all observables from COCO (860k objects) : 4.13s

- Modifying all entries from COCO (118k images) : 840ms

In [14]:
dataset, entries, observables = get_coco_dataset_entries_obs()

100%|██████████| 5000/5000 [00:00<00:00, 30545.13it/s]
100%|██████████| 36781/36781 [00:00<00:00, 76590.56it/s]


In [15]:
%%time
upper = EntryMapper(['filename'], ['filename'], lambda f: f.upper())
transformer = AttributesTransformer(entries_mappers=[upper])
transformer.transform(dataset)

CPU times: user 41.9 ms, sys: 0 ns, total: 41.9 ms
Wall time: 41.4 ms


In [16]:
%%time
expand_xmin = ObservableMapper(['xmin'], ['xmin'], lambda x: x-1)
transformer = AttributesTransformer(observables_mappers=[expand_xmin])
transformer.transform(dataset)

CPU times: user 252 ms, sys: 0 ns, total: 252 ms
Wall time: 251 ms


In [17]:
%%time
entries['filename'] = entries['filename'].apply(lambda f: f.upper())

CPU times: user 6.5 ms, sys: 0 ns, total: 6.5 ms
Wall time: 6.44 ms


In [18]:
%%time
observables['xmin'] = observables['xmin'].apply(lambda x: x-1)

CPU times: user 28.9 ms, sys: 161 µs, total: 29.1 ms
Wall time: 28.2 ms


## Task 5 : iteration over dataset entries + observables vs. iteration over Dataframe rows

* Iterate 100 times over same dataset with 2468 entries : 22.9ms
* at each iteration, entry data and its corresponding observables data are available


* Iterate 100 times over rows from Dataframe with itertuples() : 33.7ms
* N.B : at each iteration, only entry data is available


* Iterate 100 times over rows from Dataframe with itertuples() : 8.76s
* at each iteration,  entry data and its corresponding observables data are available


* Iterate 100 times over rows from Dataframe with itertuples() : 1.96s
* at each iteration, only entry data is available

In [19]:
dataset, entries, observables = get_coco_dataset_entries_obs()

100%|██████████| 5000/5000 [00:00<00:00, 56538.29it/s]
100%|██████████| 36781/36781 [00:00<00:00, 81131.86it/s]


In [20]:
%%time
n = 1
for k in range(n):
    for entry, obs in dataset:
        filename = entry['filename']

CPU times: user 24.7 ms, sys: 4.04 ms, total: 28.7 ms
Wall time: 28.8 ms


In [21]:
%%time
n = 1
for k in range(n):
    for e in entries.itertuples():
        filename = e[1]

CPU times: user 27.2 ms, sys: 0 ns, total: 27.2 ms
Wall time: 26.3 ms


In [22]:
%%time
n = 1
for k in range(n):
    for e in entries.itertuples():
        filename = e[1]
        if isinstance(e[-1], list):
            obs = observables.loc[e[-1], :]

CPU times: user 1.71 s, sys: 28.1 ms, total: 1.74 s
Wall time: 1.71 s


In [23]:
%%time
n = 1
for k in range(n):
    for i, r in entries.iterrows():
        filename = r['filename']

CPU times: user 452 ms, sys: 0 ns, total: 452 ms
Wall time: 452 ms
