# Benchmark pandas Versus cuDF

Using `timeit` to compare benchmarks pandas and RAPIDS cuDF.

## System Details

### GPU

In [4]:
!nvidia-smi -q



Timestamp                           : Tue Aug 13 07:29:07 2019
Driver Version                      : 418.56
CUDA Version                        : 10.1

Attached GPUs                       : 1
GPU 00000000:81:00.0
    Product Name                    : Tesla T4
    Product Brand                   : Tesla
    Display Mode                    : Enabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 4000
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0561119011981
    GPU UUID                        : GPU-8b4068b3-1bcf-8dbe-978e-8eacb3c22801
    Minor Number                    : 0
    VBIOS Version                   : 90.04.38.00.03
    MultiGPU Board                  : No
    Board ID                        : 0x8100
    GPU Part Number                 : 900-2G183-0000-0

### CPU

In [5]:
!less /proc/cpuinfo

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 79
model name      : Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
stepping        : 1
microcode       : 0xb000021
cpu MHz         : 2200.000
cache size      : 25600 KB
physical id     : 0
siblings        : 10
core id         : 0
cpu cores       : 10
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 20
wp              : yes
[K:[K         : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma [7m/proc/cpuinfo[m[K

## Benchmark Setup

### Installations

Install v3io-generator to create the 1 GB dataset for the benchmark. <br>
You only need to run it once and use the same file again 

In [6]:
!pip install -i https://test.pypi.org/simple/ v3io-generator --upgrade

Looking in indexes: https://test.pypi.org/simple/
Requirement already up-to-date: v3io-generator in /User/.pythonlibs/lib/python3.6/site-packages (0.0.27.dev0)


### Configurations

In [7]:
# Benchmark configurations
metric_names = ['cpu_utilization', 'latency', 'packet_loss', 'throughput']
nlargest = 10

### Imports

In [8]:
import os
import yaml
import time
import datetime
import json
import itertools

# Generator
from v3io_generator import metrics_generator, deployment_generator

# Dataframes
import cudf
import pandas as pd

### Create the Data Source

Use v3io-generator to create a time-series network-operations dataset for 100 companies, including 4 metrics (CPU utilization, latency, throughput, and packet loss).

Then, write the dataset to a JSON file to be used as the data source.

In [9]:
# Create a metadata factory
dep_gen = deployment_generator.deployment_generator()
faker=dep_gen.get_faker()

# Design the metadata
dep_gen.add_level(name='company',number=100,level_type=faker.company)

# Generate a deployment structure
deployment_df = dep_gen.generate_deployment()

# Initialize the metric values
for metric in metric_names:
    deployment_df[metric] = 0

deployment_df.head()

Unnamed: 0,company,cpu_utilization,latency,packet_loss,throughput
0,Walker_Inc,0,0,0,0
1,Perkins__Pierce_and_Lloyd,0,0,0,0
2,Mills__Lara_and_Wilkins,0,0,0,0
3,Reyes-Ramos,0,0,0,0
4,Clark__Smith_and_Kelly,0,0,0,0


In [10]:
metrics_configuration = yaml.safe_load("""
errors: {length_in_ticks: 50, rate_in_ticks: 150}
timestamps: {interval: 5s, stochastic_interval: false}
metrics:
  cpu_utilization:
    accuracy: 2
    distribution: normal
    distribution_params: {mu: 70, noise: 0, sigma: 10}
    is_threshold_below: true
    past_based_value: false
    produce_max: false
    produce_min: false
    validation:
      distribution: {max: 1, min: -1, validate: false}
      metric: {max: 100, min: 0, validate: true}
  latency:
    accuracy: 2
    distribution: normal
    distribution_params: {mu: 0, noise: 0, sigma: 5}
    is_threshold_below: true
    past_based_value: false
    produce_max: false
    produce_min: false
    validation:
      distribution: {max: 1, min: -1, validate: false}
      metric: {max: 100, min: 0, validate: true}
  packet_loss:
    accuracy: 0
    distribution: normal
    distribution_params: {mu: 0, noise: 0, sigma: 2}
    is_threshold_below: true
    past_based_value: false
    produce_max: false
    produce_min: false
    validation:
      distribution: {max: 1, min: -1, validate: false}
      metric: {max: 50, min: 0, validate: true}
  throughput:
    accuracy: 2
    distribution: normal
    distribution_params: {mu: 250, noise: 0, sigma: 20}
    is_threshold_below: false
    past_based_value: false
    produce_max: false
    produce_min: false
    validation:
      distribution: {max: 1, min: -1, validate: false}
      metric: {max: 300, min: 0, validate: true}
""")

In [11]:
met_gen = metrics_generator.Generator_df(metrics_configuration, 
                                         user_hierarchy=deployment_df, 
                                         initial_timestamp=time.time())

In [18]:
#source_file = '/tmp/ops.logs'
source_file = 'data/ops.logs'
metrics = met_gen.generate_range(start_time=datetime.datetime.now(),
                                 end_time=datetime.datetime.now()+datetime.timedelta(hours=62),
                                 as_df=True,
                                 as_iterator=False)

# Generate file from metrics
with open(source_file, 'w') as f:
    metrics_batch = metrics
    metrics_batch.to_json(f,
                          orient='records',
                          lines=True)

### Validate the Target File Size

Set the target size for the test file, in MB.

In [19]:
!ls -lah data

total 0
-rw-r--r-- 1 iguazio iguazio 1.2G Aug 13 08:16 ops.logs


In [20]:
!head data/ops.logs

{"company":"Walker_Inc","cpu_utilization":60.8022581748,"cpu_utilization_is_error":false,"latency":5.58591763,"latency_is_error":false,"packet_loss":1.7237422742,"packet_loss_is_error":false,"throughput":246.5403546954,"throughput_is_error":false,"timestamp":1565683755938}
{"company":"Perkins__Pierce_and_Lloyd","cpu_utilization":78.1321330134,"cpu_utilization_is_error":false,"latency":0.0,"latency_is_error":false,"packet_loss":0.7646416098,"packet_loss_is_error":false,"throughput":260.8775151366,"throughput_is_error":false,"timestamp":1565683755938}
{"company":"Mills__Lara_and_Wilkins","cpu_utilization":78.6219671645,"cpu_utilization_is_error":false,"latency":0.0,"latency_is_error":false,"packet_loss":0.0,"packet_loss_is_error":false,"throughput":245.173861719,"throughput_is_error":false,"timestamp":1565683755938}
{"company":"Reyes-Ramos","cpu_utilization":64.2627482575,"cpu_utilization_is_error":false,"latency":0.0,"latency_is_error":false,"packet_loss":0.0,"packet_loss_is_error":fals

## Benchmark

The benchmark tests use the following flow:

- Read file
- Compute aggregations
- Get the n-largest values

In [21]:
benchmark_file = source_file

in the below examples we are using timeit for running it in a loop <br>
you can change the number of runs and loops:<br>
%%timeit -n 1 -r 1

### cuDF Benchmark

In [22]:
%%timeit

# Read file
gdf = cudf.read_json(benchmark_file, lines=True)

# Perform aggregation
ggdf = gdf.groupby(['company']).\
            agg({k: ['min', 'max', 'mean'] for k in metric_names})

# Get the n-largest values (from the original DataFrame)
raw_nlargest = gdf.nlargest(nlargest, 'cpu_utilization')

3.11 s ± 26.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### pandas Benchmark

In [17]:
%%timeit

# Read file
pdf = pd.read_json(benchmark_file, lines=True)

# Perform aggregation
gpdf = pdf.groupby(['company']).\
            agg({k: ['min', 'max', 'mean'] for k in metric_names})

# Get the n-largest values (from the original DataFrame)
raw_nlargest = pdf.nlargest(nlargest, 'cpu_utilization')

40.5 s ± 388 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Test Load Times

#### cuDF

In [24]:
%%timeit -r 2
gdf = cudf.read_json(benchmark_file, lines=True)

2.73 s ± 4.57 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)


#### pandas

In [36]:
%%timeit
gdf = pd.read_json(benchmark_file, lines=True)

41.1 s ± 651 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Test Aggregation

Load the files to memory to allow applying `timeit` only to the aggregations.

In [4]:
gdf = cudf.read_json(benchmark_file, lines=True)
pdf = pd.read_json(benchmark_file, lines=True)

#### cuDF

In [5]:
%%timeit

ggdf = gdf.groupby(['company']).\
            agg({k: ['min', 'max', 'mean'] for k in metric_names})
raw_nlargest = gdf.nlargest(nlargest, 'cpu_utilization')

212 ms ± 14.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### pandas

In [6]:
%%timeit

gpdf = pdf.groupby(['company']).\
            agg({k: ['min', 'max', 'mean'] for k in metric_names})
raw_nlargest = pdf.nlargest(nlargest, 'cpu_utilization')

2.17 s ± 72.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
