# Performance Comparison &mdash; pandas Versus RAPIDS cuDF

This tutorial uses `timeit` to compare performance benchmarks with pandas and RAPIDS cuDF.

<a id="rapids-conda-envir-setup"></a>
## Setting Up a RAPIDS conda Environment with cuDF and cuML

To use the cuDF and cuML RAPIDS libraries, you need to create a RAPIDS conda environment and run this notebook with the python kernel.
For example, use the following command to create a RAPIDS conda environment named `rapids` with rapids version 0.14 and python 3.7:

```sh
conda create -n rapids -c rapidsai -c nvidia -c anaconda -c conda-forge -c defaults ipykernel rapids=0.14 python=3.7 cudatoolkit=10.1
```

After that, make sure to open this notebook with the kernel named `conda-rapids`.

## System Details

### GPU

In [1]:
!nvidia-smi -q



Timestamp                           : Thu Jul  2 15:45:49 2020
Driver Version                      : 440.31
CUDA Version                        : 10.2

Attached GPUs                       : 1
GPU 00000000:00:1E.0
    Product Name                    : Tesla V100-SXM2-16GB
    Product Brand                   : Tesla
    Display Mode                    : Enabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 4000
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0323617005627
    GPU UUID                        : GPU-43bd4553-f5b7-55ab-0633-ecba7c3a64d5
    Minor Number                    : 0
    VBIOS Version                   : 88.00.4F.00.09
    MultiGPU Board                  : No
    Board ID                        : 0x1e
    GPU Part Number                 : 900-2G

## Benchmark Setup

### Installations

Install v3io-generator to create a 1 GB data set for the benchmark.<br>
You only need to run the generator once, and then you can reuse the generated data set.

In [2]:
import sys
!{sys.executable} -m pip install -i https://test.pypi.org/simple/ v3io-generator
!{sys.executable} -m pip install faker
!{sys.executable} -m pip install pytimeparse

Looking in indexes: https://test.pypi.org/simple/
Collecting v3io-generator
  Downloading https://test-files.pythonhosted.org/packages/6c/f6/ba9045111de98747af2c94e10f3dbf74311e6bd3a033c7ea1ca84e084e82/v3io_generator-0.0.27.dev0-py3-none-any.whl (9.3 kB)
Installing collected packages: v3io-generator
Successfully installed v3io-generator-0.0.27.dev0
Collecting faker
  Using cached Faker-4.1.1-py3-none-any.whl (1.0 MB)
Collecting text-unidecode==1.3
  Using cached text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
Installing collected packages: text-unidecode, faker
Successfully installed faker-4.1.1 text-unidecode-1.3
Collecting pytimeparse
  Using cached pytimeparse-1.1.8-py2.py3-none-any.whl (10.0 kB)
Installing collected packages: pytimeparse
Successfully installed pytimeparse-1.1.8


> **Note:** You must **restart the Jupyter kernel** to complete the installation.

### Imports

In [3]:
import os
import yaml
import time
import datetime
import json
import itertools

# Generator
from v3io_generator import metrics_generator, deployment_generator

# Dataframes
import cudf
import pandas as pd

### Configurations

In [4]:
# Benchmark configurations
metric_names = ['cpu_utilization', 'latency', 'packet_loss', 'throughput']
nlargest = 10
source_file = os.path.join(os.getcwd(), 'data', 'ops.logs') # Use full path

### Create the Data Source

Use v3io-generator to create a time-series network-operations dataset for 100 companies, including 4 metrics (CPU utilization, latency, throughput, and packet loss).<br>
Then, write the dataset to a JSON file to be used as the data source.

In [5]:
# Create a metadata factory
dep_gen = deployment_generator.deployment_generator()
faker=dep_gen.get_faker()

# Design the metadata
dep_gen.add_level(name='company',number=100,level_type=faker.company)

# Generate a deployment structure
deployment_df = dep_gen.generate_deployment()

# Initialize the metric values
for metric in metric_names:
    deployment_df[metric] = 0

deployment_df.head()

Unnamed: 0,company,cpu_utilization,latency,packet_loss,throughput
0,Williams_and_Sons,0,0,0,0
1,Guerrero_Ltd,0,0,0,0
2,Harris-Gutierrez,0,0,0,0
3,Shaw-Williams,0,0,0,0
4,Harris_Inc,0,0,0,0


Specify metrics configuration for the generator.

In [6]:
metrics_configuration = yaml.safe_load("""
errors: {length_in_ticks: 50, rate_in_ticks: 150}
timestamps: {interval: 5s, stochastic_interval: false}
metrics:
  cpu_utilization:
    accuracy: 2
    distribution: normal
    distribution_params: {mu: 70, noise: 0, sigma: 10}
    is_threshold_below: true
    past_based_value: false
    produce_max: false
    produce_min: false
    validation:
      distribution: {max: 1, min: -1, validate: false}
      metric: {max: 100, min: 0, validate: true}
  latency:
    accuracy: 2
    distribution: normal
    distribution_params: {mu: 0, noise: 0, sigma: 5}
    is_threshold_below: true
    past_based_value: false
    produce_max: false
    produce_min: false
    validation:
      distribution: {max: 1, min: -1, validate: false}
      metric: {max: 100, min: 0, validate: true}
  packet_loss:
    accuracy: 0
    distribution: normal
    distribution_params: {mu: 0, noise: 0, sigma: 2}
    is_threshold_below: true
    past_based_value: false
    produce_max: false
    produce_min: false
    validation:
      distribution: {max: 1, min: -1, validate: false}
      metric: {max: 50, min: 0, validate: true}
  throughput:
    accuracy: 2
    distribution: normal
    distribution_params: {mu: 250, noise: 0, sigma: 20}
    is_threshold_below: false
    past_based_value: false
    produce_max: false
    produce_min: false
    validation:
      distribution: {max: 1, min: -1, validate: false}
      metric: {max: 300, min: 0, validate: true}
""")

Create the data according to the given hierarchy and metrics configuration.

In [7]:
met_gen = metrics_generator.Generator_df(metrics_configuration, 
                                         user_hierarchy=deployment_df, 
                                         initial_timestamp=time.time())

metrics = met_gen.generate_range(start_time=datetime.datetime.now(),
                                 end_time=datetime.datetime.now()+datetime.timedelta(hours=62),
                                 as_df=True,
                                 as_iterator=False)

# Verify that the source-file parent directory exists.
os.makedirs(os.path.dirname(source_file), exist_ok=1)

print(f'Saving generated data to: {source_file}')

# Generate file from metrics
with open(source_file, 'w') as f:
    metrics_batch = metrics
    metrics_batch.to_json(f,
                          orient='records',
                          lines=True)

Saving generated data to: /User/data-ingestion-and-preparation/data/ops.logs


### Validate the Target File Size

Get the target size for the test file.

In [8]:
from pathlib import Path
Path(source_file).stat().st_size

1207964564

In [9]:
with open(source_file) as myfile:
    head = [next(myfile) for x in range(10)]
print(head)

['{"company":"Williams_and_Sons","cpu_utilization":64.6440138248,"cpu_utilization_is_error":false,"latency":2.9965630871,"latency_is_error":false,"packet_loss":0.0,"packet_loss_is_error":false,"throughput":258.7732213917,"throughput_is_error":false,"timestamp":1593707325519}\n', '{"company":"Guerrero_Ltd","cpu_utilization":68.5296690547,"cpu_utilization_is_error":false,"latency":0.0,"latency_is_error":false,"packet_loss":0.0,"packet_loss_is_error":false,"throughput":288.8039306559,"throughput_is_error":false,"timestamp":1593707325519}\n', '{"company":"Harris-Gutierrez","cpu_utilization":55.8557277251,"cpu_utilization_is_error":false,"latency":1.7068227314,"latency_is_error":false,"packet_loss":1.6544231936,"packet_loss_is_error":false,"throughput":265.4031916784,"throughput_is_error":false,"timestamp":1593707325519}\n', '{"company":"Shaw-Williams","cpu_utilization":72.8668610421,"cpu_utilization_is_error":false,"latency":1.6477141418,"latency_is_error":false,"packet_loss":0.8709185994,

## Benchmark

The benchmark tests use the following flow:

- Read file
- Compute aggregations
- Get the n-largest values

In [10]:
benchmark_file = source_file

In the following examples, `timeit` is executed in a loop.<br>
You can change the number of runs and loops:
```
%%timeit -n 1 -r 1
```

## Test Load Times

#### cuDF

In [13]:
%%timeit -n 1 -r 2
gdf = cudf.read_json(benchmark_file, lines=True)

5.04 s ± 35.7 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)


#### pandas

In [14]:
%%timeit -n 1 -r 2
pdf = pd.read_json(benchmark_file, lines=True)

36.7 s ± 202 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)


## Test Aggregation

Load the files to memory to allow applying `timeit` only to the aggregations.

In [15]:
gdf = cudf.read_json(benchmark_file, lines=True)
pdf = pd.read_json(benchmark_file, lines=True)

#### cuDF

In [16]:
%%timeit -n 1 -r 7

ggdf = gdf.groupby(['company']).\
            agg({k: ['min', 'max', 'mean'] for k in metric_names})
raw_nlargest = gdf.nlargest(nlargest, 'cpu_utilization')

246 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### pandas

In [17]:
%%timeit -n 1 -r 7

gpdf = pdf.groupby(['company']).\
            agg({k: ['min', 'max', 'mean'] for k in metric_names})
raw_nlargest = pdf.nlargest(nlargest, 'cpu_utilization')

1.82 s ± 38.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
