# Pruebas de rendimiento sobre Streaming Inference

Es necesario tener instalada la versión de java 1.8:

In [1]:
%%bash
java -version

java version "1.8.0_171"
Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)


También es necesario tener añadida al PATH la carpeta bin de spark 2.2.1 para hadoop 2.7 o posterior ([descarga](https://spark.apache.org/downloads.html)).

In [2]:
import pandas as pd

Esta función rellena la base de datos *benchmark* con entidades de prueba:

In [3]:
def createTestCollection(elements=120, entities=2, versions=2, depth=2, fields=2):
    !mkdir -p output
    out = !java -jar es.um.nosql.streaminginference.benchmark-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
    --elements $elements \
    --entities $entities \
    --versions $versions \
    --depth $depth \
    --fields $fields \
    --mode mongo \
    --host localhost \
    --port 27017 \
    --database benchmark    

Esta función ejecuta la aplicación de inferencia sobre la base de datos previamente creada y genera el archivo *stats.csv*:

In [4]:
def benchmarkSparkApp(interval=1000, block=200):
    out = !spark-submit --master local[*] es.um.nosql.streaminginference.json2dbschema-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
    --mode mongo \
    --database benchmark \
    --host localhost \
    --port 27017 \
    --benchmark true \
    --interval $interval \
    --block-interval $block

La siguiente función compone las funciones anteriores para ejecutar una prueba con los parámetros introducidos:

In [5]:
def benchmark(interval=1000, block=200, elements=120, entities=2, versions=2, depth=2, fields=2):
    global benchmarked
    !rm -f output/stats.csv
    createTestCollection(elements, entities, versions, depth, fields)
    for x in range(0, 1):
        benchmarkSparkApp(interval, block)
    benchmarked = pd.read_csv("output/stats.csv")
    return benchmarked

Generación de un dataset de 1000 elementos 

In [6]:
createTestCollection(1000)

Prueba de diferentes intervalos de bloque para un dataset de 100 elementos:

In [8]:
results = pd.DataFrame()
for block in [100, 200, 400, 600, 800, 1000]:
    df = benchmark(interval=1000, block=block, elements=100, entities=2, versions=2, depth=4, fields=2)
    display(df)
    df.to_csv("block-"+str(block)+".csv")

Unnamed: 0,BATCH_INTERVAL,BLOCK_INTERVAL,PROCESSING_INTERVAL,TOTAL_BATCHES,TOTAL_DELAY,TOTAL_PROCESSING,TOTAL_RECORDS,AVERAGE_DELAY,AVERAGE_PROCESSING,AVERAGE_RECORDS,MAX_PROCESSING,MAX_DELAY
0,1000,100,1002,1,385,384,100,385,384,100,384,385


Unnamed: 0,BATCH_INTERVAL,BLOCK_INTERVAL,PROCESSING_INTERVAL,TOTAL_BATCHES,TOTAL_DELAY,TOTAL_PROCESSING,TOTAL_RECORDS,AVERAGE_DELAY,AVERAGE_PROCESSING,AVERAGE_RECORDS,MAX_PROCESSING,MAX_DELAY
0,1000,200,1995,2,410,404,100,205,202,50,333,334


Unnamed: 0,BATCH_INTERVAL,BLOCK_INTERVAL,PROCESSING_INTERVAL,TOTAL_BATCHES,TOTAL_DELAY,TOTAL_PROCESSING,TOTAL_RECORDS,AVERAGE_DELAY,AVERAGE_PROCESSING,AVERAGE_RECORDS,MAX_PROCESSING,MAX_DELAY
0,1000,400,998,1,364,364,100,364,364,100,364,364


Unnamed: 0,BATCH_INTERVAL,BLOCK_INTERVAL,PROCESSING_INTERVAL,TOTAL_BATCHES,TOTAL_DELAY,TOTAL_PROCESSING,TOTAL_RECORDS,AVERAGE_DELAY,AVERAGE_PROCESSING,AVERAGE_RECORDS,MAX_PROCESSING,MAX_DELAY
0,1000,600,1000,1,407,402,100,407,402,100,402,407


Unnamed: 0,BATCH_INTERVAL,BLOCK_INTERVAL,PROCESSING_INTERVAL,TOTAL_BATCHES,TOTAL_DELAY,TOTAL_PROCESSING,TOTAL_RECORDS,AVERAGE_DELAY,AVERAGE_PROCESSING,AVERAGE_RECORDS,MAX_PROCESSING,MAX_DELAY
0,1000,800,993,1,355,354,100,355,354,100,354,355


Unnamed: 0,BATCH_INTERVAL,BLOCK_INTERVAL,PROCESSING_INTERVAL,TOTAL_BATCHES,TOTAL_DELAY,TOTAL_PROCESSING,TOTAL_RECORDS,AVERAGE_DELAY,AVERAGE_PROCESSING,AVERAGE_RECORDS,MAX_PROCESSING,MAX_DELAY
0,1000,1000,995,1,568,566,100,568,566,100,566,568


Prueba de diferentes intervalos de batch para un dataset de 1000000 elementos:

In [12]:
results = pd.DataFrame()
for batch in [3000]:
    df = benchmark(interval=batch, block=200, elements=1000000, entities=2, versions=2, depth=4, fields=2)
    display(df)
    df.to_csv("interval-"+str(batch)+".csv")

Unnamed: 0,BATCH_INTERVAL,BLOCK_INTERVAL,PROCESSING_INTERVAL,TOTAL_BATCHES,TOTAL_DELAY,TOTAL_PROCESSING,TOTAL_RECORDS,AVERAGE_DELAY,AVERAGE_PROCESSING,AVERAGE_RECORDS,MAX_PROCESSING,MAX_DELAY
0,3000,200,95941,32,229272,76298,702266,7164,2384,21945,33213,33216
