# Benchmarking DynamoDB Write Methods
This notebook explores a couple of different ways we can write large sets of predictions to DynamoDB.

### Reading From Disk
These are the precalculations/predictions that I generated earlier using the 2M compound reference library.

In [1]:
import pandas as pd
from ersilia_precalc_poc.read import get_predictions_from_dataframe

MODEL_ID = "eos3b5e"

In [2]:
df_all_predictions = pd.read_csv("../data/prediction_output.csv", usecols=[1,2,3])
df_all_predictions.head()

Unnamed: 0,key,input,mw
0,PCQFQFRJSWBMEL-UHFFFAOYSA-N,COC(=O)C1=CC=CC2=C1C(=O)C1=CC([N+](=O)[O-])=CC...,283.239
1,MRSBJIAZTHGJAP-UHFFFAOYSA-N,CN(C)CCC1=CN(C)C2=CC=C(O)C=C12\n,218.3
2,CJUOVTMTGQENNQ-UHFFFAOYSA-N,CC1=C(S(=O)(=O)N2CCCCC2)C2=C(S1)N=CN(CC(=O)N1C...,516.649
3,OFCIHDDKDVHWGO-LICLKQGHSA-N,CN(C)CCOC1=CC=C(C(=O)/C=C/C2=CC=C(OC3=CC=CC=C3...,387.479
4,SZROQWMXFIMBGE-UHFFFAOYSA-N,O=C(CCC1=COC2=CC=CC(OCC3CCCCC3)=C2C1=O)C1=CC=C...,406.478


I noticed that by default we have an index column in the CSV file, which gets read in as an unnamed column `0`. For now have ignored this by using `usecols=[1,2,3]`, but will need to make sure there's a fool-proof way to sort this out. Perhaps using `pandera` to enforce the same schema on the output as it's generated, as we do when we read the data frame from disk will solve this problem.

In [3]:
%%time
predictions = get_predictions_from_dataframe(MODEL_ID, df_all_predictions)

CPU times: user 12 s, sys: 305 ms, total: 12.3 s
Wall time: 12.4 s


12s to spin the data from a DataFrame to a python list, ready to write to DynamoDB. Not terrible considering this is only done once per model, and we're testing with the full suite of 2M inputs.

### Method 1: Using the Built-in Batch Writer

In [4]:
from ersilia_precalc_poc.write import write_precalcs_batch_writer

Start by writing 1000 records to the table and measure time taken.

In [7]:
%%time
write_precalcs_batch_writer("precalculations-poc", predictions[:1000])

CPU times: user 205 ms, sys: 15.1 ms, total: 220 ms
Wall time: 16.1 s


Up it to 10,000

In [8]:
%%time
write_precalcs_batch_writer("precalculations-poc", predictions[1000:11000])

CPU times: user 3.1 s, sys: 196 ms, total: 3.3 s
Wall time: 2min 31s


100,000

In [9]:
%%time
write_precalcs_batch_writer("precalculations-poc", predictions[11000:111000])

CPU times: user 30.5 s, sys: 1.85 s, total: 32.3 s
Wall time: 25min 47s


Seems to scale pretty linearly; 16s -> 151s -> 1547s

Expected time per 2M compounds (single worker):


In [10]:
run_time = ((1547/100_000) * 2_000_000)

run_time/60

515.6666666666666

515 minutes ~= 8 to 9 hours