# Parameters for lowest RAM

Mouse vs fly reference genomes

In [3]:
import pandas as pd
import numpy as np
import matplotlib as plt

## Cart queue parameters

The cart queue is defined by the maximum number of carts in the queue and the size of a cart. Records are added into the cart by producer threads of local prefiltering and consumed by stellar search threads.

In [7]:
df = pd.read_csv("cart_parameters.time", sep="\t")
df.head()

Unnamed: 0,time,mem,error-code,bins,min-len,er,cmin,cmax,repeat-mask,repeats,ibf-size,matches,truth-set-matches,true-matches,missed,min_overlap,max-cap,max-queued
0,154.11,14229632,0,1024,150,0.025,0,254,--keep-all-repeats,156,"4,3G",75143,74641,74158,0.01,10,20000,1024
1,98.01,13190204,0,1024,150,0.025,0,254,--keep-best-repeats,156,"4,3G",44325,74641,63850,0.15,10,20000,1024
2,34.8,11140172,0,1024,150,0.025,0,254,,156,"4,3G",2462,74641,4427,0.95,10,20000,1024
3,154.83,13517256,0,1024,150,0.025,0,254,--keep-all-repeats,156,"4,3G",75135,74641,74158,0.01,10,5000,1024
4,98.23,12551388,0,1024,150,0.025,0,254,--keep-best-repeats,156,"4,3G",44327,74641,63850,0.15,10,5000,1024


In [9]:
df["mem"].corr(df["max-cap"])

0.4127845444819974

In [11]:
df["mem"].corr(df["max-queued"])

0.4614943115504689

In [17]:
keep_all_df = df[df["repeat-mask"] == "--keep-all-repeats"]
keep_all_df["mem"].corr(keep_all_df["max-queued"])

0.7856057029724623

In [19]:
keep_best_df = df[df["repeat-mask"] == "--keep-best-repeats"]
keep_best_df["mem"].corr(keep_best_df["max-queued"])

0.7040909737492136

Okay, of course it correlates but how much of an effect does it have?

In [43]:
keep_all_df[keep_all_df["mem"] == np.max(keep_all_df["mem"])]

Unnamed: 0,time,mem,error-code,bins,min-len,er,cmin,cmax,repeat-mask,repeats,ibf-size,matches,truth-set-matches,true-matches,missed,min_overlap,max-cap,max-queued
0,154.11,14229632,0,1024,150,0.025,0,254,--keep-all-repeats,156,"4,3G",75143,74641,74158,0.01,10,20000,1024


In [44]:
keep_all_df[keep_all_df["mem"] == np.min(keep_all_df["mem"])]

Unnamed: 0,time,mem,error-code,bins,min-len,er,cmin,cmax,repeat-mask,repeats,ibf-size,matches,truth-set-matches,true-matches,missed,min_overlap,max-cap,max-queued
21,219.19,10631028,0,1024,150,0.025,0,254,--keep-all-repeats,156,"4,3G",75142,74641,74158,0.01,10,10,10


In [54]:
np.round((np.max(keep_all_df["mem"]) - np.min(keep_all_df["mem"])) / np.max(keep_all_df["mem"]) * 100, 2)

25.29

Decreasing the queue parameters reduced the max memory peak by 25%

In [48]:
time_at_max_mem = keep_all_df[keep_all_df["mem"] == np.max(keep_all_df["mem"])]["time"].values[0]
time_at_min_mem = keep_all_df[keep_all_df["mem"] == np.min(keep_all_df["mem"])]["time"].values[0]

In [55]:
np.round((time_at_max_mem - time_at_min_mem) / time_at_max_mem * 100, 2)

-42.23

And increased the runtime by 42%

## Parallelization

The number of threads determines the number of local prefiltering AND stellar search jobs.

In [8]:
df = pd.read_csv("parallelization_search_valik.time", sep="\t")
df.head()

Unnamed: 0,time,mem,error-code,command,bins,threads,cart-max-cap,max-queued,min-len,er,...,cmax,repeat-mask,repeats,ibf-fpr,ibf-size,matches,truth-set-matches,true-matches,missed,min_overlap
0,841.55,11278436,0,valik-search,1024,2,15000,1024,150,0.025,...,254,--keep-best-repeats,0.01,208,"3,0G",50093,84323,73254,0.14,100
1,441.61,11124948,0,valik-search,1024,4,15000,1024,150,0.025,...,254,--keep-best-repeats,0.01,208,"3,0G",50097,84323,73254,0.14,100
2,250.85,11636988,0,valik-search,1024,8,15000,1024,150,0.025,...,254,--keep-best-repeats,0.01,208,"3,0G",50091,84323,73254,0.14,100
3,141.16,12079936,0,valik-search,1024,16,15000,1024,150,0.025,...,254,--keep-best-repeats,0.01,208,"3,0G",50097,84323,73254,0.14,100


In [11]:
(np.max(df["mem"]) - np.min(df["mem"])) / np.max(df["mem"]) * 100

7.905571685148001

Decreasing the number of threads decreased the RAM peak by 8%

In [12]:
time_at_max_mem = df[df["mem"] == np.max(df["mem"])]["time"].values[0]
time_at_min_mem = df[df["mem"] == np.min(df["mem"])]["time"].values[0]

In [13]:
np.round((time_at_max_mem - time_at_min_mem) / time_at_max_mem * 100, 2)

-212.84

And increased the runtime by 212%