# Benchmark: Implement Levenshtein term similarity matrix and fast SCM between corpora ([RaRe-Technologies/gensim PR #2016][#2016])

 [#2016]: https://github.com/RaRe-Technologies/gensim/pull/2016 (Implement Levenshtein term similarity matrix and fast SCM between corpora - Pull Request #2016)

In [1]:
!git rev-parse HEAD

0c3549b1b6cc736148d9e55de2e23fe90fea55f0


In [2]:
from copy import deepcopy
from datetime import timedelta
from itertools import product
import logging
from math import floor, ceil, log10
import pickle
from random import sample, seed, shuffle
from time import time

import numpy as np
import pandas as pd
from tqdm import tqdm_notebook as tqdm

from gensim.corpora import Dictionary
import gensim.downloader as api
from gensim.similarities.index import AnnoyIndexer
from gensim.similarities import SparseTermSimilarityMatrix
from gensim.similarities import UniformTermSimilarityIndex
from gensim.similarities import LevenshteinSimilarityIndex
from gensim.models import WordEmbeddingSimilarityIndex
from gensim.utils import simple_preprocess

RANDOM_SEED = 12345

logger = logging.getLogger(__name__)

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.WARNING)
pd.set_option('display.max_rows', None, 'display.max_seq_items', None)

In [3]:
"""Repeatedly run a benchmark callable given various configurations and
get a list of results.

Return a list of results of repeatedly running a benchmark callable.

Parameters
----------
benchmark : callable tuple -> dict
    A benchmark callable that accepts a configuration and returns results.
configurations : iterable of tuple
    An iterable of configurations that are used for calling the benchmark function.
results_filename : str
    A filename of a file that will be used to persistently store the results using
    pickle. If the file exists, then the function will load the stored results
    instead of calling the benchmark callable.

Returns
-------
iterable of tuple
    The return values of the individual invocations of the benchmark callable.

"""
def benchmark_results(benchmark, configurations, results_filename):
    try:
        with open(results_filename, "rb") as file:
            results = pickle.load(file)
    except IOError:
        configurations = list(configurations)
        shuffle(configurations)
        results = list(tqdm(
            (benchmark(configuration) for configuration in configurations),
            total=len(configurations), desc="benchmark"))
        with open(results_filename, "wb") as file:
            pickle.dump(results, file)
    return results

## Implement Levenshtein term similarity matrix

In Gensim PR [#1827][], we added a base implementation of the soft cosine measure (SCM). The base implementation would create term similarity matrices using a single complex procedure. In the Gensim PR [#2016][], we split the procedure into:

- **TermSimilarityIndex** builder classes that produce the $k$ most similar terms for a given term $t$ that are distinct from $t$ along with the term similarities, and
- the **SparseTermSimilarityMatrix** director class that constructs term similarity matrices and consumes term similarities produced by **TermSimilarityIndex** instances.

One of the benefits of this separation is that we can easily measure the speed at which a **TermSimilarityIndex** builder class produces term similarities and compare this speed with the speed at which the **SparseTermSimilarityMatrix** director class consumes term similarities. This allows us to see which of the classes are a bottleneck that slows down the construction of term similarity matrices.

In this notebook, we measure all the currently available builder and director classes. For the measurements, we use the [Google News word embeddings][word2vec-google-news-300] distributed with the C implementation of Word2Vec. From the word embeddings, we will derive a dictionary of 2.01M terms.

 [word2vec-google-news-300]: https://github.com/mmihaltz/word2vec-GoogleNews-vectors (word2vec-GoogleNews-vectors)
 [#1827]: https://github.com/RaRe-Technologies/gensim/pull/1827 (Implement Soft Cosine Measure - Pull Request #1827)
 [#2016]: https://github.com/RaRe-Technologies/gensim/pull/2016 (Implement Levenshtein term similarity matrix and fast SCM between corpora - Pull Request #2016)

In [4]:
full_model = api.load("word2vec-google-news-300")

try:
    full_dictionary = Dictionary.load("matrix_speed.dictionary")
except IOError:
    full_dictionary = Dictionary([[term] for term in full_model.vocab.keys()])
    full_dictionary.save("matrix_speed.dictionary")

### Director class benchmark
#### SparseTermSimilarityMatrix
First, we measure the speed at which the **SparseTermSimilarityMatrix** director class consumes term similarities.

In [5]:
def benchmark(configuration):
    dictionary, nonzero_limit, symmetric, repetition = configuration
    index = UniformTermSimilarityIndex(dictionary)
    
    start_time = time()
    matrix = SparseTermSimilarityMatrix(
        index, dictionary, nonzero_limit=nonzero_limit, symmetric=symmetric,
        dtype=np.float16).matrix
    end_time = time()
    
    duration = end_time - start_time
    return {
        "dictionary_size": len(dictionary),
        "nonzero_limit": nonzero_limit,
        "matrix_nonzero": matrix.nnz,
        "repetition": repetition,
        "symmetric": symmetric,
        "duration": duration, }

In [6]:
dictionary_sizes = [10**k for k in range(3, int(ceil(log10(len(full_dictionary)))))]
seed(RANDOM_SEED)
dictionaries = []
for size in tqdm(dictionary_sizes, desc="dictionaries"):
    dictionary = Dictionary([sample(list(full_dictionary.values()), size)])
    dictionaries.append(dictionary)
dictionaries.append(full_dictionary)
nonzero_limits = [1, 10, 100]
symmetry = (True, False)
repetitions = range(10)

configurations = product(dictionaries, nonzero_limits, symmetry, repetitions)
results = benchmark_results(benchmark, configurations, "matrix_speed.director_results")

HBox(children=(IntProgress(value=0, description='dictionaries', max=4), HTML(value='')))




The following tables show how long it takes to construct a term similarity matrix (the **duration** column), how many nonzero elements there are in the matrix (the **matrix_nonzero** column) and the mean term similarity consumption speed (the **consumption_speed** column) as we vary the dictionary size (the **dictionary_size** column) the maximum number of nonzero elements outside the diagonal in every column of the matrix (the **nonzero_limit** column), and the matrix symmetry constraint (the **symmetric** column). Ten independendent measurements were taken. The top table shows the mean values and the bottom table shows the standard deviations.

We can see that the symmetry constraint severely limits the number of nonzero elements in the resulting matrix. This in turn increases the consumption speed, since we end up throwing away most of the elements that we consume. The effects of the dictionary size on the mean term similarity consumption speed are minor.

In [7]:
df = pd.DataFrame(results)
df["consumption_speed"] = df.dictionary_size * df.nonzero_limit / df.duration
df = df.groupby(["dictionary_size", "nonzero_limit", "symmetric"])

def display(df):
    df["duration"] = [timedelta(0, duration) for duration in df["duration"]]
    df["matrix_nonzero"] = [int(nonzero) for nonzero in df["matrix_nonzero"]]
    df["consumption_speed"] = ["%.02f Kword pairs / s" % (speed / 1000) for speed in df["consumption_speed"]]
    return df

In [8]:
display(df.mean()).loc[
    [10000, len(full_dictionary)], :, :].loc[
    :, ["duration", "matrix_nonzero", "consumption_speed"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,duration,matrix_nonzero,consumption_speed
dictionary_size,nonzero_limit,symmetric,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
10000,1,False,00:00:00.286091,20000,34.95 Kword pairs / s
10000,1,True,00:00:00.166287,10002,60.15 Kword pairs / s
10000,10,False,00:00:01.573833,110000,63.54 Kword pairs / s
10000,10,True,00:00:00.640328,10118,156.20 Kword pairs / s
10000,100,False,00:00:14.662728,1010000,68.20 Kword pairs / s
10000,100,True,00:00:05.233251,20198,191.09 Kword pairs / s
2010000,1,False,00:01:02.938585,4020000,31.94 Kword pairs / s
2010000,1,True,00:00:35.977733,2010002,55.87 Kword pairs / s
2010000,10,False,00:05:53.410117,22110000,56.88 Kword pairs / s
2010000,10,True,00:02:07.940066,2010118,157.11 Kword pairs / s


In [9]:
display(df.apply(lambda x: (x - x.mean()).std())).loc[
    [10000, len(full_dictionary)], :, :].loc[
    :, ["duration", "matrix_nonzero", "consumption_speed"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,duration,matrix_nonzero,consumption_speed
dictionary_size,nonzero_limit,symmetric,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
10000,1,False,00:00:00.001519,0,0.19 Kword pairs / s
10000,1,True,00:00:00.002255,0,0.82 Kword pairs / s
10000,10,False,00:00:00.013232,0,0.53 Kword pairs / s
10000,10,True,00:00:00.009424,0,2.27 Kword pairs / s
10000,100,False,00:00:00.101245,0,0.47 Kword pairs / s
10000,100,True,00:00:00.021103,0,0.77 Kword pairs / s
2010000,1,False,00:00:00.205360,0,0.10 Kword pairs / s
2010000,1,True,00:00:00.091344,0,0.14 Kword pairs / s
2010000,10,False,00:00:01.252888,0,0.20 Kword pairs / s
2010000,10,True,00:00:00.302513,0,0.37 Kword pairs / s


### Builder class benchmark
#### UniformTermSimilarityIndex
First, we measure the speed at which the **UniformTermSimilarityIndex** builder class produces term similarities. **UniformTermSimilarityIndex** is a dummy class that just generates a sequence of constants. It produces much more term similarities per second than the **SparseTermSimilarityMatrix** is capable of consuming and its results will serve as an upper limit.

In [10]:
def benchmark(configuration):
    dictionary, nonzero_limit, repetition = configuration
    
    start_time = time()
    index = UniformTermSimilarityIndex(dictionary)
    end_time = time()
    constructor_duration = end_time - start_time
    
    start_time = time()
    for term in dictionary.values():
        for _j, _k in zip(index.most_similar(term, topn=nonzero_limit), range(nonzero_limit)):
            pass
    end_time = time()
    production_duration = end_time - start_time
    
    return {
        "dictionary_size": len(dictionary),
        "nonzero_limit": nonzero_limit,
        "repetition": repetition,
        "constructor_duration": constructor_duration,
        "production_duration": production_duration, }

In [11]:
nonzero_limits = [1, 10, 100, 1000]

configurations = product(dictionaries, nonzero_limits, repetitions)
results = benchmark_results(benchmark, configurations, "matrix_speed.builder_results.uniform")

The following tables show how long it takes to retrieve the most similar terms for all terms in a dictionary (the **production_duration** column) and the mean term similarity production speed (the **production_speed** column) as we vary the dictionary size (the **dictionary_size** column), and the maximum number of most similar terms that will be retrieved (the **nonzero_limit** column). Ten independendent measurements were taken. The top table shows the mean values and the bottom table shows the standard deviations.

The **production_speed** is proportional to **nonzero_limit**.

In [12]:
df = pd.DataFrame(results)
df["processing_speed"] = df.dictionary_size ** 2 / df.production_duration
df["production_speed"] = df.dictionary_size * df.nonzero_limit / df.production_duration
df = df.groupby(["dictionary_size", "nonzero_limit"])

def display(df):
    df["constructor_duration"] = [timedelta(0, duration) for duration in df["constructor_duration"]]
    df["production_duration"] = [timedelta(0, duration) for duration in df["production_duration"]]
    df["processing_speed"] = ["%.02f Kword pairs / s" % (speed / 1000) for speed in df["processing_speed"]]
    df["production_speed"] = ["%.02f Kword pairs / s" % (speed / 1000) for speed in df["production_speed"]]
    return df

In [13]:
display(df.mean()).loc[
    [1000, len(full_dictionary)], :, :].loc[
    :, ["production_duration", "production_speed"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,production_duration,production_speed
dictionary_size,nonzero_limit,Unnamed: 2_level_1,Unnamed: 3_level_1
1000,1,00:00:00.003828,261.66 Kword pairs / s
1000,10,00:00:00.009975,1002.80 Kword pairs / s
1000,100,00:00:00.073020,1372.84 Kword pairs / s
1000,1000,00:00:00.727086,1375.54 Kword pairs / s
2010000,1,00:00:08.315807,241.71 Kword pairs / s
2010000,10,00:00:21.027485,955.90 Kword pairs / s
2010000,100,00:02:24.223933,1393.70 Kword pairs / s
2010000,1000,00:23:57.702287,1398.09 Kword pairs / s


In [14]:
display(df.apply(lambda x: (x - x.mean()).std())).loc[
    [1000, len(full_dictionary)], :, :].loc[
    :, ["production_duration", "production_speed"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,production_duration,production_speed
dictionary_size,nonzero_limit,Unnamed: 2_level_1,Unnamed: 3_level_1
1000,1,00:00:00.000163,10.67 Kword pairs / s
1000,10,00:00:00.000174,17.08 Kword pairs / s
1000,100,00:00:00.004030,67.59 Kword pairs / s
1000,1000,00:00:00.009082,16.98 Kword pairs / s
2010000,1,00:00:00.023885,0.70 Kword pairs / s
2010000,10,00:00:00.054096,2.46 Kword pairs / s
2010000,100,00:00:00.763313,7.42 Kword pairs / s
2010000,1000,00:00:06.162980,5.99 Kword pairs / s


#### LevenshteinSimilarityIndex
Next, we measure the speed at which the **LevenshteinSimilarityIndex** builder class produces term similarities. **LevenshteinSimilarityIndex** is currently just a naïve implementation that produces much fewer term similarities per second than the **SparseTermSimilarityMatrix** class is capable of consuming.

In [15]:
def benchmark(configuration):
    dictionary, nonzero_limit, workers, query_terms, repetition = configuration
    
    start_time = time()
    index = LevenshteinSimilarityIndex(dictionary, workers=workers)
    end_time = time()
    constructor_duration = end_time - start_time
    
    start_time = time()
    for term in query_terms:
        for _j, _k in zip(index.most_similar(term, topn=nonzero_limit), range(nonzero_limit)):
            pass
    end_time = time()
    production_duration = end_time - start_time
    
    return {
        "dictionary_size": len(dictionary),
        "mean_query_term_length": np.mean([len(term) for term in query_terms]),
        "nonzero_limit": nonzero_limit,
        "workers": workers,
        "repetition": repetition,
        "constructor_duration": constructor_duration,
        "production_duration": production_duration, }

In [16]:
nonzero_limits = [1, 10, 100]
workers = [2**k for k in range(5)]
seed(RANDOM_SEED)
min_dictionary = sorted((len(dictionary), dictionary) for dictionary in dictionaries)[0][1]
query_terms = sample(list(min_dictionary.values()), 10)

configurations = product(dictionaries, nonzero_limits, workers, [query_terms], repetitions)
results = benchmark_results(benchmark, configurations, "matrix_speed.builder_results.levenshtein")

The following tables show how long it takes to retrieve the most similar terms for ten randomly sampled terms from a dictionary (the **production_duration** column), the mean term similarity production speed (the **production_speed** column) and the mean term similarity processing speed (the **processing_speed** column) as we vary the dictionary size (the **dictionary_size** column), the maximum number of most similar terms that will be retrieved (the **nonzero_limit** column), and the number of worker processes (the **workers** column). Ten independendent measurements were taken. The top table shows the mean values and the bottom table shows the standard deviations.

The **production_speed** is proportional to **nonzero_limit / dictionary_size**. Notably, **production_speed** does not increase with **workers**.

In [17]:
df = pd.DataFrame(results)
df["processing_speed"] = df.dictionary_size * len(query_terms) / df.production_duration
df["production_speed"] = df.nonzero_limit * len(query_terms) / df.production_duration
df = df.groupby(["dictionary_size", "nonzero_limit", "workers"])

def display(df):
    df["constructor_duration"] = [timedelta(0, duration) for duration in df["constructor_duration"]]
    df["production_duration"] = [timedelta(0, duration) for duration in df["production_duration"]]
    df["processing_speed"] = ["%.02f Kword pairs / s" % (speed / 1000) for speed in df["processing_speed"]]
    df["production_speed"] = ["%.02f word pairs / s" % speed for speed in df["production_speed"]]
    return df

In [18]:
display(df.mean()).loc[
    [1000, 1000000, len(full_dictionary)], :, [1, 16]].loc[
    :, ["production_duration", "production_speed", "processing_speed"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,production_duration,production_speed,processing_speed
dictionary_size,nonzero_limit,workers,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1000,1,1,00:00:00.088804,112.66 word pairs / s,112.66 Kword pairs / s
1000,1,16,00:00:00.110414,90.62 word pairs / s,90.62 Kword pairs / s
1000,10,1,00:00:00.088173,1135.12 word pairs / s,113.51 Kword pairs / s
1000,10,16,00:00:00.103538,966.10 word pairs / s,96.61 Kword pairs / s
1000,100,1,00:00:00.084138,11896.50 word pairs / s,118.96 Kword pairs / s
1000,100,16,00:00:00.098996,10104.65 word pairs / s,101.05 Kword pairs / s
1000000,1,1,00:00:58.555345,0.17 word pairs / s,170.86 Kword pairs / s
1000000,1,16,00:01:00.098409,0.17 word pairs / s,166.47 Kword pairs / s
1000000,10,1,00:00:58.418561,1.71 word pairs / s,171.23 Kword pairs / s
1000000,10,16,00:00:59.656511,1.68 word pairs / s,167.66 Kword pairs / s


In [19]:
display(df.apply(lambda x: (x - x.mean()).std())).loc[
    [1000, 1000000, len(full_dictionary)], :, [1, 16]].loc[
    :, ["production_duration", "production_speed", "processing_speed"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,production_duration,production_speed,processing_speed
dictionary_size,nonzero_limit,workers,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1000,1,1,00:00:00.002072,2.65 word pairs / s,2.65 Kword pairs / s
1000,1,16,00:00:00.002730,2.25 word pairs / s,2.25 Kword pairs / s
1000,10,1,00:00:00.002696,36.02 word pairs / s,3.60 Kword pairs / s
1000,10,16,00:00:00.001824,17.08 word pairs / s,1.71 Kword pairs / s
1000,100,1,00:00:00.002675,394.92 word pairs / s,3.95 Kword pairs / s
1000,100,16,00:00:00.001845,191.00 word pairs / s,1.91 Kword pairs / s
1000000,1,1,00:00:01.315360,0.00 word pairs / s,3.87 Kword pairs / s
1000000,1,16,00:00:01.375036,0.00 word pairs / s,3.80 Kword pairs / s
1000000,10,1,00:00:01.088038,0.03 word pairs / s,3.21 Kword pairs / s
1000000,10,16,00:00:00.909982,0.03 word pairs / s,2.56 Kword pairs / s


#### WordEmbeddingSimilarityIndex
Lastly, we measure the speed at which the **WordEmbeddingSimilarityIndex** builder class constructs an instance and produces term similarities. Gensim currently supports slow and precise nearest neighbor search, and also approximate nearest neighbor search using [ANNOY][]. We evaluate both options.

 [ANNOY]: https://github.com/spotify/annoy
          (Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk)

In [20]:
def benchmark(configuration):
    (model, dictionary), nonzero_limit, annoy_n_trees, query_terms, repetition = configuration
    use_annoy = annoy_n_trees > 0
    model.init_sims()
    
    start_time = time()
    if use_annoy:
        annoy = AnnoyIndexer(model, annoy_n_trees)
        kwargs = {"indexer": annoy}
    else:
        kwargs = {}
    index = WordEmbeddingSimilarityIndex(model, kwargs=kwargs)
    end_time = time()
    constructor_duration = end_time - start_time
    
    start_time = time()
    for term in query_terms:
        for _j, _k in zip(index.most_similar(term, topn=nonzero_limit), range(nonzero_limit)):
            pass
    end_time = time()
    production_duration = end_time - start_time
    
    return {
        "dictionary_size": len(dictionary),
        "mean_query_term_length": np.mean([len(term) for term in query_terms]),
        "nonzero_limit": nonzero_limit,
        "use_annoy": use_annoy,
        "annoy_n_trees": annoy_n_trees,
        "repetition": repetition,
        "constructor_duration": constructor_duration,
        "production_duration": production_duration, }

In [21]:
models = []
for dictionary in tqdm(dictionaries, desc="models"):
    if dictionary == full_dictionary:
        models.append(full_model)
        continue
    model = full_model.__class__(full_model.vector_size)
    model.vocab = {word: deepcopy(full_model.vocab[word]) for word in dictionary.values()}
    model.index2entity = []
    vector_indices = []
    for index, word in enumerate(full_model.index2entity):
        if word in model.vocab.keys():
            model.index2entity.append(word)
            model.vocab[word].index = len(vector_indices)
            vector_indices.append(index)
    model.vectors = full_model.vectors[vector_indices]
    models.append(model)
annoy_n_trees = [0] + [10**k for k in range(3)]
seed(RANDOM_SEED)
query_terms = sample(list(min_dictionary.values()), 1000)

configurations = product(zip(models, dictionaries), nonzero_limits, annoy_n_trees, [query_terms], repetitions)
results = benchmark_results(benchmark, configurations, "matrix_speed.builder_results.wordembeddings")

HBox(children=(IntProgress(value=0, description='models', max=5), HTML(value='')))




The following tables show how long it takes to construct an ANNOY index and the builder class instance (the **constructor_duration** column), how long it takes to retrieve the most similar terms for 1,000 randomly sampled terms from a dictionary (the **production_duration** column), the mean term similarity production speed (the **production_speed** column) and the mean term similarity processing speed (the **processing_speed** column) as we vary the dictionary size (the **dictionary_size** column), the maximum number of most similar terms that will be retrieved (the **nonzero_limit** column), and the number of constructed ANNOY trees (the **annoy_n_trees** column). Ten independendent measurements were taken. The top table shows the mean values and the bottom table shows the standard deviations.

If we do not use ANNOY (**annoy_n_trees**${}=0$), then **production_speed** is proportional to **nonzero_limit / dictionary_size**. 
If we do use ANNOY (**annoy_n_trees**${}>0$), then **production_speed** is proportional to **nonzero_limit / (annoy_n_trees)$^{1/2}$**.

In [22]:
df = pd.DataFrame(results)
df["processing_speed"] = df.dictionary_size * len(query_terms) / df.production_duration
df["production_speed"] = df.nonzero_limit * len(query_terms) / df.production_duration
df = df.groupby(["dictionary_size", "nonzero_limit", "annoy_n_trees"])

def display(df):
    df["constructor_duration"] = [timedelta(0, duration) for duration in df["constructor_duration"]]
    df["production_duration"] = [timedelta(0, duration) for duration in df["production_duration"]]
    df["processing_speed"] = ["%.02f Kword pairs / s" % (speed / 1000) for speed in df["processing_speed"]]
    df["production_speed"] = ["%.02f Kword pairs / s" % (speed / 1000) for speed in df["production_speed"]]
    return df

In [23]:
display(df.mean()).loc[
    [1000000, len(full_dictionary)], [1, 100], [0, 1, 100]].loc[
    :, ["constructor_duration", "production_duration", "production_speed", "processing_speed"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,constructor_duration,production_duration,production_speed,processing_speed
dictionary_size,nonzero_limit,annoy_n_trees,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1000000,1,0,00:00:00.000016,00:00:19.814856,0.05 Kword pairs / s,50467.35 Kword pairs / s
1000000,1,1,00:00:29.243768,00:00:00.086151,11.61 Kword pairs / s,11607994.56 Kword pairs / s
1000000,1,100,00:06:18.394023,00:00:00.145153,6.89 Kword pairs / s,6889489.33 Kword pairs / s
1000000,100,0,00:00:00.000014,00:00:21.404975,4.67 Kword pairs / s,46718.15 Kword pairs / s
1000000,100,1,00:00:29.327988,00:00:00.148678,672.60 Kword pairs / s,6725972.71 Kword pairs / s
1000000,100,100,00:06:17.906254,00:00:01.267254,78.91 Kword pairs / s,789115.67 Kword pairs / s
2010000,1,0,00:00:00.000013,00:01:55.708445,0.01 Kword pairs / s,17371.28 Kword pairs / s
2010000,1,1,00:01:30.093113,00:00:00.169142,5.91 Kword pairs / s,11883667.16 Kword pairs / s
2010000,1,100,00:23:21.211156,00:00:00.317731,3.15 Kword pairs / s,6341482.88 Kword pairs / s
2010000,100,0,00:00:00.000012,00:02:12.106273,0.76 Kword pairs / s,15215.26 Kword pairs / s


In [24]:
display(df.apply(lambda x: (x - x.mean()).std())).loc[
    [1000000, len(full_dictionary)], [1, 100], [0, 1, 100]].loc[
    :, ["constructor_duration", "production_duration", "production_speed", "processing_speed"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,constructor_duration,production_duration,production_speed,processing_speed
dictionary_size,nonzero_limit,annoy_n_trees,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1000000,1,0,00:00:00.000007,00:00:00.038433,0.00 Kword pairs / s,97.76 Kword pairs / s
1000000,1,1,00:00:00.037389,00:00:00.000601,0.08 Kword pairs / s,79750.61 Kword pairs / s
1000000,1,100,00:00:00.778346,00:00:00.000842,0.04 Kword pairs / s,39420.33 Kword pairs / s
1000000,100,0,00:00:00,00:00:00.019706,0.00 Kword pairs / s,43.04 Kword pairs / s
1000000,100,1,00:00:00.230572,00:00:00.000249,1.13 Kword pairs / s,11255.12 Kword pairs / s
1000000,100,100,00:00:00.365938,00:00:00.004160,0.26 Kword pairs / s,2582.72 Kword pairs / s
2010000,1,0,00:00:00,00:00:00.165553,0.00 Kword pairs / s,24.79 Kword pairs / s
2010000,1,1,00:00:00.054403,00:00:00.000622,0.02 Kword pairs / s,43763.82 Kword pairs / s
2010000,1,100,00:02:36.227334,00:00:00.017605,0.15 Kword pairs / s,308488.25 Kword pairs / s
2010000,100,0,00:00:00.000001,00:00:00.546129,0.00 Kword pairs / s,62.80 Kword pairs / s
