

    Solve the following using Python:

        generate 1 million random integers (1..10000)
        find all 77s
        measure the time (using %timeit or %%timeit)



In [2]:
import numpy as np
import pandas as pd

In [4]:
array = np.random.randint(0, 10000, size=1000000)

# 1. solution

In [24]:
df = pd.DataFrame(array)

In [10]:
df['is77'] = df[df == 77].notna()

In [12]:
%%timeit
df['is77'] = df[df == 77].notna()
df[df['is77'] == True]

91.8 ms ± 14.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


# 2. solution

In [15]:
%%timeit
seventy_sevens = []
for index, value in enumerate(array):
    if value == 77:
        seventy_sevens.append((index, value))

680 ms ± 41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


# 3. Solution

In [16]:
from collections import defaultdict

In [17]:
number_lookup = defaultdict(list) # That means number_lookup will be a dictionary whith lists as values

In [18]:
number_lookup

defaultdict(list, {})

In [23]:
%%timeit
for index, number in enumerate(array):
    number_lookup[number].append(index)
# creates a dictionary with the value of the array as key, and the indices of the value as value of the dictionary

783 ms ± 51.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [20]:
number_lookup

defaultdict(list,
            {8225: [0,
              18742,
              25629,
              30846,
              32598,
              33873,
              41809,
              59793,
              61556,
              70832,
              76847,
              90825,
              108529,
              109705,
              118428,
              118671,
              128135,
              146528,
              146735,
              155106,
              172713,
              179302,
              207713,
              218063,
              241650,
              252230,
              254266,
              258519,
              267780,
              272775,
              283076,
              299354,
              299732,
              302029,
              336705,
              340306,
              341947,
              359615,
              375109,
              383652,
              385808,
              385982,
              396526,
              400574,
              412290,
  

In [22]:
%%timeit
number_lookup[77]

443 ns ± 130 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


- The lookup itself is now a lot faster
- But I of course have to take into account that creating the number lookup takes time as well
- If you have a stable dataset and you want to find values frequently, at some point the overhead ov creating the lookup will pay off
- If the underlying data changes a lot, we will also have to maintain the lookup, which will be costly

#### Indexes can be very helpful if you query a lot of data with WHERE CLAUSES