# Performance Benchmarks

This notebook contains the benchmarks for performance improving choices made in this repository.

2. spacy: `nlp.pipe` vs `nlp`.


In [1]:
import os

os.getpid()

1952

## Pandas accessor performance

In [2]:
import pandas as pd

df = pd.read_csv("~/Downloads/Geolocated_places_climate_with_LGA_and_remoteness_with_text.csv")

In [3]:
%time for i in range(max(len(df), 100)): df['text'].iloc[i]  # average

CPU times: user 164 ms, sys: 1.78 ms, total: 166 ms
Wall time: 166 ms


In [4]:
%time for i in range(max(len(df), 100)): df.iloc[i]['text']  # slowest

CPU times: user 1.37 s, sys: 7.03 ms, total: 1.38 s
Wall time: 1.38 s


In [5]:
%time for i in range(max(len(df), 100)): df.at[i, 'text']  # fastest

CPU times: user 80.7 ms, sys: 1.38 ms, total: 82.1 ms
Wall time: 81.4 ms


## nlp.pipe vs cached spacy docs.

This will compare the difference between pre-computed vs cached.
n = 10
n = 100
n = 1,000
n = 10,000

In [6]:
import pandas as pd
import spacy

nlp = spacy.load('en_core_web_sm')


def perform_operation_on(doc: spacy.tokens.Doc):
    pass


# n: int = 10_000
n: int = 50_000
df = pd.read_csv("~/Downloads/Geolocated_places_climate_with_LGA_and_remoteness_with_text.csv", nrows=n)
f"n = {n}"

'n = 50000'

In [7]:
from time import time
s = time()
print("Caching spacy docs...", end='')
df['doc'] = list(nlp.pipe(df.loc[:, 'text'], n_process=-1))
print(f"Done. {time() - s}s elapsed.")

Caching spacy docs...Done. 21.758426666259766s elapsed.


In [8]:
%%timeit -r 3 -n 1
for doc in nlp.pipe(df.loc[:n, 'text']):
    perform_operation_on(doc)

49.7 s ± 1.32 s per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [9]:
%%timeit -r 3 -n 1
for i in range(n):
    perform_operation_on(df.at[i, 'doc'])

81.6 ms ± 801 µs per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [10]:
%%timeit -r 3 -n 1
df.loc[:, 'doc'].apply(perform_operation_on)

4.54 ms ± 431 µs per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [11]:
from tqdm import tqdm

In [12]:
%%timeit -r 3 -n 1
for doc in nlp.pipe(df.loc[:n, 'text'], disable=[nlp.pipe_names]):
    perform_operation_on(doc)

50.7 s ± 575 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [13]:
%%timeit -r 3 -n 1
for doc in nlp.pipe(df.loc[:n, 'text'], n_process=-1):
    perform_operation_on(doc)

26.9 s ± 32.4 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [14]:
%%timeit -r 3 -n 1
for doc in nlp.pipe(df.loc[:n, 'text'], disable=[nlp.pipe_names], n_process=-1):
    perform_operation_on(doc)

27.1 s ± 167 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [15]:
def stream(df: pd.DataFrame, col: str, n: int):
    for i in range(min(n, len(df))):
        yield df.at[i, col]

for i in stream(df, 'text', n):
    print(i); break

<TWEET> "Merry Crisis", "You cannot eat money", "Coal bludger", just some of the messages on show at the Solidarity Sit-Down outside Parliament House. The activists protesting over climate change inaction, they say is contributing to catastrophic bushfire conditions. @9NewsAdel <https://t.co/6qaEHbIXy5> </TWEET>



In [16]:
%%timeit -r 3 -n 1
for doc in nlp.pipe(stream(df, 'text', n), disable=[nlp.pipe_names], n_process=-1):
    perform_operation_on(doc)

25.5 s ± 555 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [17]:
%%timeit -r 3 -n 1
for doc in nlp.pipe(stream(df, 'text', n), disable=[nlp.pipe_names], n_process=4):
    perform_operation_on(doc)

24 s ± 100 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [18]:
%%timeit -r 3 -n 1
for doc in nlp.pipe(stream(df, 'text', n), disable=[nlp.pipe_names], n_process=4, batch_size=200):
    perform_operation_on(doc)

23.7 s ± 70.6 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [19]:
%%timeit -r 3 -n 1
for doc in nlp.pipe(stream(df, 'text', n), disable=[nlp.pipe_names], n_process=4, batch_size=200):
    continue

24 s ± 426 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [20]:
nlp_disabled = spacy.load('en_core_web_sm', disable=['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner'])
nlp_excluded = spacy.load('en_core_web_sm', exclude=['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner'])

In [21]:
%%timeit -r 3 -n 1
for doc in nlp_disabled.pipe(stream(df, 'text', n), n_process=4):
    continue

12.9 s ± 188 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [22]:
%%timeit -r 3 -n 1
for doc in nlp_excluded.pipe(stream(df, 'text', n), n_process=4):
    continue

12.4 s ± 160 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [23]:
%%timeit -r 3 -n 1
for doc in nlp.pipe(stream(df, 'text', n), disable=['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner'], n_process=4):
    continue

12.8 s ± 87.6 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [24]:
%%timeit -r 3 -n 1
for doc in nlp.pipe(stream(df, 'text', n), disable=['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']):
    continue

1.48 s ± 14.3 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [25]:
%%timeit -r 3 -n 1
for doc in nlp.pipe(df.loc[:n, 'text'], disable=['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']): continue

1.35 s ± 14.6 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [26]:
%%timeit -r 3 -n 1
for doc in nlp.pipe(df.loc[:n, 'text']): continue

52.7 s ± 392 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [27]:
%%timeit -r 3 -n 1
for doc in nlp.pipe(df.loc[:n, 'text'], n_process=4): continue

24.4 s ± 419 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [28]:
len(df)

50000

In [29]:
df.memory_usage(deep=True)['text']/1_000_000

16.229894

## pyarrow string dtype vs numpy objects

In [30]:
df['text'].memory_usage(deep=True)

16230022

In [31]:
df['text'].astype(pd.StringDtype(storage='pyarrow')).memory_usage(deep=True)

9122782

In [32]:
df['doc'].memory_usage(deep=True)

8800128

# Count Uniques

In [40]:
# Python implementation to count words in a trie

# Alphabet size (# of symbols)
from pickle import NONE

ALPHABET_SIZE = 26

# Trie node
class TrieNode:

    def __init__(self):
        # isLeaf is true if the node represents
        # end of a word
        self.isLeaf = False
        self.children = [None for i in range(ALPHABET_SIZE)]


root = TrieNode()

# If not present, inserts key into trie
# If the key is prefix of trie node, just
# marks leaf node
def insert(key):

    length = len(key)

    pCrawl = root

    for level in range(length):

        index = ord(key[level]) - ord('a')
        if (pCrawl.children[index] == None):
            pCrawl.children[index] = TrieNode()

        pCrawl = pCrawl.children[index]

    # mark last node as leaf
    pCrawl.isLeaf = True


# Function to count number of words
def wordCount(root):

    result = 0

    # Leaf denotes end of a word
    if (root.isLeaf == True):
        result += 1

    for i in range(ALPHABET_SIZE):
        if (root.children[i] != None):
            result += wordCount(root.children[i])

    return result

# Driver Program

# Input keys (use only 'a' through 'z'
# and lower case)
keys = ["the", "a", "there", "answer", "any", "by", "bye", "their"]

root = TrieNode()

# Construct Trie
for i in range(len(keys)):
    insert(keys[i])

print(wordCount(root))

8


In [None]:
from tqdm import tqdm

from spacy.matcher import Matcher

is_alpha = Matcher(nlp.vocab)
is_alpha.add("is_alpha", patterns=[
    [{"IS_ALPHA": True, "IS_ASCII": True}]
])


In [60]:
%load_ext memory_profiler

In [61]:
%%memit
root = TrieNode()
for doc in tqdm(df.loc[:, 'doc']):
    _is_alpha_doc = is_alpha(doc)
    for _, start, end in _is_alpha_doc:
        x = doc[start:end].text.lower()
        insert(x)

wordCount(root)

100%|██████████| 50000/50000 [00:04<00:00, 12087.52it/s]


peak memory: 3037.06 MiB, increment: 0.00 MiB


In [63]:
import sys
sys.getsizeof(root)

48

In [62]:
%%memit
uniqs = set()
for doc in tqdm(df.loc[:, 'doc']):
    _is_alpha_doc = is_alpha(doc)
    for _, start, end in _is_alpha_doc:
        x = doc[start:end].text.lower()
        uniqs.add(x)

len(uniqs)

100%|██████████| 50000/50000 [00:02<00:00, 18791.71it/s]

peak memory: 3015.31 MiB, increment: 0.05 MiB





In [65]:
sys.getsizeof(uniqs)/1_000_000

2.097368

In [33]:
df.head()

Unnamed: 0,year,month,day,lat_mid,lon_mid,screen_name,tweet_id,retweet,text,geometry,tweet_lga,lga_code_2020,lga_name_2020,state_code_2016,state_name_2016,remoteness,remote_level,doc
0,2019,11,29,-34.92162,138.598244,G_Westgarth,1.2e+18,False,"<TWEET> ""Merry Crisis"", ""You cannot eat money""...","c(138.598244, -34.92162)",Adelaide (C),40070.0,Adelaide (C),4.0,South Australia,Major Cities of Australia,1.0,"(<, TWEET, >, "", Merry, Crisis, "", ,, "", You, ..."
1,2019,12,30,-34.92877,138.599702,adelparklands,1.21e+18,False,<TWEET> #adelaideparklands #picoftheday \nThe ...,"c(138.599702, -34.92877)",Adelaide (C),40070.0,Adelaide (C),4.0,South Australia,Major Cities of Australia,1.0,"(<, TWEET, >, #, adelaideparklands, , #, pico..."
2,2020,1,29,-34.925639,138.600768,timklapdor,1.22e+18,False,<TWEET> Same academics who would have their su...,"c(138.6007685, -34.925639)",Adelaide (C),40070.0,Adelaide (C),4.0,South Australia,Major Cities of Australia,1.0,"(<, TWEET, >, Same, academics, who, would, hav..."
3,2020,1,30,-34.925639,138.600768,timklapdor,1.22e+18,False,<TWEET> Care to explain @UniSuperNews? You're ...,"c(138.6007685, -34.925639)",Adelaide (C),40070.0,Adelaide (C),4.0,South Australia,Major Cities of Australia,1.0,"(<, TWEET, >, Care, to, explain, @UniSuperNews..."
4,2020,2,19,-34.925639,138.600768,timklapdor,1.23e+18,False,<TWEET> FYI: the time for change is now. With ...,"c(138.6007685, -34.925639)",Adelaide (C),40070.0,Adelaide (C),4.0,South Australia,Major Cities of Australia,1.0,"(<, TWEET, >, FYI, :, the, time, for, change, ..."


In [34]:
import os
os.getpid()

1952

In [8]:
hashtags = ("#tag1", "#tag2", "#tag3", "#tag4")
import re
pattern = re.compile(r'#tag4')

In [9]:
%timeit '#tag4' in hashtags

52.7 ns ± 0.578 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)


In [10]:
%timeit [pattern.match(ht) for ht in hashtags]

536 ns ± 2.43 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


## Benchmarking equals vs in

In [17]:
import pandas as pd
ht_series: pd.Series = pd.Series(hashtags, name='hashtags')

In [18]:
%timeit ht_series.apply(lambda x: x == '#tag1') | ht_series.apply(lambda x: x == '#tag2')

93.9 µs ± 915 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [19]:
%timeit ht_series.apply(lambda x: x in ('#tag1', '#tag2'))

33.3 µs ± 1.33 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [20]:
%timeit ht_series.apply(lambda x: x == '#tag1')

33.5 µs ± 2.97 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [21]:
%timeit ht_series.apply(lambda x: x in ('#tag1'))

32 µs ± 92.2 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


Conclusion:
`in` is always going to be faster (most likely because of overhead in vectorising multiple times)
The difference is signficant enough ~3x.
`in` should be used for OR operations.

with AND operations, i guess