# Notebook 5

# February 26 2024

# Count min sketch

We are in the streaming model, and data is passing by, and each data item $x$ has an ID. Suppose we want to know, given an ID, what is number of times data with that ID has appeared in the stream, call this $c_x$? In order to answer this exactly, we would need to store a table of all ID's we have seen as well as the number of times we have seen this. Suppose we are willing to have an approximate answer, and in return have a very small structure. That is where the count-min sketch is of use.

The count min sketch, given values of $\epsilon$ and $\delta$ will for any item $x$ compute an approximate frequency $\hat{c}_x$ such that $c_x \leq \hat{c}_x$ and $\hat{c}_x \leq c_x + \epsilon n$ with probability at least $1-\delta$, where $n$ is the number of items seen so far. Thus the answer given may overestimate the frequency, but it will never underestimate it.


The count-min sketch is a 2-dimensional array $A$ with width $w=2/\epsilon$ and height $h=\log \frac{1}{\delta}$. This is the entire structure. It thus has size $\frac{2}{\epsilon}\log_2 \frac{1}{\delta}$. We assume we have $h$ hash functions $hash_i$, $i \in [h]$. The two operations are increment and query:

- $Increment(x)$: For all $i \in [h]$, increment $A[i][hash_i(x)]$
- $Query(x)$: For all $i \in [h]$, return the smallest $A[i][hash_i(x)]$

Now, lets prove that it works as claimed.

Let $X_{x,i}$ be $A[i][hash_i(x)] - c_x$, that is, the number of excess items in the $i$th row for item $x$, those beyond $c_x$.

We know $E[X_{x,i}] \leq n/w$, as less than $n$ items have been added that are not $x$ and each has a probability of $1/w$ of incrementing the same bucket as $i$. Thus by Markov $Pr[X_{x,i} \geq \overbrace{2n/w}^{n\epsilon}] \leq \frac{1}{2}$. (Note these events are *not* independent but Markov is ok with that).

This is the chance that a single value is above the desired error threshold of $n\epsilon$. But as we use the min, failure occurs when *all* $h$ locations have above the desired error threshold. Each row is computed independently, so we can just multiply, and you get that this happens with probability at most $\delta$:

$$ Pr\left[ \bigwedge_{i=1}^h [X_{a,i} \geq n\epsilon] \right] \leq \frac{1}{2^h} = \delta
$$

Thus you can get a value, accurate to within $1\%$ additive error, with $1\%$ failure probability using an array of size $1400$.







Here is the code in a class. Most of the code is for printing things out in a nice way. The four lines of code in the `add` and `count` functions are the meat of the countmin sketch.

In [10]:
import math

class countMin:
    def __init__(self,epsilon,delta,verbose=False):
        self._w=math.ceil(2/epsilon)
        self._h=math.ceil(math.log(1/delta,2))
        self._epsilon=epsilon
        self._delta=delta
        if verbose:
            print("Initializing with width=",self._w,
                  " and height=",self._h,
                  " total size=",self._w*self._h,
                  "for epsilon=", epsilon,
                  "delta=",delta)
        self._A=[ [0]*self._w for i in range(self._h)]
        self._n=0
    def __repr__(self):
        return "".join( (repr(self._A[i])+"\n" for i in range(self._h) ) )
    def add(self,x,ammount=1):
        for h in range(self._h):
            self._A[h][hash((h,x))%self._w]+=ammount
        self._n+=ammount
    def count(self,x):
        return min (self._A[h][hash((h,x))%self._w] for h in range(self._h))
    def __len__(self):
        return self._n
    def printCount(self,x):
        print("Count of \""+x+
             "\" is between ",max(0,self.count(x)-int(self._n*self._epsilon)),
             "and ",self.count(x),
              "with probability at least",1-self._delta)

In [34]:
# A simple demo   
CM=countMin(0.1,0.01,True)
CM.add("Tiger")
CM.add("Fish")
CM.add("Bird")

print(repr(CM))
CM.printCount("Tiger")        

Initializing with width= 20  and height= 7  total size= 140 for epsilon= 0.1 delta= 0.01
[1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
[1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0]
[0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]

Count of "Tiger" is between  1 and  1 with probability at least 0.99


Here is a longer example where I load the collected works of Shakespeare, break it into a list of words, and then put the words into a countMin sketch and use it to estimate their frequency.

In [35]:
f=open("data/shakespeare.txt","r") 
#Download http://www.gutenberg.org/files/100/100-0.txt and put it into a subdirectory called data

shakespere=f.read()
print(shakespere[5000:6000]) # Test to make sure the file is as expected

ven thee to give?
Profitless usurer why dost thou use
So great a sum of sums yet canst not live?
For having traffic with thy self alone,
Thou of thy self thy sweet self dost deceive,
Then how when nature calls thee to be gone,
What acceptable audit canst thou leave?
  Thy unused beauty must be tombed with thee,
  Which used lives th’ executor to be.


                    5

Those hours that with gentle work did frame
The lovely gaze where every eye doth dwell
Will play the tyrants to the very same,
And that unfair which fairly doth excel:
For never-resting time leads summer on
To hideous winter and confounds him there,
Sap checked with frost and lusty leaves quite gone,
Beauty o’er-snowed and bareness every where:
Then were not summer’s distillation left
A liquid prisoner pent in walls of glass,
Beauty’s effect with beauty were bereft,
Nor it nor no remembrance what it was.
  But flowers distilled though they with winter meet,
  Leese but their show, their substance still lives sweet.


In [36]:
import re
words=re.findall(r"[\w']+", shakespere) #This breaks it into words
print(words[:100]) # Test showing the first 100 words of the document


['The', 'Project', 'Gutenberg', 'eBook', 'of', 'The', 'Complete', 'Works', 'of', 'William', 'Shakespeare', 'by', 'William', 'Shakespeare', 'This', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'and', 'most', 'other', 'parts', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', 'You', 'may', 'copy', 'it', 'give', 'it', 'away', 'or', 're', 'use', 'it', 'under', 'the', 'terms', 'of', 'the', 'Project', 'Gutenberg', 'License', 'included', 'with', 'this', 'eBook', 'or', 'online', 'at', 'www', 'gutenberg', 'org', 'If', 'you', 'are', 'not', 'located', 'in', 'the', 'United', 'States', 'you', 'will', 'have', 'to', 'check', 'the', 'laws', 'of', 'the', 'country', 'where', 'you', 'are', 'located', 'before', 'using', 'this', 'eBook', 'Title']


In [40]:
#Now, lets make a countMin sketch and add our words (in lower case)
import string

CM=countMin(epsilon=0.001,delta=0.01,verbose=True)
for word in words:
    CM.add(word.lower())
print("Words added: ",len(CM))

Initializing with width= 2000  and height= 7  total size= 14000 for epsilon= 0.001 delta= 0.01
Words added:  983333


Note our original data set had 5.8 million characters, and we have reduced this to a table of size 4000.

In [41]:
CM.printCount("the")
CM.printCount("thy")
CM.printCount("King") # There are none as we converted to lower case
CM.printCount("king")

Count of "the" is between  29281 and  30264 with probability at least 0.99
Count of "thy" is between  3430 and  4413 with probability at least 0.99
Count of "King" is between  0 and  105 with probability at least 0.99
Count of "king" is between  2163 and  3146 with probability at least 0.99


# Homework

With the countMin sketch, given an item we can estimate its frequency. In the Shakespeare data set, we saw that `the` was fairly frequent. But we had to know to ask about `the`. What if we wanted to know all the words with frequency above a certain threshold, such as more than $2\epsilon n$, without having to guess what they were? Show how to modify the class above so that a method `mostFrequent()` can be added that reports the words with frequency above $2\epsilon n$. Most importantly, you can not store all the words! As your list that you will return will be at most size $1/2\epsilon$ you should store at most roughly that many words.