# Lecture 4

# March 6, 2023


# Streaming

A streaming algorithm is one that looks at the data once, one element at a time. The idea is that it does not store the data but stores some very small amount of data that allows it to compute what it wants. 

Streaming algorithms are of use not only in their original context where data is going by so quickly that you can not store it, but also in the big data context where you have a data set that is so big that you don't want to look at it more than once and copying or moving it is out of the question

## Streaming and iterators

In python (and other similar languages) the construct `for x in X` works like a stream, where the data is not stored and you can look at each item in `X` once. How do you create such a stream? One way is to use what python calls a generator, where each element of the stream is produced with a `yield` statement. Execution then alternates between the `for` loop and the generator, and nothing is stored.

In [2]:
import random
import math

In [3]:
def someRandom(maxcount,maxnum,debug=False):
    for i in range(random.randint(1,maxcount)):
        if debug:
            print("SomeRandom")
        yield(random.randint(0,maxnum))
        
for x in someRandom(10,100,True):
    print(x)

SomeRandom
74


In [4]:
# Good code to compute the average
# See how execution alternates back and forth from the generator
# And nothing is stored
l=0
s=0
for x in someRandom(10,1000,debug=True):
    print("Average")
    s+=x
    l+=1
print(s/l)

SomeRandom
Average
SomeRandom
Average
SomeRandom
Average
SomeRandom
Average
SomeRandom
Average
SomeRandom
Average
SomeRandom
Average
572.0


In [5]:
# Bad code to compute the average
# All values are saved in A
A=[x for x in someRandom(10,1000,debug=True)]
print("A created")
print(sum(A)/len(A))


SomeRandom
SomeRandom
SomeRandom
SomeRandom
SomeRandom
SomeRandom
SomeRandom
A created
782.2857142857143


# Median streaming

In order to compute the median in the streaming model, you can use exactly the same sampling method we discussed previously. The only question is how to get a uniform sample in the streaming model. The answer given a stream of data $x_1, \ldots x_n$, at step $i$ replace the current sample with $x_i$ with probability $\frac{1}{i}$. Thus the chance that $x_i$ is chosen, is the chance that $x_i$ is picked and the chance that $x_j$, $j>i$ are all not picked.

$$
\displaystyle \frac{1}{i} \prod_{j={i+1}}^n \frac{j-1}{j}
$$

This telescopes to $\frac{1}{n}$:

$$
\displaystyle  \frac{\prod_{j={i+1}}^n (j-1)}{i\prod_{j={i+1}}^n j}
=
\displaystyle  \frac{\prod_{j={i}}^{n-1} j}{n\prod_{j={i}}^{n-1} j}
=\frac{1}{n}$$







# Distinct elements: Theory

Notation:

- Let $d$ be the number of distinct elements. Name them $x_1,\ldots x_d$.
- Let $z(x_i)$ be the number of trailing zeros in the binary representation of $h(x_i)$
- Let $z = max_i z(x_i)$. This is the maximum number of zeros
- Let $\hat{d} = 2^{z+\frac{1}{2}}$. This is the output approximation for $d$.

What we want to prove is that usually $\hat{d}$ is close to $d$. Specifically we will bound that they are within a factor of 3 of each other.

Let us proceed with the upper bound. We want to ask, what is the chance that $\hat{d}\geq 3d$. As usual we want to break things down into indicator random variables so that we can use linearity of expectation.

We define the indicator random variable $z_j(x_i)$ which is true if 
$z(x_i)\geq j$. Note that since the chance that a random binary string ends with $i$ 0's is $\frac{1}{2^i}$, $E[z(x_i)] = \frac{1}{2^i}$. We let $Y_j = \sum_{i=1}^d z_j(x_i)$. Thus $Y_j$ is the number of distinct items in the stream that have zeros values at least $j$. Thus if $z$, the maximum number of zeros is $j$, we known $Y_j \geq 1$ and $Y_{j+1}=0$. Thus we can relate the maximum with the sum, which we know how to deal with.

Let's look at the chance that $\hat{d}$ is too big by a factor of three.

$$
\begin{align*}
&P[\hat{d} \geq 3d] \\
&= P[ 2^{z+\frac{1}{2}} \geq 2^{\log (3d)} ]
\\
&= P[ z+\frac{1}{2} \geq \log (3d)]
\\
&= P[ z \geq \log (3d)-\frac{1}{2}]
\\
&= P[Y_{\log (3d)- \frac{1}{2}} \geq 1]
\\
&\leq  \frac{E[Y_{\log (3d)- \frac{1}{2}}]}{1}
& \text{Markov: }Pr[X\geq  a ] \leq \frac{E[X]}{a}
\\
& =  E\left[\sum_{i}^d z_{\log (3d)- \frac{1}{2}}(x_i) \right]
&\text{As }Y_j = \sum_{i}^d z_j(x_i)
\\
& =  \sum_{i}^d E[z_{\log (3d)- \frac{1}{2}}(x_i) ]
\\
& =  \sum_{i}^d \frac{1}{2^{\log (3d)- \frac{1}{2} }}
\\
& =  \sum_{i}^d \frac{\sqrt{2}}{3d} 
\\
& =  \frac{\sqrt{2}}{3}\approx 47\%
\end{align*}
$$

Next, Let's look at the chance that $\hat{d}$ is too small. *We did not do the math for this in class, and it is a little different than the too big case*.
As Markov is useless in this case, we introduce the third commonly used inequality:
Chebyshev, which bounds how likely a random variable can be far from its expected value as a function of its variance: 

$$ Pr[ | X - E[X] | \geq k ] \leq \frac{Var[X]}{k^2}$$

The variance $Var[X]$ is has the easy-to-use formula of $E[X]^2-E[X^2]$. This is very easy to use for indicator random variables, as these are always 0 or 1, which are the two numbers that don't change when you square them; this for indicator random variables $E[X^2]$ is $E[X]$. Finally, for independent variables $Var[X+Y]=Var[X]+Var[Y]$

Ok, so lets look at the chance that $\hat{d}$ is too small by a factor of 3.


$$
\begin{align*}
&P[\hat{d} \leq \frac{d}{3}] \\
&= P[ 2^{z+\frac{1}{2}} < 2^{\log \frac{d}{3}} ]
\\
&= P[ z+\frac{1}{2} < \log \frac{d}{3}]
\\
&= P[Y_{\log \frac{d}{3}+ \frac{1}{2}} = 0 ]
\\
&\leq  P[ | Y_{\log \frac{d}{3}+ \frac{1}{2}} - E[Y_{\log \frac{d}{3}+ \frac{1}{2}}]|  \geq  E[Y_{\log \frac{d}{3}+ \frac{1}{2}}] ]
\\
&\leq  \frac{Var[Y_{\log \frac{d}{3}+ \frac{1}{2}}]}{E[Y_{\log \frac{d}{3}+ \frac{1}{2}}]^2}
& \text{Chebyshev: }Pr[ | X - E[X] | \geq k ] \leq \frac{Var[X]}{k^2}
\\
&\leq  \frac{E[Y_{\log \frac{d}{3}+ \frac{1}{2}}]}{E[Y_{\log \frac{d}{3}+ \frac{1}{2}}]^2}
& \text{*}
\\
&\leq  \frac{1}{E[Y_{\log \frac{d}{3}+ \frac{1}{2}}]}
\\
&\leq  \frac{\sqrt{2}}{3}
&\text{As  above, $E[Y_m] = \frac{d}{2^m}$}
\end{align*}
$$

Point * is because $Var[Y_m]=Var[\sum_{i}^d z_j(x_i)]=\sum_{i}^d Var [z_m(x_i)] =
\sum_{i}^d (E[ z_m(x_i)^2]-E[ z_m(x_i)]^2 )\leq \sum_{i}^d E[ z_m(x_i)^2]= \sum_{i}^d E[ z_m(x_i)] = E[\sum_{i}^d z_m(x_i)] = E[Y_m]
$

So, we have the probability that $\hat{d}$ is within a factor of 3 of the real value of $d$ is $1-\frac{2 \sqrt{2}}{3}\approx 0.057$.
A 6% success rate is hardly impressive. But, by using several independent computations of the distinct elements, and taking the median value, we can boost the success rate. This is exactly the same as the approximate median finding algorithm with $\epsilon= 1-\frac{2 \sqrt{2}}{3}\approx 0.057$. 

Thus, using the formula from the previous lecture, to get an answer that is correct within a factor of 3 with an at most $\gamma$ chance of failure, one should run the sketch
$\frac{6}{(0.057)^2} \ln \frac{2}{\gamma}$
times in parallel and take the median.

In [6]:
def numberOfApproxDistinctNeeded(failure):
    return (6/(0.57)**2)*math.log(2/failure)

for failure in (0.1,0.01,0.001,0.00001,0.000001):
    print(failure,numberOfApproxDistinctNeeded(failure))

0.1 55.322849003767146
0.01 97.84519605813549
0.001 140.36754311250382
1e-05 225.41223722124053
1e-06 267.93458427560887


# Distinct elements, code

Here is the code for zeros. There are many ways you can do this, and in languages like C it can be done very quickly as the CPU has circuitry to compute it directly.

In [16]:
def zeros(x):
    answer=0
    while x % 2 == 0:
        x=x//2
        answer+=1
    return answer

In [17]:
for x in [1,4,16,24,768,1000,10001]:
    print(x," has ",zeros(x)," zeros")

1  has  0  zeros
4  has  2  zeros
16  has  4  zeros
24  has  3  zeros
768  has  8  zeros
1000  has  3  zeros
10001  has  0  zeros


Here is code for computing the exact number of distinct elements. It uses the fact that inserting into a set does nothing if the item is there already. Note that this uses lots of space!

In [18]:
def distinctElements(strings):
    S=set()
    for string in strings:
        S.add(string)
    return len(S)

Here is the code for computing the approximate number of distinct elements. It uses very little space.

In [10]:
def maxZeros(strings,seed=0):
    return max(zeros(hash((string,seed))) for string in strings )

In [11]:
def approxDistinctElements(strings):
    seed=random.random()
    return int(2**(maxZeros(strings,seed)+0.5)) # the int gets rid of the clutter after the decimal point

Here is a demo. Note that the exact method needs to store $100000\cdot 50\cdot 10$ characters, 50 million characters. The approximate method stores one integer. 

In [20]:
def randomStringOfChars(length):
    letters = "qwertyuiopasdfghjklzxcvbnnm"
    return "".join((random.choice(letters) for i in range(length)))

A=[randomStringOfChars(50) for s in range(100000)]
print("  Real distinct elements: ",distinctElements(A*10)) #Five copies of A
print("           Maximum zeros: ",maxZeros(A*10))
print("Approx distinct elements: ",approxDistinctElements(A*10))

  Real distinct elements:  100000
           Maximum zeros:  17
Approx distinct elements:  185363


In [21]:
f=open("data/shakespeare.txt","r") 
#Download http://www.gutenberg.org/files/100/100-0.txt and put it into a subdirectory called data

shakespere=f.read()
print(shakespere[5000:6000]) # Test to make sure the file is as expected

ven thee to give?
Profitless usurer why dost thou use
So great a sum of sums yet canst not live?
For having traffic with thy self alone,
Thou of thy self thy sweet self dost deceive,
Then how when nature calls thee to be gone,
What acceptable audit canst thou leave?
  Thy unused beauty must be tombed with thee,
  Which used lives th’ executor to be.


                    5

Those hours that with gentle work did frame
The lovely gaze where every eye doth dwell
Will play the tyrants to the very same,
And that unfair which fairly doth excel:
For never-resting time leads summer on
To hideous winter and confounds him there,
Sap checked with frost and lusty leaves quite gone,
Beauty o’er-snowed and bareness every where:
Then were not summer’s distillation left
A liquid prisoner pent in walls of glass,
Beauty’s effect with beauty were bereft,
Nor it nor no remembrance what it was.
  But flowers distilled though they with winter meet,
  Leese but their show, their substance still lives sweet.


In [14]:
import re
words=re.findall(r"[\w']+", shakespere) #This breaks it into words
print(words[:100]) # Test showing the first 100 words of the document

['The', 'Project', 'Gutenberg', 'eBook', 'of', 'The', 'Complete', 'Works', 'of', 'William', 'Shakespeare', 'by', 'William', 'Shakespeare', 'This', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'and', 'most', 'other', 'parts', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', 'You', 'may', 'copy', 'it', 'give', 'it', 'away', 'or', 're', 'use', 'it', 'under', 'the', 'terms', 'of', 'the', 'Project', 'Gutenberg', 'License', 'included', 'with', 'this', 'eBook', 'or', 'online', 'at', 'www', 'gutenberg', 'org', 'If', 'you', 'are', 'not', 'located', 'in', 'the', 'United', 'States', 'you', 'will', 'have', 'to', 'check', 'the', 'laws', 'of', 'the', 'country', 'where', 'you', 'are', 'located', 'before', 'using', 'this', 'eBook', 'Title']


In [22]:
print("               Real distinct elements: ",distinctElements(words))
print("             Approx distinct elements: ",approxDistinctElements(words))
print("Median of 99 Approx distinct elements: ",[approxDistinctElements(words) for i in range(99)][50])

               Real distinct elements:  35963
             Approx distinct elements:  11585
Median of 99 Approx distinct elements:  46340


Note that in the above code, each of the approximate distinct elements was computed separately, that is it passes through the data 99 times. As an exercise write code that only passes through the data once and returns the median approximate distinct element. 

# Homework

In order to get a uniform sample in the streaming model, we needed to take the $i$th item with probability $\frac{1}{i}$. This involves computing a random number. 

Recall that to compute 10%-approximate median with a failure rate of $0.01\%$ we needed about 6000 random samples. Implemented in a straightforward way, this would mean 6000 random numbers would be needed for each data item, which is a lot!

So your homework is to create an method to compute a sample in a stream that is *almost* uniform, that is that the probability to pick a given item is within a factor of two of $\frac{1}{n}$ if the stream has had $n$ items so far. Your method instead of having $n$ random numbers generated in total should generate far fewer: a number logarithmic in $n$.

Present your method, code it, and prove that it works.
