# Algorithms for Data Science

## Counting Distinct Items

### 1. Preliminaries 

The objective of this lab is to implement the Flajolet-Martin approach to count distinct items. First, we generate an universe of $N$ strings of length $12$, and take $d$ items which will constitute our universe of distinct items.

In [1]:
import random
from string import ascii_lowercase

#parameters
N = 256 #universe of N 
d = 3 #distinct items
stream_size = 10000

#generate some random strings of size 10
U = []
for _ in range(N):
  U.append(''.join(random.choice(ascii_lowercase) for i in range(12)))

D = random.sample(U,k=d)

print(D)

['fvhfbehnvvjl', 'hnfxvvpyogpj', 'vbnvzputlatp']


### 2. Flajolet-Martin: Creating a Hash Function, Estimating Distinct Items Using Trailing 0s

In the following we create a hash function $h(x)$, which also takes as a parameter a hashable and $N$, and returns a value in $0,\dots,N-1$. We simulate a stream taking random values from $D$, count the trailing $0$s in its hash value, keep the maximum value $R$, and then output $2^R$ as the estimator.

In [3]:
import math
import random
from datetime import datetime

random.seed(datetime.now())

def h(x,n):
  return hash(x)%n

#method for counting trailing 0s
def trailing_0(x):
  x1 = x
  t0 = 0
  while x1%2==0 and x1!=0:
    t0 += 1
    x1 = int(x1/2)
  return t0

#simulating the stream
R = 0
for _ in range(stream_size):
  #take a random string from the distinct pool
  s = random.choice(D)
  #check its hash value
  hv = h(s,2*N) #to allow more space for hash values
  r = trailing_0(hv)
  if r>R: R=r

est = int(math.pow(2,R))

print('Estimation of distinct items: %d'%est)

Estimation of distinct items: 2


### 3. **TASK** Flajolet-Martin: Using Multiple Hash Functions

Implement the refined version of the above estimator, using multiple ($k$) hash functions (use the method of generating several pairs of numbers presented last time in the lab) and compute:
1. The average of the $k$ estimators
2. The median of the $k$ estimators
3. Divide the estimators into groups (vary the group size); take the median in each group and then the average over the groups.

Compare the three methods' final outputs. What do you notice?

_Note_: you can use the Python 3.4 _statistics_ package (not available in previous versions) to compute medians, averages, and other statistics.

In [18]:
# YOUR CODE HERE
import statistics

k=20
p = 122354367

a_list=[]
b_list=[]

for i in range(k):

    7
    a = random.randrange(p)
    a_list.append(a)
    b = random.randrange(p)
    b_list.append(b)


def h(x,a,b,p,n):
  return ((a*hash(x)+b)%p)%n

#method for counting trailing 0s
def trailing_0(x):
  x1 = x
  t0 = 0
  while x1%2==0 and x1!=0:
    t0 += 1
    x1 = int(x1/2)
  return t0

list_r=[]

stream = random.choices(D, k=stream_size)

for i in range(k):
#simulating the stream
  R = 0
  for s in stream:
    #take a random string from the distinct pool
    #check its hash value
    hv = h(s,a_list[i],b_list[i],p,N) #to allow more space for hash values
    r = trailing_0(hv)
    if r>R: R=r

  list_r.append(int(math.pow(2,R)))


print(f'the mean of the estimators {statistics.mean(list_r)}')

print(f'the median of the estimators {statistics.median(list_r)}')

n=4
#list_r_group_mean = [statistics.mean(list_r[i:i + n]) for i in range(0, len(range(k)), n)]
list_r_group_median = [statistics.median(list_r[i:i + n]) for i in range(0, len(range(k)), n)]
print(f'the mean of the estimators by 5 groups {statistics.mean(list_r_group_median)}')
print(f'the median of the estimators by 5 groups {statistics.median(list_r_group_median)}')

n=2
#list_r_group_mean = [statistics.mean(list_r[i:i + n]) for i in range(0, len(range(k)), n)]
list_r_group_median = [statistics.median(list_r[i:i + n]) for i in range(0, len(range(k)), n)]
print(f'the mean of the estimators by 10 groups {statistics.mean(list_r_group_median)}')
print(f'the median of the estimators by 10 groups {statistics.median(list_r_group_median)}')


n=10
#list_r_group_mean = [statistics.mean(list_r[i:i + n]) for i in range(0, len(range(k)), n)]
list_r_group_median = [statistics.median(list_r[i:i + n]) for i in range(0, len(range(k)), n)]
print(f'the mean of the estimators by 2 groups {statistics.mean(list_r_group_median)}')
print(f'the median of the estimators by 2 groups {statistics.median(list_r_group_median)}')

the mean of the estimators 7.35
the median of the estimators 4.0
the mean of the estimators by 5 groups 4.7
the median of the estimators by 5 groups 4.0
the mean of the estimators by 10 groups 7.35
the median of the estimators by 10 groups 3.0
the mean of the estimators by 2 groups 3.0
the median of the estimators by 2 groups 3.0


## Interpretation

From the result, we can observe that with the mean method, the erros is large, because we aggregate by the power of two. 
Taking median is a good estimation. 

When we devide the estimators by groups, when the group number is small, the estimation is more accurate when using the mean method, however, with a larger group size, the median gives a better result. 


