# Practice Session 08: Data streams

In this session we will take a large corpus of queries and compute statistics on them using methods for data stream sampling.

# 0. Preliminaries

The dataset we will use contains the most 1,000 prolific users from the AOL Query Log (2006), a dataset released for research, and later retracted [[download link](https://github.com/wasiahmad/aol_query_log_analysis)]. The idea of this practice is to obtain some statistics on this file **without** storing parts of the file in main memory.

## 0.1. Required imports

In [1]:
import io
import csv
import math
import random
import statistics

## 0.2. How to iterate through this file

In [2]:
INPUT_FILE = "user_queries.csv"
with io.open(INPUT_FILE) as file:
    reader = csv.reader(file, delimiter="\t")
    for timestamp, userid, query in reader:
        # Prints 0.01% of lines
        if random.random() < 0.0001:
            print("On %s user %s issued query %s" % (timestamp, userid, query)) 

On 2006-03-04 13:03:17 user u1142189 issued query yesterdays tractor
On 2006-03-05 16:31:08 user u7011826 issued query lumina therapy clinic
On 2006-03-16 17:09:07 user u2544636 issued query bud and alley's seaside fl
On 2006-03-24 08:12:59 user u4633322 issued query haaretz
On 2006-03-26 21:38:42 user u8555813 issued query google
On 2006-03-28 15:27:39 user u4555258 issued query gold country casino
On 2006-03-28 23:46:55 user u351790 issued query red bull health drink
On 2006-03-29 22:51:54 user u8408524 issued query silverthorn resort lake shasta
On 2006-03-30 05:12:53 user u6754960 issued query so yesterday lyrics
On 2006-03-30 08:13:34 user u4205230 issued query yahoo
On 2006-04-02 12:33:58 user u8439139 issued query clive owen
On 2006-04-08 15:34:12 user u8221292 issued query history of plessey v. ferguson
On 2006-04-08 16:09:15 user u3273964 issued query mens warehouse
On 2006-04-08 19:21:37 user u6100295 issued query suffolk horse
On 2006-04-08 20:39:48 user u825602 issued query

# 1. Determine approximately the top-5 queries

In this query log the most frequent queries are:

* "google" (1.6% of the queries)
* "ebay" (1.5%)
* "yahoo" (1.3%)
* "myspace" (1.0%)
* "craigslist" (0.5%)

Instead of loading the entire query log in main memory, we will use reservoir sampling to determine approximately the top-5 queries.

**Reservoir sampling**: In reservoir sampling, if we have a reservoir of size S:

* We store the first S elements of the stream
* When the n<sup>th</sup> element arrives (let's call it X<sub>n</sub>):
   * With probability 1 - s/n, we ignore this element.
   * With probability s/n, we:
      * Discard a random element from the reservoir
      * Add element X<sub>n</sub> to the reservoir (calling *add_to_reservoir*)

[**CODE**] Implement a function `add_reservoir(reservoir, item, max_size)` that adds an item to the reservoir, maintaining its size. If the reservoir is already of size *max_size*, a random item is selected and evicted *before* adding the item. It is important to evict an old item *before* adding the new item.

In [3]:
def add_to_reservoir(reservoir, item, max_reservoir_size):
    # YOUR CODE HERE
    assert(len(reservoir) <= max_reservoir_size)

[**CODE**] Iterate through the file using the reservoir sampling method seen in class. In this function you will decide, for every item, whether to call *add_to_reservoir* or to ignore the item.

In [4]:
def reservoir_sampling(filename, reservoir_size)
    reservoir = []

    with io.open(filename) as file:
        reader = csv.reader(file, delimiter="\t")
        i = 0
        for timestamp, userid, query in reader:
            i += 1
            # YOUR CODE HERE: decide whether to call add_to_reservoir or not
            
    return i, reservoir

num_lines, reservoir = reservoir_sampling(INPUT_FILE, 500)

print("Number of queries seen    : %d" % num_lines)
print("Number of queries sampled : %d" % len(reservoir) )

Number of queries seen    : 318023
Number of queries sampled : 500


[**CODE**] Write code to list the top-5 queries found by looking at frequencies in the reservoir. If you see a query C times in the reservoir, you can estimate the query appears *C x dataset_size / reservoir_size* times in the entire dataset (*dataset_size* is the size of the entire dataset). Dividing this quantity by *dataset_size* and multiplying by 100 will give you the percentage.

In [5]:
freq = {}
for item in reservoir:
    freq[item] = reservoir.count(item)

most_frequent_items = sorted([(frequency, query) for query, frequency in freq.items()], reverse=True)[:5]
    
# YOUR_CODE_HERE

ebay 2.0%
myspace 1.8%
google 1.6%
msn 0.8%
yahoo mail 0.6%


[**REPORT**] For various sizes of the reservoir: 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, list the top-5 queries and their estimated frequency. 

[**REPORT**] Find by trial and error, and include in your report, the minimum reservoir size you would have to use to have an overlap of 3/5 between the queries found by the approximate method and the actual top-5.

# 2. Determine approximately the number of users

We will estimate the number of distinct users without creating a dictionary or hash table with users, but instead, we will use the Flajolet-Martin probabilistic counting method.

**Flajolet-Martin probabilistic counting**:

* For every element *u* in the stream:
   * Compute hash value *h(u)*
   * Let *r(u)* be the number of trailing zeroes in *h(u)*
   * Maintain *R* as the maximum value of *r(u)* seen so far
* Output *2<sup>R</sup>* as our estimate for the number of distinct elements *u* seen

Use this function to count trailing zeroes in the binary representation of a number.

In [6]:
def count_trailing_zeroes(number):
    count = 0
    while number & 1 == 0:
        count += 1
        number = number >> 1
    return count

Use this function to generate a random hash function. Note this generates a function, so you can do `hash_function = random_hash_function()` and then call `hash_function(x)` to compute the hash value of `x`.

In [7]:
def random_hash_function():
    salt = random.random()
    return lambda string: hash(string + str(salt))

[**CODE**] Perform *number_of_passes* passes over the file, reading the entire file on each pass (we don't use the reservoir in this part). In each pass, create a new hash function and use it to hash userids. Keep the maximum number of trailing zeroes seen in the hash value of a userid. 

In [8]:
number_of_passes = 10

estimates = []

for i in range(number_of_passes):
    # YOUR_CODE_HERE: read the file and generate an estimate
    
    estimates.append(estimate)
    print("Estimate on pass %d: %d" % (i+1, estimate))
    
print("* Average of estimates: %.1f" % statistics.mean(estimates))
print("* Median  of estimates: %.1f" % statistics.median(estimates))

Estimate on pass 1: 512
Estimate on pass 2: 512
Estimate on pass 3: 256
Estimate on pass 4: 8192
Estimate on pass 5: 8192
Estimate on pass 6: 1024
Estimate on pass 7: 128
Estimate on pass 8: 4096
Estimate on pass 9: 1024
Estimate on pass 10: 1024
* Average of estimates: 2496.0
* Median  of estimates: 1024.0


[**REPORT**] Include in your report the median of estimates obtained in 3 separate runs of your algorithm; each run should do 10 passes over the file. Indicate why the median of estimates is preferable to the average of estimates.

*Note: in this dataset, the actual number of users is 1000.*

# 3. Deliver

Deliver:

* A zip file containing your notebook (.ipynb file) with all the [**CODE**] parts implemented.
* A 2-pages PDF report including all parts of this notebook marked with "[**REPORT**]"
The report should end with the following statement: **I hereby declare that, except for the code provided by the course instructors, all of our code, report, and figures were produced by myself.**