In [1]:
%pylab inline
import pandas as pd

import numpy as np
from __future__ import division
import itertools

import matplotlib.pyplot as plt
import seaborn as sns

import logging
logger = logging.getLogger()

Populating the interactive namespace from numpy and matplotlib


4 Mining Data Streams
============

Data:    
1. database   
2. stream:     
    + lost forever if the data arrived is not processed immediately or stored.    
    + the data arrives so rapidly that it isn't feasible to store all.   
    
    solution: summarization     
        + sample/filter, then estimate     
        + fixed-length "window"

### 4.1 The Stream Data Model

Examples of Steam Sources:   

+ Sensor Data:      
  Deploy a million sensors    
  
+ Image Data:     
  satellites, surveilance camera

+ Internet and Web Traffic

![Fig 4.1: A data-stream_management system](files/res/figure_4_1.png)

##### 4.1.3 Stream Queries
two ways:

1. standing queries     
   preset, permanetly execting, and produce outputs at appropriate times.    

2. ad-hoc queries    
   a question asked once about the current state of a steam or streams.    
   + a common sample approach is to store a **sliding windows** of each stream in the working store.
  
  
**generalizations** about stream algorithms:

1. Often, an **approximate answer** is much more **efficient** than an **exact solution**.

2. **hash function** introduces useful **randomness** intho the algorithm's behavior $\to$ approximate answer

### 4.2 Sampling Data in a Stream

extract reliable samples from a stream:    
select $S \subset B$, $\, s.t. \, E[f(S)] = E[f(B)]$.

Example   
Prob: What fraction of the typical user's queries were repeated over the past month?     

given: a user has issued $s$ search queries one time, $d$ queries two times, and no queires more than twice.

solutions:

- _REAL_: $$tf = \frac{d}{s+d}$$

- _BAD_: store 1/10 th of the stream **elements**.    
  + for s: $1/10 s$ one time    
  + for d:      
    $(1/10)*(1/10)*d = 1/100 d$ two times,      
    $1/10*9/10 + 9/10*1/10 d = 18/100d$ one time.
  + calc: 
  $$\hat{tf} = \frac{1/100d}{1/10s + 1/100d + 18/100d} = \frac{d}{10s + 19d}$$

- _GOOD_: pick 1/10th of the **users** and take all their searches for the sample.    
  $$\hat{tf} = \frac{d}{s+d}$$     
  
  sample: **hash**    
  obtain a sample consisting of any rational fraction $a/b$ of the users by hashing user names to $b$ buckets. Add the search query to the sample if the hash value is less than $a$.
  
  **The General Sampling Problem**   
  Our steam $x$ consists of tuples with $n$ components $\{x_n\}$(eg: user, query, time), and a subset of the components are the _key_ components $x_k$(eg: user).     
  To take a sample of size $a/b$, we hash the _key_ value $h(x_k)$ for each tuple to $b$ buckets, and accept the tuple if $h(x_k) < a$.
  
  **varying the sample size**    
  storage is limited, while users/queries grows as time goes on  ==> decrease the select fraction $a/b$.    
  
  solution:   
  1. $h(x) = \{0, 1, \dots, B-1\}$, $B$ is sufficient large.    
  2. maintain a threshold $t$, we accept when $h(x) < t$.       
  3. $t = t - 1$ if the allotted space is exceeded. remove all samples $h(x) = t$.    
     [opt] efficient:   
     + lower $t$ by more than 1.    
     + maintaining an index on the hash value to find all those tuples quickly.