# Reservoir sampling

Reservoir sampling is a method for maintaining an in-memory pool of k items while ensuring that each new item ecountered in data stream has a fair chance of being added into the pool.

For that purpose, a counter of seen items is kept in memory. All new items will be added to the reservoir until it reaches maximum capacity. After that, each new item has a chance $\frac{k}{N_i}$ for being added into the reservoir, where $N_i$ is the current count of total number of items seen. If new element is added, then it replaces a random item in the reservoir.

Standard batch data mining methods can then be applied on reservoir at any time with no modifications needed for stream setting.

In [110]:
//package main

import (
	"math/rand"
)

// Reservoir implements simple reservoir filter
type Reservoir struct {
	k        int
	total    uint64
	switches uint64
	sample   []interface{}
}

// InitReservoir instantiates new Reservoir struct
func InitReservoir(k int) (r *Reservoir, err error) {
	r = &Reservoir{
		k:        k,
		total:    0,
		switches: 0,
		sample:   make([]interface{}, k),
	}
	return r, nil
}

// Add new item to reservoir
func (r *Reservoir) Add(item interface{}) *Reservoir {
	r.total++
	if len(r.sample) < r.k {
		r.sample = append(r.sample, item)
	} else {
		if rand.Float64() < (float64(r.k) / float64(r.total)) {
			r.sample[rand.Intn(r.k)] = item
			r.switches++
		}
	}
	return r
}

// GetSample is a helper to return size of sampled data
func (r *Reservoir) GetSample() []interface{} {
	return r.sample
}

// GetK is a helper to return all sampled values
func (r *Reservoir) GetK() int {
	return r.k
}

// GetTotal is a helper to return number of items seen
func (r *Reservoir) GetTotal() uint64 {
	return r.total
}

// GetSwitches is a helper to return number of items seen
func (r *Reservoir) GetSwitches() uint64 {
	return r.switches
}

In [111]:
import (
    "os"
    "bufio"
    "fmt"
    "strings"
)
func readLine(path string, k int) []interface{} {
    reservoir, _ := InitReservoir(k)
    
    inFile, _ := os.Open(path)
    defer inFile.Close()
    scanner := bufio.NewScanner(inFile)
    scanner.Split(bufio.ScanLines)
    
    for scanner.Scan() {
        words := strings.Split(scanner.Text(), " ")
        for _, word := range words {
            reservoir.Add(word)
        }
    }
    return reservoir.GetSample()
}

In [114]:
src := "/home/jovyan/data/SDM/logs/apache-short.log"
// trivial and useless example
k := 15
finalSample := readLine(src, k)

In [115]:
for _, item := range finalSample {
    fmt.Println(item)
}

44.19.216.236
(KHTML,
NT
"http://www.patel.org/post.asp"
Safari/5340"
/wp-admin
Linux
"Mozilla/5.0
(Macintosh;
<nil>
85.83.140.71
(Windows
4961
200
95)


Note that example code does not consider the possiblity that new item is already in reservoir.