# [1. Top k frequent items](https://leetcode.com/problems/top-k-frequent-elements/)
- Given
    + An integer array `A`
    + An integer `K`
- Return the k most frequent elements


## 1.1 Array - Non Stream Problem

##### O(NlogN)

```C++
class Solution {
public:
    vector<int> topKFrequent(vector<int> &A, int K) {
        // Count frequency
        unordered_map<int, int> num_counter;
        for(int &a: A) num_counter[a] += 1;

        // Sort unordered_map
        vector<pair<int, int>> tmp(num_counter.begin(), num_counter.end());
        sort(tmp.begin(), tmp.end(), [](pair<int, int> &a, pair<int, int> &b){return a.second > b.second;});

        // Get top K
        vector<int> ans;
        for(int k=0; k<K && k<tmp.size(); ++k) {
            ans.push_back(tmp[k].first);
        }
        return ans;
    }
};
```

#### O(NlogK)

```C++
class Solution {
private:
    struct Compare {
        bool operator() (const pair<int,int> &a, const pair<int,int> &b) const {
            return (a.second > b.second);
        }
    };
public:
    vector<int> topKFrequent(vector<int> &A, int K) {
        // Count frequency
        unordered_map<int, int> num_counter;
        for(int &a: A) num_counter[a] += 1;

        // min heap, {a, freq},  size K
        priority_queue<
            pair<int,int>,
            vector<pair<int,int>>,
            Compare> H;

        // Maintain top K in min Heap
        for(auto &[a, freq]: num_counter) {
            H.push( {a, freq} );
            if(H.size() > K) H.pop();
        }

        // Get top K
        vector<int> ans;
        while(!H.empty()) {
            ans.push_back(H.top().first);
            H.pop();
        }
        return ans;
    }
};
```

#### Bucket sort - O(max\_value)

```C++
class Solution {
public:
    vector<int> topKFrequent(vector<int> &A, int K) {
        // Count frequency
        int max_freq = 0;
        unordered_map<int, int> num_counter;
        for(int &a:A) {
            num_counter[a] += 1;
            max_freq = max(max_freq, num_counter[a]);
        }

        // Put into buckets, label = frequency
        vector<vector<int>> buckets(max_freq+1, vector<int>());
        for(auto &[a, freq]: num_counter) {
            buckets[freq].push_back(a);
        }

        // Sort buckets by descending frequency
        reverse(begin(buckets), end(buckets));

        // Get top K
        vector<int> ans;
        for(vector<int> &bucket: buckets) {
            for(int &x:bucket) {
                ans.push_back(x);
                if(ans.size() == K) return ans;
            }
        }

        return ans;
    }
};
```


## 1.2 Stream - Heavy hitters Problem
- The array become a stream. $N = \infty$
    - Given a stream of items: $S = (i_1, i_2, i_3, \dots, i_N)$
    - Output: the k most freqent items
- Constraints: Storage size: $O(k*logN)$

#### Solution
- If we know the frequency we can construct an algorithm to retrieve top k in a heap as
    + **Note**: Need to design a special Min Heap {key, val}
    + Sorted by `val`
    + Search by `key`

```python
'''
Maintain a min heap
    max_size = k
    {i_p, f_p}
    Sorted by f_p: [top] f_min .... 
'''
H = min_heap()

def get_top_k(new_item):
    i_p = new_item
    f_p = get_frequency(i_p) 

    if i_p in H:
        H.update( {i_p, f_p} )
    else if H.size() < k:
        H.add( {i_p, f_p} )
    else:
        i_min, f_min = H.get_min()

        if f_min < f_p:
            H.add( {i_p, f_p} )
            H.pop_min()

    return list(H).reverse()
```

#### Approximate way to implement `get_frequency`
- To implement `get_frequency(item)`, we need to store at least O(N) space
- An approximate approach to get_freq in O(klogN)




#### Count sketch
- t hash functions `h_i(item)`: $h_1, h_2, \dots, h_t$
    + Each function map item -> b buckets: ${1,2,\dots, b}$ (uniform distribution bucket)
- t hash functions `s_i(item)`: $s_1, s_2, \dots, s_t$
    + Each function map item -> {-1, 1} (uniform distribution sign -1 or 1)
- C: Counter matrix

<img src="./assets/1.png" width="500"/>


```python
def update(item):
    for i in range(1, t):
        bucket = h_i(item)
        sign = s_i(item)
        C[i][bucket] += (sign * 1)

def get_approx_frequency(item):
    return median( C[i][h_i(item)] * s_i(item) for i in range(1, t) )
```

#### How Count Sketch works
- Suppose we use only 1 hash function $h$ with the sign hash $s$ uniformly distributed

```python
update(item):
    C += (s(item) * 1)

get_frequency(item):
    return C*s(item)
```

- The `get_frequency(item)` will return the a random variable, f_i is the frequency 

$$E[C*s(\text{item})] + Var[C*s(\text{item})] = f_i + Var[C*s(\text{item})]$$

- To reduce the variance
    + Increase the number of $h_i$: $t$
    + Use median instead of mean (paper)
    + Increase the number of buckets: $b$

# 2. Count sketch Applied Problem:  Nice vs Naughty
- Given 2 lists and a threhold: 
    + list A size N (format: `id number_of_good_behaviors`): The number of good behaviors that a children id=i commits within a year
    + list B size N (format: `id number_of_bad_behaviors`): The number of bad behaviors that a children id=i commits within a year
    + A threshold

- Find the ans for Q queries (format: `id`)
    + A children id=i if number of good behaviors - number of bad behaviors >= threshold: Output 1 (nice)
    + Else this children is bad: Output 0 (naughty)
- **Note**
    - N is extremely large (a stream)
    - id spawning not in order

#### Input format

```
N
threshold
id number_of_good_behaviors
id number_of_good_behaviors
...
id number_of_bad_behaviors
id number_of_bad_behaviors
Q
id_1 id_2 id_3 ...
```

- Example Input

```
3
2
3 42
8 50001
11 230040
8 50000
3 40
11 230040
2
8 3
```

- Example Output

```
0 1
```

- Explanation
    + There are three children: with ids 3, 8, and 11
    + Child 8 is naughty since 50001 − 50000 < 2
    + Child 3 is nice since 42 − 40 $\geq$ 2

## Solution
- Apply Count Sketch data structure
    + t = 3
    + b = 21157

```python
def update(C, id, number_of_behaviors):
    for i in range(1, t):
        bucket = h_i(id)
        sign = s_i(id)
        C[i][bucket] += (sign * number_of_behaviors)

def estimate(id):
    return median( C[i][h_i(id)] * s_i(id) for i in range(1, t) )
```



#### Code
```C++
const int BUCKETS = 21157;

int h_0(int child_id) {
    int p = 982451933;
    int a = 982452277, b = 982453051;
    return (a*child_id + b) % p % BUCKETS;
}
int h_1(int child_id) {
    int p = 982453117;
    int a = 982453601, b = 982453393;
    return (a*child_id + b) % p % BUCKETS;
}
int h_2(int child_id) {
    int p = 982453397;
    int a = 982462417, b = 982452479;
    return (a*child_id + b) % p % BUCKETS;
}

int s_0(int child_id) {
    int p = 1000000007;
    int a = 1190494759, b = 1190492651;

    int res = (a*child_id + b) % p % 2;
    if(res == 0) return 1;
    return -1;
}
int s_1(int child_id) {
    int p = 1190485151;
    int a = 1190485453, b = 1190485633;

    int res = (a*child_id + b) % p % 2;
    if(res == 0) return 1;
    return -1;
}
int s_2(int child_id) {
    int p = 1190486201;
    int a = 1190469499, b = 1190469689;

    int res = (a*child_id + b) % p % 2;
    if(res == 0) return 1;
    return -1;
}


void update(vector<vector<int>> &C, int child_id, int val) {
    C[0][h_0(child_id)] = (C[0][h_0(child_id)] + s_0(child_id)*val);
    C[1][h_1(child_id)] = (C[1][h_1(child_id)] + s_1(child_id)*val);
    C[2][h_2(child_id)] = (C[2][h_2(child_id)] + s_2(child_id)*val);
}
int estimate(vector<vector<int>> &C, int child_id) {
    int val_0 = C[0][h_0(child_id)] * s_0(child_id);
    int val_1 = C[1][h_1(child_id)] * s_1(child_id);
    int val_2 = C[2][h_2(child_id)] * s_2(child_id);

    // Return median
    vector<int> ans({val_0, val_1, val_2});
    sort(ans.begin(), ans.end());
    return ans[1];
}


void solve() {
    int N; cin >> N;
    int threshold; cin >> threshold;

    // List A: Good
    vector<vector<int>> A(3, vector<int>(BUCKETS, 0));
    int child_id, val;
    for(int x=0; x<N; ++x) {
        cin >> child_id >> val;
        update(A, child_id, val);
    }

     // List B: Bad
    vector<vector<int>> B(3, vector<int>(BUCKETS, 0));
    for(int x=0; x<N; ++x) {
        cin >> child_id >> val;
        update(B, child_id, val);
    }

    // Ans queries
    int Q; cin >> Q;
    for(int q=0; q<Q; ++q) {
        cin >> child_id;

        int est_A = estimate(A, child_id);
        int est_B = estimate(B, child_id);

        if(est_A - est_B >= threshold) cout << "1 ";
        else cout << "0 ";
    }
}
```