# Guha's Algorithm

Consider the following **simple** algorithm

```
Maintain a set of R of representatives of the stream that are at least 2tau apart from one another
    If stream contains an item >= 2tau from R, then add it to R
If at some stage R has size at least k+1, then FAIL
Otherwise, output R
```

Either $R$ is a solution of cost $2\tau$, because it never had more than $k$ items, that is, all other items in the stream are at most $2\tau$ from something in R.

Or, if it returns FAIL, the **Lemma** states that there is no solution of cost less than $\tau$(optimal cost is at least $\tau$)

## Parallel Processing

We produce an $\alpha+O(\epsilon)$ factor approximation of optimal

**Initialize** by determining a lower bound $c$ on optimal cost. **Let** $p=\lceil\log_{1+\epsilon}(\frac{\alpha}{\epsilon})\rceil$. We will at all times maintain $p$ instances of the **simple** algorithm.

Initially, the thresholds for the instances are: $c(1+\epsilon), c(1+\epsilon)^2, \cdots, c(1+\epsilon)^p$

Whenever an instance return FAIL:
* Raise its threshold by a factor of $(1+\epsilon)^p$
* Although this instance cannot access directly the items it has already seen, it relies on the representatives $R$ of the stream so far to represent the part of the stream already read

If $c(1+\epsilon)$ fails, we then maintain $p$ instances from $c(1+\epsilon)^2$ to $c(1+\epsilon)^{p+1}$

Because the thresholds keep increasing, some instance has to (eventually) succeed.

## Space requirement

Initially, $O(k)$ space to determine a lower bound

Then $p$ copies of a $O(k)$ space **simple** algorithm

Hence, with $\alpha$ begin constant, the total space needed is
$O(k\log_{1+\epsilon}(\frac{\alpha}{\epsilon}))=O(\frac{k\log{\frac{1}{\epsilon}}}{\log{1+\epsilon}})=O((\frac{k}{\epsilon})\log{\frac{1}{\epsilon}})$

Relying on the facts that $\log{(1+\epsilon)}\approx\epsilon$

## Performance analysis

Focus on the instance with the smallest threshold that succeeds

Let $t$ be this threshold
* But the threshold of this **instance** might have increased several times as the stream flowed
* Let's say it increased $j$ times
* So we had the following threshold $t_1, t_2, \cdots, t_{j+1}=t$
* Break up the stream $\sigma$ into $j+1$ phases, each corresponding to a different threshold: $\sigma_1, \sigma_2, \cdots, \sigma_{j+1}$

In particular, threshold $t_i=t_{j+1}/((1+\epsilon)^p)^{j+1-i}$

And we choose $p=\lceil\log_{1+\epsilon}(\frac{\alpha}{\epsilon})\rceil$ so that $(1+\epsilon)^p\ge\alpha/\epsilon$

Hence $t_i\le(\frac{\epsilon}{\alpha})^{j+1-i}t$

Let $R_i$ be the summary of the stream up to phase $i$

The k-center cost of summary $R_i$ is in relation to
* The "new" part of the stream in phase $i$: $\sigma_i$
* And the summary of the previous parts of the stream $R_{i-1}$

That is, $\Delta(R_{i-1}\cdot\sigma_i, R_i)$

The cost is $\Delta(R_{i-1}\cdot\sigma_i, R_i)\le\alpha t_i\le\alpha(\frac{\epsilon}{\alpha})^{j+1-i}t$

The next step is to relate $\Delta(R_{i-1}\cdot\sigma_i, R_i)$ to $\Delta(\sigma_1\cdot\sigma_2\cdot\cdots\cdot\sigma_i, R_i)$

We claim that cost between $\sigma_1\cdot\sigma_2\cdot\cdots\cdot\sigma_i$ and $R_i$ is at most the cost between $R_{i-1}\cdot\sigma_i$ and $R_i$ plus the cost between $\sigma_1\cdot\sigma_2\cdot\cdots\cdot\sigma_{i-1}$ and $R_{i-1}$

If $\Delta(\sigma_1\cdot\sigma_2\cdot\cdots\cdot\sigma_i, R_i)$ is sustained by $\Delta(\sigma_i, R_i)$, then the above claim holds.

If $\Delta(\sigma_1\cdot\sigma_2\cdot\cdots\cdot\sigma_i, R_i)$ is sustained by $\Delta(\sigma_1\cdot\sigma_2\cdot\cdots\cdot\sigma_{i-1}, R_i)$
* Then the triangle inequality tells us that this cost is at most $\Delta(\sigma_1\cdot\sigma_2\cdot\cdots\cdot\sigma_{i-1}, R_{i-1})$ plus the cost from $R_{i-1}$ to $R_i$

By recursion, $\Delta(\sigma_1\cdot\cdots\cdot\sigma_{j+1}, R_{j+1})$ is at most 
$$\Delta(R_0\cdot\sigma_1, R_1)+\Delta(R_1\cdot\sigma_2, R_2)+\cdots+\Delta(R_j\cdot\sigma_{j+1}, R_{j+1})$$

And we can bound this each of these with
$$\alpha(\frac{\epsilon}{\alpha})^jt+\alpha(\frac{\epsilon}{\alpha})^{j-1}t+\cdots+\alpha(\frac{\epsilon}{\alpha})^0t$$

We can take $\alpha t$ out as a common factor, and we get
$\alpha t(1+\frac{\epsilon}{\alpha}+\frac{\epsilon^2}{\alpha^2}+\cdots)=\alpha t+\epsilon t + smaller\space terms=(\alpha+O(\epsilon))t$

We choose $t$ to be the smallest succeeding threshold, so the cost of an optimal solution is at least $t/(1+\epsilon)$(the last fail threshold)

Hence our solution is within $(1+\epsilon)(\alpha+O(\epsilon))=\alpha+O(\epsilon)$ of optimal