# Clustering and Representatives

A typical approach to design a streaming algorithm is to produce a succinct updatable summary of the stream so far. This summary might be composed of **representatives** of the stream.

Suppose the stream comprises a collection of points in space and you want to cluster the points so that similar items are grouped together. One approach might be to choose a representative from each cluster.

Clustering are relate to another problem called **facility locations**
* If we want to serve a large family of potential customers, where do we open our warehouses?
* Or fire stations?

# K-Center

In this subject, we focus mostly on the k-center objective. We can build at most $k$ warehouses/facilities/cluster centers.

We seek to minimize the **worst** of the service costs to each customer from (its nearest) facility. We simply focus on the customer that is furthest from all the facilities.(Like a fire station)

That is, for each customer, find the nearest severing point. Then find the customer with the furthest distance to their severing point and we want to minimize this distance(cost).

Firstly, we can't hope to solve this problem **exactly** in a reasonable amount of time. The best that can be done in a reasonable time is to produce an answer **twice** as costly as the optimal.

Here's an algorithm that achieves a solution within twice the optimal cost.

```
Pick a point arbitrarily

While we have picked fewer than k points:
    Pick the point that is farthest from the points picked so far

Return the k picked points
```

But this algorithm requires us to store all the points in memory.

Here we will introduce two algorithms suitable for stream. The doubling algorithm, suing $O(k)$ space. And Guha's $2+\epsilon$ algorithm, using $O(\frac{k}{\epsilon})$ space.

At all times, maintain an estimate $\tau$ of cost of solution.

Rely on the following **Lemma**:

If there are $k+1$ points all at distance $t$ apart from one another in the input, then no matter what set of $k$ representatives I choose(centers can be different from those points), the solution cost must be at least $t/2$.

Proof: there must be at least one center which takes care of two points. If the solution cost is less than $t/2$, then the distance from those two points to the center will both be less than $t/2$. Therefore, the distance between these two points will be less than $t/2 + t/2$, these leads to a contradiction.

# Doubling Algorithm

```
Initialize the algorithm by taking first k+1 elements from the stream and setting (y,z) to be the closest pair in these first k+1 elements.

We let tau be d(y,z) and we let our representatives R be the k+1 elements so far, except z.

For each new item x

If its minimum distance to an element of R>2 tau:

    Add x to R
    While |R|>k:
        Double tau
        Find a maximal subset R* of R so that for every pair of distinct items in R*, their distance is at least tau
        Let R be R*
```

The algorithm has three properties:

1. For all pairs of items in $R$, their distance is at least $\tau$
2. The k-center cost of $R$ with the whole stream(so far) is at most $2\tau$
3. After initialization, and just before each reset of $R$(doubling of $\tau$): An optimal solution has cost at least $\tau/2$

From property 2, at the end of the stream, k-center cost of $R$ with the whole stream is at most $2\tau$:
* Final $R$ is called $\hat{R}$; final $\tau$ is called $\hat{\tau}$, $\sigma$ is the stream, $\Delta$ is the k-center cost function
* $\Delta(\sigma, \hat{R})\le 2\hat{\tau}$

From property 3, just before the last time we update $R$, an optimal solution has cost at least $\tau/2$. When we update $R$ for the final time, we have $\hat{\tau}=2\tau$, so the ratio of $\Delta(\sigma, \hat{R})$ to the optimal cost is at most $\frac{2\hat{\tau}}{\tau/2}=\frac{4\tau}{\tau/2}=8$

That is out solution has cost at most **8** times optimal cost.

We need to show that after $R$ is reset, its distance to the stream is at most $2\tau$.

Focus on old $R$ and new $R^*$. Since Property 2 was already true, after $\tau$ was doubled, but before we chose the new $R^*$, $\Delta(\sigma, R)\le\tau$

Consider an arbitrary item $x$ in stream $\sigma$
* If its closest point in $R$, $x_R$, is also in $R^*$, then $x$ is within $\tau$ of $R^*$
* If $x_R\notin R^*$, since set $R^*$ was a maximal, with each pair at distance $\tau$ from one another, every other point in $R$, including $x_R$, is within $\tau$ of $R^*$.
* So by the triangle inequality, $x$ is within $2\tau$ of $R^*$

