## Spectral clustering

The most common one (von Luxburg):

1. Similarity matrix $S$
2. Construct similarity graph -> Adjacency matrix $W$
3. Graph Laplacian $L = D - W$, where $D$ is diagonal matrix of node degrees.
4. Get $V$ by computing the first $k$ eigenvectors
5. Cluster the nodes with $k$-means

## How to get adjacency matrix
- $\epsilon$-neighborhood graph
- $k$-nearest neighbor graph
    - directed
    - undirected
- fully connected graph (a weighted graph)
- (MST, and tons of others)

## To get graph Laplacian
- Ordinay: $$L = D - W$$
- Normalized: $$L = I - D^{-1}W$$
- Symmetric: $$L = I - D^{-1/2}WD^{-1/2}$$
- Generalization for directed graph based on random walk theory
    $$\Theta = (\Pi^{1/2}P\Pi^{-1/2} + \Pi^{-1/2}P^T\Pi^{1/2})/2$$
    $$L = I - \Theta$$
- ...

## Eigendecomposition / Singular Value Decomposition

- diagonal, triangular, tridiagonal, Hessenburg form, ...
- spar.........................se

### Power iteration

Gives the eigenvector corresponding to the largest eigenvalue.

```
some initial x0,
for k = 1, 2, ...:
    y[k] = A x[k-1]
    x[k] = y[k] / norm(y[k-1])
```

### QR Iteration

Gives all eigenvalues and eigenvectors of A:

```
A[0] = A
for k=1, 2, ...:
    Compute QR factorization: Q[k]R[k] = A[k-1]
    A[k] = R[k]Q[k]
```

### Lanczos/Arnoldi Iteration

1. Basically it produces a upper Hessenburg matrix column by column using **only matrix multiplication.**
2. We use the eigenvalues of the result Hessenburg matrix as approximation.
3. $\text{computeSVD}$ in Spark (ARPARCK)

---





---









---

In [48]:
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext(appName='spectral')
spark = pyspark.sql.SparkSession(sc)

In [50]:
from pyspark.mllib.linalg.distributed import MatrixEntry, CoordinateMatrix
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans
from pyspark.mllib.clustering import PowerIterationClustering, PowerIterationClusteringModel

In [51]:
from pyspark.sql.functions import sqrt
from pyspark.ml.evaluation import ClusteringEvaluator

In [52]:
txt = sc.textFile('./data/email-Eu-core.txt')
txt.take(10)

['0 1', '2 3', '2 4', '5 6', '5 7', '8 9', '10 11', '12 13', '12 14', '15 16']

## Data

In [53]:
txt = txt.map(lambda x: x.split(' ')).map(lambda x: (int(x[0]) ,int(x[1])))
txt.take(10)

[(0, 1),
 (2, 3),
 (2, 4),
 (5, 6),
 (5, 7),
 (8, 9),
 (10, 11),
 (12, 13),
 (12, 14),
 (15, 16)]

In [54]:
N = txt.flatMap(lambda x: [int(xx) for xx in x]).max() + 1
N

1005

## Power Iteration

In [64]:
K = 5
print(K)

5


In [68]:
similarities = txt.map(lambda x: tuple([x[0], x[1], 1.0]))

In [69]:
model = PowerIterationClustering.train(similarities, K, maxIterations=10)

In [72]:
model.assignments().take(20)

[Assignment(id=454, cluster=0),
 Assignment(id=386, cluster=2),
 Assignment(id=522, cluster=2),
 Assignment(id=324, cluster=2),
 Assignment(id=180, cluster=2),
 Assignment(id=320, cluster=2),
 Assignment(id=752, cluster=0),
 Assignment(id=586, cluster=1),
 Assignment(id=408, cluster=1),
 Assignment(id=428, cluster=4),
 Assignment(id=986, cluster=3),
 Assignment(id=996, cluster=2),
 Assignment(id=464, cluster=0),
 Assignment(id=346, cluster=2),
 Assignment(id=14, cluster=2),
 Assignment(id=466, cluster=2),
 Assignment(id=24, cluster=0),
 Assignment(id=520, cluster=1),
 Assignment(id=912, cluster=2),
 Assignment(id=302, cluster=3)]

## Lanczos Iteration

In [55]:
upper_entries = txt.map(lambda x: MatrixEntry(int(x[0]), int(x[1]), 1.0))
lower_entries = txt.map(lambda x: MatrixEntry(int(x[1]), int(x[0]), 1.0))
degrees = upper_entries.map(lambda entry: (entry.i, entry.value)).reduceByKey(lambda a, b: a + b)
W = CoordinateMatrix(upper_entries.union(lower_entries), numCols=N, numRows=N)

In [None]:
entries = degrees.map(lambda x: MatrixEntry(x[0], x[0], x[1]))
D = CoordinateMatrix(entries, numCols=N, numRows=N)
L = D.toBlockMatrix().subtract(W.toBlockMatrix()).toCoordinateMatrix()

In [56]:
entries = degrees.map(lambda x: MatrixEntry(x[0], x[0], 1/x[1]))
D_inv = CoordinateMatrix(entries, numCols=N, numRows=N).toBlockMatrix()
I = CoordinateMatrix(sc.range(N).map(lambda i: MatrixEntry(i, i, 1.0)), numCols=N, numRows=N).toBlockMatrix()
L = I.subtract(D_inv.multiply(W.toBlockMatrix())).toCoordinateMatrix()

In [14]:
entries = degrees.map(lambda x: MatrixEntry(x[0], x[0], 1/sqrt(x[1])))
D_invsq = CoordinateMatrix(entries, numCols=N, numRows=N).toBlockMatrix()
I = sc.range(N).map(lambda i: MatrixEntry(i, i, 1.0), N, N)
tmp = D_invsq.multiply(W.toBlockMatrix()).multiply(D_invsq)
L = I.toBlockMatrix().subtract(tmp)

pyspark.mllib.linalg.distributed.CoordinateMatrix

In [57]:
K = 5
print(K)

5


In [58]:
svd = L.toRowMatrix().computeSVD(k=K, computeU=False)
print(type(svd.s))
print(type(svd.V))

<class 'pyspark.mllib.linalg.DenseVector'>
<class 'pyspark.mllib.linalg.DenseMatrix'>


In [59]:
V = svd.V.toArray().tolist()
VV = spark.createDataFrame(V)
kmeans = KMeans().setK(K).setSeed(1)
vecAssembler = VectorAssembler(inputCols=VV.schema.names, outputCol='features')
VV = vecAssembler.transform(VV)

In [60]:
model = kmeans.fit(VV.select('features'))
clusters = model.transform(VV)

In [61]:
clusters.select('prediction').show()

+----------+
|prediction|
+----------+
|         0|
|         0|
|         0|
|         0|
|         3|
|         3|
|         0|
|         3|
|         0|
|         0|
|         0|
|         3|
|         3|
|         1|
|         1|
|         0|
|         0|
|         2|
|         3|
|         3|
+----------+
only showing top 20 rows



In [62]:
clusters.describe('prediction').show()

+-------+------------------+
|summary|        prediction|
+-------+------------------+
|  count|              1005|
|   mean| 0.408955223880597|
| stddev|0.9934644014343972|
|    min|                 0|
|    max|                 4|
+-------+------------------+



In [42]:
clusters.toPandas().to_csv('out/email_assignment_normalized.csv')

In [63]:
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(clusters)
print(silhouette)

0.7158510975063149


In [47]:
sc.stop()

---
References:
- Von Luxburg, Ulrike. "A tutorial on spectral clustering." Statistics and computing 17, no. 4 (2007): 395-416.