# Expectation-Maximization Clustering

### Maximization
$$
q_{mk} = \frac{\sum\limits_{n=1}^{N} r_{nk}I(t_m \in d_n)}{\sum\limits_{n=1}^{N}r_{nk}};
\alpha_{k} = \frac{1}{N} \sum\limits_{n=1}^{N} r_{nk}
$$

### Expectation
$$
r_{nk} = \frac{\alpha_k\left(\prod_{t_m \in d_n}q_{mk}\right)
\left(\prod_{t_m \not\in d_n}(1-q_{mk})\right)}
{\sum\limits_{k=1}^{K}\alpha_k\left(\prod_{t_m \in d_n}q_{mk}\right)
\left(\prod_{t_m \not\in d_n}(1-q_{mk})\right)}
$$

### Example
<table>
    <tr>
        <th>DocID</th>
        <th>Tokens</th>
        <th>Class</th>
    </tr>
    <tr>
        <td>0</td>
        <td>apple ios mac book fruit</td>
        <td>A</td>
    </tr>
    <tr>
        <td>1</td>
        <td>apple mac book apple store fruit</td>
        <td>A</td>
    </tr>
    <tr>
        <td>2</td>
        <td>microsoft ibm apple oracle</td>
        <td>A</td>
    </tr>
    <tr>
        <td>3</td>
        <td>apple banana mango fruit</td>
        <td>B</td>
    </tr>
    <tr>
        <td>4</td>
        <td>apple fruit</td>
        <td>B</td>
    </tr>
</table>

In [1]:
import numpy as np
from IPython.core.display import display, HTML, Image

docs = [
    ['apple', 'ios', 'mac', 'book', 'fruit'],
    ['apple', 'mac', 'book', 'apple', 'store'],
    ['microsoft', 'ibm', 'apple', 'oracle'],
    ['apple', 'banana', 'mango', 'fruit'],
    ['apple', 'fruit', 'mango']
]
terms = list(set([x for y in docs for x in y]))
M = np.array([[1 if x in y else 0 for x in terms] for y in docs])
N, K, m = len(docs), 2, len(terms)
A = np.zeros(K)
Q = np.zeros((m, K))

## Random init $r_{nk}$

In [2]:
R = np.zeros((N, K))
for doc in range(N):
    a = np.random.uniform()
    R[doc] = [a, 1-a]

In [3]:
R

array([[0.05605402, 0.94394598],
       [0.44017889, 0.55982111],
       [0.44729537, 0.55270463],
       [0.0036333 , 0.9963667 ],
       [0.68571744, 0.31428256]])

## Functions

In [4]:
def maximization(K, R):
    for k in range(K):
        A[k] = R[:,k].sum() / N
        for word in range(m):
            sigma_doc = 0.0
            for doc in range(N):
                sigma_doc += R[doc,k] * M[doc,word]
            Q[word][k] = sigma_doc / R[:,k].sum()

In [5]:
def estimate(doc, k, Q, A):
    q_doc = np.zeros(m)
    for word in range(m):
        if M[doc,word] > 0:
            q_doc[word] = Q[word,k]
        else:
            q_doc[word] = 1 - Q[word,k]
    return A[k] * q_doc.prod()
    
def expectation(K, Q, A):
    for doc in range(N):
        k_estimation = np.array([estimate(doc, k, Q, A) for k in range(K)])
        for k in range(K):
            R[doc][k] = k_estimation[k] / k_estimation.sum()

In [8]:
def to_table(title, data, cols, rows):
    header = "<tr>" + "".join(["<th>{}</th>".format(x) for x in [''] + list(cols)]) + "</tr>"
    trs = []
    for i, c in enumerate(rows):
        tr = "<tr>" + "<td>{}</td>".format(c)
        tr += "".join(["<td>{}</td>".format(round(x, 3)) for x in data[i]])
        tr += "</tr>"
        trs.append(tr)
    table = "<h3>{}</h3><table>{}{}</table>".format(
        title,
        header,
        "".join(trs)
    )
    return table

def show(r, q, a):
    table = "<table><tr><td style='vertical-align: top;'>{}</td><td style='vertical-align: top;'>{}</td><td style='vertical-align: top;'>{}</td></tr></table>".format(
        r, q, a
    )
    display(HTML(table))

## Start

In [9]:
TR = to_table('$r_{nk}$', R, range(K), range(N))
TQ = to_table('$q_{mk}$', Q, range(K), terms)
TA = to_table('$a_{k}$', [A], range(K), ['priors'])
show(TR, TQ, TA)

0,1,2
$r_{nk}$0100.0560.94410.440.5620.4470.55330.0040.99640.6860.314,$q_{mk}$01mac0.00.0ibm0.00.0book0.00.0banana0.00.0store0.00.0microsoft0.00.0mango0.00.0oracle0.00.0fruit0.00.0ios0.00.0apple0.00.0,$a_{k}$01priors0.00.0

Unnamed: 0,0,1
0,0.056,0.944
1,0.44,0.56
2,0.447,0.553
3,0.004,0.996
4,0.686,0.314

Unnamed: 0,0,1
mac,0.0,0.0
ibm,0.0,0.0
book,0.0,0.0
banana,0.0,0.0
store,0.0,0.0
microsoft,0.0,0.0
mango,0.0,0.0
oracle,0.0,0.0
fruit,0.0,0.0
ios,0.0,0.0

Unnamed: 0,0,1
priors,0.0,0.0


## Iterate

In [10]:
for iteration in range(10):
    maximization(K, R)
    expectation(K, Q, A)
    TR = to_table('$r_{nk}$', R, range(K), range(N))
    TQ = to_table('$q_{mk}$', Q, range(K), terms)
    TA = to_table('$a_{k}$', [A], range(K), ['priors'])
    display(HTML("<h2>ITERATION {}</h2>".format(iteration+1)))
    show(TR, TQ, TA)

0,1,2
$r_{nk}$0100.0140.98610.4140.58620.9020.09830.0030.99740.3830.617,$q_{mk}$01mac0.3040.447ibm0.2740.164book0.3040.447banana0.0020.296store0.270.166microsoft0.2740.164mango0.4220.389oracle0.2740.164fruit0.4560.67ios0.0340.28apple1.01.0,$a_{k}$01priors0.3270.673

Unnamed: 0,0,1
0,0.014,0.986
1,0.414,0.586
2,0.902,0.098
3,0.003,0.997
4,0.383,0.617

Unnamed: 0,0,1
mac,0.304,0.447
ibm,0.274,0.164
book,0.304,0.447
banana,0.002,0.296
store,0.27,0.166
microsoft,0.274,0.164
mango,0.422,0.389
oracle,0.274,0.164
fruit,0.456,0.67
ios,0.034,0.28

Unnamed: 0,0,1
priors,0.327,0.673


0,1,2
$r_{nk}$0100.01.010.2030.79721.00.030.01.040.0310.969,$q_{mk}$01mac0.2490.479ibm0.5260.03book0.2490.479banana0.0020.304store0.2410.178microsoft0.5260.03mango0.2250.492oracle0.5260.03fruit0.2330.792ios0.0080.3apple1.01.0,$a_{k}$01priors0.3430.657

Unnamed: 0,0,1
0,0.0,1.0
1,0.203,0.797
2,1.0,0.0
3,0.0,1.0
4,0.031,0.969

Unnamed: 0,0,1
mac,0.249,0.479
ibm,0.526,0.03
book,0.249,0.479
banana,0.002,0.304
store,0.241,0.178
microsoft,0.526,0.03
mango,0.225,0.492
oracle,0.526,0.03
fruit,0.233,0.792
ios,0.008,0.3

Unnamed: 0,0,1
priors,0.343,0.657


0,1,2
$r_{nk}$0100.01.010.0040.99621.00.030.01.040.01.0,$q_{mk}$01mac0.1650.477ibm0.810.0book0.1650.477banana0.00.266store0.1650.212microsoft0.810.0mango0.0250.523oracle0.810.0fruit0.0250.788ios0.00.266apple1.01.0,$a_{k}$01priors0.2470.753

Unnamed: 0,0,1
0,0.0,1.0
1,0.004,0.996
2,1.0,0.0
3,0.0,1.0
4,0.0,1.0

Unnamed: 0,0,1
mac,0.165,0.477
ibm,0.81,0.0
book,0.165,0.477
banana,0.0,0.266
store,0.165,0.212
microsoft,0.81,0.0
mango,0.025,0.523
oracle,0.81,0.0
fruit,0.025,0.788
ios,0.0,0.266

Unnamed: 0,0,1
priors,0.247,0.753


0,1,2
$r_{nk}$0100.01.010.01.021.00.030.01.040.01.0,$q_{mk}$01mac0.0040.5ibm0.9960.0book0.0040.5banana0.00.25store0.0040.249microsoft0.9960.0mango0.00.5oracle0.9960.0fruit0.00.751ios0.00.25apple1.01.0,$a_{k}$01priors0.2010.799

Unnamed: 0,0,1
0,0.0,1.0
1,0.0,1.0
2,1.0,0.0
3,0.0,1.0
4,0.0,1.0

Unnamed: 0,0,1
mac,0.004,0.5
ibm,0.996,0.0
book,0.004,0.5
banana,0.0,0.25
store,0.004,0.249
microsoft,0.996,0.0
mango,0.0,0.5
oracle,0.996,0.0
fruit,0.0,0.751
ios,0.0,0.25

Unnamed: 0,0,1
priors,0.201,0.799


0,1,2
$r_{nk}$0100.01.010.01.021.00.030.01.040.01.0,$q_{mk}$01mac0.00.5ibm1.00.0book0.00.5banana0.00.25store0.00.25microsoft1.00.0mango0.00.5oracle1.00.0fruit0.00.75ios0.00.25apple1.01.0,$a_{k}$01priors0.20.8

Unnamed: 0,0,1
0,0.0,1.0
1,0.0,1.0
2,1.0,0.0
3,0.0,1.0
4,0.0,1.0

Unnamed: 0,0,1
mac,0.0,0.5
ibm,1.0,0.0
book,0.0,0.5
banana,0.0,0.25
store,0.0,0.25
microsoft,1.0,0.0
mango,0.0,0.5
oracle,1.0,0.0
fruit,0.0,0.75
ios,0.0,0.25

Unnamed: 0,0,1
priors,0.2,0.8


0,1,2
$r_{nk}$0100.01.010.01.021.00.030.01.040.01.0,$q_{mk}$01mac0.00.5ibm1.00.0book0.00.5banana0.00.25store0.00.25microsoft1.00.0mango0.00.5oracle1.00.0fruit0.00.75ios0.00.25apple1.01.0,$a_{k}$01priors0.20.8

Unnamed: 0,0,1
0,0.0,1.0
1,0.0,1.0
2,1.0,0.0
3,0.0,1.0
4,0.0,1.0

Unnamed: 0,0,1
mac,0.0,0.5
ibm,1.0,0.0
book,0.0,0.5
banana,0.0,0.25
store,0.0,0.25
microsoft,1.0,0.0
mango,0.0,0.5
oracle,1.0,0.0
fruit,0.0,0.75
ios,0.0,0.25

Unnamed: 0,0,1
priors,0.2,0.8


0,1,2
$r_{nk}$0100.01.010.01.021.00.030.01.040.01.0,$q_{mk}$01mac0.00.5ibm1.00.0book0.00.5banana0.00.25store0.00.25microsoft1.00.0mango0.00.5oracle1.00.0fruit0.00.75ios0.00.25apple1.01.0,$a_{k}$01priors0.20.8

Unnamed: 0,0,1
0,0.0,1.0
1,0.0,1.0
2,1.0,0.0
3,0.0,1.0
4,0.0,1.0

Unnamed: 0,0,1
mac,0.0,0.5
ibm,1.0,0.0
book,0.0,0.5
banana,0.0,0.25
store,0.0,0.25
microsoft,1.0,0.0
mango,0.0,0.5
oracle,1.0,0.0
fruit,0.0,0.75
ios,0.0,0.25

Unnamed: 0,0,1
priors,0.2,0.8


0,1,2
$r_{nk}$0100.01.010.01.021.00.030.01.040.01.0,$q_{mk}$01mac0.00.5ibm1.00.0book0.00.5banana0.00.25store0.00.25microsoft1.00.0mango0.00.5oracle1.00.0fruit0.00.75ios0.00.25apple1.01.0,$a_{k}$01priors0.20.8

Unnamed: 0,0,1
0,0.0,1.0
1,0.0,1.0
2,1.0,0.0
3,0.0,1.0
4,0.0,1.0

Unnamed: 0,0,1
mac,0.0,0.5
ibm,1.0,0.0
book,0.0,0.5
banana,0.0,0.25
store,0.0,0.25
microsoft,1.0,0.0
mango,0.0,0.5
oracle,1.0,0.0
fruit,0.0,0.75
ios,0.0,0.25

Unnamed: 0,0,1
priors,0.2,0.8


0,1,2
$r_{nk}$0100.01.010.01.021.00.030.01.040.01.0,$q_{mk}$01mac0.00.5ibm1.00.0book0.00.5banana0.00.25store0.00.25microsoft1.00.0mango0.00.5oracle1.00.0fruit0.00.75ios0.00.25apple1.01.0,$a_{k}$01priors0.20.8

Unnamed: 0,0,1
0,0.0,1.0
1,0.0,1.0
2,1.0,0.0
3,0.0,1.0
4,0.0,1.0

Unnamed: 0,0,1
mac,0.0,0.5
ibm,1.0,0.0
book,0.0,0.5
banana,0.0,0.25
store,0.0,0.25
microsoft,1.0,0.0
mango,0.0,0.5
oracle,1.0,0.0
fruit,0.0,0.75
ios,0.0,0.25

Unnamed: 0,0,1
priors,0.2,0.8


0,1,2
$r_{nk}$0100.01.010.01.021.00.030.01.040.01.0,$q_{mk}$01mac0.00.5ibm1.00.0book0.00.5banana0.00.25store0.00.25microsoft1.00.0mango0.00.5oracle1.00.0fruit0.00.75ios0.00.25apple1.01.0,$a_{k}$01priors0.20.8

Unnamed: 0,0,1
0,0.0,1.0
1,0.0,1.0
2,1.0,0.0
3,0.0,1.0
4,0.0,1.0

Unnamed: 0,0,1
mac,0.0,0.5
ibm,1.0,0.0
book,0.0,0.5
banana,0.0,0.25
store,0.0,0.25
microsoft,1.0,0.0
mango,0.0,0.5
oracle,1.0,0.0
fruit,0.0,0.75
ios,0.0,0.25

Unnamed: 0,0,1
priors,0.2,0.8
