K-Means
===
- [t-SNE, Dimensions Resduction](#t_SNE-(Visualization-for-High-dimensional-Data)), Visualization of high-dimensional data
- [KMeans, Math background](#Mathematical-Background)
- [Association among data](#Apriori's-Algorithm)

In [None]:
import numpy as np

In [None]:
from sklearn.cluster import KMeans
import pandas as pd

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

In [None]:
# generate 6 vetors data with 4 dimensions, each element being integer between [0,9]
X = np.random.randint(10,size=(6,4))
X

In [None]:
# 3-clusters
kmeans = KMeans(n_clusters=3, random_state=0).fit(X)
kmeans.cluster_centers_

In [None]:
# 2 data in type 0, 3 in type 1, and 1 in type 2
kmeans.labels_

In [None]:
# regenerate 2 additional data
y=np.random.randint(10,size=(2,4))
y

In [None]:
# determine which cluster each one belongs to 
kmeans.predict(y)

t_SNE (Visualization for High-dimensional Data)
---
One-dimentional data, `{(x1,c1),(x2,c2), ...}`, could be visualized by 2D picture, 2-dimentional data, `{(x1,y1,c1),(x2,y2,c2), ...}`, could be visualized by 3D picture. But how to visualize the data with dimension more than 2? With the concept of "manifold" of Mathemetical theory, we can also visualize the data. 

In [None]:
# Make more data
X = np.random.randint(10,size=(100,4))
df = pd.DataFrame(X)

In [None]:
kmeans = KMeans(n_clusters=3, random_state=0).fit(X)
kmeans.cluster_centers_

In [None]:
r1 = pd.Series(kmeans.labels_).value_counts() #统计各个类别的数目
r2 = pd.DataFrame(kmeans.cluster_centers_)

In [None]:
r=pd.concat([r2, r1], axis = 1)
r

In [None]:
r.columns = list(df.columns) + ['Num']
r

In [None]:
r = pd.concat([df, pd.Series(kmeans.labels_, index = df.index)], axis = 1)
r.columns = list(df.columns) + ['Cluster']


In [None]:
df1=df.copy()

In [None]:
tsne = TSNE()
tsne.fit_transform(df1);


In [None]:
tsne = pd.DataFrame(tsne.embedding_, index = df1.index)

In [None]:
tsne

In [None]:

#plt.rcParams['font.serif'] = ['SimHei'] #用来正常显示中文标签
#plt.rcParams['axes.unicode_minus'] = False

plt.figure(figsize=(6,6))

d = tsne[r['Cluster'] == 0]
plt.plot(d[0], d[1], 'r.')
d = tsne[r['Cluster'] == 1]
plt.plot(d[0], d[1], 'go')
d = tsne[r['Cluster'] == 2]
plt.plot(d[0], d[1], 'b*')
plt.show()

In [None]:
# introduce seaborn style
import seaborn as sns
sns.set()

Mathematical Background
---
1. Initial $k$ centers of cluster: Choose $k$ samples, $y_i$, $i=1,2,\cdots,k$, randomly from $n$ dataset, $X_n=\{x_{n1},x_{n2},\cdots,x_{nm}\}$;
2. Each data was included into the the cluster above to which it is closest;
- Re-calculate the centers of clusters;
- Compare new centers with the last result; if varies, repeat step (2), or forward;
- if not change, output the clusters.

Apriori's Algorithm
===

Support and Confidence
---

1. Suppose that sample space $\Omega=\{Event_1,Event_2,\cdots\}$, and each events could includes elements, $A, B,$ and etc, e.g.
     ```
      Event 1: A, B, C, D
      Event 2: C, D
      Event 3: B, C, D
      Event 4: A, D
      Event 5: A, C, D
     ```
  a). support of $A$, frequency of $A$ included in events:
    $$ \text{support}(A)=\frac{P(A)}{P(\Omega)}=\left(\color{red}{\frac{3}{5}}\right)$$
    3 $A$'s in 5 event.
  b). Confidence of $A\to B$
    $$\text{confidence}(B|A)=P(A\to B)=\frac{\text{support}(A\cap B)}{\text{support}(A)}=\left(\color{red}{\frac{1}{3}}\right)$$
    One $(A,B)$ occurence come with respect to 3 $(A)$ occurences in total events. 



Algorithm
---
consider the case, minimal support, <font color="brown">0.4</font>, and confidence, <font color="brown">0.6</font>, given:
   
a).   

|C1, Support|L1, Support|
|---|---|
|{A} → 0.6|{A} → 0.6 ✔️|
|{B} → 0.4|{B} → 0.4 ✔️|
|{C} → 0.8|{C} → 0.8 ✔️|
|{D} → 1.0|{D} → 1.0 ✔️|

where the cases in L1 represents they satisfy the condition, **support  ≥ 0.4**; in other words, the cases occur frequently.

b).

|C2, Support|L2, Support|
|---|---|
|<del>{A,B} → 0.2</del>||
|<del>{A,C} → 0.2</del>||
|{A,D} → 0.6|{A,D} → 0.6 ✔️|
|{B,C} → 0.4|{B,C} → 0.4 ✔️|
|{B,D} → 0.4|{B,D} → 0.4 ✔️|
|{C,D} → 0.6|{C,D} → 0.6 ✔️|

c).

|C3, Support|L3, Support|
|---|---|
|{B,C,D} → 0.4|{B,C,D} → 0.4 ✔️|

In C3, support of all the non-null subset have to be satisfies in L1 and L2. Thus only
{B,C,D} is satisfies, e.g.
- {A,B,C} ▶︎ {A,B} ✘, {A,C} ✘
- {B,C,D} ▶︎ {B,C}, {B,D}, {C,D} ✔️


|Rule|Support|Confidence|
|---|---:|---:|
|A → D|60%|100%| 
|D → A|60%|60%| 
|B → C|40%|100%| 
|C → B|40%|50%|
|B → D|40%|100%| 
|D → B|40%|40%|
|C → D|60%|75%| 
|D → C|60%|60%|
|B,C → D|40%|100%|
|B,D → C|40%|100%|
|C,D → B|40%|66.7%|
|B → C,D|40%|100%|
|C → B,D|40%|50%|
|D → B,C|40%|40%|

The result of the 3rd inference shows the fact that item C should be also chosen once item B was chosen.