K-Means
===
- [t-SNE, Dimensions Resduction](#t_SNE-(Visualization-for-High-dimensional-Data)), Visualization of high-dimensional data
- [KMeans, Math background](#Mathematical-Background)
- [Association among data](#Apriori's-Algorithm)


[Windows Python packages](https://www.lfd.uci.edu/~gohlke/pythonlibs/)

[goo.gl](https://docs.google.com/forms/d/e/1FAIpQLSeoNLyNEOSC3sy0JQVYxIyCIRbXYZe_batJ-vszreVYGbRaXw/viewform)

In [None]:
from sklearn.cluster import KMeans
import pandas as pd
import numpy as np

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

In [None]:
# generate 6 vetors data with 4 dimensions, each element being integer between [0,9]
X = np.random.randint(10,size=(6,4))
X

In [None]:
# 3-clusters
kmeans = KMeans(n_clusters=3, random_state=0).fit(X)
kmeans.cluster_centers_

In [None]:
kmeans.labels_

In [None]:
# 2 data in type 0, 2 in type 1, and 2 in type 2
kmeans.labels_

In [None]:
# regenerate 2 additional data
y=np.random.randint(10,size=(2,4))
y

In [None]:
# determine which cluster each one belongs to 
kmeans.predict(y)

t_SNE (Visualization for High-dimensional Data)
---
One-dimentional data, `{(x1,c1),(x2,c2), ...}`, could be visualized by 2D picture, 2-dimentional data, `{(x1,y1,c1),(x2,y2,c2), ...}`, could be visualized by 3D picture. But how to visualize the data with dimension more than 2? With the concept of "manifold" of Mathemetical theory, we can also visualize the data. 

In [None]:
# Make more data
X = np.random.randint(10,size=(100,4))
df = pd.DataFrame(X)

In [None]:
kmeans = KMeans(n_clusters=3, random_state=0).fit(X)
kmeans.cluster_centers_

In [None]:
r1 = pd.Series(kmeans.labels_).value_counts() 
r2 = pd.DataFrame(kmeans.cluster_centers_)

In [None]:
r2

In [None]:
r=pd.concat([r2, r1], axis = 1)
r

In [None]:
r.columns = list(df.columns) + ['Num']
r

In [None]:
r = pd.concat([df, pd.Series(kmeans.labels_, index = df.index)], axis = 1)
r.columns = list(df.columns) + ['Cluster']


In [None]:
df1=df.copy()

In [None]:
TSNE?

In [None]:
tsne = TSNE()
tsne.fit_transform(df1);


In [None]:
tsne3 = TSNE(n_components=3)
tsne3.fit_transform(df1);


In [None]:
tsne = pd.DataFrame(tsne.embedding_, index = df1.index)

In [None]:
tsne3 = pd.DataFrame(tsne3.embedding_, index = df1.index)

In [None]:

#plt.rcParams['font.serif'] = ['SimHei'] #用来正常显示中文标签
#plt.rcParams['axes.unicode_minus'] = False

plt.figure(figsize=(6,6))

d = tsne[r['Cluster'] == 0]
plt.plot(d[0], d[1], 'r.')
d = tsne[r['Cluster'] == 1]
plt.plot(d[0], d[1], 'go')
d = tsne[r['Cluster'] == 2]
plt.plot(d[0], d[1], 'b*')
plt.show()

In [None]:

#plt.rcParams['font.serif'] = ['SimHei'] #用来正常显示中文标签
#plt.rcParams['axes.unicode_minus'] = False

A= plt.figure(figsize=(12,8)).gca(projection='3d')

d = tsne3[r['Cluster'] == 0]
A.scatter(d[0], d[1], d[2],'r.')
d = tsne3[r['Cluster'] == 1]
A.scatter(d[0], d[1],d[2], 'go')
d = tsne3[r['Cluster'] == 2]
A.scatter(d[0], d[1],d[2], 'b*')
A.view_init(elev=20., azim=-95)
#plt.show()

In [None]:

#plt.rcParams['font.serif'] = ['SimHei'] #用来正常显示中文标签
#plt.rcParams['axes.unicode_minus'] = False

plt.figure(figsize=(6,6))

d = tsne[r['Cluster'] == 0]
plt.plot(d[0], d[1], 'r.')
d = tsne[r['Cluster'] == 1]
plt.plot(d[0], d[1], 'go')
d = tsne[r['Cluster'] == 2]
plt.plot(d[0], d[1], 'b*')
plt.show()

In [None]:
# introduce seaborn style
import seaborn as sns
sns.set()

Mathematical Background
---
1. Initial $k$ centers of cluster: Choose $k$ samples, $y_i$, $i=1,2,\cdots,k$, randomly from $n$ dataset, $X_n=\{x_{n1},x_{n2},\cdots,x_{nm}\}$;
2. Each data was included into the the cluster above to which it is closest;
- Re-calculate the centers of clusters;
- Compare new centers with the last result; if varies, repeat step (2), or forward;
- if not change, output the clusters.

Apriori's Algorithm
===

Support and Confidence
---

1. Suppose that sample space $\Omega=\{Event_1,Event_2,\cdots\}$, and each events could includes elements, $A, B,$ and etc, e.g.
     ```
      Event 1: A, B, C, D
      Event 2: C, D
      Event 3: B, C, D
      Event 4: A, D
      Event 5: A, C, D
     ```
  a). support of $A$, frequency of $A$ included in events:
    $$ \text{support}(A)=\frac{P(A)}{P(\Omega)}=\left(\color{red}{\frac{3}{5}}\right)$$
    3 $A$'s in 5 event.
  b). Confidence of $A\to B$
    $$\text{confidence}(B|A)=P(A\to B)=\frac{\text{support}(A\cap B)}{\text{support}(A)}=\left(\color{red}{\frac{1}{3}}\right)$$
    One $(A,B)$ occurence come with respect to 3 $(A)$ occurences in total events. 



Algorithm
---
consider the case, minimal support, <font color="brown">0.4</font>, and confidence, <font color="brown">0.6</font>, given:
   
a).   

|C1, Support|L1, Support|
|---|---|
|{A} → 0.6|{A} → 0.6 ✔️|
|{B} → 0.4|{B} → 0.4 ✔️|
|{C} → 0.8|{C} → 0.8 ✔️|
|{D} → 1.0|{D} → 1.0 ✔️|

where the cases in L1 represents they satisfy the condition, **support  ≥ 0.4**; in other words, the cases occur frequently.

b).

|C2, Support|L2, Support|
|---|---|
|<del>{A,B} → 0.2</del>||
|{A,C} → 0.4|{A,C} → 0.4 ✔️|
|{A,D} → 0.6|{A,D} → 0.6 ✔️|
|{B,C} → 0.4|{B,C} → 0.4 ✔️|
|{B,D} → 0.4|{B,D} → 0.4 ✔️|
|{C,D} → 0.8|{C,D} → 0.8 ✔️|

c).

|C3, Support|L3, Support|
|---|---|
|{B,C,D} → 0.4|{B,C,D} → 0.4 ✔️|

In C3, support of all the non-null subset have to be satisfies in L1 and L2. Thus only
{B,C,D} is satisfies, e.g.
- {A,B,C} ▶︎ <font color="red">{A,B} ✘</font>, {A,C} ✔️, {B,C} ✔️
- {A,C,D} ▶︎ {A,C}, {A,D}, {C,D} ✔️
- {B,C,D} ▶︎ {B,C}, {B,D}, {C,D} ✔️


|Rule|Support|Confidence|Rules|
|---|---:|---:|
|A → C|40%|66.7%| 
|C → A|40%|50%| 
|A → D|60%|100%| ✔️|
|D → A|60%|60%| 
|✝︎B → C|40%|100%| ✝︎✔️|
|C → B|40%|50%|
|✝︎B → D|40%|100%| ✝︎✔️|
|D → B|40%|40%|
|C → D|80%|100%| ✔️|
|D → C|80%|80%| ✔️ |
|A,C → D|40%|100%| ✔️|
|B,C → D|40%|100%| ✔️|
|B,D → C|40%|100%| ✔️|
|C,D → B|40%|66.7%|
|✝︎B → C,D|40%|100%|✝︎✔️|
|C → B,D|40%|50%|
|D → B,C|40%|40%|

where `✝︎` means the same induction. The result of the 3rd inference shows the fact that item C should be also chosen once item B was chosen.

Frequent-Pattern (FP) Growth
---
FP-Growth is an improvement of apriori designed to eliminate some of the heavy bottlenecks in apriori. 

It's not hard to find out all the associated rules but only complicated. Here, `pyfpgrowth` plays the role to find out the associated rule among set of considered events. This packages can be install by `pip`.

Steps
---
The first step is we count all the items in all the transactions

1. **Transactions**= [ A: 3, B: 2, C: 4, D: 5]
- Set minimal threshhole = 2, (i.e. 2/5=0.4), <br>
**Transactions**= [ A: 3, B: 2, C: 4, D: 5]
- Now we sort the list according to the count of each item:<br>
**Transactions**= [D: 5, C: 4, A: 3, B: 2 ]
- Make the FP-tree:

Build the tree
---
Event1: [ 'A', 'B', 'C', 'D']
```
              D[1]
              ↙︎
          C[1]
           ↙︎
         A[1]  
         ↙︎
       B[1]  
```
Event 5 ['A','C','D'] → 3 ['B','C','D']

```
              D[2]               D[3]
              ↙︎                 ↙︎
          C[2]                C[3]
           ↙︎                  ↙︎ ↘︎
         A[2]               A[2]  B[1]
         ↙︎                  ↙︎
       B[1]               B[1]
```
Event 2 ['C', 'D'] → 4  [ 'A', 'D']
```
               D[4]                              D[5]      
              ↙︎                                ↙︎   ↘︎ 
           C[4]                             C[4]     A[1]  
           ↙︎  ↘︎                            ↙︎  ↘︎         
         A[2]  B[1]                       A[2]   B[1]          
        ↙︎                               ↙︎  
      B[1]                             B[1]        
```
**Finally**, check all the sub-tree's  whether the confidence is above the given.

**Note.** the scan speed of FP-growth algorithm is much fast than its of aprioi algorithm,

In [None]:
aprioi(transactions,support=0.4,confidence=0.7)

In [None]:
import pyfpgrowth

transactions = [ [ 'A', 'B', 'C', 'D'], [ 'C', 'D'], [ 'B', 'C', 'D'], [ 'A', 'D'] ,['A','C','D']]

patterns = pyfpgrowth.find_frequent_patterns(transactions, 2)
rules = pyfpgrowth.generate_association_rules(patterns, 0.7)

print(rules)

In [None]:
patterns

Output of rules was a list with key-value format:
`{'key1':'value1',...}`, where
- key-*n* is list of features,
- value-*n* is formed by two-element pair, one being feature list and the other is minimal confidence required.
- Use `rules.item()` to retrieve data one by one from  

In [None]:
for k, v in rules.items():
    print(k, v)

In [None]:
for k, v in rules.items():
    karr=', '.join(k)
    varr=', '.join(v[0])

    print('{%s} --> {%s}, %.2f' %(karr, varr,v[1]))

In [None]:
W  = '\033[0m'  # white (normal)
K  = '\033[30m' # black
R  = '\033[31m' # red
G  = '\033[32m' # green
O  = '\033[1;33m' # orange
B  = '\033[34m' # blue
P  = '\033[35m' # purple
T =  '\033[1;33;43m' #Title


In [None]:
def aprioi(dataset,support=0.2,confidence=0.6):
    transaction=int(len(dataset)*support)
    transactions = dataset

    patterns = pyfpgrowth.find_frequent_patterns(transactions, transaction)
    rules = pyfpgrowth.generate_association_rules(patterns, confidence)
    print(T,B,"\tRules",R,"\t\tConfidence",W)
    for k, v in rules.items():
        karr=', '.join(k)
        varr=', '.join(v[0])
        #print(" {1:6} --> {1:10}".format(karr,varr),'{0:8.3f}'.format(v[1]))
        print(' [{k:6s}] --> [{s:6s}]   {v:5.3f} '.format(k=karr,s=varr,v=v[1]))

In [None]:
aprioi(transactions,support=0.4,confidence=0.7)

In [None]:
import pandas as pd
inputfile = '../data/apriori.csv'
data = pd.read_csv(inputfile, header=None, dtype = object)
b=data.as_matrix()

In [None]:
b

In [None]:
import time

In [None]:
start = time.clock()
# FP-Growth start
aprioi(b,support=0.06,confidence=0.75)
# End
end = time.clock()-start
print('\n搜尋完成，執行時間：%0.2f 秒' %(end))