## <font color='green'> RCV1 - Clustering (Sparse Matrix)<font>

### <font color='green'> 1. Description<font>
    
Clustering of documents using k-means. Dataset is downloaded by scikit-learn.

This demo categorizes the document into culsters using kmeans.
The dataset has manually assigned classes, which is used to check if the clustering is working appropriately.
(there are multiple classes in the manually assigned classes, so the only the first one is used.) 

### <font color='green'> 2. Data Preprocessing<font>

In [1]:
from sklearn.datasets import fetch_rcv1
from sklearn import metrics
import numpy as np

rcv1 = fetch_rcv1()
X = rcv1.data
target = rcv1.target

# though there are multiple classes, only the first class is used
idx = target.indices
off = target.indptr
offsize = off.shape[0]
y = [idx[i] for i in off[0:offsize-1]]

In [2]:
n_clusters = 100
max_iter = 10

### <font color='green'> 3. Implementation using Frovedis<font>

In [3]:
# train
import os, time
from frovedis.mllib.cluster import KMeans as frovKMeans
from frovedis.exrpc.server import FrovedisServer
FrovedisServer.initialize("mpirun -np 8 {}".format(os.environ['FROVEDIS_SERVER']))

frov_km = frovKMeans(n_clusters=n_clusters, max_iter=max_iter, init='random', n_init=1)
start = time.time()
frov_y_pred = frov_km.fit_predict(X)
end = time.time()
print ("Frovedis train time: {:.3f} sec".format(end-start))
frov_homogeneity_score = metrics.homogeneity_score(y, frov_y_pred)
print('Frovedis Homogeneity Score : '+ str(frov_homogeneity_score))

Frovedis train time: 2.994 sec
Frovedis Homogeneity Score : 0.5026113149386877


### <font color='green'> 4. Implementation using scikit-learn<font>

In [4]:
from sklearn.cluster import KMeans as skKMeans

# algorithm="full" is better than "auto" (elkan)
sk_km = skKMeans(n_clusters=n_clusters, max_iter=max_iter, init='random',  algorithm="full", n_init=1, n_jobs=12)
start = time.time()
sk_y_pred = sk_km.fit_predict(X)
end = time.time()
print ("scikit-learn train time: {:.3f} sec".format(end-start))
sk_homogeneity_score = metrics.homogeneity_score(y, sk_y_pred)
print('scikit-learn Homogeneity Score : '+ str(sk_homogeneity_score))



scikit-learn train time: 26.446 sec
scikit-learn Homogeneity Score : 0.5071450369250687
