# Implementation from Scratch

<br />

I am going to implement algorithms by using the least kinds of libraries such as Numpy possible.

## [Task 1] Create a Class of K-means

<br />

I am going to create a class of K-means that is a nonhierarchical clustering method.

#### Artificial Dataset for Clustering

<br />

I prepare an artificial dataset for the validation of the clustering.

The make_blobs function outputs correct labels, but I do not use them on this task.

In [1]:
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=100, n_features=2, centers=4, cluster_std=0.5, shuffle=True, random_state=0)

In [2]:
X

array([[ 0.72086751,  3.71347124],
       [-1.89468423,  7.96898545],
       [ 1.35678894,  4.36462484],
       [ 1.05374379,  4.49286859],
       [ 1.59141542,  4.90497725],
       [ 0.78260667,  4.15263595],
       [-1.95751686,  3.87291474],
       [-0.77354537,  7.87923564],
       [ 0.12313498,  5.27917503],
       [-1.43284669,  7.71577043],
       [-0.92819001,  7.02698199],
       [-1.74836345,  7.06307447],
       [-1.26789718,  7.25141327],
       [-0.98661744,  7.74968685],
       [-0.81984047,  7.50994722],
       [ 2.99684287,  0.22378413],
       [ 1.46870582,  1.86947425],
       [-0.33533163,  3.390122  ],
       [-1.86407034,  2.93379754],
       [ 2.62496786,  0.28025075],
       [ 2.11114739,  3.57660449],
       [-1.8219901 ,  7.61654999],
       [-1.91186205,  3.18750686],
       [ 2.28809874,  0.12954182],
       [ 0.5285368 ,  4.49723858],
       [-1.57613028,  2.58614312],
       [-0.565433  ,  3.65813966],
       [ 0.802314  ,  4.38196181],
       [ 2.79939362,

In [3]:
X = X.T

In [4]:
X

array([[ 0.72086751, -1.89468423,  1.35678894,  1.05374379,  1.59141542,
         0.78260667, -1.95751686, -0.77354537,  0.12313498, -1.43284669,
        -0.92819001, -1.74836345, -1.26789718, -0.98661744, -0.81984047,
         2.99684287,  1.46870582, -0.33533163, -1.86407034,  2.62496786,
         2.11114739, -1.8219901 , -1.91186205,  2.28809874,  0.5285368 ,
        -1.57613028, -0.565433  ,  0.802314  ,  2.79939362,  2.64465731,
         1.7190373 , -0.93564005,  2.14398059,  2.06051753, -1.21986433,
         1.13280393, -1.497272  ,  1.85367905, -0.1666378 , -1.89928142,
         1.04829186, -1.44356727, -1.57006498, -1.98331513, -1.87418794,
        -1.86097353,  1.61986895, -1.84482705,  0.72144399,  0.5323772 ,
         0.3498724 ,  1.89949126, -1.2386086 , -1.74448079, -0.96358605,
        -1.26041884, -0.8623605 ,  2.4198128 ,  2.23345072, -0.65424088,
        -1.42525273,  1.51989121,  2.11872357,  1.74265969,  1.42002502,
        -0.69842598, -2.18485772, -1.32890066,  2.1

#### Objective Function

<br />

K-means fit datasets by computing $r_{nk}$ and $\mu$ minimizing SSE, sum of squared errors.

I use this to take the number of clusters, $K$ on the elbow method.

$$
SSE = \sum_{n=1}^N \sum_{k=1}^K r_{nk} \|X_n - \mu_k\|^2
$$

$n$: index of a data point

$k$: index of a cluster

$X_n$: $n$th data point

$\mu_k$: $k$th central point

$r_{nk}$: 1 if the data point $X_n$ is in the cluster $k$, 0 if not.

In [5]:
import numpy as np
import random

In [13]:
# Create a class of K-means from scratch

class ScratchKMeans():
    """
    Implementation of K-means from scratch
    
    Parameters
    ----------
    k: int
        The number of labels
    
    num_iter: int
        The number of iteration
    
    Attributes
    ----------
    self.coef_: ndarray, shape(n_features,)
        parameters
    """
    
    def __init__(self, k, num_iter):
        # Record hyperparameters as attribute
        self.k = k
        self.iter = num_iter
    
    
    def fit(self, X):
        """
        Fit datasets by SVM.
        
        Parameters
        ----------
        X: ndarray whose shape is (n_samples,n_features)
            Features of train dataset
        
        y: ndarray whose shape is (n_samples,)
            Correct values of train dataset
        
        X_val: ndarray whose shape is (n_samples,n_features)
            Features of validation dataset
        
        y_val: ndarray whose shape is (n_samples,)
            Correct values of validation dataset
        """
        
#         # Change the vectors to a matrix
#         y = y.reshape(len(y), 1)   # (80,1)
#         if y_val is not None:
#             y_val = y_val.reshape(len(y_val), 1)   # (20,1)
        
        # Transform arrays to move their features to rows
        X = X.T
        #y = y.T
        #if (X_val is not None) and (y_val is not None):
            #X_val = X_val.T
            #y_val = y_val.T
        
        # Set initial central points
        np.random.seed(32)
        index = np.array(range(X.shape[1]))
        np.random.shuffle(index)
        k_index = index[:self.k]
        centroids = X[:,k_index]
        
        for i in range(self.iter):
            if np.allclose(centroids, self.update_centroids(X, self.assign_cluster(X, centroids))):
                break
            else:
                centroids = self.update_centroids(X, self.assign_cluster(X, centroids))
    
    
    def assign_cluster(self, X, centroids):
        # Compute distances between each point and central points and assign them to theirown label
        cluster_table = np.zeros((X.shape[1], self.k))
        for i in range(X.shape[1]):
            min_d = 1e+10000
            label = 0
            for j in range(self.k):
                d = np.linalg.norm(X[:,i]-centroids[:,j])
                if d < min_d:
                    min_d = d
                    label = j
            cluster_table[i,label] = min_d
        
        return cluster_table
    
    
    def update_centroids(self, X, cluster_table):
        # 
        x = []
        y = []
        for i in range(self.k):
            index = np.where(cluster_table[:,i]!=0)
            x.append(sum(X[0,index])/len(index))
            y.append(sum(X[1,index])/len(index))
        centroids = np.array(x,y).reshape(2,1)
        
        return centroids

In [14]:
kmeans = ScratchKMeans(2,1000)

In [15]:
kmeans.fit(X)

TypeError: data type not understood