# Sparse Learning, Distance and (k)NN method

> Weitong Zhang
> 2015011493
>
> <zwt15@mails.tsinghua.edu.cn>

## Voronor Gird in Euclid Space

We are about to prove the following statement:

$$\begin{cases}
\forall i\ \|\vec x_i - \vec s_1 \|_2^2 \ge \|\vec x_0 - \vec s_1 \|_2^2\\
\forall i\ \|\vec x_i - \vec s_2 \|_2^2 \ge \|\vec x_0 - \vec s_2 \|_2^2\\
\forall t\ \in [0,1], \vec s = t\vec s_1 + (1-t) \vec s_2
\end{cases} \Rightarrow \forall i\ \|\vec x_i - \vec s \|_2^2 \ge \|\vec x_0 - \vec s \|_2^2
$$

$$\begin{cases}
\|\vec x_i - \vec s \|_2^2 - \|\vec x_0 - \vec s \|_2^2 = \sum_j (x_{ij} - s_j)^2 - \sum_j (x_{0j} - s_j)^2 = \sum_j (x_{ij} - x_{0j})(x_{ij} + x_{0j} - 2s_j)\\
\|\vec x_i - \vec s_1 \|_2^2 - \|\vec x_0 - \vec s_1 \|_2^2 = \sum_j (x_{ij} - s_{1j})^2 - \sum_j (x_{0j} - s_{1j})^2 = \sum_j (x_{ij} - x_{0j})(x_{ij} + x_{0j} - 2s_{1j}) \ge 0\\
\|\vec x_i - \vec s_2 \|_2^2 - \|\vec x_0 - \vec s_2 \|_2^2 = \sum_j (x_{ij} - s_{2j})^2 - \sum_j (x_{0j} - s_{2j})^2 = \sum_j (x_{ij} - x_{0j})(x_{ij} + x_{0j} - 2s_{2j}) \ge 0
\end{cases}$$

We can easily found out that the first inequation is the linear combination of the later two, therefore, 

$$ \begin{aligned}
\|\vec x_i - \vec s \|_2^2 - \|\vec x_0 - \vec s \|_2^2 = \sum_j (x_{ij} - s_j)^2 - \sum_j (x_{0j} - s_j)^2 = \sum_j (x_{ij} - x_{0j})(x_{ij} + x_{0j} - 2s_j) \\
= t(\sum_j (x_{ij} - x_{0j})(x_{ij} + x_{0j} - 2s_{1j})) + (1-t) (\sum_j (x_{ij} - x_{0j})(x_{ij} + x_{0j} - 2s_{2j})) \ge 0
\end{aligned}$$

Therefore, the set determined by Voronoi is a convex set

## Error rate of NN method

### Error rate of Baysian Method

Since the probability of $x \in [0,\frac{cr}{c-1}]$, the classifier could be only randomly chosen from all of the $c$ categories

$$P^* = \sum_i P(w_i)P_{err} = \frac{cr}{c-1} \times \frac{c-1}{c} = r$$

### Error rate of NN method

Suppose that the number of samples in the training dataset is sufficient, i.e. for each $x$ where $p(x|w_i) \ne 0$, there are enough samples belong to $w_i$

For each $x\in w_i$, if $x \in [i,i+ 1 - \frac{cr}{c-1}]$, the NN method will not generate error, if $x\in [0,\frac{cr}{c-1}]$, the nearest sample of $x$ belongs to all of the $c$ categories, therefore, the probability of error is $\frac{c-1}{c}$

Therefore, the error rate of NN method is 

$$\frac{cr}{c-1} \times \frac{c-1}{c} = r = P^*$$

## Minkowski Distance

Minkowski distance should be described as:

$$ D(X,Y) = (\sum_i |x_i - y_i|^p )^{1/p}$$

According to the definition of distance, a distance should obey the following rules:

$$\begin{aligned}
D(p,q) & \ge 0, D(p,q) = 0 \Leftrightarrow p = q\\
D(p,q) &= D(q,p)\\
D(p,q) &\le D(p,z) + D(q,z)
\end{aligned}$$

Owing to the fact that $|x| = |-x| \ge 0, |x| = 0 \Leftrightarrow x = 0$, the first two rules are easily satisfied. We are about to prove:

$$ (\sum_i |x_i + y_i|^p )^{1/p} \le (\sum_i |x_i|^p )^{1/p} + (\sum_i |y_i|^p )^{1/p}$$

$$ \sum_i |x_i + y_i|^p = \sum_i |x_i + y_i| \times |x_i + y_i|^{p-1} \le \sum_i |x_i| \times |x_i + y_i|^{p-1} + \sum_i |y_i| \times |x_i + y_i|^{p-1}$$

Now, we have to use the $\mathrm {H\ddot older}$ inequation:

$$\sum_i u_iv_i \le (\sum u_i^p)^{\frac1p}(v_i^q)^{\frac1q}, \text{ where } \frac1p + \frac1q = 1$$

Therefore:

$$\sum_i |x_i| \times |x_i + y_i|^{p-1} \le (\sum_i |x_i|^p)^{\frac1p}(\sum_i|x_i + y_i|^{(p-1)q})^{\frac1q}, \text{ where } \frac1p + \frac1q = 1 \Rightarrow pq - q = p$$

Therefore:

$$\sum_i |x_i| \times |x_i + y_i|^{p-1} + \sum_i |y_i| \times |x_i + y_i|^{p-1} \le (\sum_i |x_i|^p)^{\frac1p}(\sum_i |x_i + y_i|^p)^{\frac1q} + (\sum_i |y_i|^p)^{\frac1p}(\sum_i |x_i + y_i|^p)^{\frac1q}$$

Therefore, since $\frac1p + \frac1q = 1$, we get

$$\begin{aligned}
&\sum_i |x_i + y_i|^p \le ((\sum_i |y_i|^p)^{\frac1p} + (\sum_i |x_i|^p)^{\frac1p})(\sum_i |x_i + y_i|^p)^{\frac1q}\\
&\Leftrightarrow \frac{\sum_i |x_i + y_i|^p}{(\sum_i |x_i + y_i|^p)^{\frac1q}} \le (\sum_i |y_i|^p)^{\frac1p} + (\sum_i |x_i|^p)^{\frac1p}\\
&\Leftrightarrow (\sum_i |x_i + y_i|^p )^{1/p} \le (\sum_i |x_i|^p )^{1/p} + (\sum_i |y_i|^p )^{1/p}
\end{aligned}$$

The prove above use the [Hölder's inequality](https://en.wikipedia.org/wiki/Hölder%27s_inequality#Proof_of_Hölder's_inequality)


## Programming

### Getting the MNIST data

In order to minimize the file to upload, we get the MNIST data from website each time

In [1]:
import gzip
import matplotlib.pyplot as plt
import numpy as np
import os
import shutil
import struct
import sys
try: 
    from urllib.request import urlretrieve 
except ImportError: 
    from urllib import urlretrieve
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
import time

In [3]:
def loadData(src, cimg):
    print ('Downloading ' + src)
    gzfname, h = urlretrieve(src, './delete.me')
    try:
        with gzip.open(gzfname) as gz:
            n = struct.unpack('I', gz.read(4))
            # Read magic number.
            if n[0] != 0x3080000:
                raise Exception('Invalid file: unexpected magic number.')
            # Read number of entries.
            n = struct.unpack('>I', gz.read(4))[0]
            if n != cimg:
                raise Exception('Invalid file: expected {0} entries.'.format(cimg))
            crow = struct.unpack('>I', gz.read(4))[0]
            ccol = struct.unpack('>I', gz.read(4))[0]
            if crow != 28 or ccol != 28:
                raise Exception('Invalid file: expected 28 rows/cols per image.')
            # Read data.
            res = np.frombuffer(gz.read(cimg * crow * ccol), dtype = np.uint8)
    finally:
        os.remove(gzfname)
    return res.reshape((cimg, crow * ccol))

def loadLabels(src, cimg):
    print ('Downloading ' + src)
    gzfname, h = urlretrieve(src, './delete.me')
    try:
        with gzip.open(gzfname) as gz:
            n = struct.unpack('I', gz.read(4))
            # Read magic number.
            if n[0] != 0x1080000:
                raise Exception('Invalid file: unexpected magic number.')
            # Read number of entries.
            n = struct.unpack('>I', gz.read(4))
            if n[0] != cimg:
                raise Exception('Invalid file: expected {0} rows.'.format(cimg))
            # Read labels.
            res = np.frombuffer(gz.read(cimg), dtype = np.uint8)
    finally:
        os.remove(gzfname)
    return res.reshape((cimg, 1))

def try_download(dataSrc, labelsSrc, cimg):
    data = loadData(dataSrc, cimg)
    labels = loadLabels(labelsSrc, cimg)
    return data.astype(np.float32)/256.0,labels

# URLs for the train image and label data
url_train_image = 'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz'
url_train_labels = 'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz'
num_train_samples = 60000
train_data,train_labels = try_download(url_train_image, url_train_labels, num_train_samples)

# URLs for the test image and label data
url_test_image = 'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz'
url_test_labels = 'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz'
num_test_samples = 10000
test_data,test_labels = try_download(url_test_image, url_test_labels, num_test_samples)

train_sum = 60000
test_sum = 10000

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz


### Kernel of NN method

In [None]:
def main_nn(size=60000,k=5,p=2):
    data_used, _, label_used, _ = train_test_split(train_data,train_labels,test_size = train_sum - size)
    neigh = KNeighborsClassifier(n_neighbors=k,p=p,n_jobs=-1)
    start = time.clock()
    neigh.fit(data_used,np.ravel(label_used))
    predict = neigh.predict(test_data)
    end = time.clock()
    return accuracy_score(test_labels,predict),end-start

### The influence of number of samples on the performance of the classifier

According to the algorithm, the space complexity is $\mathcal O(n)$ (suppose the sort function could be done locally). The time complexity is $\mathcal O(N\log N) + \mathcal O(Nd)$, where $N$ is the number of samples. $d$ is the dimension of the features.

The accuracy of the classifier would keep increasing as the number of the samples increasing. However, the increasing rate would be slow down

In this experiment, we set $k = 5$ and the distance function is euclid distance.

In [None]:
for samples in [6,60,600,6000,60000]:
    p,t = main_nn(size=samples)
    print('Samples = {}, Accuracy = {:.1f}%, TimeElapsed = {:e}s'.format(samples,p * 100,t))

### The influence of $k$ on the performance of the classifier

We test $k = 1,3,5,7,9,11$ in the situation that number of samples is 6,000 to check the incluence of $k$ on the performance of the classifier. The distance function is still set to euclid function.

As the result shows, the accuracy of the performance is not increasing, This might because the number of samples is too large, therefore, we do this experiment again setting the number of samples to 600.

After setting the number of samples to 600, we can find out that the accuracy of the classifier is dropping, which might because the train set is too small, that the 11-th sample, 9-th sample are not belong to the same category of the testing one.

In [None]:
print('In 6,000 samples: ')
for k in [1,3,5,7,9,11]:
    p,t = main_nn(size=6000,k=k)
    print('k = {}, Accuracy = {:.1f}%, TimeElapsed = {:e}s'.format(k,p * 100,t))
print('In 600 samples: ')    
for k in [1,3,5,7,9,11]:
    p,t = main_nn(size=600, k=k)
    print('k = {}, Accuracy = {:.1f}%, TimeElapsed = {:e}s'.format(k,p * 100,t))

### Using different distance function

We use minkowski distance, where $p = 1$ (Manhattan Distance), $p = 2$ (Euclid Distance, mentioned above),$p = \infty$ (Chebyshev distance) to carry on this experiment.

In [None]:
for s in [1,2,float('inf')]:
    p,t = main_nn(p=s)
    print('p = {}, Accuracy = {:.1f}%, TimeElapsed = {:e}s'.format(s,p * 100,t))

### Finding the weight for each element

Our idea is to split the training dataset to validation dataset and training dataset, and use the validation dataset to correct the $\vec a$, all of the algorithm is described below

In [None]:
def main_nn_with_a(size=60000):
    data_used, _, label_used, _ = train_test_split(train_data,train_labels,test_size = train_sum - size)
    A = np.zeros(28 * 28)
    lr = 0.01
    epoch = 1
    batch_size = 5000
    inner_iter = 10
    for batch in range(epoch):
        eA = 1.0 / (1 + np.exp(-A))
        td,vd,tl,vl = train_test_split(train_data,train_labels,train_size = batch_size, test_size = batch_size)
        for inner in range(inner_iter):
            neigh = KNeighborsClassifier(n_neighbors=1,n_jobs=-1)
            neigh.fit(td * eA,np.ravel(tl))
            pre = neigh.kneighbors(vd * eA)[1]
            pre_label = tl[pre]
            nei = td[pre]
            acc = 0
            for i in range(batch_size):
                delta = nei[i] - vd[i]
                l = np.abs(delta) / np.linalg.norm(delta) / np.linalg.norm(delta)
                #print (np.max(l))
                if pre_label[i] != vl[i]: 
                    #print (np.max(l))
                    A = A + lr * l 
                else:
                    acc = acc + 1
                    A = A - 1e-2 * lr * l
#         if batch % 100 == 0:
#             print ('Minibatch {}: accuracy: {:.1f}%, max A = {}'
#                    .format(batch,acc / batch_size * 100, np.max(eA)))
            
    eA = 1.0 / (1 + np.exp(-A))
    neigh = KNeighborsClassifier(n_neighbors=1,n_jobs=-1)
    start = time.clock()
    neigh.fit(data_used * eA,np.ravel(label_used))
    predict = neigh.predict(test_data * eA)
    end = time.clock()
    neigh_old = KNeighborsClassifier(n_neighbors=1,n_jobs=-1)
    
    start_old = time.clock()
    neigh_old.fit(data_used,np.ravel(label_used))
    predict_old = neigh_old.predict(test_data)
    end_old = time.clock()
    return accuracy_score(test_labels,predict),end-start, \
        accuracy_score(test_labels,predict_old),end_old-start_old, eA

p,t,p_old,t_old,A = main_nn_with_a(size=60000)
print('New: Accuracy = {:.1f}%, TimeElapsed = {:e}s\tOld: Accuracy = {:.1f}%, TimeElapsed = {:e}s'
      .format(p * 100,t,p_old * 100,t_old))

### Tangent distance on MNIST