# Information Retrieval in High Dimensional Data
## Lab #2, 26.10.2017
## Statistical Decision making

### Task 1

Consider the two dimensional, discrete random variable $X = [X_1\,X_2]^T$ subjected to the joint probability density $p_X$ as described in the following table.

$\begin{array}{c||cc} p_X(X_1,X_2) & X_2=0 & X_2=1\\ \hline
X_1 = 0 & 0.4 & 0.3 \\
X_1 = 1 & 0.2 & 0.1
\end{array}$

a) Compute the marginal probability densities $p_{X_1}, p_{X_2}$ and the coonditional probability $P(X_2=0|X_1=0)$ as well as the expected value $\mathbb{E}[X]$ and the covariance matrix $\mathbb{E}[(X-\mathbb{E}[X])(X-\mathbb{E}[X])^T]$.

In [2]:
import numpy as np

In [11]:
# Init needed variables
joint_prob = np.array([[0.4, 0.3], [0.2, 0.1]])
results = np.array([[0, 1, 0, 1],
                    [0, 0, 1, 1]])

In [12]:
px1 = np.sum(joint_prob, axis=1)
px2 = np.sum(joint_prob, axis=0)

print ("Marginal distribution density px1: {}".format(px1))
print ("Marginal distribution density px2: ",px2)

Marginal distribution density px1: [ 0.7  0.3]
Marginal distribution density px2:  [ 0.6  0.4]


In [13]:
cond_prob = joint_prob[0][0]/px1[0]
print ("Conditional probability P(X2=0|X1=0): {:.3f}".format(cond_prob))

Conditional probability P(X2=0|X1=0): 0.571


In [14]:
EX = np.dot(results, np.ravel(joint_prob, 'F'))
print ("Expected Value EX:", EX)

Expected Value EX: [ 0.3  0.4]


In [15]:
results_centered = results - np.reshape(EX, (2,1))
CovX = np.dot(np.dot(results_centered, np.diag(joint_prob.ravel('F'))), results_centered.T)
print ("Covariance Matrix: \n{}".format(CovX))

Covariance Matrix: 
[[ 0.21 -0.02]
 [-0.02  0.24]]


b) Write a PYTHON function toyrnd that expects the positive integer parameter n as its input an returns a matrix $X$ of size (2,n), containing $n$ samples drawn independently from the distribution $p_X$, as its output.

In [20]:
def toyrnd(n):
    x1 = np.random.choice([0, 1], (1, n), p=px1)
    x2 = np.random.choice([0, 1], (1, n), p=px2)
    return np.vstack((x1,x2))

[[0 1 0 1 0 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 1 0 1 0 1 0 0 0 1 0 0 1 0 1 0 1 0
  0 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0
  0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0]
 [1 0 0 1 1 0 0 0 0 1 1 1 0 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 1 0 1 1 0 0 1 0
  0 0 1 0 0 1 1 1 1 0 1 1 1 1 1 1 0 0 1 0 1 1 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0
  1 1 1 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 1 0 1 0 0 1 0 0]]


c) Verify your results in a) by generating 10000 samples with toyrnd and computing the respective empirical values

In [48]:
samples = toyrnd(10000)

p_x1equ1 = (samples[0].sum()/len(samples[0]))
p_x2equ1 = (samples[1].sum()/len(samples[1]))
p_x1 = np.array([1-p_x1equ1, p_x1equ1])
p_x2 = np.array([1-p_x2equ1, p_x2equ1])

print ("Empirical Marginal distribution density p_x1: {}".format(p_x1))
print ("Empirical Marginal distribution density p_x2: ",p_x2)

Empirical Marginal distribution density p_x1: [ 0.6986  0.3014]
Empirical Marginal distribution density p_x2:  [ 0.5997  0.4003]


In [49]:
cond_prob_empirical = samples[1, samples[0,:]==0]
np.sum(cond_prob_empirical==0)/len(cond_prob_empirical)

0.5986258230747209

### Task 2
The MNS trainign set consists of handwritten digits from 0 to 9, stored as PNG files of size 28 x 28 an indexed by label. Download the provided ZIP file from Moodle and make yourself familiar with the directory structure.

a) Grayscale images are typically described as matrices of uint8 values. For numerical calculations, it is more sensible to work with floating point numbers. Load two (arbitrary) images from the database and convert them to matrices I1 and I2 of float64 values in the interval $[0, 1]$.

In [3]:
import imageio
import os

In [7]:
path = './mnist/d4/'
filenames = os.listdir(path)
im1 = imageio.imread(path + filenames[0])
I1 = im1/255
im2 = imageio.imread(path +  filenames[10])
I2 = im2/255

b) The matrix equivalent of te euclidean norm $\| \cdot \|_2$ is the $Frobenius$ norm. For any matrix $\mathbf{A} \in \mathbb{R}^{m\, \times \, n}$, it is defined as

\begin{equation} \|\mathbf{A}\|_F = \sqrt{tr(\mathbf{A}^T\mathbf{A})} \end{equation}

where tr denotes the trace of a matrix. Compute the distance $\|\mathbf{I}_1 - \mathbf{I}_2 \|_F$ between the images I1 and I2 using three different procedures in PYTHON:
- Running the numpy.linalg.norm function with the 'fro' parameter
- Directly applying formula (1)
- Computing the euclidean norm between the vectorized images

In [15]:
D = I1 - I2

In [16]:
dist1 = np.linalg.norm(D, ord='fro')
dist1

9.0642043267400751

In [20]:
dist2 = np.sqrt(np.trace(np.dot(D.T,D)))
dist2

9.0642043267400751

In [21]:
d = np.reshape(D, 784)
dist3 = np.linalg.norm(d)
dist3

9.0642043267400751

c) In the following, we want to solve a simple classification problem by applying <i>k-Nearest Neighbors</i>. To this end, choose two digits classes, e.g. 0 and 1, and lod n_train = 500 images from each class to the workspace. Convert them according to subtask a) and store them in vectorized form in the matrix X_train of size (784, 2*n_train). Provide an indicator vector Y_train pf length 2*n_train that assigns the respective digit class label to each column of X_train.  
From each of the two classes, choose another set of n_test=10 images and create according matrices X_test and Y_test. Now, for each sample in the test set, determine the k = 20 training samples with the smallest Frobenius distance to it and store their indices in the (2*n_test, k) matrix NN. Generate a vector Y_kNN containing the respective estimated class labels by performing a majority vote on NN. Compare the result with Y_test.

In [19]:
X_train = np.zeros((784, 1000))
Y_train = np.zeros(1000, dtype='int64')
digitclass0 = 0
digitclass1 = 1
path0 = './mnist/d' + str(digitclass0) +'/'
path1 = './mnist/d' + str(digitclass1) +'/'
filenames0 = os.listdir(path0)
filenames1 = os.listdir(path1)

for i in range(500):
    im0 = imageio.imread(path0 + filenames0[i])
    im1 = imageio.imread(path1 + filenames1[i])
    X_train[:, i] = np.reshape(im0, 784)/255
    X_train[:, 500 + i] = np.reshape(im1, 784)/255
    Y_train[i] = digitclass0
    Y_train[500 + i] = digitclass1

In [20]:
X_test = np.zeros((784, 20))
Y_test = np.zeros(20, dtype='int64')

for i in range(500, 510):
    im0 = imageio.imread(path0 + filenames0[i])
    im1 = imageio.imread(path1 + filenames1[i])
    X_test[:, i-500] = np.reshape(im0, 784)
    X_test[:, i-490] = np.reshape(im1, 784)
    Y_test[i - 500] = digitclass0
    Y_test[i - 490]= digitclass1

In [30]:
distances = np.zeros(1000)
NN = np.zeros((20, 20))
Y_kNN = np.zeros(20, dtype='int64')
for i in range(20):
    for j in range(1000):
        DIFF = X_test[:, i] - X_train[:, j]
        distances[j] = np.linalg.norm(DIFF)
    indices = np.argsort(distances)
    labels = Y_train[indices[:20]]
    bins = np.bincount(labels)
    Y_kNN[i] = np.argsort(bins)[-1]

In [32]:
print('Labels determined by k-Nearest Neighbors:')
print(Y_kNN)
print('\nLabels of training set')
print(Y_test)

Labels determined by k-Nearest Neighbors:
[0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1]

Labels of training set
[0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1]
