# COMP5328 - Advanced Machine Learning

## Assignment 2 - Learning with Noisy Data

**Lecturer**: Tongliang Liu

**Tutors** : Zhuozhuo Tu, Liu Liu

**Group Members** : Chen Chen, Yutong Cao, Yixiong Fang

**Objectives:**

The goal of this assignment is to study how to learn with label noise. Specifically, you need to use at least two methods to classify real world images with noisy labels into a set of categories. Then, you need to compare the performance of these classifiers and analyze the robustness of label noise methods.
The datasets are quite large, so you need to be smart on which methods you gonna use and perhaps perform a pre-processing step to reduce the amount of computation. Part of your marks will be a function of the performance of your classifier on the test set.

**Reuirements:**
- sklearn 0.20.0 (The develpmetn version, download here)
- multiprocessing
- numpy
- matplotlib

## 1. Import Library

In [2]:
from sklearn.preprocessing import StandardScaler
from sklearn import svm
from os import cpu_count
from sklearn.decomposition import IncrementalPCA as PCA
from sklearn.model_selection import train_test_split
from multiprocessing import Pool
from scipy.spatial.distance import cdist
from scipy import exp
from itertools import product

import csv
import time
import numpy as np
import matplotlib.pyplot as plt

## 2. Load Data

1. Training features and labels:
    - Xtr: shape=(10000, d). There are 10, 000 instances. The raw data are 28 × 28 (for Fashion-MNIST) or 32 × 32 × 3 (for CIFAR) images, which are reshaped to features with dimension d = 784 or d = 3072.
    - Str: shape=(10000, 1). There are 10, 000 noisy labels for the corre- sponding instances.
    - These 10,000 instances belong to two categories. The corresponding labels for these two categories are 0 and 1. These training examples are with label noise. The flip rates are $ ρ_0 = p(S = 1|Y = 0) = 0.2 $ and $ ρ_1 =p(S=0|Y =1)=0.4 $, where $S$ and $Y$ are the variables of noisy labels and true labels, respectively.
    - Note that do not use all the 10,000 examples to train your models. You are required to independently and randomly sample 8,000 examples from the 10,000 examples to train every classifier. The reported performance of each model should be the average performance of at least 10 learned classifiers.

In [17]:
def load_data(filepath, name, standardlize=False, PCA=False, n=100):
    dataset = np.load(filepath)
    Xtr = dataset['Xtr'].astype(float)
    Str = dataset['Str'].ravel()
    Xts = dataset['Xts'].astype(float)
    Yts = dataset['Yts'].ravel()
    print('====Loading Dataset: {}===='.format(name))
    print('Traing Set:')
    print('Xtr.shape = {}'.format(Xtr.shape))
    print('Str.shape = {}'.format(Str.shape))
    print('Testing Set:')
    print('Xts.shape = {}'.format(Xts.shape))
    print('Yts.shape = {}'.format(Yts.shape))
    if standardlize:
        # TODO        
        continue
    if PCA:
        # TODO        
        continue
    return Xtr, Str, Xts, Yts
Xtr, Str, Xts, Yts = load_data('../input_data/mnist_dataset.npz', 'mnist')

====Loading Dataset: mnist====
Traing Set:
Xtr.shape = (10000, 784)
Str.shape = (10000,)
Testing Set:
Xts.shape = (2000, 784)
Yts.shape = (2000,)


In [None]:
def estimateBeta(S,prob,rho0,rho1):
    S=S.astype(int)
    rho=np.array([rho1,rho0])    
    prob=prob[:,0]*(1-S[:])+prob[:,1]*(S[:])
    beta=(prob[:]-rho[S].ravel())/(1-rho0-rho1)/prob[:]
    return beta

In [3]:
dataset1 = np.load('../input_data/mnist_dataset.npz')
dataset2 = np.load('../input_data/cifar_dataset.npz')
size_image1 = 28
dim_image1 = 1
size_image2 = 32
dim_image2 = 3

In [5]:
print(dataset1)

<numpy.lib.npyio.NpzFile object at 0x10b7895f8>


In [None]:
#to store the data splits
data_cache={}

#transform dataset1
Xtr1 = dataset1 ['Xtr'].astype(float)
Str1 = dataset1 ['Str'].ravel()
Xts1 = dataset1 ['Xts'].astype(float)
Yts1 = dataset1 ['Yts'].ravel()
scaler = StandardScaler()
Xts1 = scaler.fit_transform(Xts1.T).T
Xtr1 = scaler.fit_transform(Xtr1.T).T
data_cache[1]=(Xtr1,Str1,Xts1,Yts1)

#transform dataset2
Xtr2 = dataset2 ['Xtr'].astype(float)
Str2 = dataset2 ['Str'].ravel()
Xts2 = dataset2 ['Xts'].astype(float)
Yts2 = dataset2 ['Yts'].ravel()
#scaler = StandardScaler()
Xts2 = scaler.fit_transform(Xts2.T).T
Xtr2 = scaler.fit_transform(Xtr2.T).T
data_cache[2]=(Xtr2,Str2,Xts2,Yts2)

pca = PCA(n_components=100)
pca.fit(Xtr2)
Xtr2=pca.transform(Xtr2)
Xts2=pca.transform(Xts2)
print('pca explained variance:',sum(pca.explained_variance_ratio_))
if plot:
    xplot=scaler.fit_transform(pca.inverse_transform(Xts2).T).T
dset=1
if plot:
    plt.figure()
    for i in range(0,30):
        if dset==1:
            image=xplot[i,].reshape(dim_image1,size_image1,size_image1).transpose([1, 2, 0])
            plt.subplot(5, 6, i+1)
            plt.imshow(image[:,:,:],interpolation='bicubic')
            plt.title(Yts1[i])
        else:
            image=xplot[i,].reshape(dim_image2,size_image2,size_image2).transpose([1, 2, 0])
            plt.subplot(5, 6, i+1)
            plt.imshow(image[:,:,:],interpolation='bicubic')
            plt.title(Yts2[i])