**<p style='text-align: right;'>Ver. 1.0.1</p>**

# Introductory Applied Machine Learning (IAML) Coursework 2 - Semester 2, 2020-21

### Author: Hiroshi Shimodaira and Jinhong Lu

## Important Instructions

#### It is important that you follow the instructions below carefully for things to work properly.

You need to set up and activate your environment as you would do for your labs, see Learn section on Labs.  **You will need to use Noteable to create one of the files you will submit (the PDF)**.  Do **NOT** create the PDF in some other way, we will not be able to mark it.  If you want to develop your answers in your own environment, you should make sure you are using the same packages we are using, by running the cell which does imports below.

Read the instructions in this notebook carefully, especially where asked to name variables with a specific name. Wherever you are required to produce code you should use code cells, otherwise you should use markdown cells to report results and explain answers. In most cases we indicate the nature of answer we are expecting (code/text), and also provide the required code/markdown cell.

- We will use the IAML Learn page for any announcements, updates, and FAQs on this assignment. Please visit the page frequently to find the latest information.
- Data files that you will be using are located in the ./datasets directory that is included in the [git repository](https://github.com/uoe-iaml/DL-S2-2021-CW2) for this coursework.
- There is a helper file 'iaml02cw2_helpers.py' in the git repository, which you should upload to your environment.
- Some of the topics in this coursework are covered in weeks 7 and 8 of the course. Focus first on questions on topics that you have covered already, and come back to the other questions as the lectures progress.
- Keep your answers brief and concise.
- Make sure to show all your code/working.
- Write readable code. While we do not expect you to follow PEP8 to the letter, the code should be adequately understandable, with plots/visualisations correctly labelled. Do use inline comments when doing something non-standard.
- When asked to present numerical values, make sure to represent real numbers in the appropriate precision to exemplify your answer. 
- When you use libraries specified in this coursework, you should use the default parameters unless specified explicitly.
- The criteria on which you will be judged include the quality of the textual answers and/or any plots asked for.

- You will see <html>\\pagebreak</html> at the start of each subquestion.  ***Do not remove these, if you do we will not be able to mark your coursework.***

#### Good Scholarly Practice
Please remember the University requirement regarding all assessed work for credit. Details about this can be found at:
http://web.inf.ed.ac.uk/infweb/admin/policies/academic-misconduct
Specifically, this assignment should be your own individual work. We will employ tools for detecting misconduct.

Moreover, please note that Piazza is NOT a forum for discussing the solutions of the assignment. You may, in exceptional circumstances, ask private questions.

### SUBMISSION Mechanics
This assignment will account for 30% of your final mark. We ask you to submit answers to all questions.

You will submit (1) a PDF of your Notebook via Gradescope, and (2) the Notebook itself via Learn.  Your grade will be based on the PDF, we will only use the Notebook if we need to see details.  **You must use the following procedure to create the materials to submit**.

1. Make sure your Notebook and the datasets are in Noteable and will run.  If you developed your answers in Noteable, this is already done.

2. Select **Kernel->Restart & Run All** to create a clean copy of your submission, this will run the cells in order from top to bottom.  This may take a while (a few hours) to complete, ensure that all the output and plots have complete before you proceed.

3. Select **File->Download as->PDF via LaTeX (.pdf)** and wait for the PDF to be created and downloaded.

4. Select **File->Download as->Notebook (.ipynb)**

5. You now should have in your download folder the pdf and the notebook.  Rename them sNNNNNNN.pdf and sNNNNNNN.ipynb, where sNNNNNNN is your matriculation number (student number).

**Details on submission instructions will be announced and documented on Learn well before the deadline**. 

The submission deadline for this assignment is **23rd March 2021 at 16:00 UK time (UTC)**.  Don't leave it to the last minute!

### Tips on experiments
- Some experiments may take a long time to complete (e.g. more than 10 minutes or could be even more than an hour for Question 3.2 depending on conditions). It will be a good idea that you test your code with a small number of samples for debugging. You should use the whole data when you write your report.

#### IMPORTS
Execute the cell below to import all packages you will be using for this assignment.  If you are not using Noteable, make sure the python and package version numbers reported match the python and package numbers specified in the comment at the end of this cell.

In [None]:
import os
import platform
import sys
import sklearn
import scipy
import numpy as np
np.random.seed(260393)
# import pandas as pd
# import seaborn as sns
import matplotlib as mp
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsClassifier
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster, leaves_list
from sklearn.svm import SVC
from sklearn.mixture import GaussianMixture

import warnings 
warnings.filterwarnings('ignore')

print("All packages imported!")
print("python=={}".format(platform.python_version()))
# print("seaborn=={}".format(sns.__version__))
print("scikit-learn=={}".format(sklearn.__version__))
# print("pandas=={}".format(pd.__version__))
print("numpy=={}".format(np.__version__))
print("scypy=={}".format(scipy.__version__))
print("matplotlib=={}".format(mp.__version__))

# You should see this output:
# python==3.7.6
# seaborn==0.11.0
# scikit-learn==0.23.2
# pandas==1.1.4
# numpy==1.19.4
# scypy==1.5.3
# matplotlib==3.2.2

\pagebreak

# Question 1 Data analysis

#### 69 marks out of 163 for this coursework

### EMNIST Handwritten Character Dataset

This question employs the [EMNIST handwritten character data set](https://www.nist.gov/itl/iad/image-group/emnist-dataset). Each character image is represented as 28-by-28 pixels in gray scale (ranging from 0 to 255), stored as a row vector of 784 elements (28 × 28 = 784) in column-major order. A subset of the original EMNIST data set is considered in the coursework, restricting characters to English alphabet of 26 letters in lower case. Label numbers are given in alphabetical order, where label 0 corresponds to 'a' and 25 to 'z'. There are 1800 training samples and 300 test samples for each class. Note that you will find some errors (wrong labels and wrong letters) in the data set, but we use the data set as it is.

***Loading data:***
Upload the data set file "data1.mat" to your environment, make sure that you have "iaml02cw2_helpers.py" in your environment, and run the following cell.

In [None]:
# Load the data set and apply some changes
from iaml02cw2_helpers import *
Xtrn_org, Ytrn_org, Xtst_org, Ytst_org = load_EMNIST_subset()
Xtrn = Xtrn_org / 255.0   # Training data : (46800, 784)
Xtst = Xtst_org / 255.0   # Testing data : (7800, 784)
Ytrn = Ytrn_org - 1       # Labels for Xtrn : (46800,)
Ytst = Ytst_org - 1       # Labels for Xtst : (7800,)
Xmean = np.mean(Xtrn, axis=0)
Xtrn_nm = Xtrn - Xmean; Xtst_nm = Xtst - Xmean  # Mean-normalised versions of data


Xtrn and Ytrn are training data and corresponding labels, whereas Xtst and Ytst are test data and labels. Xtrn_nm and Xtst_nm are mean-normalised versions of Xtrn and Xtst, respectively.

\pagebreak

# ========== Question 1.1 --- [4 marks] ==========

Show the minimum, maximum, mean, and variance of the pixel values for the first sample in Xtrn.

In [None]:
# Your code goes here

\pagebreak

# ========== Question 1.2 --- [6 marks] ==========


Display the images of the first and last sample in the training data Xtrn in the following two ways.

1. [Code] Display the images using Matplolib's imshow(). You should display images in a gray scale, 
2. [Code] Display the images using print() function, where you display a character '\*' when the value of pixel is greater than 0 (zero), and display ' ' (space) otherwise, so that an image is displayed using 28 lines, each of which has the length of 28 characters.

Note that an image of 28-by-28 pixels is stored in a vector of 784 elements in ***colum-major order*** instead of row-major order. You need be careful about the order when you recover the original 28-by-28 array from a vector so that the image is displayed properly.

In [None]:
#(1) Your code goes here

In [None]:
#(2) Your code goes here

\pagebreak

# ========== Question 1.3 --- [10 marks] ==========

Using Xtrn and the Euclidean distance measure, for the first four classes, i.e., 'a','b','c', and 'd', find the two closest samples and four furthest samples of that class to the mean of the class.
1. [Code] Display the images of the mean vectors and samples you found in a 4-by-7 grid, where the top row corresponds to the images for class 'a', and the bottom to those for class 'd'. The seven columns are the image of the mean vector and the images of the first and second closest and fourth, third, second, and first furthest samples to the mean vector for that class, respectively (from left to right). For each image sample, you should provide the class number and the sample number in the data set. Note that we use 0-based indexing.
2. [Text] Discuss possible issues when we use this data set for classification tasks.

In [None]:
#(1) Your code goes here

(2) ***Your answer goes here***

\pagebreak

# ========== Question 1.4 --- [12 marks] ==========

Applying the Principal Component Analysis (PCA) to Xtrn_nm using sklearn.decomptision.PCA, answer the following questions:

1. [Code] Report the variances of the projected data for the first five principal components.
2. [Code] Plot a graph of the cumulative explained variance ratio as a function of the number of principal components, k, where 1 $\le$ K $\le$ 784.
3. [Code] Find the minimum number of principal components required to explain 50%, 60%, 70%, 80%, and 90% of the total variance, respectively.
4. [Code] Display the images of the first 10 principal components in a 2-by-5 grid, putting the image of first principal component on the top left corner, followed by the one of second component to the right. 
5. [Text] Based on the images you obtained above, discuss your findings briefly.


In [None]:
#(1) Your code goes here

In [None]:
#(2) Your code goes here

In [None]:
#(3) Your code goes here

In [None]:
#(4) Your code goes here

(5) ***Your answer goes here***

\pagebreak

# ========== Question 1.5 --- [15 marks] ==========

We now consider applying dimensionality reduction with PCA to a sample and reconstructing the sample from the dimensionality-reduced sample.
Answer the following questions:

1. [Text] Describe this process using mathematical formulae.
2. [Code] For each class of 'a', 'b', 'c', and 'd', and for each number of principal components K=5,10,20,40,80,160,320, find the first sample in Xtrn_nm, apply the dimensionality reduction, reconstruct that sample, and display the image of that reconstructed sample in a 4-by-7 grid, where each row corresponds to a class and each column corresponds to a value of K. For each reconstructed sample, you should provide the root mean square error. Note that you should add Xmean to each reconstructed sample to display the corresponding image.
3. [Text] Explain your findings

(1) ***Your answer goes here***

In [None]:
# (2) Your code goes here

(3) ***Your answer goes here***

\pagebreak

# ========== Question 1.6 --- [6 marks] ==========

Applying k-means clustering to Xtrn with various values of k, answer (in brief) the following questions:

1. [Code] Show a graph of the sum of square error (SSE) as a function of k.
2. [Text] Discuss your findings. 

In [None]:
#(1) Your code goes here

(2) ***Your answer goes here***

\pagebreak

# ========== Question 1.7 --- [16 marks] ==========

Apply hierarchical clustering with the Ward's linkage to all the samples of letter 'b' in Xtrn. Answer the following questions.

 1. [Code] Plot a dendrogram with scipy.cluster.hierarchy.dendrogram(), in which you specify orientation='right'.
 2. [Code] For each of the last four clusters to the root node, i.e, the four nodes from the root node on the dendrogram, find the number of samples (i.e. the number of leaf nodes) that belong to the cluster.
 3. [Code] For each of the last four clusters described above, display the image of the cluster centre and the image of the sample that is closest to the cluster centre. For each image sample, you should provide the sample number in the data set. Note that cluster centre is the mean vector of samples in that cluster. 
 4. [Text] Discuss your findings.

In [None]:
#(1) Your code goes here

In [None]:
#(2) Your code goes here

In [None]:
#(3) Your code goes here

(4) ***Your answer goes here***

\pagebreak

# Question 2  Classification of handwritten characters

#### 62 marks out of 163 for this coursework

We use the same data set as the one in Question 1. We use Xtrn_nm for training and Xtst_nm for testing.



\pagebreak

# ========== Question 2.1 --- [14 marks] ==========

We consider applying multiclass logistic regression classification to the data set.
You should use sklearn.linear_model.LogisticRegression() with the default parameters.
Make sure that you use Xtrn_nm for training and Xtst_nm for testing.
Answer (in brief) the following questions:  

1. [Text] Explain how you can extend the original logistic regression for binary classification to multiclass logistic regression.
2. [Code] Carry out a classification experiment with multiclass logistic regression, and report the classification accuracy for the training set and test set.
3. [Code] Find the top five classes that were misclassified most in the test set. You should provide the class numbers, corresponding alphabet letters, and the numbers of misclassifications. 
4. [Text] For each class that you identified in 3 above, make a quick investigation and explain possible reasons for misclassification.

(1) ***Your answer goes here***

In [None]:
#(2) Your code goes here

In [None]:
#(3) Your code goes here

(4) ***Your answer goes here***

\pagebreak

# ========== Question 2.2 --- [10 marks] =========

We now look into the effect of the number of training samples ($N$) on classification accuracy using multiclass logistic regression.
1. [Code] Carry out a classification experiment for $N$ = 100, 200, 500, 1000, 2000, 5000, 10000, 20000, 30000, 46800, and plot the accuracy for the training set (Xtrn_nm) and test set (Xtst_nm) on the same graph. Note that you should always use the whole Xtst for testing irrespective of $N$.
2. [Text] Discuss your findings based on the graphs (results) you obtained, comparing the graphs for training and testing.

In [None]:
#(1) Your code goest here

(2) ***Your answer goes here***

\pagebreak

# ========== Question 2.3 --- [6 marks] =========

We cosinder applying Support Vector Machines (SVMs) to the data set, using sklearn.svm.SVC().
For each of the two conditions shown below, carry out a classification experiment, report classification accuracy and confusion matrix (in numbers instead of in graphical representation such as heatmap) for the test set. You may share code between the two conditions. Make sure that you use Xtrn_nm for training and Xtst_nm for testing.

1. [Code] Condition A: SVM with a linear kernel and C=1
2. [Code] Condition B: SVM with a RBF kernel and C=1. 
3. [Text] Discuss your findings, comparing the results for the two conditions.

In [None]:
#(1) Your code goes here

In [None]:
#(2) Your code goes here

(3) ***Your answer goes here***

\pagebreak

# ========== Question 2.4 --- [12 marks] ==========

We used default parameters for the SVM in Question 2.2. We now want to tune the parameters by using cross-validation. 

1. [Text] Explain why we employ cross validation to determine the value of C.
2. [Code] To reduce the time for experiments, you pick up the first 400 training samples from each class (from 0 ('a') to 25 ('z') to create Xsmall, so that Xsmall contains 400\*26=1040 samples in total and its first 400 samples correspond to 'a' and the last 400 to 'z'. Accordingly, you create labels, Ysmall.
By using a 3-fold cross validation and Xsmall only, estimate the classification accuracy of an SVM classifier with RBF kernel, while you vary the penalty parameter C in the range $10^{-2}$ to $10^4$ (use 13 values spaced equally log space, where the logarithm base is 10). Set the kernel coefficient parameter gamma to 'auto' for this question. You should use sklearn.model_selection.cross_val_score.
Display the mean cross-validation classification accuracy for each C, and plot it against C by using a log-scale for the x-axis.
3. [Text] Report the highest obtained mean accuracy score and the value of C which yielded it.

(1) ***Your answer goes here***

In [None]:
#(2) Your code goes here

(3) ***Your answer goes here***

\pagebreak

# ========== Question 2.5 --- [20 marks] ==========

We now want to improve the classification accuracy for the multiclass logistic regression model from the one obtained in Question 2.1. Answer the following questions.
1. [Text] Discuss possible approaches, and decide the one you implement. Note that you should not use other classification models, but you should stick to the multiclass regression approach.
2. [Code] Implement the approach you have chosen, carry out a classification experiment, and report accuracy for the training set and test set.
3. [Text] Make a quick investigation to the result and report your findings.

(1) ***Your answer goes here***

In [None]:
#(2) Your code goes here

(3) ***Your answer goes here***

\pagebreak

# Question 3  Classification of sequential data

#### 32 marks out of 163 for this coursework

### Human Activity Recognition data set (UniMiB SHAR)

The aim of this task is to predict types of activities of daily living (ADL) from acceleration samples acquired with a smartphone. We use a subset of [UniMiB SHAR Activities](http://www.sal.disco.unimib.it/technologies/unimib-shar/).
Activities of 30 people (subjects) were recorded at a sampling frequency of 50Hz and parameterised as a fixed-length sequence of (x,y,z)-accelaration vectors. Each sample is represented as a vector of 453 elements, whose original shape was 151-by-3.
There are nine activities:'StandingUpFS', 'StandingUpFL', 'Walking', 'Running', 'GoingUpS', 'Jumping', 'GoingDownS', 'LyingDownFS', 'SittingDown'.

For training and evaluation, we employ leave-one-subject-out cross-validation, in which, for each of 30 subjects, the data of the remaining 29 subjects is used for training and the data of the left-out subject is used for validation (testing).
We will mainly employ ***macro F1 score*** as the evaluation measure. Note that we can not expect good F1 scores for test data in this question. Scores of around 0.4 are even possible.

### Loading data: ###

Upload the data set files, 'adl_data.mat', 'adl_labels.mat', 'adl_train_idxssubjective_folds.mat', and 'adl_test_idxssubjective_folds.mat' to your environment, and run the following cell.

In [None]:
# Load the data set and apply some changes
from iaml02cw2_helpers import *
X, Y, train_idx, test_idx = load_UniMiB_SHAR_ADL()
# Change labels and indices for Python (zero-indexing)
Y = Y-1
train_idx = train_idx - 1
test_idx = test_idx - 1


- X: whole data (shape=(7579,453))
- Y: labels for X (shape=(7579,3)).The first columns is the type of activity, second column is the subject number, and third column is the gender of the subject.
- train_idx: list of training data indices for the leave-one-subject-out. For subject k, train_idx\[k\] gives indices to X that should be used for training
- test_idx:  list of test data indices for the leave-one-subject-out. For subject k, test_idx\[k\] gives indices to X that should be used for testing

\pagebreak

# ========== Question 3.1--- [7 marks] ==========

Answer the following questions: 
1. [Code] Using a barplot, plot the frequency of each activity in the whole data set.
2. [Text] Explain your findings and discuss possible issues with the data set when we use it for activity classification tasks.

In [None]:
#(1) Your code goes here

(2) ***Your answer goes here***

\pagebreak

# ========== Question 3.2--- [10 marks] ==========

Answer the following questions: 
1. [Code] Carry out an experiment with k-nearest neighbours classification with the Euclidean distance measure for k = 1, 2, 3, 5, 7, 10, display *macro F1 scores* for training and testing, and plot the scores on a single figure. Make sure that you employ leave-one-subject-out cross validation.
2. [Text] Discuss your findings.

In [None]:
#(1) Your code goes here

(2) ***Your answer goes here***

\pagebreak

# ========== Question 3.3--- [15 marks] ==========

We now consider using Gaussian Mixture Models (GMMs) in sklearn, where we employ a GMM for each activity class $C_i$ to estimate $p(X|C_i)$. In classification, we find the class $C_j$ such that $P(X,C_j) > P(X,C_i)$ for $i \neq j$, where $P(X,C_i) = P(X|C_i)P(C_i)$. We assueme equal prior probabilities, i.e., $P(C_i) = \frac{1}{9}$ for all $i$ in this question.

1. [Code] Train GMMs with diagonal covariance matrices on the training data and calculate the per-sample average log likelihood  and macro-F1 score on the training data and test data for each K = 1,2,3,5,10,20,40, where K denotes the number of mixture components.
Display the obtained likelihood values and F1 scores in separate tables, where columns correspond to K and rows correspond to training data and test data, providing a right table title.
Plot the likelihoods and F1 scores in separate graphs, where x-axis represents K, using different colours and marks for training and test conditions, and providing a right figure title.
Make sure that you employ leave-one-subject-out cross validation to obtained the requested information. Note that, in each cross validation, you should first obtain a per-sample average log likelihood for each class using the data of that class instead of the whole data, and obtain unweighted average over all the classes irrespective of class size. 
2. [Text] Discuss your findings, compareing the results for training and testing.

In [None]:
#(1) Your code goes here

(2) ***Your answer goes here***

\pagebreak

In [None]:
# This cell's output will confirm all cells have been run if you select Kernel->Restart & Run All.
# Wait until you see the output printed
print("*****************************")
print("*                           *")
print("* All cells have been run!! *")
print("*                           *")
print("*****************************")