You will be implementing networks to recognize handwritten Hiragana symbols. The dataset to be used is Kuzushiji-MNIST or KMNIST for short. Significant changes occurred to the language when Japan reformed their education system in 1868, and the majority of Japanese today cannot read texts published over 150 years ago. The dataset we will be using contains 10 Hiragana characters with 7000 samples per class.
- Implement a model NetLin which computes a linear function of the pixels in the image, followed by log softmax. Run the code by typing:
python3 kuzu_main.py --net lin
Produce the final accuracy and confusion matrix. Note that the rows of the confusion matrix indicate the target character, while the columns indicate the one chosen by the network. (0="o", 1="ki", 2="su", 3="tsu", 4="na", 5="ha", 6="ma", 7="ya", 8="re", 9="wo").
- Implement a fully connected 2-layer network NetFull (i.e. one hidden layer, plus the output layer), using tanh at the hidden nodes and log softmax at the output node. Run the code by typing:
python3 kuzu_main.py --net full
Try different values (multiples of 10) for the number of hidden nodes and try to determine a value that achieves high accuracy (at least 84%) on the test set. Produce the final accuracy and confusion matrix.
- Implement a convolutional network called NetConv, with two convolutional layers plus one fully connected layer, all using relu activation function, followed by the output layer, using log softmax. You are free to choose for yourself the number and size of the filters, metaparameter values (learning rate and momentum), and whether to use max pooling or a fully convolutional architecture. Run the code by typing:
python3 kuzu_main.py --net conv
Your network should consistently achieve at least 93% accuracy on the test set after 10 training epochs. Produce the final accuracy and confusion matrix.
- Briefly discuss the following points:
- the relative accuracy of the three models,
- the confusion matrix for each model: which characters are most likely to be mistaken for which other characters, and why?
Final Accuracy Test set: Average loss: 1.0102, Accuracy: 6967/10000 (70%)
Confusion Matrix
Final Accuracy Test set: Average loss: 0.4974, Accuracy: 8492/10000 (85%)
Confusion Matrix
Final Accuracy Test set: Average loss: 0.2481, Accuracy: 9387/10000 (94%)
Confusion Matrix
It is clear from the results above that the accuracy improves as the complexity of the model increases. We can see that ‘NetLin’ which was the simplest of the three models had the lowest accuracy of around 70% whereas ‘NetFull’, which employs the use of a hidden layer with ‘tanh’ activation performed better with 85% accuracy. Furthermore, NetConv utilised convolutional layers as well as a fully connected layer with ‘relu’ activations which gave us an accuracy of 94% which was the highest out of the three models.
Fig 1.4 – Most frequent misclassifications (red = most frequent, orange = 2nd most frequent, yellow = 3rd most frequent)
If we look at ‘NetConv’ and its three most frequent misclassifications in descending order from fig1.4 above, we can see that this model is most likely to mistake:
- は (ha) for す (su)
- き (ki) for ま (ma)
- お (o) for な (na)
All three models seem to misclassify ‘は (ha) for す (su)’ and ‘き (ki) for ま (ma)’. However, ‘NetLin’ and ‘NetFull’ often mistakes ‘お (o)’ for ‘や (ya) and は (ha)’ rather than ‘な (na)’ which is how ‘NetConv’ behaves.
To understand why some characters may be mistaken for others, we look at three comparisons below. We can see that the character ‘は (ha)’ is often mistaken for ‘す (su)’ and ‘き (ki)’ is mistaken for ‘ま (ma)’ in all three models. From looking at the comparisons below we can see two very similar features in both characters circled below for both pairs.
For the ‘NetConv’ model, ‘お (o)’ is often mistaken for ‘な (na)’, however the other two models ‘NetLin’ and ‘NetFull’ mistake ‘お (o)’ for either ‘や (ya)’ or ‘は (ha)’. By inspection we can see that the misclassification from ‘NetConv’ is more convincing compared to the misclassification from ‘NetLin’ and ‘NetFull’ for this example.