# Rot-MNIST data visualization

<b>Rot-MNIST</b> is a variant of the popular MNIST dataset where digits are rotated in-plane by arbitrary angles. The dataset can be found at https://www.dropbox.com/s/0fxwai3h84dczh0/mnist_rotation_new.zip.

This script is meant to visualize the training and test sets and provide an understanding of this dataset.

In [None]:
# Importing the necessary dependencies below
import argparse
import os
import random
import sys
import time
from urllib.request import urlopen
import zipfile
sys.path.append('../')
import numpy as np
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import matplotlib.pyplot as plt
import random

### Loading the dataset

In [None]:
# specify below the relative path of the directory where the folder mnist_rotation_new is kept.
data_dir = '.'

In [None]:
# loading the train, val and test sets
rmnist_dir = data_dir + '/mnist_rotation_new'
train = np.load(rmnist_dir + '/rotated_train.npz')
valid = np.load(rmnist_dir + '/rotated_valid.npz')
test = np.load(rmnist_dir + '/rotated_test.npz')

## Statistics of the data

#### Samples in training, validation and test sets

In [None]:
print('No. training samples: ', train['x'].shape[0])
print('No. validation samples: ', valid['x'].shape[0])
print('No. test samples: ', test['x'].shape[0])

#### Analyzing sample counts per class for the data subsets

In [None]:
def analyze_class_distribution_tabuler(data, flag):
    '''
    Behaviour:
        Displays sample fraction present in each class in 
        the data.
    Args:
        data (Numpy array):  Contains the dataset
        flag (String): Flag for data-split
    '''
        
    data_count = []
    data_frac = []
    for i in range(10):
        data_count.append((data['y']==i).sum())
        data_frac.append(((data['y']==i).sum()/data['y'].shape[0]).round(3))
    # Creating a table containing fractions of each class in the train set
    from IPython.display import HTML, display
    import tabulate
    table = [["Class", 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, "Total"],
             ["Samples", data_count[0], data_count[1], data_count[2], data_count[3],
              data_count[4], data_count[5], data_count[6], data_count[7], data_count[8],
              data_count[9], data['y'].data[0]],
             ["Fraction", data_frac[0], data_frac[1], data_frac[2], data_frac[3],
              data_frac[4], data_frac[5], data_frac[6], data_frac[7], data_frac[8],
              data_frac[9], 1.0]]
    print(flag, ' set of ', data['y'].shape[0], ' samples')
    display(HTML(tabulate.tabulate(table, tablefmt='html')))

In [None]:
analyze_class_distribution_tabuler(data = train, flag = "Training")

In [None]:
analyze_class_distribution_tabuler(data = valid, flag = "validation")

In [None]:
analyze_class_distribution_tabuler(data = test, flag = "Test")

From the distribution of samples between the classes, it can be seen that the data is mostly balanced across different classes with sample fraction not going below 9% for any class in dataset. SImilarly, sample fraction is around 11% max across the three subsets.

This implies that beyond the traditional cross-entropy loss, no adding class balancing methods are needed to train the model for classification task on this dataset.

## Data visualization

#### Training samples

In [None]:
def display_image_tile(data, tile_size):
    '''
    Behaviour:
        Displays a large tile with tile_size times tile_size
        images randomly chosen from the data.
    
    Args:
        data (Numpy array):  Contains the dataset.
        tile_size (int) : dimention of a squre tile  
    '''
    tile = np.zeros((28*tile_size, 28*tile_size))
    for i in range(tile_size):
        for j in range(tile_size):
            idx = random.randint(0, data['x'].shape[0])
            tile[i*28:(i+1)*28, j*28:(j+1)*28] = np.reshape(data['x'][idx,:], (28,28))
    # Plotting the tile
    plt.figure(figsize = (15,15))
    plt.imshow(tile)
    ax = plt.gca()
    ax.axes.xaxis.set_visible(False)
    ax.axes.yaxis.set_visible(False)
    plt.show()



In [None]:
display_image_tile(data = train, tile_size = 25)

Above we see 625 samples from the training set. The dataset comprises rotated variants of the 10 MNIST digits. 

#### Test samples

In [None]:
display_image_tile(data = test, tile_size = 25)

Similar to the training set, the 625 test samples shown above also comprise in-plane rotations at arbitrary angles.

### Context of the dataset

In the above dataset, the training set comprises only 10000 samples. Studies using this dataset argue that the limited rotation variations appearing in the training set are not sufficient to let the model learn all rotational variations of the samples. Moreover the overlap of the train and test sets is relatively low in the context of orientations. Thus, it is hoped that when regular CNNs are trained on the orientations present in the training set, they might not be good enough for predictions on the test set, where the rotational orientations differ.