# HOMEWORK 12

In this homework you are going to inspect the GTSDB (German Traffic Sign Detection Benchmark) dataset. The dataset contains images of various classes of traffic signs used in Germany (and the whole EU). The objective of this homework is to go through the steps described below and to implement the necessary code.

At the end, as usual, there will be a couple of questions for you to answer. In addition, the last section of this homework is optional and, if you chose to do it, you'll earn extra point :-)

In [None]:
import os
import cv2
import numpy as np
import pandas as pd

from matplotlib import pyplot as plt
plt.rcParams['figure.figsize'] = [15, 10]

### Step 0

Go to the GTSRB dataset official site ([link](https://benchmark.ini.rub.de/gtsrb_dataset.html)) to learn more about the dataset.

### Step 1

Download the dataset ([link](https://www.kaggle.com/meowmeowmeowmeowmeow/gtsrb-german-traffic-sign)) and unzip it.

### Step 2

For this homework, you will be working with the training set. Check out the `Train.csv`, open it and see what it contains. Load the dataset and plot random samples.

In [None]:
# Load the training labels
root = 'data/GTSRB' # Path to the dataset location, e.g., '/data/janko/dataset/GTSRB'
data = pd.read_csv(os.path.join(root, 'Train.csv'))

# Number of training samples (amount of samples in data)
num_samples = len(data)

print(num_samples)

# Show random data samples
for ii in range(15):
    # Get random index
    idx = np.random.randint(0, num_samples)
    # Load image
    img = cv2.imread(os.path.join(root, data.iloc[idx]['Path']))
    # Convert image to RGB
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    
    # Show image
    plt.subplot(3,5, ii+1), plt.imshow(img), plt.title(data.iloc[idx]['ClassId'])

### Step 3

Inspect the dataset by computing and plotting the per-class histogram.

In [None]:
# Extract class identifiers
# Hint: Check the csv 
ids = data['ClassId'].values

Compute the per class histogram. You can use any approach you want (e.g. `numpy`). It's also worth looking at the `Counter` function from the `collections` module ([link](https://docs.python.org/3/library/collections.html#collections.Counter)) ;-)

In [None]:
from collections import Counter
hist = Counter(ids)

plt.bar(hist.keys(), hist.values()), plt.grid(True)
plt.xlabel('Traffic Sign ID'), plt.ylabel('Counts')

In [None]:
class_id = 0

results = []

for class_id in hist.keys():
    data_by_class = data[data['ClassId'] == class_id]

    images = [cv2.cvtColor(cv2.imread(os.path.join(root, img_path)), cv2.COLOR_BGR2GRAY) for img_path in data_by_class['Path']] #I forgot that the size of the image is in the file Train.csv

    info_shape = [image.shape for image in images]
    # info_shape = [image.shape[1::-1] for image in images]

    total_count = len(images)
    unic_shape_by_class = np.unique(info_shape, axis=0)
    unic_shape_count_by_class = len(unic_shape_by_class)
    max_shape_by_class = np.max(info_shape, axis=0)
    min_shape_by_class = np.min(info_shape, axis=0)
    mean_shape_by_class = np.mean(info_shape, axis=0)
    mean_bright = np.mean([np.mean(image) for image in images])

    results.append([class_id, total_count, unic_shape_count_by_class, min_shape_by_class, max_shape_by_class, mean_shape_by_class, mean_bright])

results = sorted(results, key=lambda a_entry: a_entry[1])

results = np.array(results)

fig, axs = plt.subplots(2,1)
collabel = ('class_id', 'image_count', 'unic_shapes_count', 'min_shape', 'max_shape', 'mean_shape', 'mean_bright')
axs[0].axis('tight')
axs[0].axis('off')
the_table = axs[0].table(cellText = results, colLabels = collabel, loc = 'top')

axs[1].bar(results[:,0],results[:,6])
axs[1].set_xlabel('mean_bright')
axs[1].set_ylabel('value')
axs[1].grid(True)
plt.show()


In [None]:
import tensorflow as tf

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()

In [None]:

num_classes = 10
for ii in range(15):
    idx = np.random.randint(0, len(y_train))
    plt.subplot(3,5,ii+1), plt.imshow(x_train[idx, ...], cmap='gray'), plt.title(y_train[idx])

In [None]:
centers = np.arange(0, num_classes + 1)
counts, bounds = np.histogram(y_train, bins=centers-0.5)

plt.bar(centers[:-1], counts), plt.grid(True)
plt.xlabel('Class ID'), plt.ylabel('counts')

In [None]:
filter = np.where(y_train == 7)
filter = (y_train == 7).flatten()
print(filter.shape)
x_train_gray = np.array([cv2.cvtColor(image, cv2.COLOR_RGB2GRAY) for image in x_train])

b = y_train[filter]
a = x_train_gray[filter, ...]
print(b)
print(len(filter), len(b), len(x_train_gray))
print(y_train.shape)
print(x_train.shape)
print(x_train_gray.shape)
print(b.shape)
print(a.shape)

In [None]:
results = []

for class_id in range(num_classes):
    
    filter = (y_train == class_id).flatten()
    y_train_ = y_train[filter]
    x_train_ = x_train[filter, ...]

    info_shape = [image.shape for image in x_train_]
    # info_shape = [image.shape[1::-1] for image in images]

    total_count = len(x_train_)
    unic_shape_by_class = np.unique(info_shape, axis=0)
    unic_shape_count_by_class = len(unic_shape_by_class)
    max_shape_by_class = np.max(info_shape, axis=0)
    min_shape_by_class = np.min(info_shape, axis=0)
    mean_shape_by_class = np.mean(info_shape, axis=0)
    print(x_train_.shape)
    mean_bright = np.mean([np.mean(cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)) for image in x_train_])

    results.append([class_id, total_count, unic_shape_count_by_class, min_shape_by_class, max_shape_by_class, mean_shape_by_class, mean_bright])

results = sorted(results, key=lambda a_entry: a_entry[1])

results = np.array(results)

fig, axs = plt.subplots(2,1)
collabel = ('class_id', 'image_count', 'unic_shapes_count', 'min_shape', 'max_shape', 'mean_shape', 'mean_bright')
axs[0].axis('tight')
axs[0].axis('off')
the_table = axs[0].table(cellText = results, colLabels = collabel, loc = 'top')

axs[1].bar(results[:,0],results[:,6])
axs[1].set_xlabel('mean_bright')
axs[1].set_ylabel('value')
axs[1].grid(True)
plt.show()

### Questions

Please answer the following questions:
* Do you consider the dataset to be balanced? If so, why? If not, why?

    No. The number of images per class is not normally distributed. Image sizes vary greatly within and between classes.

* Are there any classes that are (significantly) over-represented or under-represeneted?

    Yes. 

        Over-represented classes: 10, 38, 12, 13, 1, 2.

        Under-represeneted classes: 0, 19, 27, 37, 32, 41,42


### Optional

Perform a further analysis on the dataset and draw some conclusion from it.

Hint 1: Unlike MNIST or CIFAR10, this dataset contains images with various spatial resolutions. Is there anything we can tell about the resolution distribution?

    In CIFAR10 dataset all images with resolution 32x32.

Hint 2: What about the brightness distribution? Are there classes there are significantly more bright than others?

    In the CIFAR 10 data set, the brightness distribution is approximately the same level without large drops.