# Models metrics comparison

In [1]:
import os
import sys
import json

sys.path.insert(0, '../src')

import numpy as np
import pandas as pd

from loader import ImagesDataset

In [2]:
SEED = 11
rng = np.random.default_rng(SEED)

In [3]:
DATA_PATH = '../images'
SHAPE = (128, 128, 3)

dataset = ImagesDataset(path=DATA_PATH, preload=False, encode_labels=True)
classes = dataset.label_encoder.classes_.tolist()
n_classes = len(classes)

dataset.split(shuffle=SEED)

train_uniques, train_counts = np.unique(dataset.labels[dataset.train.indices], return_counts=True)
train_shares = train_counts / train_counts.sum()

100%|[38;2;76;175;80m████████████████████████████████████████████[0m| 9/9 [00:00<00:00, 446.02it/s][0m


In [4]:
classwise_metrics = ['precision', 'recall']
general_metrics = ['accuracy', 'f1_score']
metrics_names = classwise_metrics + general_metrics

metrics = {name: {} for name in metrics_names}

METRICS_PATH = '../metrics'
for file in sorted(os.listdir(METRICS_PATH)):
    if not file.endswith('.json'):
        continue

    file_path = os.path.join(METRICS_PATH, file)
    with open(file_path, 'r') as f:
        metrics_dict = json.load(f)
    
    model_name = file.split('.')[0].split('_')[1]
    for name in metrics_names:
        metrics[name][model_name] = metrics_dict[name]

In [5]:
html = ''

styles = [
    {
        'selector': "caption",
        'props': [
            ('font-family', 'monospace'),
            ("font-size", "150%")
        ]
    },
    {
        'selector': "tbody",
        'props': [
            ('font-family', 'monospace'),
            ('text-align', 'right'),
        ]
    },
]

for metrics_name in metrics_names:
    metrics_dict = metrics[metrics_name]

    if metrics_name in classwise_metrics:
        metrics_dict = {'shares': train_shares, **metrics_dict}
        index = classes
    else:
        metrics_dict = {'shares': 1 / n_classes, **metrics_dict}
        index = ['\u00A0' * 3 + 'altogether']

    df = pd.DataFrame(data=metrics_dict, index=index)
    formatter = lambda value: f"{(value * 100).round(1)}"
    styled_df = df.style.background_gradient('RdYlGn', vmin=0, vmax=1, axis=0).format(formatter).set_caption(metrics_name).set_table_styles(styles)
    display(styled_df)
    
    html += styled_df.render()
    
with open('../report/report.html', 'w') as f:
    f.write(html)

Unnamed: 0,shares,GUESS,LOGISTIC,VGG,VGG+KMEANS,VGG+LGBM,EFFICIENTNET,EFFICIENTNET+KNN
ArtDeco,4.7,7.1,0.0,75.0,66.7,40.0,50.0,42.9
Cubism,25.7,30.0,34.2,64.1,72.5,71.2,71.1,72.0
Impressionism,17.0,15.7,40.0,51.7,57.1,54.5,63.8,64.4
Japonism,14.1,10.8,31.2,75.9,74.2,63.6,67.7,67.7
Naturalism,15.2,14.3,41.9,83.8,78.9,80.0,73.3,73.9
Rococo,8.6,9.1,20.0,52.0,48.3,65.0,52.8,54.1
cartoon,4.9,12.5,12.5,63.6,72.7,77.8,53.3,61.5
photo,9.7,13.6,22.6,79.2,66.7,71.9,81.8,87.5


Unnamed: 0,shares,GUESS,LOGISTIC,VGG,VGG+KMEANS,VGG+LGBM,EFFICIENTNET,EFFICIENTNET+KNN
ArtDeco,4.7,8.3,0.0,50.0,50.0,50.0,30.0,30.0
Cubism,25.7,33.3,18.1,81.9,80.6,79.2,84.3,84.3
Impressionism,17.0,17.4,26.1,65.2,69.6,65.2,68.2,65.9
Japonism,14.1,9.5,35.7,52.4,54.8,50.0,60.0,60.0
Naturalism,15.2,13.3,40.0,68.9,66.7,71.1,62.3,64.2
Rococo,8.6,10.5,57.9,68.4,73.7,68.4,65.5,69.0
cartoon,4.9,10.5,15.8,36.8,42.1,36.8,50.0,50.0
photo,9.7,10.3,24.1,65.5,69.0,79.3,64.3,75.0


Unnamed: 0,shares,GUESS,LOGISTIC,VGG,VGG+KMEANS,VGG+LGBM,EFFICIENTNET,EFFICIENTNET+KNN
altogether,12.5,17.6,27.8,65.8,67.3,66.5,67.0,68.4


Unnamed: 0,shares,GUESS,LOGISTIC,VGG,VGG+KMEANS,VGG+LGBM,EFFICIENTNET,EFFICIENTNET+KNN
altogether,12.5,14.2,26.2,64.5,65.2,64.0,62.3,63.9


### Conlusion

Clustering of embeddings produces the highest metrics values so far, in both accuracy and F1 score. The latter is basically a harmonic mean of precision and recall, so with no extra assumptions about specific classes importance (e.g. errors on smaller classes must be minimized at all costs).

🥇 The first place is shared between two clustering approaches — KMeans on VGG embeddings and KNN on EfficientNet ones. First one is 1.3% better in terms of F1-score, while the latter is 1.1% better in terms of Accuracy. Choose your figther depending on what you wish to achieve.

🥈 The honorable second place is also taken by two participant — a pure VGG itself and the LGBM trained on that VGG embeddings. Vanilla convnet classification beats boosting on 0.5% in F1-score, while losing a 0.7% in terms of Accuracy.

🥉 Finally, third place goes to EfficientNet. It does well in general Accuracy, but fails miserably when it comes to F1-score. And it also deals kinda bad with rare classes, like _cartoon_ and _ArtDeco_.

### Criticism

There are much more here to test and research. Among most important, yet not implemented ideas are:

- Validate as a pro:
    - Metrics above are obtained on a single fixed seed and therefore estimate of model performance quality based solely on them is kinda weak;   
    - Moreover no extra validation subset was allocated due, which also raises some questions on overfitting;
    - Finally, what'll make those metrics a descent approximation of what we should expect from trained models _in production_, when confronted with real data, is definitely __cross-validation__. This fella, while being rather greedy on your resources (effectively increase computation time proportional to the total amount of train-test splits combination), is the most reliable source of information about quality of your model.
- Enrich the data:
    - The dataset is small and even simple augmentations like flip, rotate, cutout, crop and noise might significantly enrich it and therefore, increase model generalizing capability and reduce overfitting;
    - Also, there are open datasets out there containing images of various styles (e.g. ArtGan). They might contribute a lot;
- Research hyperparameters:
    - Parametrization fixed in presented experiments is obtained via short-time research on limited parameters domain. More detailed and long research in some kind of Optuna or whatever, might bring you a few metrics percents.
- Try more architectures and algorithms:
    - Only a few of many existing neural networks architectures were checked — others might be heavier, but also more performant. And vision transformers are also a thing.
    - There also more of a clustering, then just KMeans and KNN. Go for DBSCAN, SOM, Affinity Propagation, Hierarchical clustering and many others. Some of them might cluster images embeddings just due to the nature of that data. Low variance of features in poorly represented classes means that choosing like one fixed number (n_clusters or n_neighbors) and expect the clustering algorithm to perform equally good on different-sized clustergs, is rather naive. Instead, some kind of class-specific parameter should be accounted while clustering. Just be sure not to overfit on test subset. That why you need a validation set.