# Synthetic data. Previous study.

Main branch => https://github.com/albertovpd/viu_tfm-deep_vision_classification/tree/synthetic_data_study

With a confusion matrix and a classification report we're able to see the weaknesses of our models regarding the classes. Nevertheless, how do you check what images are particulary hard to classify correctly?

This notebook answers that question. In previous branches of the project:
- 150pics of each classes were taken to create the test set.
- the rest:
  - were shuffled and divided into train set (80%), validation set (20%). This process were repeated 5 times, to create 5 subfolders with different configurations of train and validation set => https://github.com/albertovpd/viu_tfm-deep_vision_classification/tree/kfolds_validation

  - the same model was trained with and validated with this 5 subfolders, saving a different configuration each time.

Now I'm going to check in all folders, what pictures are misclassified by all models. Those will be *the hardest pictures to classify*, and I'll work on create synthetic data for those ones.





In [None]:
# Google Drive stuff
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)

Mounted at /content/drive/


In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Tue Mar 22 09:55:16 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  A100-SXM4-40GB      Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    46W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# tf
%tensorflow_version 2.x
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


- libs

In [None]:
%tensorflow_version 2.x
# batch ingestion of pics without pickle
from tensorflow.keras.preprocessing import image_dataset_from_directory

# nns
from tensorflow.keras.applications import ResNet50 

from tensorflow.keras import Model
from tensorflow.keras.models import load_model # Sequential
from tensorflow.keras import layers 

# optimization
from tensorflow.keras.optimizers import SGD #Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy, categorical_crossentropy
from tensorflow.keras.callbacks import EarlyStopping

# nn architectures, metrics, viz & reports => written in my_functions202202 file
import sys
sys.path.append("/content/drive/My Drive/2-Estudios/viu-master_ai/tfm-deep_vision/src")
from misclassification_functions import inferences_target_list, get_misclassified, get_list_of_files

import numpy as np
%matplotlib inline

# navigating through folders
import os

# to copy files from one path to another folder
import shutil

- paths

In [None]:
base_folder = "/content/drive/My Drive/2-Estudios/viu-master_ai/tfm-deep_vision/"
test_folder = base_folder+"input/dataset_1test_5trainval_folders/test_ds/"

# input
reg_input = base_folder+"input/dataset_1test_5trainval_folders/train_val_ds/trainval_regular_partitions/"
misclassified_folder = base_folder+"input/dataset_1test_5trainval_folders/misclassifications/"

# models stored in 
output_folder = base_folder + "/output/"

- create folder for pictures misclassified for all models

In [None]:
# create a folder with 5 subfolders, 1 for each class
for cls in class_names:
    folder = misclassified_folder+ cls
    os.makedirs(misclassified_folder+ cls, exist_ok=True)
    print(folder[49:])

- common parameters

In [None]:
image_size = (128,128)
batch_size = 128
epochs = 250
opt = SGD(momentum=0.9) 

- models!

In [None]:
# available models
onlyfiles = [f for f in os.listdir(output_folder) if (os.path.isfile(os.path.join(output_folder, f)) & (".h5" in f) )]
for files in sorted(onlyfiles):
    print(files)

In [None]:
# build a dict with the models I want
models_dict = {"resnet50_NOdataAug_dropoutFirst007_regKfolds_fold0.h5": load_model(output_folder+"resnet50_NOdataAug_dropoutFirst007_regKfolds_fold0.h5"),
               "resnet50_NOdataAug_dropoutFirst007_regKfolds_fold1.h5": load_model(output_folder+"resnet50_NOdataAug_dropoutFirst007_regKfolds_fold1.h5"),
               "resnet50_NOdataAug_dropoutFirst007_regKfolds_fold2.h5": load_model(output_folder+"resnet50_NOdataAug_dropoutFirst007_regKfolds_fold2.h5"),
               "resnet50_NOdataAug_dropoutFirst007_regKfolds_fold3.h5": load_model(output_folder+"resnet50_NOdataAug_dropoutFirst007_regKfolds_fold3.h5"),
               "resnet50_NOdataAug_dropoutFirst007_regKfolds_fold4.h5": load_model(output_folder+"resnet50_NOdataAug_dropoutFirst007_regKfolds_fold4.h5")
              }

In [None]:
folders = os.listdir(reg_input)
folders

['fold2', 'fold4', 'fold0', 'fold3', 'fold1']

- train set folders (it's just folders with a subfolder called train_set (one for each class) in which we evaluate all models)

In [None]:
for f in folders:
  print("===>", f)
  misclassified_train_folders=[]
  # -- train dataset for each folder
  train_path = reg_input+f+"/"+'train_ds/'
  # print("\n train dataset:", "\n", train_path)
  train_ds = image_dataset_from_directory(
            train_path,
            class_names=["Bedroom","Bathroom","Dinning","Livingroom","Kitchen"],
            seed=None,
            validation_split=None, 
            subset=None,
            image_size= image_size,
            batch_size= batch_size,
            color_mode='rgb',
            shuffle=False 
            )        
  # list of paths for analysed images
  pic_list = get_list_of_files(train_path)

  for nn in models_dict:
        # inferences and real values
        y_pred, y_target = inferences_target_list(models_dict[nn], train_ds)
        
        # misclassified ones
        misclassified = get_misclassified(y_pred, y_target)
        print("elements misclassified in {} for model {}: ".format(f, nn), len(misclassified))
        misclassified_train_folders.append(misclassified)

  # get the indexes of common misclassified pics in the same folder for all models 
  common_misclassified = list(set.intersection(*map(set, misclassified_train_folders)))  
  print("- total common misclassified in {} for all models: ".format(f), len(common_misclassified))

  # labels associated to that indexes, for testing 
  # target_misclassified = [y_target[i] for i in common_misclassified]
  # pred_misclassified = [y_pred[i] for i in common_misclassified]

  # get associated to that indexes
  pic_list_misclassified = [pic_list[i] for i in common_misclassified]

  # copy that pics to the global misclassification folder
  relocating_misclassified(common_misclassified)

===> fold2
Found 3598 files belonging to 5 classes.
elements misclassified in fold2 for model resnet50_NOdataAug_dropoutFirst007_regKfolds_fold0.h5:  714
elements misclassified in fold2 for model resnet50_NOdataAug_dropoutFirst007_regKfolds_fold1.h5:  692
elements misclassified in fold2 for model resnet50_NOdataAug_dropoutFirst007_regKfolds_fold2.h5:  384
elements misclassified in fold2 for model resnet50_NOdataAug_dropoutFirst007_regKfolds_fold3.h5:  722
elements misclassified in fold2 for model resnet50_NOdataAug_dropoutFirst007_regKfolds_fold4.h5:  271
- total misclassified in fold2 for all models:  167
===> fold4
Found 3598 files belonging to 5 classes.
elements misclassified in fold4 for model resnet50_NOdataAug_dropoutFirst007_regKfolds_fold0.h5:  695
elements misclassified in fold4 for model resnet50_NOdataAug_dropoutFirst007_regKfolds_fold1.h5:  691
elements misclassified in fold4 for model resnet50_NOdataAug_dropoutFirst007_regKfolds_fold2.h5:  468
elements misclassified in fo

In [None]:
# shutil.copy won't duplicate an existing record in a destination fodler, it will overwrite it
sum([len(files) for r, d, files in os.walk(misclassified_folder)])

778

- validation set folders (it's just folders with subfolder called val_set (one for each class) in which we evaluate all models)

In [None]:

for f in folders:
  print(f)
  misclassified_val_folders=[]
  # --validation dataset for each folder
  val_path = reg_input+f+"/"+"val_ds"
  val_ds = image_dataset_from_directory(
          val_path,
          class_names=["Bedroom","Bathroom","Dinning","Livingroom","Kitchen"],
          seed=None,
          validation_split=None, 
          subset=None,
          image_size= image_size,
          batch_size= batch_size,
          color_mode='rgb',
          shuffle=False 
      )      
  # list of paths for analysed images
  pic_list_val = get_list_of_files(val_path)

  for nn in models_dict:
      # inferences and real values
      y_pred_val, y_target_val = inferences_target_list(models_dict[nn], val_ds)
      
      # misclassified ones
      misclassified_val = get_misclassified(y_pred_val, y_target_val)
      print("elements misclassified in {} for model {}: ".format(f, nn), len(misclassified_val))
      misclassified_val_folders.append(misclassified_val)

  # get the indexes of common misclassified pics in the same folder for all models 
  common_misclassified_val = list(set.intersection(*map(set, misclassified_val_folders)))
  print("- total common misclassified in {} for all models: ".format(f), len(common_misclassified_val))

  # associated paths to that indexes
  pic_list_misclassified_val = [pic_list_val[i] for i in common_misclassified_val]
  print(len(pic_list_misclassified_val))

  # copy that pics to the global misclassification folder
  relocating_misclassified(common_misclassified_val)

fold2
Found 902 files belonging to 5 classes.
elements misclassified in fold2 for model resnet50_NOdataAug_dropoutFirst007_regKfolds_fold0.h5:  173
elements misclassified in fold2 for model resnet50_NOdataAug_dropoutFirst007_regKfolds_fold1.h5:  178
elements misclassified in fold2 for model resnet50_NOdataAug_dropoutFirst007_regKfolds_fold2.h5:  208
elements misclassified in fold2 for model resnet50_NOdataAug_dropoutFirst007_regKfolds_fold3.h5:  167
elements misclassified in fold2 for model resnet50_NOdataAug_dropoutFirst007_regKfolds_fold4.h5:  99
- total misclassified in fold2 for all models:  61
61
fold4
Found 911 files belonging to 5 classes.
elements misclassified in fold4 for model resnet50_NOdataAug_dropoutFirst007_regKfolds_fold0.h5:  193
elements misclassified in fold4 for model resnet50_NOdataAug_dropoutFirst007_regKfolds_fold1.h5:  182
elements misclassified in fold4 for model resnet50_NOdataAug_dropoutFirst007_regKfolds_fold2.h5:  125
elements misclassified in fold4 for mod

In [None]:
sum([len(files) for r, d, files in os.walk(misclassified_folder)])

1007

- test set folder (a folder with 5 subfolders, one for each class). Every folder have 150 pics that don't appear in the train/validation folders.

In [None]:
test_ds = image_dataset_from_directory(
    test_folder,
      class_names=["Bedroom","Bathroom","Dinning","Livingroom","Kitchen"],
      seed=None,
      validation_split=None, 
      subset=None,
      image_size= image_size,
      batch_size= batch_size,
      color_mode='rgb',
      shuffle=False 
  )     
# list of paths for analysed images
pic_list = get_list_of_files(test_folder)
misclassified_test_folder = []
for nn in models_dict:
      # inferences and real values
      y_pred_test, y_target_test = inferences_target_list(models_dict[nn], test_ds)
      
      # misclassified ones
      misclassified_test = get_misclassified(y_pred_test, y_target_test)
      print("elements misclassified in test dataset for model {}: ".format(nn), len(misclassified_test))
      misclassified_test_folder.append(misclassified_test)

# get the indexes of common misclassified pics in the same folder for all models 
common_misclassified_test = list(set.intersection(*map(set, misclassified_test_folder)))  
print("- total common misclassified in test dataset for all models: ", len(common_misclassified_test))

# get associated to that indexes
pic_list_misclassified_test = [pic_list[i] for i in common_misclassified_test]
print(len(pic_list_misclassified_test))

# copy that pics to the global misclassification folder
relocating_misclassified(common_misclassified_test)

Found 750 files belonging to 5 classes.
elements misclassified in test dataset for model resnet50_NOdataAug_dropoutFirst007_regKfolds_fold0.h5:  234
elements misclassified in test dataset for model resnet50_NOdataAug_dropoutFirst007_regKfolds_fold1.h5:  221
elements misclassified in test dataset for model resnet50_NOdataAug_dropoutFirst007_regKfolds_fold2.h5:  220
elements misclassified in test dataset for model resnet50_NOdataAug_dropoutFirst007_regKfolds_fold3.h5:  234
elements misclassified in test dataset for model resnet50_NOdataAug_dropoutFirst007_regKfolds_fold4.h5:  215
- total common misclassified in test dataset for all models:  134
134


In [None]:
sum([len(files) for r, d, files in os.walk(misclassified_folder)])

1141