#### Creation of TFRecords (for IRNV2) and creation of image-input for ensemble (based on IRNV2 output)
The creation of TFRecords is done by following a tutorial of Kwot Sin, which can be found here:
###### https://kwotsin.github.io/tech/2017/01/29/tfrecords.html
Some alterations have been made, as my approach required Stratified K-Folds, and I needed to include the unique image ID in the TFRecord. This was necessary to be able to map the IRNV2 output (probabilities) to the correct image again, as IRNV2 shuffled the batches. Thus, two modules have been changed, which have 'daan' included

###### Intermediate output (TFRecords) used for "GH - Image Classification - IRNV2 + CLR"
###### Intermediate output (strat_****_data) used for "GH - Text Classification" (and further manipulation in this notebook)
###### Final output of this notebook used for: "GH - Ensemble model"

In [6]:
import math
import os
import sys
import tensorflow as tf
import random
from dataset_utils_png import _dataset_exists, _get_filenames_and_classes, write_label_file, _convert_dataset
from dataset_utils_png_daan import _convert_dataset_daan
import pandas as pd

##### The TFRecords will be created, to make the image data suitable for the IRNV2 model.

Creation of Dummy Classes for the images; As normal code wants to do 'raw split' in train/test, and I want Stratified K-Folds

In [3]:
dataset_dir = 'C:/Users\studentid\Desktop\JADS - Master Thesis\Data\ImagesFinalData'


dataset_main_folder_list = [name for name in os.listdir(dataset_dir) if os.path.isdir(os.path.join(dataset_dir,name))]
dataset_root = os.path.join(dataset_dir, dataset_main_folder_list[0])
directories = []
class_names = []

for filename in os.listdir(dataset_root):
    path = os.path.join(dataset_root, filename)
    if os.path.isdir(path):
        directories.append(path)
        class_names.append(filename)

dummy_classes_strat = []
photo_filenames = []
for directory in directories:
    for filename in os.listdir(directory):
        path = os.path.join(directory, filename)
        dummy_classes_strat.append(directory)
        photo_filenames.append(path)


In [5]:
flags = tf.app.flags
#tf.app.run


#State your dataset directory
flags.DEFINE_string('dataset_dir', 'C:/Users\studentid\Desktop\JADS - Master Thesis\Data\ImagesFinalData', 'String: Your dataset directory')

# Proportion of dataset to be used for evaluation
flags.DEFINE_float('validation_size', 0.25, 'Float: The proportion of examples in the dataset to be used for validation')

# The number of shards to split the dataset into.
flags.DEFINE_integer('num_shards', 10, 'Int: Number of shards to split the TFRecord files into')

# Seed for repeatability.
flags.DEFINE_integer('random_seed', 0, 'Int: Random seed to use for repeatability.')

#Output filename for the naming the TFRecord file
flags.DEFINE_string('tfrecord_filename', 'productimages', 'String: The output filename to name your TFRecord file')

FLAGS = flags.FLAGS

remaining_args = FLAGS([sys.argv[0]] + [flag for flag in sys.argv if flag.startswith("--")]) #Addition, to empty flags.

assert(remaining_args == [sys.argv[0]])

#### The excel file 'labels_left' is used later on, to create the labelencoding based on the labels that are left.
The 'Labels_left' file is created a few blocks below, but had to be used again here, because mapping the 'class names to ids' had to be done on the 275 labels in the data, not on the amount of labels that were present before all labels < 3 products were dropped. This is also done below, when train(1), train(2), test(1), and val(2) are created

In [2]:
excel_labels275 = pd.read_excel('labels_left.xlsx')
print(len(excel_labels275))
excel_labels275 = excel_labels275.drop_duplicates()
print(len(excel_labels275))
list_labels275 = list(excel_labels275['labels_left'])
#list_labels275

5230
275


In [6]:
#### photo_filenames, class_names = _get_filenames_and_classes('C:/Users\studentid\Desktop\JADS - Master Thesis\Data\ImagesPaddedLabel')
class_names_sorted = [int(x) for x in list_labels275]
class_names_sorted.sort(key=float)
class_names_string = [str(item) for item in class_names_sorted]
class_names_to_ids = dict(zip(class_names_string, range(len(class_names_string))))

In [7]:
len(class_names_sorted)

275

### Creation of train(1), test(1), train(2), and val(2). 

In [8]:
#The first split is made here. 
#The complete dataset is split in train(1) and test(1)
#Hereafter, test(1) has to be split in train(2) and val(2)

from sklearn.model_selection import train_test_split
train_1, test_1, y_train_1, y_test_1 = train_test_split(photo_filenames, dummy_classes_strat,
                                                    stratify= dummy_classes_strat, 
                                                    test_size=0.25, random_state = 1)

#The creation of train(1) is now complete. This set does not need further processing.

In [9]:
print(len(train_1))
print(len(test_1))

16068
5356


In [10]:
#First, the classes with less than 4 products have to be dropped again
#To allow for gridsearch and a correct stratified split
test_1_df = pd.DataFrame({'image_test': test_1})
y_test_1_df = pd.DataFrame({'test_label': y_test_1})
total_test_1 = pd.concat([test_1_df, y_test_1_df], axis=1)

#Drop the labels with too few products
#Test_1 will not be used in the end, only its resulting datasets train(2) and val(2)
counts = total_test_1['test_label'].value_counts()
total_test_1 = total_test_1[total_test_1['test_label'].isin(counts[counts > 3].index)]
total_test_1 = total_test_1.reset_index(drop=True)

In [12]:
#This has led to a reduction 
total_test_1['test_label'].nunique()

275

In [13]:
#Creation of second split
#Here, the test set created above is used as 'total dataset', and is split in a new train and val set.

train_2, val_2, y_train_2, y_val_2 = train_test_split(total_test_1['image_test'], total_test_1['test_label'],
                                                    stratify= total_test_1['test_label'], 
                                                    test_size=0.25, random_state = 1)

In [14]:
print(len(train_2))
print(len(val_2))

3922
1308


In [15]:
#Make sure the data formats are equal again (train_2 and val_2 are different compared to train-1)
train_2 = list(train_2)
val_2 = list(val_2)

In [16]:
#Necessary preprocessing to create image-ids, later on needed
unique_images_train1 = []
for image1 in train_1:
    single_image1 = os.path.basename((image1))
    unique_images_train1.append(single_image1)

unique_images_train2 = []
for image2 in train_2:
    single_image2 = os.path.basename((image2))
    unique_images_train2.append(single_image2)    
    
unique_images_val2 = []
for image3 in val_2:
    single_image3 = os.path.basename((image3))
    unique_images_val2.append(single_image3)

In [17]:
#Necessary processing to create dictionaries with the image_ids
unique_images_to_ids_train1 = dict(zip(unique_images_train1, range(len(unique_images_train1))))
unique_images_to_ids_train2 = dict(zip(unique_images_train2, range(len(unique_images_train2))))
unique_images_to_ids_val2 = dict(zip(unique_images_val2, range(len(unique_images_val2))))

#Check if lengths are still correct
print(len(unique_images_to_ids_train1))
print(len(unique_images_to_ids_train2))
print(len(unique_images_to_ids_val2))

16068
3922
1308


In [21]:
# Finally, write the labels file:
labels_to_class_names = dict(zip(range(len(class_names_string)), class_names_string))
write_label_file(labels_to_class_names, FLAGS.dataset_dir)


#### Code to make sure train/validation sets are same for text and images!

Because we have splitted the text data and image data separately from each other, we have to make sure both base-models (text classification & image classification) have the same training data, and also same validation data! 
Or else the meta-classifier will not be able to classify...

In [23]:
#Create a copy of train_1, train_2 & val_2
#So they can afterwards be converted to new dataframes
#To match with the textdata, for creating the same text & image sets
copy_train_1 = train_1[:]
copy_train_2 = train_2[:]
copy_val_2 = val_2[:]

Therefore, a match has to be made between the training images and the 'textdata' file, to make a separation in that data (based on the stratified split made here). The same goes for the validation images.

In [24]:
import pandas as pd

image_files_train1 = pd.DataFrame({'image_train1': copy_train_1})
image_files_train2 = pd.DataFrame({'image_train2': copy_train_2})
image_files_validation2 = pd.DataFrame({'image_validation2': copy_val_2})

In [2]:
#This is the format each string of images has, below is what it needs to be
image_files_validation2['image_validation2'][26]

#### Import the 'imagedata.xlsx' , the file that will be matched against

In [26]:
totaldata = pd.read_excel('imagedata.xlsx')

In [27]:
#All the '/' are replaced by '\\', now the strings are equal
image_files_train1['image_train1'] = image_files_train1['image_train1'].str.replace('/','\\')
image_files_train2['image_train2'] = image_files_train2['image_train2'].str.replace('/','\\')
image_files_validation2['image_validation2'] = image_files_validation2['image_validation2'].str.replace('/','\\')

totaldata['Images'] = totaldata['Images'].str.replace('/','\\')


In [28]:
#The strings still differ, as the image-data was kept in two layers deeper folders.
#These parts of the string have to be ommitted.

edited_train_string1 = []
for row1 in image_files_train1['image_train1']:
    part1 = row1[:57]
    part2 = row1[-15:]
    total1 = part1 + part2
    edited_train_string1.append(total1)
    
edited_train_string2 = []
for row2 in image_files_train2['image_train2']:
    part3 = row2[:57]
    part4 = row2[-15:]
    total2 = part3 + part4
    edited_train_string2.append(total2)
    
edited_val_string2 = []
for row3 in image_files_validation2['image_validation2']:
    part_a = row3[:57]
    part_b = row3[-15:]
    total_ab = part_a + part_b
    edited_val_string2.append(total_ab)

In [29]:
#Write new DF, with the strings edited. Now the two df's can be matched
image_files_train_edit1 = pd.DataFrame({'image_train_edit1': edited_train_string1})
image_files_train_edit2 = pd.DataFrame({'image_train_edit2': edited_train_string2})
image_files_val_edit2 = pd.DataFrame({'image_val_edit2': edited_val_string2})

In [30]:
#The two new datasets are created, these will be used in both 'Text Classification
#and 'Image Classification (the latter already created with TFRecords)

strat_train1_data = pd.merge(left=image_files_train_edit1, right=totaldata, left_on=image_files_train_edit1['image_train_edit1'], right_on=totaldata['Images'])
strat_train2_data = pd.merge(left=image_files_train_edit2, right=totaldata, left_on=image_files_train_edit2['image_train_edit2'], right_on=totaldata['Images'])

strat_val2_data = pd.merge(left=image_files_val_edit2, right=totaldata, left_on=image_files_val_edit2['image_val_edit2'], right_on=totaldata['Images'])

In [32]:
labels_left = []
for label in total_test_1['test_label']:
    pathofimage = label[84:]
    total_label = pathofimage
    labels_left.append(total_label)

In [33]:
labels_left_df = pd.DataFrame({'labels_left': labels_left})

In [34]:
labels_left_df.to_excel("labels_left.xlsx", index=False)

In [35]:
# TRAIN1 needs to have same amount of classes as TRAIN2 and VAL2
# Because InceptionResNet V2 requires same amount of classes in train/test
# So a new merge, to make sure they have same labels.
# The old 'directory structure', where images are saved, is required again
# Simply concatenate them, as structure is retained with earlier merge
strat_train1_data = pd.concat([strat_train1_data, image_files_train1], axis=1)

In [36]:
#Here, all label differences are dropped. 

unique_labels_left = pd.DataFrame({'Label':labels_left_df['labels_left'].unique()})
unique_labels_left['Label'] = pd.to_numeric(unique_labels_left['Label'])
strat_train1_data = strat_train1_data[strat_train1_data['Label'].isin(unique_labels_left['Label'])]

In [37]:
#strat_train1_data['image_train1'][0]
new_train_1 = list(strat_train1_data['image_train1'])

#Get unique ids
unique_images_new_train1 = []
for new_image1 in new_train_1:
    single_image = os.path.basename((new_image1))
    unique_images_new_train1.append(single_image)
    
unique_images_to_ids_new_train1 = dict(zip(unique_images_new_train1, range(len(unique_images_new_train1))))


In [None]:
#Train1 has been slightly reduced, now the new TFRecord is created.
len(strat_train1_data)

#Now the TFRecords file is written
_convert_dataset_daan('train', new_train_1, class_names_to_ids, unique_images_to_ids_new_train1,
                 dataset_dir = FLAGS.dataset_dir,
                 tfrecord_filename = FLAGS.tfrecord_filename,
                 _NUM_SHARDS = FLAGS.num_shards)

#Train2
_convert_dataset_daan('traintest', train_2, class_names_to_ids, unique_images_to_ids_train2,
                 dataset_dir = FLAGS.dataset_dir,
                 tfrecord_filename = FLAGS.tfrecord_filename,
                 _NUM_SHARDS = FLAGS.num_shards)

#Val_2
_convert_dataset_daan('validation', val_2, class_names_to_ids, unique_images_to_ids_val2,     
                 dataset_dir = FLAGS.dataset_dir,
                 tfrecord_filename = FLAGS.tfrecord_filename,
                 _NUM_SHARDS = FLAGS.num_shards)

In [37]:
#Finally, all the datasets have been created.

strat_train1_data.to_excel("strat_train1_data.xlsx", index=False)
strat_train2_data.to_excel("strat_train2_data.xlsx", index=False)
strat_val2_data.to_excel("strat_val2_data.xlsx", index=False)

In [1]:
#strat_val2_data

Done! The two files created here are sent to the Jupyter Notebook 'Text Classification'. There these two files will be used as the training & validation data (train is split into train/test inside the GridSearch algorithms)

#### Last section in this notebook: rematch 'image_id' output of InceptionResNetV2 with the actual images
#### This is the case for train2 and val2 (train2 is traindata for svm, val2 is val data)

In [39]:
#For each of the images in 'inception_output.xlsx', the 'image_id' will be rematched with its actual '.png' file. 
#This is done to be able to perfectly match the 'text_output' and 'image_output', to make them serve as input for final model.

print(list(unique_images_to_ids_val2.keys())[list(unique_images_to_ids_val2.values()).index(0)])

1000018772.png


In [40]:
#The output of the InceptionResNetV2 model, as preprocessed in 'Preprocess Image Output.ipynb', is imported
inception_output_train2 = pd.read_excel('inception-output_train2.xlsx')
inception_output_val2 = pd.read_excel('inception-output_val2.xlsx')

# CREATE: Inception_output_train2 and Inception_output_val2


In [41]:
rematched_images_train2 = []
for row in inception_output_train2['ImageID']:
    pngname1 = list(unique_images_to_ids_train2.keys())[list(unique_images_to_ids_train2.values()).index(row)]
    rematched_images_train2.append(pngname1)

rematched_images_val2 = []
for row in inception_output_val2['ImageID']:
    pngname2 = list(unique_images_to_ids_val2.keys())[list(unique_images_to_ids_val2.values()).index(row)]
    rematched_images_val2.append(pngname2)

In [42]:
#rematched_images_val2

In [43]:
basepath_train2 = []
for row1 in image_files_train2['image_train2']:
    pathofimage1 = row1[:57]
    dashsign1 = '/'
    total1 = pathofimage1 + dashsign1
    basepath_train2.append(total1)

basepath_val2 = []
for row2 in image_files_validation2['image_validation2']:
    pathofimage2 = row2[:57]
    dashsign2 = '/'
    total2 = pathofimage2 + dashsign2
    basepath_val2.append(total2)

In [44]:
len(rematched_images_train2)
len(basepath_train2)

3922

In [45]:
#Create DF's for both the train2 and val2 dataset
basepath_train2_df = pd.DataFrame({'basepath_train2': basepath_train2})
rematched_images_train2_df = pd.DataFrame({'pngnames_train2': rematched_images_train2})
rematched_images_train2_df['matched_images'] = basepath_train2_df['basepath_train2'] + '' + rematched_images_train2_df['pngnames_train2']
rematched_images_train2_df['matched_images'] = rematched_images_train2_df['matched_images'].str.replace('/', '\\')

basepath_val2_df = pd.DataFrame({'basepath_val2': basepath_val2})
rematched_images_val2_df = pd.DataFrame({'pngnames_val2': rematched_images_val2})
rematched_images_val2_df['matched_images'] = basepath_val2_df['basepath_val2'] + '' + rematched_images_val2_df['pngnames_val2']
rematched_images_val2_df['matched_images'] = rematched_images_val2_df['matched_images'].str.replace('/', '\\')

In [46]:
#rematched_images_df['matched_images'][0]

In [46]:
len(inception_output_train2)

3922

In [47]:
inception_output_train2 = pd.concat([inception_output_train2, rematched_images_train2_df], axis=1)

inception_output_val2 = pd.concat([inception_output_val2, rematched_images_val2_df], axis=1)


In [3]:
#train2_dataset['matched_images'][0]
#inception_output_val2

In [49]:
train2_dataset = pd.read_excel('strat_train2_data.xlsx')
val2_dataset = pd.read_excel('strat_val2_data.xlsx')

In [4]:
#val2_dataset

In [51]:
matches_counter = 0

for row in train2_dataset['image_train_edit2']:
    for image in inception_output_train2['matched_images']:
        if row == image:
            matches_counter += 1

#CHECK IF ALL IMAGES MATCH IN BOTH DATASETS, AND THEY DO! 
print(matches_counter)

3922


In [52]:
#The two frames are now merged, to create the dataframe that will be used as input for the final layer of the ensemble model.
image_input_train2_ensemble = train2_dataset.merge(inception_output_train2, left_on='image_train_edit2', right_on='matched_images', how='inner')

image_input_val2_ensemble = val2_dataset.merge(inception_output_val2, left_on='image_val_edit2', right_on='matched_images', how='inner')

In [53]:
len(image_input_val2_ensemble)
#image_input_val2_ensemble

1308

In [5]:
#image_input_val2_ensemble

In [55]:
#Some unnecessary columns are dropped, before writing this to an excel file.
#What is important to remain in the data, are the probability scores, and their respective labels
cols = [1,2,3,4,5,6,7,8,9,10,12,17,18]

image_input_train2_ensemble.drop(image_input_train2_ensemble.columns[cols],axis=1,inplace=True)

image_input_val2_ensemble.drop(image_input_val2_ensemble.columns[cols],axis=1,inplace=True)

In [56]:
#image_input_val2_ensemble

In [57]:
#As the labels were changed (due to stratification in train/test(1) and train/val(2), the label 'names' have to be remapped)
#The 'Predictions' and 'Labels' columns, resulting from the Inception model, still have the encoded labels, instead of the actual
#That's why there is a mismatch at the moment between 'Label' and 'Labels'.
ids_to_class_names = {v: k for k, v in class_names_to_ids.items()}

image_input_train2_ensemble = image_input_train2_ensemble.replace({"Labels": ids_to_class_names})
image_input_train2_ensemble = image_input_train2_ensemble.replace({"Predictions": ids_to_class_names})

image_input_val2_ensemble = image_input_val2_ensemble.replace({"Labels": ids_to_class_names})
image_input_val2_ensemble = image_input_val2_ensemble.replace({"Predictions": ids_to_class_names})



In [62]:
#Remapping is a success!
#image_input_val2_ensemble

In [58]:
#Finally, write the dataframe to Excel. This file will later on be used, in combination with the text output.
image_input_train2_ensemble.to_excel("image_input_train2_ensemble.xlsx", header=True)

image_input_val2_ensemble.to_excel("image_input_val2_ensemble.xlsx", header=True)