# Isolation of Training Data

This notebook will prepare the data for the next set of segmenation options - Neural Networks

## Author: Alexander Goudemond, Student Number: 219030365

We cannot conduct Semantic Segmentation, and we can conduct Instance Segmentation! So this notebook will prepare our data to be used for training, and in another notebook we can train NN!

# Imports


In [1]:
from os import getcwd, walk, mkdir, stat, remove
from os import sep # used later on, in a function, to print directory contents
from os.path import exists, basename, join

from shutil import copyfile

from PIL.Image import fromarray
import cv2

import matplotlib.pyplot as plt
import numpy as np

# Directory of Images

In [2]:
def get_manual_segmentation_directories(startPath):
    location_array = []
    acceptable_folders = ["SEG"]

    for root, dirs, files in walk(startPath):
        # skip this folder
        if ("OriginalZipped" in root):
            continue

        elif (root[ -3 : ] not in acceptable_folders):
            continue

        location_array.append(root)
    
    return location_array
###

In [3]:
current_directory = getcwd()
desired_directory = "..\\..\\Comp700_Segmented"

In [4]:
path = (current_directory + "\\" + desired_directory)
location_array = get_manual_segmentation_directories(path)

In [5]:
# first 10
print( location_array[0:10] ) 
print("Number of folders:", len( location_array ) ) 

['c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\BF-C2DL-HSC\\BF-C2DL-HSC\\01_GT\\SEG', 'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\BF-C2DL-HSC\\BF-C2DL-HSC\\01_ST\\SEG', 'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\BF-C2DL-HSC\\BF-C2DL-HSC\\02_GT\\SEG', 'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\BF-C2DL-HSC\\BF-C2DL-HSC\\02_ST\\SEG', 'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\BF-C2DL-MuSC\\BF-C2DL-MuSC\\01_GT\\SEG', 'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\BF-C2DL-MuSC\\BF-C2DL-MuSC\\01_ST\\SEG', 'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\BF-C2DL-MuSC\\BF-C2DL-MuSC\\02_GT\\SEG', 'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\BF-C2DL-MuSC\\BF-C2DL-MuSC\\02_ST\\SEG', 'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\DIC-C2DH-HeLa\\DIC-C2DH-HeLa\\01_GT\\SEG

We are now in a position where we can investigate the quantity of pictures inside each of those 36 folders. These 36 folders act as the Memo for our dataset, so collecting the names together will allow us to fetch the relevant images as our training images:

In [6]:
# loop through location_array, looking at the number of images:

image_quantities = []

for item in location_array:
    for root, dirs, files in walk(item):
        image_quantities.append(len(files))

image_quantities[0:10]
    

[49, 1764, 8, 1764, 50, 1376, 50, 1376, 9, 84]

In [7]:
# use image_quantities and location_array to generate a collection of paths for each image

y_training_data_paths = []
temp_array = []

for item in location_array:

    for root, dirs, files in walk(item):
        # temp_array = files
        for name in files:
            temp_array.append(item + "\\" + name)

    y_training_data_paths.append(temp_array)
    temp_array = [] # reset


In [8]:
y_training_data_paths

[['c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\BF-C2DL-HSC\\BF-C2DL-HSC\\01_GT\\SEG\\man_seg0058.tif',
  'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\BF-C2DL-HSC\\BF-C2DL-HSC\\01_GT\\SEG\\man_seg0108.tif',
  'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\BF-C2DL-HSC\\BF-C2DL-HSC\\01_GT\\SEG\\man_seg0126.tif',
  'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\BF-C2DL-HSC\\BF-C2DL-HSC\\01_GT\\SEG\\man_seg0175.tif',
  'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\BF-C2DL-HSC\\BF-C2DL-HSC\\01_GT\\SEG\\man_seg0183.tif',
  'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\BF-C2DL-HSC\\BF-C2DL-HSC\\01_GT\\SEG\\man_seg0218.tif',
  'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\BF-C2DL-HSC\\BF-C2DL-HSC\\01_GT\\SEG\\man_seg0332.tif',
  'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\BF-C2DL-HSC\\BF-C2DL-HSC

Sanity check:

In [9]:
index = -1

for row in y_training_data_paths:
    index += 1
    if (len(row) != image_quantities[index]):
        print("Incompatible sizes!")

Fantastic!

We are now in a position where we can go through each of the elements in the y_training_data_paths and extract the number and extensions. This will then be used to fetch the image paths for our x_training_data_paths!

Now, it is worth noting that location_array sometimes has 2 folders of manually segmented images. So we actually have the opportunity for twice as many training sets!

To recognize this:

In [10]:
location_array[0:10]

['c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\BF-C2DL-HSC\\BF-C2DL-HSC\\01_GT\\SEG',
 'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\BF-C2DL-HSC\\BF-C2DL-HSC\\01_ST\\SEG',
 'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\BF-C2DL-HSC\\BF-C2DL-HSC\\02_GT\\SEG',
 'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\BF-C2DL-HSC\\BF-C2DL-HSC\\02_ST\\SEG',
 'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\BF-C2DL-MuSC\\BF-C2DL-MuSC\\01_GT\\SEG',
 'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\BF-C2DL-MuSC\\BF-C2DL-MuSC\\01_ST\\SEG',
 'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\BF-C2DL-MuSC\\BF-C2DL-MuSC\\02_GT\\SEG',
 'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\BF-C2DL-MuSC\\BF-C2DL-MuSC\\02_ST\\SEG',
 'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\DIC-C2DH-HeLa\\DIC-C2DH-HeLa\\01

So, we can split this into 2 arrays: st_array and gt_array!

In [11]:
gt_y_training_data_paths = []
st_y_training_data_paths = []
index = -1

for item in location_array:
    index += 1
    if ("_GT" in item):
        gt_y_training_data_paths.append(y_training_data_paths[index])
    else:
        st_y_training_data_paths.append(y_training_data_paths[index])

print("GT_ :", len(gt_y_training_data_paths))
print("ST_ :", len(st_y_training_data_paths))

GT_ : 20
ST_ : 16


This is expected because not all of the data-sets have ST folders. To verify which folders are missing that data:

-- Fluo-C2DL-Huh7 (x2)

-- Fluo-N2DH-SIM+ (x2)

Let us now look at the x data:

In [12]:
def get_segmentation_directories(startPath):
    location_array = []
    acceptable_folders = ["\\01", "\\02"]

    for root, dirs, files in walk(startPath):
        # skip this folder
        if ("OriginalZipped" in root):
            continue

        elif (root[ -3 : ] not in acceptable_folders) or ("(1)" in root):
            continue

        location_array.append(root)
    
    return location_array
###

In [13]:
path = (current_directory + "\\" + desired_directory)
image_location_array = get_segmentation_directories(path)

print(image_location_array[0:10])
print(len(image_location_array))

['c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\BF-C2DL-HSC\\BF-C2DL-HSC\\01', 'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\BF-C2DL-HSC\\BF-C2DL-HSC\\02', 'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\BF-C2DL-MuSC\\BF-C2DL-MuSC\\01', 'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\BF-C2DL-MuSC\\BF-C2DL-MuSC\\02', 'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\DIC-C2DH-HeLa\\DIC-C2DH-HeLa\\01', 'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\DIC-C2DH-HeLa\\DIC-C2DH-HeLa\\02', 'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\Fluo-C2DL-Huh7\\Fluo-C2DL-Huh7\\01', 'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\Fluo-C2DL-Huh7\\Fluo-C2DL-Huh7\\02', 'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\Comp700_Segmented\\Fluo-C2DL-MSC\\Fluo-C2DL-MSC\\01', 'c:\\Users\\G5\\Documents\\GitHub\\COMP700\\..\\..\\C

Now all we have to do is loop through all images inside the respective paths in image_location_array, and extract the corresponding images!

In [14]:
def extractNumber(path, symbol):
    right_most_index = path.rfind(symbol)
    return (path[right_most_index + 1 : ])
###

print(y_training_data_paths[7][3])
extractNumber(y_training_data_paths[7][3], "g")

c:\Users\G5\Documents\GitHub\COMP700\..\..\Comp700_Segmented\BF-C2DL-MuSC\BF-C2DL-MuSC\02_ST\SEG\man_seg0003.tif


'0003.tif'

In [15]:
image_number = ""
row = 0
column = 0

gt_x_training_data_paths = []
temp_array = []
count = -1

for item in image_location_array:
    count += 1
    # print(item)
    for root, dirs, files in walk(item):
        # print(files)
        for image in files:
            # initialize
            if (column == 0):
                image_number = extractNumber(gt_y_training_data_paths[row][column], "g")
                column += 1

            if (image_number in image):
                # print(count)
                temp_array.append(item + "\\" + image)

                # stop at final index position
                if (column < len(gt_y_training_data_paths[count]) ):
                    image_number = extractNumber(gt_y_training_data_paths[row][column], "g")
                    column += 1

        row += 1; column = 0
        gt_x_training_data_paths.append(temp_array)
        temp_array = []

    print("folder", (count+1), "of", len(image_location_array) , "complete")


folder 1 of 20 complete
folder 2 of 20 complete
folder 3 of 20 complete
folder 4 of 20 complete
folder 5 of 20 complete
folder 6 of 20 complete
folder 7 of 20 complete
folder 8 of 20 complete
folder 9 of 20 complete
folder 10 of 20 complete
folder 11 of 20 complete
folder 12 of 20 complete
folder 13 of 20 complete
folder 14 of 20 complete
folder 15 of 20 complete
folder 16 of 20 complete
folder 17 of 20 complete
folder 18 of 20 complete
folder 19 of 20 complete
folder 20 of 20 complete


Let's do a sanity check to ensure we have corresponding images:

In [16]:
for row in range(len(gt_x_training_data_paths)):
    print( len(gt_x_training_data_paths[row]), ":::", len(gt_y_training_data_paths[row]) )

49 ::: 49
8 ::: 8
50 ::: 50
50 ::: 50
9 ::: 9
9 ::: 9
8 ::: 8
5 ::: 5
18 ::: 18
33 ::: 33
30 ::: 30
20 ::: 20
65 ::: 65
150 ::: 150
28 ::: 28
8 ::: 8
15 ::: 15
19 ::: 19
2 ::: 2
2 ::: 2


Let's repeat the process for st_x_training_data_paths:

Recall that st_y_training_data_paths does not contain folders in every file location

-- Fluo-C2DL-Huh7 (x2)

-- Fluo-N2DH-SIM+ (x2)

In [21]:
image_number = ""
row = 0
column = 0

st_x_training_data_paths = []
temp_array = []
count = -1

for item in image_location_array:
    count += 1
    # print(item)
    for root, dirs, files in walk(item):
        # print(files)
        for image in files:

            # stop because these folders don't have ST folders
            test = image_location_array[count]
            if ("Fluo-C2DL-Huh7" in test) or ("Fluo-N2DH-SIM+" in test):
                break

            # initialize
            if (column == 0):
                image_number = extractNumber(st_y_training_data_paths[row][column], "g")
                column += 1

            if (image_number in image):
                # print(count)
                temp_array.append(item + "\\" + image)

                # stop at final index position
                if (column < len(st_y_training_data_paths[row]) ):
                    image_number = extractNumber(st_y_training_data_paths[row][column], "g")
                    column += 1

        column = 0
        if (len(temp_array) != 0):
            row += 1; 
            st_x_training_data_paths.append(temp_array)
            temp_array = []

    
    print("folder", (count+1), "of", len(image_location_array) , "complete")


folder 1 of 20 complete
folder 2 of 20 complete
folder 3 of 20 complete
folder 4 of 20 complete
folder 5 of 20 complete
folder 6 of 20 complete
folder 7 of 20 complete
folder 8 of 20 complete
folder 9 of 20 complete
folder 10 of 20 complete
folder 11 of 20 complete
folder 12 of 20 complete
folder 13 of 20 complete
folder 14 of 20 complete
folder 15 of 20 complete
folder 16 of 20 complete
folder 17 of 20 complete
folder 18 of 20 complete
folder 19 of 20 complete
folder 20 of 20 complete


In [22]:
for row in range(len(st_x_training_data_paths)):
    print( len(st_x_training_data_paths[row]), ":::", len(st_y_training_data_paths[row]) )

1764 ::: 1764
1764 ::: 1764
32 ::: 1376
1376 ::: 1376
84 ::: 84
84 ::: 84
48 ::: 48
48 ::: 48
92 ::: 92
92 ::: 92
92 ::: 92
92 ::: 92
115 ::: 115
115 ::: 115
300 ::: 300
300 ::: 300


TBD - do x_training for st and gt