 <font size="6">  Image file inspection </font> 

---

In this notebook, we will 

* check the composition of image files, including unique sample id, image size, the marker in the panel
* remove the unwanted images 
* spot the missing samples
* spot the repeated images

## Load library

In [1]:
import glob
import PIL
import numpy as np
from collections import Counter
import pandas as pd
from PIL import Image 
import tifffile

## Check image number and image size

Load all images from a folder with glob 

In [2]:
folder_path = '../data/tif/*.tif'
file_list = sorted(glob.glob(folder_path))

Count the number of different image sizes

In [3]:
print('File number: ',len(file_list))
image_size_list = [np.array(Image.open(x).size) for x in file_list]
img_size, count = np.unique(image_size_list,return_counts=True)
for i in range(int(len(count)/2)):
    print('There are ',str(count[2*i+1]), 'images of size ',str(img_size[2*i:2*i+2]),'\n')

File number:  2
There are  2 images of size  [1004 1340] 



## Check image number and size per sample

In [4]:
## Take sample id from the file name
patient_id,total_number_list = np.unique(np.array([x.split('/')[-1].split(' ')[0] for i,x in enumerate(file_list)]),return_counts=True)
patient_id_list = np.array([x.split('/')[-1].split(' ')[0] for i,x in enumerate(file_list)])

## Check per sample image composition
print('Patient number: ', len(patient_id))
df = pd.DataFrame([patient_id_list,[x[0] for x in image_size_list], [1]*len(patient_id_list)]).T
df.groupby([0,1]).count()

Patient number:  2


Unnamed: 0_level_0,Unnamed: 1_level_0,2
0,1,Unnamed: 2_level_1
Sample1,1340,1
Sample2,1340,1


## Check the marker in the image

In [5]:
marker_list =   ['chan1', 'chan2', 'chan3', 'chan4', 'chan5', 'chan6', 'chan7']  # for real vectra data: [x.description.split('<Name>')[1].split(' ')[0].split('<')[0].split('+')[0] for x in  tifffile.TiffFile(file_list[0]).pages[:-1] ]
print(marker_list)

['chan1', 'chan2', 'chan3', 'chan4', 'chan5', 'chan6', 'chan7']


## Inconsisent sample (optional)

When there are multiple panels, the samples might not be consistent. Here, by providing the golden sample id, we detect the missing sample and extra sample from current panel.

In [6]:
current_sample = np.unique(df[0])

Get the gold standard sample id 

In [7]:
## Here shows one method that takes the sample id from another panel
#first_panel_file_list = sorted(glob.glob('../data/tif/*.tif'))
standard_id = ['Sample1', 'Sample2']  #[x.split('/')[-2] for x in first_panel_file_list]

In [8]:
extra_patients = [x for x in current_sample if x not in standard_id]
print('Extra sample found in this panel: ',extra_patients)
missing_patients = [x for x in standard_id if x not in current_sample]
print('Missing sample from this panel: ',missing_patients)

Extra sample found in this panel:  []
Missing sample from this panel:  []


## Move extra files out of the folder (optional)

In [9]:
import shutil
import os 
for files in file_list:
    if (len([1 for x in extra_patients if (x in files)]) >0):
        destination = files.replace('Lung_Panel18_TIF','Lung_Panel18_TIF_extra')
        os.makedirs(os.path.dirname(destination), exist_ok=True)
        shutil.move(files,destination)
        print (files, ' moved to ',destination)
        #break

## Repeated Image Removal 

Get the unique id from each file name in the folder and find the ones with repeated scanning

In [10]:
## Get the unique id from each file, the method depends on the naming system
unique_id = [(x.split('/')[-1].split(' ')[0] + '-'+x.split('/')[-1].split('[')[-1].split(']')[0]) for x in file_list]

In [11]:
print('Repeated id: ',set([x for x in unique_id if unique_id.count(x) > 1]))

Repeated id:  set()


## Inspection after correction

In [12]:
file_list = sorted(glob.glob(folder_path))

In [13]:
extra_patients = [x for x in current_sample if x not in standard_id]
print('Extra sample found in this panel: ',extra_patients)
missing_patients = [x for x in standard_id if x not in current_sample]
print('Missing sample from this panel: ',missing_patients)

Extra sample found in this panel:  []
Missing sample from this panel:  []


In [14]:
print('File number: ',len(file_list))
image_size_list = [np.array(Image.open(x).size) for x in file_list]
img_size, count = np.unique(image_size_list,return_counts=True)
for i in range(int(len(count)/2)):
    print('There are ',str(count[2*i+1]), 'images of size ',str(img_size[2*i:2*i+2]),'\n')
    
## Take sample id from the file name
patient_id,total_number_list = np.unique(np.array([x.split('/')[-1].split(' ')[0] for i,x in enumerate(file_list)]),return_counts=True)
patient_id_list = np.array([x.split('/')[-1].split(' ')[0] for i,x in enumerate(file_list)])

## Check per sample image composition
print('Patient number: ', len(patient_id))
df = pd.DataFrame([patient_id_list,[x[0] for x in image_size_list], [1]*len(patient_id_list)]).T
df.groupby([0,1]).count()

File number:  2
There are  2 images of size  [1004 1340] 

Patient number:  2


Unnamed: 0_level_0,Unnamed: 1_level_0,2
0,1,Unnamed: 2_level_1
Sample1,1340,1
Sample2,1340,1
