# Clinical Heart Failure Detection Using Whole-Slide Images of H&E tissue

**Version**
- **0.02**: Prepare Train/Validate/Test Labels and Images 
- **0.01**: Prepare Train/Validate/Test Images

**Improvement Opportunity**
- Convert code sections in data preparation for train/validation/test to functions

## Download Dataset

### Download Train, Validate and Test Images
- Source Link to the Dataset / Annotation File: https://idr.openmicroscopy.org/webclient/?show=project-402
- Follow the instructions at following link, install IBM Aspera Desktop Client to download the dataset.
- Copy downloaded folders to '**data/images**' folder in your working directory where you have this Jupyter Notebook:
  - 'held-out_validation'
  - 'training'

### Download Label Information for Train, Validate and Test Images 
- Following link will point to below Github link which has the annotation File: https://idr.openmicroscopy.org/webclient/?show=project-402
- Source Link for the Annotation File: https://github.com/IDR/idr0042-nirschl-wsideeplearning/tree/master/experimentA
- Download and copy file '**idr0042-experimentA-annotation.csv**' to '**data/labels/**' folder in your working directory where you have this Jupyter Notebook

## References 
#### Data Preparation
- Reading an image
  - mathplotlib: https://stackoverflow.com/questions/9298665/cannot-import-scipy-misc-imread
  - pathlib: https://medium.com/@ageitgey/python-3-quick-tip-the-easy-way-to-deal-with-file-paths-on-windows-mac-and-linux-11a072b58d5f#:~:text=To%20use%20it%2C%20you%20just,for%20the%20current%20operating%20system.
  - OpenCV: https://www.geeksforgeeks.org/python-opencv-cv2-imread-method/
- Load multiple images into a numpy array
  - glob / os.listdir: https://stackoverflow.com/questions/39195113/how-to-load-multiple-images-in-a-numpy-array
  - glob / cv2: https://medium.com/@muskulpesent/create-numpy-array-of-images-fecb4e514c4b
- Load a CSV file
  - Datacamp: https://www.datacamp.com/community/tutorials/pandas-read-csv?utm_source=adwords_ppc&utm_campaignid=1455363063&utm_adgroupid=65083631748&utm_device=c&utm_keyword=&utm_matchtype=b&utm_network=g&utm_adpostion=&utm_creative=278443377095&utm_targetid=dsa-429603003980&utm_loc_interest_ms=&utm_loc_physical_ms=9061994&gclid=EAIaIQobChMIz5TKz-v17QIV1AorCh0bfw96EAAYASAAEgKiGPD_BwE
- Split a String
  - Python Central: https://www.pythoncentral.io/cutting-and-slicing-strings-in-python/


## Data Preparation

### Understand Images Folder Structure and Number of Images Available

**Training/Validation**
- \..\training\fold_1: has images for training = 770#
- \..\training\test_fold_1: has images for validation = 374#
- Total = 770 + 374 = 1144 images

**Test**
- \..\held-out_validation: has images for testing = 1155#

### Understand Annotation File and Label Information Available

Relevant columns of interest:
- Column A: Dataset Name: Classifies each row/instance as 'training' or 'test'
- Column B: Image Name: Specifies filename of the image for the row/instance
- Column Z: Experimental Condition [Diagnosis]: has 3 classes:
  - 'chronic heart failure'
  - 'heart tissue pathology' - We will treat this as 'not chronic heart failure'
  - 'not chronic heart failure'
  
Breakup of training/test instances:
- **training**
  - 'chronic heart failure' = 517
  - 'not chronic heart failure' = 627
- **test**
  - 'chronic heart failure' = 517
  - 'not chronic heart failure' = 638

Total '**training**' = 517 + 627 = 1144
- Note: 'validate' is a portion of this 'training' set.

Total '**test**' = 517 + 638 = 1155

### Load libraries to aid in converting images to arrays

We will convert the images to arrays so that we can then use them to feed to our CNN model.

In [1]:
# install OpenCV package
# pip install opencv-python

In [2]:
import cv2

In [3]:
import glob

In [4]:
import numpy as np

### Load libraries to aid reading CSV file to dataframe

We will import the annotation file into a Pandas Dataframe so that we can then access the labels information.

In [5]:
import pandas as pd

In [6]:
labels = pd.read_csv('data/labels/idr0042-experimentA-annotation.csv')

In [7]:
labels

Unnamed: 0,Dataset Name,Image Name,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Organism Part],Term Source 2 REF,Term Source 2 Accession,Characteristics [Diagnosis],Term Source 3 REF,...,Characteristics [Ethnic or Racial Group],Term Source 6 REF,Term Source 6 Accession,Characteristics [Age],Characteristics [Individual],Characteristics [Clinical History],Protocol REF,Protocol REF.1,Experimental Condition [Diagnosis],Channels
0,training,33381_0_fal_10_0.png,Homo sapiens,NCBITaxon,NCBITaxon_9606,heart,UBERON,UBERON_0000948,chronic heart failure,SNOMED,...,African American,SNOMED,SNOMED_S-62310,65 years,33381,ischemic cardiomyopathy,treatment protocol,image acquisition,chronic heart failure,RGB
1,training,33381_0_fal_14_0.png,Homo sapiens,NCBITaxon,NCBITaxon_9606,heart,UBERON,UBERON_0000948,chronic heart failure,SNOMED,...,African American,SNOMED,SNOMED_S-62310,65 years,33381,ischemic cardiomyopathy,treatment protocol,image acquisition,chronic heart failure,RGB
2,training,33381_0_fal_16_0.png,Homo sapiens,NCBITaxon,NCBITaxon_9606,heart,UBERON,UBERON_0000948,chronic heart failure,SNOMED,...,African American,SNOMED,SNOMED_S-62310,65 years,33381,ischemic cardiomyopathy,treatment protocol,image acquisition,chronic heart failure,RGB
3,training,33381_0_fal_18_0.png,Homo sapiens,NCBITaxon,NCBITaxon_9606,heart,UBERON,UBERON_0000948,chronic heart failure,SNOMED,...,African American,SNOMED,SNOMED_S-62310,65 years,33381,ischemic cardiomyopathy,treatment protocol,image acquisition,chronic heart failure,RGB
4,training,33381_0_fal_25_0.png,Homo sapiens,NCBITaxon,NCBITaxon_9606,heart,UBERON,UBERON_0000948,chronic heart failure,SNOMED,...,African American,SNOMED,SNOMED_S-62310,65 years,33381,ischemic cardiomyopathy,treatment protocol,image acquisition,chronic heart failure,RGB
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2294,test,36175_1_nrm_18_0.png,Homo sapiens,NCBITaxon,NCBITaxon_9606,heart,UBERON,UBERON_0000948,not chronic heart failure,SNOMED,...,Caucasian,SNOMED,SNOMED_S-0003D,53 years,36175,normal cardiovascular function by cardiac cath...,treatment protocol,image acquisition,not chronic heart failure,RGB
2295,test,36175_1_nrm_1_0.png,Homo sapiens,NCBITaxon,NCBITaxon_9606,heart,UBERON,UBERON_0000948,not chronic heart failure,SNOMED,...,Caucasian,SNOMED,SNOMED_S-0003D,53 years,36175,normal cardiovascular function by cardiac cath...,treatment protocol,image acquisition,not chronic heart failure,RGB
2296,test,36175_1_nrm_20_0.png,Homo sapiens,NCBITaxon,NCBITaxon_9606,heart,UBERON,UBERON_0000948,not chronic heart failure,SNOMED,...,Caucasian,SNOMED,SNOMED_S-0003D,53 years,36175,normal cardiovascular function by cardiac cath...,treatment protocol,image acquisition,not chronic heart failure,RGB
2297,test,36175_1_nrm_21_0.png,Homo sapiens,NCBITaxon,NCBITaxon_9606,heart,UBERON,UBERON_0000948,not chronic heart failure,SNOMED,...,Caucasian,SNOMED,SNOMED_S-0003D,53 years,36175,normal cardiovascular function by cardiac cath...,treatment protocol,image acquisition,not chronic heart failure,RGB


In [8]:
print(labels['Dataset Name'])

0       training
1       training
2       training
3       training
4       training
          ...   
2294        test
2295        test
2296        test
2297        test
2298        test
Name: Dataset Name, Length: 2299, dtype: object


In [9]:
print(labels['Dataset Name'][0])

training


In [10]:
type(labels['Dataset Name'][0])

str

In [11]:
print(labels['Image Name'])

0       33381_0_fal_10_0.png
1       33381_0_fal_14_0.png
2       33381_0_fal_16_0.png
3       33381_0_fal_18_0.png
4       33381_0_fal_25_0.png
                ...         
2294    36175_1_nrm_18_0.png
2295     36175_1_nrm_1_0.png
2296    36175_1_nrm_20_0.png
2297    36175_1_nrm_21_0.png
2298     36175_1_nrm_2_0.png
Name: Image Name, Length: 2299, dtype: object


In [12]:
print(labels['Image Name'][0])

33381_0_fal_10_0.png


In [13]:
type(labels['Image Name'][0])

str

In [14]:
print(labels['Experimental Condition [Diagnosis]'])

0           chronic heart failure
1           chronic heart failure
2           chronic heart failure
3           chronic heart failure
4           chronic heart failure
                  ...            
2294    not chronic heart failure
2295    not chronic heart failure
2296    not chronic heart failure
2297    not chronic heart failure
2298    not chronic heart failure
Name: Experimental Condition [Diagnosis], Length: 2299, dtype: object


In [15]:
print(labels['Experimental Condition [Diagnosis]'][0])

chronic heart failure


In [16]:
type(labels['Experimental Condition [Diagnosis]'][0])

str

In [17]:
# confirm 'no info' cells have been encoded as 'nan'... check one entry
print(labels['Characteristics [Disease Subtype]'][463])

nan


### Prepare Train Images and Train Labels

In [18]:
# read all the filenames with extension as 'png' into the filelist
filelist_train = glob.glob('data/images/training/fold_1/*.png')

In [19]:
# confirm you have got the total number desired files in the list
len(filelist_train)

770

We need to extract label info (for an image) from the labels dataframe by using the filename of the image. Let us do a proof of concept for one image. We can then apply the logic to all images. 

In [20]:
# check what an element in the filelist contain
# it has both directory information and the filename
# we need to split the string to get the filename 
# the filename can then be used to check for the label info in the labels dataframe
filelist_train[0]

'data/images/training/fold_1\\33381_0_fal_10_0.png'

In [21]:
# split the string
directory, filename = filelist_train[0].split('\\')

In [22]:
# gives the directory info
directory

'data/images/training/fold_1'

In [23]:
# gives the filname we need
filename

'33381_0_fal_10_0.png'

In [24]:
# use the filename to find the label info using the label dataframe
index = 0
for img_name in labels['Image Name']:
    if (img_name == filename):
        lbl_name = labels['Experimental Condition [Diagnosis]'][index]
    index+=1

In [25]:
# check the label info matches with what we need - Yes, it does. 
lbl_name

'chronic heart failure'

In [26]:
# read 1st file in the list
img = cv2.imread(filelist_train[0])

Now, that proof of concept is done for one instance. Let us now apply the logic to all instances. 

In [27]:
# get the filename of the image from filelist
# use the filename to read the image file and append to to form a list containing all images
# use the filename to find the label info in the labels dataframe ....& append to form a list containing all labels
train_images = []
train_labels = []
for file in filelist_train:
    img = cv2.imread(file)
    train_images.append(img)
    index = 0
    for img_name in labels['Image Name']:
        if (img_name == file):
            lbl_name = labels['Experimental Condition [Diagnosis]'][index]
        index+=1
    if (lbl_name == 'chronic heart failure'):
        lbl_name = 1
    elif (lbl_name == 'not chronic heart failure'):
        lbl_name = 0
    elif (lbl_name == 'heart tissue pathology'):
        lbl_name = 0
    train_labels.append(lbl_name)    

Convert images to numpy arrays and confirm shape is as required for CNN. 

In [28]:
# confirm you have got the total number desired images in the list
len(train_images)

770

In [29]:
# train is a list
type(train_images)

list

In [30]:
# convert list to a numpy array and the values to float
train_images = np.array(train_images, dtype = 'float32')

In [31]:
# check the shape to confirm it is ready for CNN
# number of instances, width, height, number of channels
# number of instances = number of image
# number of channels = 3 ... as these are color images
train_images.shape

(770, 250, 250, 3)

Convert labels to numpy arrays and confirm shape is as required for CNN. 

In [32]:
len(train_labels)

770

In [33]:
type(train_labels)

list

In [34]:
train_labels[0]

1

In [35]:
# convert list to a numpy array and the values to int64
train_labels = np.array(train_labels, dtype = 'int64')

In [36]:
# check the shape to confirm it is ready for CNN
train_labels.shape

(770,)

### Prepare Validation Images and Validation Labels

In [37]:
# read all the filenames with extension as 'png' into the filelist
filelist_validation = glob.glob('data/images/training/test_fold_1/*.png')

In [38]:
# get the filename of the image from filelist
# use the filename to read the image file and append to to form a list containing all images
# use the filename to find the label info in the labels dataframe ....& append to form a list containing all labels
validation_images = []
validation_labels = []
for file in filelist_validation:
    img = cv2.imread(file)
    validation_images.append(img)
    index = 0
    for img_name in labels['Image Name']:
        if (img_name == file):
            lbl_name = labels['Experimental Condition [Diagnosis]'][index]
        index+=1
    if (lbl_name == 'chronic heart failure'):
        lbl_name = 1
    elif (lbl_name == 'not chronic heart failure'):
        lbl_name = 0
    elif (lbl_name == 'heart tissue pathology'):
        lbl_name = 0
    validation_labels.append(lbl_name)    

In [39]:
# convert list to a numpy array and the values to float
validation_images = np.array(validation_images, dtype = 'float32')

In [40]:
# check the shape to confirm it is ready for CNN
validation_images.shape

(374, 250, 250, 3)

In [41]:
# convert list to a numpy array and the values to int64
validation_labels = np.array(validation_labels, dtype = 'int64')

In [42]:
# check the shape to confirm it is ready for CNN
validation_labels.shape

(374,)

### Prepare Test Images and Test Labels

In [43]:
# read all the filenames with extension as 'png' into the filelist
filelist_test = glob.glob('data/images/held-out_validation/*.png')

In [44]:
# get the filename of the image from filelist
# use the filename to read the image file and append to to form a list containing all images
# use the filename to find the label info in the labels dataframe ....& append to form a list containing all labels
test_images = []
test_labels = []
for file in filelist_test:
    img = cv2.imread(file)
    test_images.append(img)
    index = 0
    for img_name in labels['Image Name']:
        if (img_name == file):
            lbl_name = labels['Experimental Condition [Diagnosis]'][index]
        index+=1
    if (lbl_name == 'chronic heart failure'):
        lbl_name = 1
    elif (lbl_name == 'not chronic heart failure'):
        lbl_name = 0
    elif (lbl_name == 'heart tissue pathology'):
        lbl_name = 0
    test_labels.append(lbl_name) 

In [45]:
# convert list to a numpy array and the values to float
test_images = np.array(test_images, dtype = 'float32')

In [46]:
# check the shape to confirm it is ready for CNN
test_images.shape

(1155, 250, 250, 3)

In [47]:
# convert list to a numpy array and the values to int64
test_labels = np.array(test_labels, dtype = 'int64')

In [48]:
# check the shape to confirm it is ready for CNN
test_labels.shape

(1155,)