# Human Protein Atlas Image Classification


Image classification of microscope slides based on mixed protein patterns.

### Project Description
The objective for this project is to determine the locations of protein organelles present in the microscope images. This breaks down into two parts: first, identifying the protein location (in general) from the image, and second,labelling each organelle within the protein.

In particular we aim to build a model that can reliably make predictions even when the images contain a mixture of different cell types with different morphologies. 

### Task 
The problem is a a multi-label image classification task. Each image containing a mixture of different cell types must be predicted for each of the 27 labels. 

The training images that was provided by the kaggle competition includes a train_csv file that contains a list of image ids with the identified protein labels. 


### Requirements

- keras
- tensorflow
- numpy
- pydot
- pandas
- OpenCV (opencv-python)

In [0]:
# libraries

from keras.layers import Dense, Activation
from keras.layers import Input, Flatten
from keras.models import Sequential

import cv2              # openCV image processing
import numpy as np      # linear algebra
import pandas as pd     # data processing
import pydot            # graphing/visualization
import matplotlib

import os
import gc
import random
import csv

Using TensorFlow backend.


### Labels

In [0]:
NUCLEOPLASM                   = 0
NUCLEAR_MEMBRANE              = 1
NUCLEOLI                      = 2
NUCLEOLI_FIBRILLAR_CENTER     = 3
NUCLEAR_SPECKLES              = 4
NUCLEAR_BODIES                = 5
ENDOPLASMIC_RETICULUM         = 6
GOLGI_APPARATUS               = 7
PEROXISOMES                   = 8
ENDOSOMES                     = 9
LYSOSOMES                     = 10
INTERMEDIATE_FILAMENTS        = 11
ACTIN_FILAMENTS               = 12
FOCAL_ADHESION_SITES          = 13
MICROTUBULES                  = 14
MICROTUBULE_ENDS              = 15
CYTOKINETIC_BRIDGE            = 16
MITOTIC_SPINDLE               = 17
MICROTUBULE_ORGANIZING_CENTER = 18
CENTROSOME                    = 19
LIPID_DROPLETS                = 20
PLASMA_MEMBRANE               = 21
CELL_JUNCTIONS                = 22
MITOCHONDRIA                  = 23
AGGRESOME                     = 24
CYTOSOL                       = 25
CYTOPLASMIC_BODIES            = 26
RODS_AND_RINGS                = 27
LEN_LABELS = 28

### Some Constants

In [0]:
LABELS_DIR = 'dataset'
TRAIN_DIR = 'dataset/train'
TEST_DIR = 'dataset/test'
TRAIN_SUBSET_SIZE = 10 # Selection of images to train from (chosen at random, max size is 31072)

### Image Loading

The first step is to take our dataset and prepare it so that it's ready for our model.

Images are split into four filters/layers:
- **green**: the protein of interest
- **blue**: the nucleus
- **red**: the microtubules
- **yellow**: the endoplasmic reticulum

For this project we will mostly be interested in the **green** filter, which will be used to predict the label, while the other filters will be used as references.

In [0]:
# Get images by layer

# training set images
train_green  = [(TRAIN_DIR+'/{}').format(i) for i in os.listdir(TRAIN_DIR) if 'green'  in i]

# Select TRAIN_SUBSET_SIZE images from the training set at random
random.shuffle(train_green)
train_green = train_green[:TRAIN_SUBSET_SIZE]
train_green_ids = [i[:-10].replace((TRAIN_DIR+'/'),'') 
                       for i in train_green] # remove '_green.png' and '_green.tif'

# Retrieve the other three layers for our subset (NB: unsorted)
train_blue =   [(TRAIN_DIR+'/{}').format(i) for i in os.listdir(TRAIN_DIR)
                  if ((i[:-9] in train_green_ids) and ('blue' in i))]
train_red =    [(TRAIN_DIR+'/{}').format(i) for i in os.listdir(TRAIN_DIR)
                  if ((i[:-8] in train_green_ids) and ('red' in i))]
train_yellow = [(TRAIN_DIR+'/{}').format(i) for i in os.listdir(TRAIN_DIR)
                  if ((i[:-11] in train_green_ids) and ('yellow' in i))]

# force garbage collection to make sure memory isn't wasted
gc.collect()

# test set images
test_green  = [(TEST_DIR+'/{}').format(i) for i in os.listdir(TEST_DIR) if 'green'  in i]
test_blue   = [(TEST_DIR+'/{}').format(i) for i in os.listdir(TEST_DIR) if 'blue'   in i]
test_red    = [(TEST_DIR+'/{}').format(i) for i in os.listdir(TEST_DIR) if 'red'    in i]
test_yellow = [(TEST_DIR+'/{}').format(i) for i in os.listdir(TEST_DIR) if 'yellow' in i]

FileNotFoundError: ignored

In [0]:
# Testing the above by seeing if the IDs are the same

train_green.sort()
train_blue.sort()
train_red.sort()
train_yellow.sort()
print(train_green, end='\n\n')
print(train_blue, end='\n\n')
print(train_red, end='\n\n')
print(train_yellow, end='\n\n')

### Label Loading

Loading the labels from *train.csv*

In [0]:
training_labels = {} # dictionary with key = photo ID and entry = list of labels

with open('dataset/train.csv') as label_file:
  csvreader = csv.reader(label_file, delimiter=',', quotechar='|')
  for row in csvreader:
    if "Id" not in row[0]:
        training_labels[row[0]] = row[1:][0].split(' ') # labels are separated by spaces

### Image Preprocessing

In [0]:
# Based on Keras and OpenCV docs

def read_and_process(images_paths):
  '''
  returns a list of images and a list of label tuples
  '''

  images = [] # will be used by keras as x
  labels = [] # will be used by keras as y

  for img in image_paths:
    # build image list
    images.append(cv2.imread(img, cv2.IMREAD_GRAYSCALE)) # load images using cv2.imread()

    # build labels/target/tag list
    # Check labels at each step and build labels list accordingly
    # --> order is important

In [0]:
read_and_process(train_green)
read_and_process(test_green)

### One Hot Encoding

In [0]:
def one_hot_encode(train_tags):
  ''' Creates a one hot encoded dictionary of all labels for each pic'''
  encoded = dic()
  for id in train_tags.getKeys():
	  # create empty vector
	  encoding = zeros(len(train_tags), dtype='uint8')
	  # mark 1 for each tag in the vector
	  for tag in train_tags.getValue():
		  encoding[int(tag)] = 1
    encoded[id] = encoding
	return encoded