# Invasive Species Monitoring

This is my solution to the Kaggle Invasive Species Monitoring competition. The task was to identify images of invasive hydrangea.

https://www.kaggle.com/c/invasive-species-monitoring

In [1]:
import keras
import util
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold

Using TensorFlow backend.


# Load Datasets
Since we will be using a generator we don't need to actually load in any files into memory, all we need is the filepaths :)

## Training Set

In [2]:
train_files, train_set = util.load_train()

train_set.head()

Unnamed: 0,name,invasive
0,../input/train/1.jpg,0
1,../input/train/2.jpg,0
2,../input/train/3.jpg,1
3,../input/train/4.jpg,0
4,../input/train/5.jpg,1


## Test Set

In [3]:
test_files, test_set = util.load_test()
    
test_set.head()

Unnamed: 0,name,invasive
0,1,0.5
1,2,0.5
2,3,0.5
3,4,0.5
4,5,0.5


# Define CNN Model Architecture

In [5]:
img_height = 800
img_width = 800
img_channels = 3
img_dim = (img_height, img_width, img_channels)

model = util.inceptionv3(img_dim=img_dim)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 800, 800, 3)       0         
_________________________________________________________________
batch_renormalization_1 (Bat (None, 800, 800, 3)       12        
_________________________________________________________________
inception_v3 (Model)         (None, 23, 23, 2048)      21802784  
_________________________________________________________________
global_average_pooling2d_1 ( (None, 2048)              0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 2048)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 2049      
Total params: 21,804,845
Trainable params: 21,770,407
Non-trainable params: 34,438
___________________________________________________________

# Train Model
Here we use 5-fold cross-validation to train the model. Submission file is saved with the average of all folds. Additionally, prediction arrays are saved for each fold in case we want to hand-pick results from an individual fold.

In [None]:
batch_size = 5
epochs = 50
n_fold = 5
img_size = (img_height, img_width)
kf = KFold(n_splits=n_fold, shuffle=True)

test_pred = util.train_model(model, batch_size, epochs, img_size, train_set, train_label, test_files, n_fold, kf)

test_set['invasive'] = test_pred
test_set.to_csv('./submissions/submission.csv', index = None)

## Psuedo Labeling
Once you have a good scoring model you can use its predictions to expand your training set, this is called pseudo labeling and is a semi-supervised learning technique.

In [None]:
# Use submission from best scoring model to create semi-supervised dataset
path = './submissions/submission.csv'
    
pred_set = util.get_pred_set(path, test_files)

# Only use subset of semi-supervised dataset
pred_set = pred_set[:1000]

# Combine training set with semi-supervised training set
pseudo_set = pd.concat([train_set, pred_set], axis=0)

pseudo_label = np.array(pseudo_set['invasive'].iloc[:])

print('We combined {} training images with {} test predicted images and now have a combined training set of {} images'.format
      (len(train_set), len(pred_set), len(pseudo_set)))

## Train Model w/ Pseudo Labeling

In [None]:
test_pred = util.train_model(model, batch_size, epochs, img_size, pseudo_set, pseudo_label, test_files, n_fold, kf)

test_set['invasive'] = test_pred
test_set.to_csv('./submissions/submission.csv', index = None)