# Humpback Whale Identification Challenge

## Capstone Project

Bo Liu

July 24st, 2018

## I. Definition

### Project Overview

After centuries of intense whaling, recovering whale populations still have a hard time adapting to warming oceans and struggle to compete every day with the industrial fishing industry for food.

To aid whale conservation efforts, scientists use photo surveillance systems to monitor ocean activity. They use the shape of whales’ tails and unique markings found in footage to identify what species of whale they’re analyzing and meticulously log whale pod dynamics and movements. For the past 40 years, most of this work has been done manually by individual scientists, leaving a huge trove of data untapped and underutilized.

The challenge is to build an algorithm to identifying whale species in images. I will analyze Happy Whale’s database of over 25,000 images, gathered from research institutions and public contributors. And I am excited to help open rich fields of understanding for marine mammal population dynamics around the globe.

Happy Whale is the organization who provide this data and problem. It is a platform that uses image process algorithms to let anyone to submit their whale photo and have it automatically identified.
This challenge is originally form kaggle.com, more info can be find [here.](https://www.kaggle.com/c/whale-categorization-playground)

### Problem Statement

We have training data contains thousands of images of humpback whale flukes. Individual whales
have been identified by researchers and given an Id. The problem we are going to solve is to
predict the whale Id of images in the test set. What makes this such a challenge is that there are
only a few examples for each of 3,000+ whale Ids. One relevant potential solution is training a
model of CNN to identify thoes whale flukes.

Additionally, the problem can be quantified as follow, for an arbitrary image of whale flukes $x$
(where $x$ is the form of point series), and the corresponding whale Id $y$, find out a model $f(x)$ that
can map $x$ to a candidate Id $\hat y$.

The problem can be measured according to the precision at cutoff $k$.

The problem can be reproduced by checking the [challenging website]() and public datasets of whale flukes.

### Metrics

We use accuracy as the metric to evaluate the performance of our model.

$$accuracy(y,\hat y) = \frac{1}{n_{samples}} \sum^{n_{samples}-1}_{i=0} 1(\hat y = y)$$

wher $\hat y$ is the predicted label, $n_{samples}$ is the number of samples, $1(x)$ is the indicator function.

And our final solution can be measured by the Mean Average Precision @ 5 (MAP@5) which is provided on the official website.
$$MAP@5 = \frac{1}{U} \sum^U_{u=1} \sum ^{min(n,5)}_{k=1} P(K)$$
where $U$ is the number of images, $P(k)$ is the precision at cutoff $k$, and $n$ is the number predictions
per image.

## II. Analysis

### Data Exploration

This training data contains thousands of images of humpback whale flukes. Individual whales have been identified by researchers and given an Id. 

There are two floders contaning training images and testing images separately, and one csv file
to map the training data to the right Id. Also a template of submission file is provided. 

- train - a folder containing 9850 training images of humpback whale flukes. 
- test - a folder containing 15610 test images to predict the whale Id. 
- train.csv - maps the training Image to the appropriate whale Id. 
- sample_submission.csv - a sample submission file in the correct format

Whales that are not predicted to have a label identified in the training data should be labeled as new_whale.
Each of the images including a humpback whale fluke, So appropriately identifying those
image are related to solving the problem we described before.

The dataset can be obtained from kaggle website , or you can click [here](https://www.kaggle.com/c/whale-categorization-playground/data).

The training dataset will be used to train our model, while the test dataset will be used to test
our model and create a submission file. Our final results can be evaluated on MAP@5 by uploading our submission to the [official challenge website](https://www.kaggle.com/c/whale-categorization-playground/leaderboard).

For the reason that the total size of datasets is over 700MB, so I will not upload the datasets with my submition of this final project.

Training Images have various sizes and different color types, so I convert all images to gray and resize to 64x64. Figure 1 shows some of the traning images.

![image.png](https://github.com/freefrog1986/Humpback-Whale-Identification-Challenge/blob/master/sample_images.png?raw=true)

Figure 1 samples of training images

### Exploratory Visualization

The competition states that it's hard because: "there are only a few examples for each of 3,000+ whale ids", there are actually 4251 categories in the training set.

There appear to be too many categories to graph count by category, so let's instead graph the number of categories by the number of images in the category which is shown on figure 2.

From the figure below we can see that, the number of samples in each category is very unbalanced.

![](https://github.com/freefrog1986/Humpback-Whale-Identification-Challenge/blob/master/distribution_samples.png?raw=true)

Figure 2 the number of categories by the number of images in the category

### Algorithms and Techniques

A applicable solution to the problem is training a CNN model to predict the Ids.
The solution can be quantified as follow: Train a model $f(x)$ using training data x_train, applying
the trained model $f(x)$ on the test data x_test to predict the condidate Id $\hat y$.

First, we will apply a shallow CNN model with 8 layers as our benchmark model.

Then, a more complexed model will be built to try to get better performance. VGG16 model is a good choice which is a convolutional neural network model proposed by K. Simonyan and A. Zisserman from the University of Oxford in the paper “Very Deep Convolutional Networks for Large-Scale Image Recognition”. The model achieves 92.7% top-5 test accuracy in ImageNet, which is a dataset of over 14 million images belonging to 1000 classes.

Finally, I try to using a 'combined cnn model', in which I will train shallow cnn models for each class and combined all those models as a whole to predict on testing dataset.

### Benchmark

[This link](https://www.kaggle.com/gimunu/data-augmentation-with-keras-into-cnn) provided one possible solution using a traditional CNN in keras, which I would like to be taking into account as a benchmark model.

The workflow of the benchmark model is:

importing the data -> One hot encoding on the labels -> Image augmentation -> Building and training model -> Predictions on test samples 

It is a simple CNN model containing 8 layers as shown below.

Input -> conv2D@1 -> Conv2D@2 -> MaxPooling2D@3 -> Conv2D@4 -> MaxPooling2D@5
-> Dropout -> Flatten -> Dense@6 -> Dense@7 -> Output@8

Where ’Input’ represent input layer, ’conv2D’ represent convolutional layer, ’MaxPooling2D’
represent max pooling layer, ’Dense’ represent fully-connected layer, ’Output’ represent output
layer, ’@n’ indicate that this is the ’n’th layer, ’dropout’ and ’flatten’ represent dropout and flatten
method separately, and they aren’t considered as a layer. All parameters are shown below:

- batch size - 128
- epochs - 9
- conv2D@1 - number of kernals is 48, kernel size is (3, 3), strides is (1, 1), padding type is
’valid’, activation function is ’relu’.
- conv2D@2 - number of kernals is 48, kernel size is (3, 3), strides is (1, 1), padding type is
’valid’, activation function is ’sigmoid’.
- MaxPooling2D@3 - pool size is 3, 3), strides is (1, 1), padding type is ’valid’.
2
- conv2D@4 - number of kernals is 48, kernel size is (5, 5), strides is (1, 1), padding type is
’valid’, activation function is ’sigmoid’.
- MaxPooling2D@5 - pool size is 3, 3), strides is (1, 1), padding type is ’valid’.
- dropout - Fraction of the input units to drop is 0.33.
- flatten - no parameters.
- Dense@6 - number of cells is 36, activation function is ’sigmoid’.
- dropout - Fraction of the input units to drop is 0.33.
- Dense@7 - number of cells is 36, activation function is ’sigmoid’.
- output@8 - number of cells is equal to the number of unique lables, activation function is
’softmax’.
- loss function - cross-entropy
- optimizer - Adadelta.
- metrics - ’accuracy’.

This model got an accuracy of xxx on training dataset, and xxx on testing dataset. After uploading our finall result to the challenge website, we got a score of 0.32660.

A script of this model is provided, name ’benchmark.py’, in the submission package.

## III. Methodology

### Data Preprocessing

4 preprocessing steps were adopt before building the model.
- Conversing all image to gray 
- Reshaping all image to (64,64)
- Image argumentation
- sample balancing

Image argumentation is used with the following settings:
- Set input mean to 0 over the dataset, feature-wise. 
- Divide inputs by std of the dataset, feature-wise.
- Apply ZCA whitening and set epsilon to 1e-06
- Random rotations within 10 degree 
- Random width shift within 10 percent of total width
- Random width shift within 10 percent of total width
- Random shear shift within 10 percent of total width
- shear Intensity is 0.1
- zoom range is 0.1 percent of total size
- Points outside the boundaries of the input are filled with the nearest points
- We multiply the data by the value of 1./255 to rescale the image

Accroding to the Exploratory Visualization part, the number of samples in each category is very unbalanced. So it's necessary to eliminating sample bias by weighting data. The weighting formula is:
$$w = \frac {1}{x^{\frac{3}{4}}}$$
where x is the number of samples of a purticular class, w is the weight for that class.

### Implementation

What we implemented in our final project is VGG16 model, as we introduced in 'Algorithms and Techniques' part, it's a well performanced model proposeed by K. Simonyan and A. Zisserman.

As described in the paper “Very Deep Convolutional Networks for Large-Scale Image Recognition”, our model have 16 weight layers. The generic design of model is list bellow:

- we use filters with a very
small receptive field: 3 × 3


parameters

### Refinement

epoch 9->50
batch size 128 -> 64 -> 256

## IV. Results

### Model Evaluation and Validation

### Justification

## V. Conclusion

### Free-Form Visualization

### Reflection

### Improvement