# Prediction

This notebook shows off the prediction methods we use in this project, in addition to the evaluation scheme.

In [14]:
import sys, os

import plotly
plotly.offline.init_notebook_mode(connected=True)

sys.path.append('..')
import planet.predict, planet.util

import numpy

data_dir = '../data'

First, we load all the label data.

In [15]:
all_tags = planet.util.read_tags(os.path.join(data_dir, 'train_v2.csv'))
tag_indices = planet.util.get_tag_indices(all_tags)
label_names = tag_indices.keys()
all_labels = planet.util.tags_to_labels(all_tags, tag_indices)
(num_all, num_labels) = all_labels.shape

Then we split the labeled data into a training set and a test set.

In [16]:
train_inds, test_inds = next(planet.util.split_data(num_all, 2))
num_train = len(train_inds)
num_test = len(test_inds)

train_labels = all_labels[train_inds, :]
test_labels = all_labels[test_inds, :]

## Random Classifier

In order to establish a baseline for performance, we first use a classifier that assigns labels at random by flipping an unbiased coin for each label.

In [4]:
pred_labels_rand = planet.predict.random(num_test, num_labels)

rand_fig = planet.predict.make_scores_plot(pred_labels_rand, test_labels, label_names, 'Random')
plotly.offline.iplot(rand_fig, filename=os.path.join(data_dir, 'rand_scores.html'))

First, note that the recall (`tp / (tp + fn)`) of this classifier is roughly `0.5` because both the number of true positives (`tp`) and false negatives (`fn`) should be half the number of positive labels (`p/2`). Also note that there is a bit more fluctuation for rarer labels like *conventional_mine*. The average used here, and elsewhere in these analyses, computes the total number of `tp` and `fn` across all samples and labels.

Next, note that the precision (`tp / (tp + fp)`) of this classifier precision roughly follows the empirical distribution of the labels (see the `Data Exploration` notebook for comparison). That's because the number of false positives should be roughly half the number of negative occurrences, leading to a precision of `p/2 / (p/2 + n/2) =  p / (p + n)` which is the empirical probability of the label.

Finally, note that the F2 score of this classifier is a little closer to the recall than precision, which is expected because it's a geometric mean between recall and precision that weights recall more heavily than precision.

## Empirical Random Classifier

Instead of using a threshold of 0.5 for each label, we can use the empirical probability of each label instead.

In [5]:
train_label_probs = numpy.mean(train_labels, axis=0, keepdims=True)
pred_labels_emp_rand = planet.predict.empirical_random(num_test, train_label_probs)

emp_rand_fig = planet.predict.make_scores_plot(pred_labels_emp_rand, test_labels, label_names, 'Empirical Random')
plotly.offline.iplot(emp_rand_fig, filename=os.path.join(data_dir, 'emp_rand_scores.html'))

Introducing these probabilities increases the average recall a little, increases the average precision a lot and balances the two scores. Note that even though recall decreased for many labels, it increased overall because some labels much more frequently that others and so the overall score is boosted by predicting those more frequently.

## Nearest Neighbors Classifier

The simplest supervised learning method is a nearest neighbors classifier. First we try resizing each image to be 1x1, which effectively make each image be represented by its mean RGB triple.

In [17]:
all_names = list(all_tags.keys())

image_size = (224, 224)
image_dir = os.path.join(data_dir, 'train-jpg')
train_names = [all_names[i] for i in train_inds]
test_names = [all_names[i] for i in test_inds]

test_images = planet.util.read_images(image_dir, test_names, out_size=image_size)
train_images = planet.util.read_images(image_dir, train_names, out_size=image_size)

100%|██████████| 20240/20240 [03:12<00:00, 105.04it/s]
100%|██████████| 20239/20239 [03:09<00:00, 106.69it/s]


In [23]:
train_images_sp = planet.util.resize_images(train_images, (1, 1))
test_images_sp = planet.util.resize_images(test_images, (1, 1))

ks = range(1, 27, 2)
knn_scores = planet.predict.score_k_nearest_neighbors(train_images_sp, train_labels, ks, 5)
        
knn_scores_fig = planet.util.make_bar_plot(ks, knn_scores, 'F2 Scores of kNN on 1x1 Images')
plotly.offline.iplot(knn_scores_fig, filename=os.path.join(data_dir, 'knns_f2_scores.html'))

100%|██████████| 20239/20239 [00:13<00:00, 1470.37it/s]
100%|██████████| 20240/20240 [00:14<00:00, 1428.63it/s]
100%|██████████| 65/65 [00:05<00:00, 10.77it/s]


In [25]:
best_k = ks[numpy.argmax(knn_scores)]
knn = planet.predict.KNearestNeighbors(train_images_sp, train_labels, k=best_k)
pred_labels_knn = knn.predict(test_images_sp)

knn_fig = planet.predict.make_scores_plot(pred_labels_knn, test_labels, label_names, 'F2 Scores of kNN (k={}) on 1x1 Images'.format(best_k))
plotly.offline.iplot(knn_fig, filename=os.path.join(data_dir, 'knn_scores.html'))


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


F-score is ill-defined and being set to 0.0 in labels with no predicted samples.



## Convolution Neural Network

Next, we try using a convolutional neural network. More specifically, we use the VGG19 network architecture with pre-trained weights from the imagenet detection task to transform each image to a 512 7x7 response fields, then train a traditional 3-layer neural network that uses those fields as inputs. Here, we just load the pre-trained weights from the extra layers and predict the test labels.  

In [26]:
force_train = False
cnn_file_path = os.path.join(data_dir, 'vgg19_cnn.h5')
if force_train or not os.path.exists(cnn_file_path):
    cnn = planet.predict.VGG19ConvNeuralNetwork.from_data(train_images, train_labels, 32)
    cnn.write(cnn_file_path)
else:
    cnn = planet.predict.VGG19ConvNeuralNetwork.from_file(cnn_file_path)
    
pred_labels_cnn = cnn.predict(test_images)

cnn_fig = planet.predict.make_scores_plot(pred_labels_cnn, test_labels, label_names, 'F2 Scores of CNN (H={}) on 224x224 Images'.format(32))
plotly.offline.iplot(cnn_fig, filename=os.path.join(data_dir, 'cnn_scores.html'))




Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


F-score is ill-defined and being set to 0.0 in labels with no predicted samples.

