# Planet: Understanding the Amazon deforestation from Space challenge

Special thanks to the kernel contributors of this challenge (especially @anokas and @Kaggoo) who helped me find a starting point for this notebook.

The whole code including the `data_helper.py` and `keras_helper.py` files are available on github [here](https://github.com/EKami/planet-amazon-deforestation) and the notebook can be found on the same github [here](https://github.com/EKami/planet-amazon-deforestation/blob/master/notebooks/amazon_forest_notebook.ipynb)

**If you found this notebook useful some upvotes would be greatly appreciated! :) **

Start by adding the helper files to the python path

In [None]:
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0"

In [None]:
import sys

sys.path.append('../src')
sys.path.append('../tests')

## Import required modules

In [None]:
import gc
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from sklearn.metrics import fbeta_score
from sklearn.model_selection import train_test_split

import data_helper
import keras_helper_ResNet50
import keras_helper_VGG19
import keras_helper_DenseNet121
from keras_helper import AmazonKerasClassifier
from keras_helper_ResNet50 import AmazonKerasClassifier_ResNet50
from keras_helper_VGG19 import AmazonKerasClassifier_VGG19
from keras_helper_DenseNet121 import AmazonKerasClassifier_DenseNet121
from keras.callbacks import EarlyStopping, ModelCheckpoint
from kaggle_data.downloader import KaggleDataDownloader
from keras import backend

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

Print tensorflow version for reuse (the Keras module is used directly from the tensorflow framework)

In [None]:
tf.__version__

In [None]:
nb_decal = 4

## Inspect image labels
Visualize what the training set looks like

In [None]:
#train_jpeg_dir, test_jpeg_dir, test_jpeg_additional, train_csv_file = data_helper.get_jpeg_data_files_paths()
train_jpeg_dir = "/shared_datasets/kaggle/Amazon/data/train-jpg"
test_jpeg_dir = "/shared_datasets/kaggle/Amazon/data/test-jpg"
test_jpeg_additional = "/shared_datasets/kaggle/Amazon/data/test-jpg-additional"
train_csv_file = "/shared_datasets/kaggle/Amazon/data/train_v2.csv"

# Define hyperparameters
Define the hyperparameters of our neural network

In [None]:
img_resize = (224, 224) # The resize size of each image
validation_split_size = 0.2
batch_size = 48

# Data preprocessing
Preprocess the data in order to fit it into the Keras model.

Due to the hudge amount of memory the resulting matrices will take, the preprocessing will be splitted into several steps:
    - Preprocess training data (images and labels) and train the neural net with it
    - Delete the training data and call the gc to free up memory
    - Preprocess the first testing set
    - Predict the first testing set labels
    - Delete the first testing set
    - Preprocess the second testing set
    - Predict the second testing set labels and append them to the first testing set
    - Delete the second testing set

In [None]:
x_train, y_train, y_map = data_helper.preprocess_train_data(train_jpeg_dir, train_csv_file, img_resize)
# Free up all available memory space after this heavy operation
gc.collect();

In [None]:
#X_train, X_valid, Y_train, Y_valid = train_test_split(x_train, y_train, test_size=validation_split_size, random_state=42)

#del x_train, y_train
#gc.collect()
"""
X_train, X_valid, Y_train, Y_valid = data_helper.decal(X_train, X_valid, Y_train, Y_valid, 5, nb_decal)
gc.collect()
"""

## Create a checkpoint

Creating a checkpoint saves the best model weights across all epochs in the training process. This ensures that we will always use only the best weights when making our predictions on the test set rather than using the default which takes the final score from the last epoch. 

In [None]:
filepath="weights.best_ResNet50_no_val.hdf5"
proba_to_save="../proba_file_ResNet50_no_val.npy"
file_to_save="../submission_file_ResNet50_no_val.csv"

In [None]:
#checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True)
checkpoint = ModelCheckpoint(filepath, monitor='acc', verbose=1, save_best_only=True)

## Define and Train model

Here we define the model and begin training. 

Before starting the training process, you should first set a learning rate annealing optimization schedule by choosing a series of learning rates (learn_rates) with corresponding number of epochs for each (epochs_arr).

Alternatively, if you just want to run one training session at a fixed learning rate and num epochs you can just input one entry for each of these. 

In [None]:
classifier = AmazonKerasClassifier_ResNet50(img_resize, len(y_map))

train_losses, val_losses = [], []
epochs_arr = [10, 10, 5]
learn_rates = [0.001, 0.0001, 0.00001]

In [None]:
i=0
for learn_rate, epochs in zip(learn_rates, epochs_arr):   
    if i==0:
        for layer in classifier.base_model.layers:
            layer.trainable = False

    if i > 0:
        #X_train, X_valid, Y_train, Y_valid = data_helper.decal(X_train, X_valid, Y_train, Y_valid, 5, 1)
        #gc.collect()
        classifier.load_weights(filepath)
        for layer in classifier.base_model.layers:
            layer.trainable = True
        
    tmp_train_losses = classifier.train_model(x_train, y_train, learn_rate, epochs, 
                                                                           batch_size, validation_split_size=validation_split_size, 
                                                                           train_callbacks=[checkpoint])
 
    """
    tmp_train_losses, tmp_val_losses, fbeta_score = classifier.train_model(X_train, X_valid, Y_train, Y_valid, learn_rate, epochs, 
                                                                           batch_size, validation_split_size=validation_split_size, 
                                                                           train_callbacks=[checkpoint])
    """
    train_losses += tmp_train_losses
    del tmp_train_losses
    gc.collect()
    i+=1

## Load Best Weights

Here you should load back in the best weights that were automatically saved by ModelCheckpoint during training

In [None]:
classifier = AmazonKerasClassifier_ResNet50(img_resize, 17)
classifier.load_weights(filepath)
print("Weights loaded")

## Monitor the results

Check that we do not overfit by plotting the losses of the train and validation sets

In [None]:
plt.plot(train_losses, label='Training loss')
plt.plot(val_losses, label='Validation loss')
plt.legend();

Look at our fbeta_score

In [None]:
fbeta_score

Before launching our predictions lets preprocess the test data and delete the old training data matrices

Now lets launch the predictions on the additionnal dataset (updated on 05/05/2017 on Kaggle)

In [None]:
"""
classifier = AmazonKerasClassifier_ResNet50(img_resize, 17)
for i in range(3, 5):
    filepath = "weights.best_" + str(i) + ".hdf5"
    x_train, y_train, y_map = data_helper.preprocess_train_data(train_jpeg_dir, train_csv_file, img_resize)
    # Free up all available memory space after this heavy operation
    gc.collect();
    
    X_train, X_valid, Y_train, Y_valid = train_test_split(x_train, y_train, test_size=validation_split_size, random_state=42)
    del x_train, y_train
    gc.collect()
    
    _, X_valid, _, Y_valid = data_helper.decal(X_train, X_valid, Y_train, Y_valid, 5, i)
    del X_train, Y_train
    gc.collect()
    
    classifier.load_weights(filepath)
    prediction_val = classifier.predict(X_valid)
    thresholds = data_helper.optimise_f2_thresholds(Y_valid, prediction_val)
    np.save("threshold" + str(i) + ".npy", thresholds)
    print str(i) + " done"
    del X_valid, Y_valid
"""

In [None]:
#del X_train, Y_train
#gc.collect()

prediction_val = classifier.predict(X_valid)
thresholds = data_helper.optimise_f2_thresholds(Y_valid, prediction_val)

del X_valid, Y_valid, prediction_val
gc.collect()

#del X_train, Y_train, X_valid, Y_valid
#gc.collect()
#thresholds = [0.2] * 17#len(labels_set)

In [None]:
x_test, x_test_filename = data_helper.preprocess_test_data(test_jpeg_dir, img_resize)
predictions = classifier.predict_TTA(x_test)

del x_test
gc.collect()

x_test, x_test_filename_additional = data_helper.preprocess_test_data(test_jpeg_additional, img_resize)
new_predictions = classifier.predict_TTA(x_test)

del x_test
gc.collect()

Before mapping our predictions to their appropriate labels we need to figure out what threshold to take for each class.

To do so we will take the median value of each classes.

Now lets map our predictions to their tags and use the thresholds we just retrieved

In [None]:
y_map = {0: 'agriculture',
 1: 'artisinal_mine',
 2: 'bare_ground',
 3: 'blooming',
 4: 'blow_down',
 5: 'clear',
 6: 'cloudy',
 7: 'conventional_mine',
 8: 'cultivation',
 9: 'habitation',
 10: 'haze',
 11: 'partly_cloudy',
 12: 'primary',
 13: 'road',
 14: 'selective_logging',
 15: 'slash_burn',
 16: 'water'}

In [None]:
#predictions_tot = np.load("/home/jb/amazon/predictions_tot.npy")
predictions_tot = np.vstack((predictions, new_predictions))
np.save(proba_to_save, predictions_tot)
predicted_labels = classifier.map_predictions_TTA(predictions_tot, y_map, thresholds)

In [None]:
# TODO complete
tags_pred = np.array(predictions_tot).T
_, axs = plt.subplots(5, 4, figsize=(15, 20))
axs = axs.ravel()

for i, tag_vals in enumerate(tags_pred):
    sns.boxplot(tag_vals, orient='v', palette='Set2', ax=axs[i]).set_title(y_map[i])

Finally lets assemble and visualize our prediction for the test dataset

In [None]:
tags_list = [None] * len(predicted_labels)
for i, tags in enumerate(predicted_labels):
    tags_list[i] = ' '.join(map(str, tags))

x_test_filename_tot = np.hstack((x_test_filename, x_test_filename_additional))
final_data = [[filename.split(".")[0], tags] for filename, tags in zip(x_test_filename_tot, tags_list)]

In [None]:
final_df = pd.DataFrame(final_data, columns=['image_name', 'tags'])
final_df.head()

If there is a lot of `primary` and `clear` tags, this final dataset may be legit...

And save it to a submission file

In [None]:
final_df.to_csv(file_to_save, index=False)
classifier.close()

That's it, we're done!

In [None]:
"""
del x_test
gc.collect()

x_test, x_test_filename_additional = data_helper.preprocess_test_data(test_jpeg_additional, img_resize)
new_predictions = classifier.predict_TTA(x_test)
#new_predictions = classifier.predict(x_test)

del x_test
gc.collect()
predictions = np.vstack((predictions, new_predictions))
x_test_filename = np.hstack((x_test_filename, x_test_filename_additional))
print("Predictions shape: {}\nFiles name shape: {}\n1st predictions entry:\n{}".format(predictions.shape, 
                                                                              x_test_filename.shape,
                                                                              predictions[0]))

del x_test
gc.collect()

x_test, x_test_filename_additional = data_helper.preprocess_test_data(test_jpeg_additional, img_resize)
x_test_filename = np.hstack((x_test_filename, x_test_filename_additional))

predicted_labels = classifier.map_predictions(predictions, y_map, thresholds)
"""