# Iceberg Image Detection

To avoid another disaster similar to Titanic, we want to create a model that is able to predict from satellite images if an object is an iceberg or a ship.

We use a dataset of satellite images, our goal is to understand it and create a model using this dataset to be able to predict from a given satellite image if it corresponds to a ship or an iceberg. Using these predictions we would know the areas to avoid when traveling by sea.


## 1) Data acquisition
We load the data from its location and explore it :

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Importing the data
df_train = pd.read_json('data/iceberg/train.json')

ValueError: Unexpected character found when decoding array value (2)

In [2]:
# Checking some train data examples
df_train.head()

NameError: name 'df_train' is not defined

In [None]:
df_train.info()

## 2) Data preparation
We load the data from its location and explore it :
### 2.1) Variables Signification
<ul>
<li>id - the id of the image</li>
<li>band_1, band_2 - the flattened image data. Each band has 75x75 pixel values in the list, so the list has 5625 elements. Note that these values are not the normal non-negative integers in image files since they have physical meanings - these are float numbers with unit being dB. Band 1 and Band 2 are signals characterized by radar backscatter produced from different polarizations at a particular incidence angle. The polarizations correspond to HH (transmit/receive horizontally) and HV (transmit horizontally and receive vertically). More background on the satellite imagery can be found here.</li>
<li>inc_angle - the incidence angle of which the image was taken. Note that this field has missing data marked as "na", and those images with "na" incidence angles are all in the training data to prevent leakage.</li>
<li>is_iceberg - the target variable, set to 1 if it is an iceberg, and 0 if it is a ship. This field only exists in train.json.</li>
</ul>

### 2.2) Missing data
Let us check for the presence of <b>na</b> values in the <i>inc_angle</i> variable

In [None]:
len(df_train[df_train['inc_angle']=='na'])

In [None]:
# We replace the 'na' string values with nan values so it can be easily handled
df_train['inc_angle'].replace('na',np.nan, inplace = True)

In [None]:
# Now the df info shows that there are null values in inc_angle which was not visible before
df_train.info()

In [None]:
df_train['inc_angle'] = df_train['inc_angle'].replace('na',np.nan)

# Let's drop the NaN values.
trainData_noNaN = df_train.dropna()

# From the noNaN dataset, let's get the mean and standard deviation.
incAngleTrain_noNaN = np.array(trainData_noNaN['inc_angle'], dtype=float)
incAngleMean = incAngleTrain_noNaN.mean(dtype=np.float64)
incAngleStd = incAngleTrain_noNaN.std(dtype=np.float64)

# Using the mean and standard deviation, normalize the inclination angle to zero mean and standard deviation of 1.
df_train['inc_angle'] -= incAngleMean
df_train['inc_angle'] /= incAngleStd

# Replace the NaN values with the mean value, 0.0
df_train['inc_angle'] = df_train['inc_angle'].replace(np.nan, 0.0)

incAngleTrain = df_train['inc_angle']

### 2.3) Exploratory Data analysis
Do we really need two variables (band 1 & 2)?
Let us check their distribution :

In [None]:
totalBand1 = []
totalBand2 = []
for imageVector in df_train['band_1']:
    for dB in imageVector:
        totalBand1.append(dB)
for imageVector in df_train['band_2']:
    for dB in imageVector:
        totalBand2.append(dB)
sns.distplot(totalBand1, label='Band 1')
sns.distplot(totalBand2, label='Band 2')
plt.legend()
plt.autoscale()
plt.xlabel('Radar Backscatter (dB)')
plt.ylabel('Density')
plt.savefig('Radar Backscatter Distribution.jpg')
plt.show()

## Commentaire:
### _Ce graphique illustre la relation entre la rétrodiffusion des radars et la densité des bandes 1 et 2. La densité de la bande 2 est initialement d'environ 0.135 pour une rétrodiffusion d'environ -30 dB, mais diminue rapidement à partir de -20 dB pour atteindre 0 à -10 dB. En revanche, la densité de la bande 1 augmente jusqu'à environ 0.08, puis diminue progressivement à partir de -10 dB pour atteindre 0 à 0 dB._

### 2.4) Image visualisation :
Since the band 1 & 2 are images, let us plot them alongside their labels

Given the distribution visualised, should we keep both variables? Why?

In [None]:
f, axarr = plt.subplots(nrows=2, ncols=6, sharex=True, sharey=True, figsize=(6,2), dpi=300)
for img in range (6):
    axarr[0, img].imshow(np.array(df_train['band_1'][img]).reshape((75,75)),cmap='binary_r')
    if df_train['is_iceberg'][img]==0:
        axarr[0, img].set_title('Ship')
    else:
        axarr[0, img].set_title('Iceberg')
    axarr[1, img].imshow(np.array(df_train['band_2'][img]).reshape((75,75)),cmap='binary_r')
axarr[0, 0].set_ylabel('Band 1')
axarr[1, 0].set_ylabel('Band 2')
plt.savefig('radar.jpg')
plt.show()

### 2.5) Data splitting and preparation :
We need to split our data into two sets : training and validation

We also need to reshape and normalize the images

In [None]:
from sklearn.model_selection import train_test_split

# TO-DO :
#   - Separate the target from the features
#   - Split the data into train and validation
Y_train = df_train.iloc[:,-1]
X_train = df_train.iloc[:,1:4]
X_train, X_valid, Y_train, Y_valid = train_test_split(X_train, Y_train, random_state=1, train_size=0.8)
print(X_train.shape, Y_train.shape)
print(X_valid.shape, Y_valid.shape)

In [None]:
# Function used to reshape and normalize
def reshape_normalize(band):
    radarImage = np.empty([1,75,75,1])
    for vector in band:
        bandMatrix = np.array(vector).reshape((75, 75))
        bandMatrix = (bandMatrix - bandMatrix.min()) / (bandMatrix.max() - bandMatrix.min())
        bandMatrix = np.expand_dims(bandMatrix, axis=0)
        bandMatrix = np.expand_dims(bandMatrix, axis=-1)
        radarImage = np.concatenate((radarImage, bandMatrix))
    radarImage = np.delete(radarImage, 0, 0)
    return radarImage

In [None]:
band_1_train = reshape_normalize(X_train['band_1'])
band_2_train = reshape_normalize(X_train['band_2'])
angle_train = X_train['inc_angle']

## 3) Model creation :
Now that our data is ready, we need to create our model.

Unlike the sequential models that we have seen before, this time we will create a non-sequential model. This model has three separate branches that are later merged :

<ol>
<li>Convolutional Model for the first band</li>
<li>Convolutional Model for the second band</li>
<li>Input layer for the inc_angle</li>
</ol>

These three 'models' are then merged and fed to a Fully Connected Neural Network (Dense)

In [None]:
import keras
from keras.layers import Input, Dense, Conv2D, MaxPooling2D, GlobalAveragePooling2D, Dropout
from keras.models import Model, Sequential
from keras import optimizers
from keras import regularizers

band1 = Input(shape=(75,75,1))
band2 = Input(shape=(75,75,1))
angle = Input(shape=(1))

# TO-DO :
#   - Complete the two convolution models
#   - Combine the results of the convolution models with inc_angle variable and feed them to an mlp

# Define the model architecture
# Start with Convolutional Model 1 for Band 1 Data
conv1 = Conv2D(filters= 64, kernel_size=5, strides=1, padding='same', activation='elu', input_shape=(75, 75, 1))(band1)
conv1 = MaxPooling2D(pool_size=2,strides=2)(conv1)
conv1 = Conv2D(filters= 128, kernel_size=4, strides=1, padding='same', activation='elu')(conv1)
conv1 = MaxPooling2D(pool_size=2,strides=2)(conv1)

conv1 = GlobalAveragePooling2D()(conv1)

# Start with Convolutional Model 2 for Band 2 Data
conv2 = Conv2D(filters= 64, kernel_size=5, strides=1, padding='same', activation='elu', input_shape=(75, 75, 1))(band2)
conv2 = MaxPooling2D(pool_size=2,strides=2)(conv2)
conv2 = Conv2D(filters= 128, kernel_size=4, strides=1, padding='same', activation='elu')(conv2)
conv2 = MaxPooling2D(pool_size=2,strides=2)(conv2)

conv2 = GlobalAveragePooling2D()(conv2)

# Combine the convolution models' outputs as well as the inc_angle
merge = keras.layers.concatenate([conv1, conv2, angle])

# Let's use a final multi-layer perceptron to weigh the three inputs
mlp = Dense(500, activation='elu')(merge)
mlp = Dense(256, activation='elu')(mlp)
mlp = Dense(128, activation='elu')(mlp)

output = Dense(1, activation='sigmoid')(mlp)

model = Model(inputs = [band1, band2, angle], outputs = output)

#Compile the model
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

model.summary()

In [None]:
# TO-DO : Train the model
model.fit(x=[band_1_train, band_2_train, angle_train ], y =Y_train , epochs = 50, verbose = 1)

In [None]:
# Prepare the data for the validation set
band_1_test = reshape_normalize(X_valid['band_1'])
band_2_test = reshape_normalize(X_valid['band_2'])
angle_test = X_valid['inc_angle']

In [None]:
# Evaluate your model
# TO-DO : Specify the parameters for this function
score = model.evaluate([band_1_test, band_2_test, angle_test], Y_valid)
print("Test loss:", score[0])
print("Test accuracy:", score[1])