## Prepare HappyWhales data
This notebook organizes the training data for HappyWhales

In [1]:
import numpy as np
import pandas as pd
from PIL import Image

In [2]:
training_data = pd.read_csv("C:/Users/DNNeu/kaggle/HappyWhale/Data/train.csv", index_col=0)

In [3]:
training_data.head()

Unnamed: 0_level_0,species,individual_id
image,Unnamed: 1_level_1,Unnamed: 2_level_1
00021adfb725ed.jpg,melon_headed_whale,cadddb1636b9
000562241d384d.jpg,humpback_whale,1a71fbb72250
0007c33415ce37.jpg,false_killer_whale,60008f293a2b
0007d9bca26a99.jpg,bottlenose_dolphin,4b00fe572063
00087baf5cef7a.jpg,humpback_whale,8e5253662392


In [4]:
unique_ids = np.array(training_data['individual_id'])
n_images = unique_ids.shape[0]
unique_ids = np.unique(unique_ids)
n_individuals = unique_ids.shape[0]
print("There are {} unique individual ids and a total of {} images".format(n_individuals, n_images))

There are 15587 unique individual ids and a total of 51033 images


In [6]:
unique_ids

array(['0013f1f5f2f0', '001618e0a31e', '0018a0f40586', ...,
       'fffb11ff4575', 'fffe15363b92', 'ffff6255f559'], dtype=object)

## Mapping from individual_id to label

We need a mapping from individual_id strings to int labels that will be used to train a neural network

In [6]:
pd.DataFrame(unique_ids).to_csv("labels_map.csv")

In [7]:
labels_map = np.array(pd.read_csv("labels_map.csv"))

In [8]:
labels_map

array([[0, '0013f1f5f2f0'],
       [1, '001618e0a31e'],
       [2, '0018a0f40586'],
       ...,
       [15584, 'fffb11ff4575'],
       [15585, 'fffe15363b92'],
       [15586, 'ffff6255f559']], dtype=object)

So let's say we wanted to know the target label for an individual id

In [9]:
def get_target(individual_id, labels_map=labels_map):
    target_label = np.where(labels_map[:,1] ==individual_id)[0][0]
    return target_label
    
random_image_ix = np.random.randint(n_images)
random_individual_id = training_data.iloc[random_image_ix].individual_id
print("individual id:", random_individual_id)
random_target_label = get_target(random_individual_id)
print("target:", random_target_label)

individual id: 02a722f078a2
target: 166


Now lets confirm that this target corresponds to this individual id

In [11]:
labels_map[random_target_label, 1]

'5dc6032fca87'

This labels_map.csv file can then be used to organize a directory of images

## Create an Image Directory For Training

This will create an image directory structure where there is a directory for each individual ID that contains images of the individual.

In [12]:
image_files = np.array(training_data.loc[training_data.individual_id=='cadddb1636b9'].index)

In [13]:
image_files

array(['00021adfb725ed.jpg'], dtype=object)

In [11]:
# loop over unique id
import os
import shutil
for individual_id in unique_ids:
    # create a directory for current unique id
    directory = str(get_target(individual_id))
    # add all images with current id to this directory
    image_files = np.array(training_data.loc[training_data.individual_id==individual_id].index)
    if len(image_files) < 10:
        continue  # create dataset w/ at least 10 examples per individual
    parent_dir = "C:/Users/DNNeu/kaggle/HappyWhale/Data_Small"
    path = os.path.join(parent_dir, directory)
    print("path:", path)
    os.makedirs(path, exist_ok=True)
    
    for f in image_files:
        # move the file to this location
        src = "C:/Users/DNNeu/kaggle/HappyWhale/Data/train_images/" + f
        dst = path + '/' + f
        shutil.copyfile(src, dst)

path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\85
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\115
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\145
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\151
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\160
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\168
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\177
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\180
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\197
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\200
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\203
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\227
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\234
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\247
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\254
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\299
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\304
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\306
path: C:/Users/DNNeu/kaggle/H

path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\2742
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\2798
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\2801
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\2802
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\2807
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\2811
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\2816
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\2821
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\2864
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\2910
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\2932
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\2957
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\2968
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\3042
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\3044
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\3048
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\3052
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\3086
path: C:/U

path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\5503
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\5520
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\5545
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\5561
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\5590
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\5600
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\5622
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\5624
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\5626
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\5641
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\5693
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\5748
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\5749
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\5769
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\5787
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\5798
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\5824
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\5839
path: C:/U

path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\8257
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\8295
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\8343
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\8356
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\8391
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\8443
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\8450
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\8487
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\8490
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\8513
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\8545
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\8546
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\8549
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\8558
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\8565
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\8566
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\8611
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\8613
path: C:/U

path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\10833
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\10905
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\10909
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\10918
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\10967
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\10985
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\11022
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\11027
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\11037
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\11059
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\11062
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\11105
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\11108
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\11118
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\11121
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\11125
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\11135
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Smal

path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\13953
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\13959
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\14009
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\14010
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\14016
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\14029
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\14036
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\14044
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\14059
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\14073
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\14095
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\14113
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\14117
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\14137
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\14163
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\14174
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Small\14177
path: C:/Users/DNNeu/kaggle/HappyWhale/Data_Smal