<a href="https://www.kaggle.com/chasset/happywhale-introduction?scriptVersionId=89584797" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction to Happywhale competition

We can have a first look at the [data](https://www.kaggle.com/c/happy-whale-and-dolphin/data) using the data explorer of Kaggle. We can see that images have different qualities ranging from a dorsal fin to a distant view of the back of the mammal.

Let’s have a more complete view. First add the data through the Kaggle UI. They are afterwards located in */kaggle/input/happy-whale-and-dolphin* folder.

In [None]:
!ls -l /kaggle/input/happy-whale-and-dolphin

## Data loading

Let’s load the images metadata described in `train.csv` file. 

We can notice that our train data describes 51 033 images with one ID field for the photography filename and 2 others fields. The latter describe the animal specy and which individual it is.


In [2]:
import pandas as pd
df = pd.read_csv("/kaggle/input/happy-whale-and-dolphin/train.csv")
df

Unnamed: 0,image,species,individual_id
0,00021adfb725ed.jpg,melon_headed_whale,cadddb1636b9
1,000562241d384d.jpg,humpback_whale,1a71fbb72250
2,0007c33415ce37.jpg,false_killer_whale,60008f293a2b
3,0007d9bca26a99.jpg,bottlenose_dolphin,4b00fe572063
4,00087baf5cef7a.jpg,humpback_whale,8e5253662392
...,...,...,...
51028,fff639a7a78b3f.jpg,beluga,5ac053677ed1
51029,fff8b32daff17e.jpg,cuviers_beaked_whale,1184686361b3
51030,fff94675cc1aef.jpg,blue_whale,5401612696b9
51031,fffbc5dd642d8c.jpg,beluga,4000b3d7c24e


# Data quality

## Field `species`

In the taxonomy [dataset](https://www.kaggle.com/chasset/happywhalespeciesclassification), authors propose to correct the field `species`.

In [3]:
df.loc

Unnamed: 0,image,species,individual_id
0,00021adfb725ed.jpg,melon_headed_whale,cadddb1636b9
1,000562241d384d.jpg,humpback_whale,1a71fbb72250
2,0007c33415ce37.jpg,false_killer_whale,60008f293a2b
3,0007d9bca26a99.jpg,bottlenose_dolphin,4b00fe572063
4,00087baf5cef7a.jpg,humpback_whale,8e5253662392
...,...,...,...
51028,fff639a7a78b3f.jpg,beluga,5ac053677ed1
51029,fff8b32daff17e.jpg,cuviers_beaked_whale,1184686361b3
51030,fff94675cc1aef.jpg,blue_whale,5401612696b9
51031,fffbc5dd642d8c.jpg,beluga,4000b3d7c24e


## Data bias

### Around individuals

The 51 033 pictures describe 15 887 individuals. The distribution is very skewed:

- 75% of individuals have only one or two pictures
- 400 pictures are dedicated to only one individual

In [None]:
individuals = df.drop(columns = ['species']).groupby(['individual_id']).count().rename(columns = { 'image': 'images_count'})
individuals

In [None]:
individuals.describe()

In [None]:
from matplotlib import pyplot as plt
plt.hist(individuals.images_count, density=True, facecolor='g', alpha=0.75)
plt.show()

### Around species

The 51 887 images describe 30 species. Again, the distribution is skewed. Frasiers dolphin specy is descibed by only 14 images, while Bottle nose dolphin (9664), Beluga (7443) and Humpback whale (7392) are over represented.

In [None]:
species = df.drop(columns = ['individual_id']).groupby(['species']).count().rename(columns = { 'image': 'images_count'}).sort_values(by = ['images_count'], ascending = False)
print(species.shape)
species

In [None]:
species.describe()

In [None]:
plt.hist(species.images_count, density=True, facecolor='g', alpha=0.75)
plt.show()

## Model quality discussion

So, all together, data contains 51 033 images of 15 587 individuals from 30 species. But few species have a good ratio. This will affect the prediction quality. Many species/individuals will be difficult to predict.

In [None]:
individuals_counts = df.drop(columns = ['image']).groupby(['species']).nunique().rename(columns = { 'individual_id': 'individuals_count'})
counts = pd.merge(species, individuals_counts, how = 'left', on = ['species'])
counts = counts.assign(ratio = counts.images_count / counts.individuals_count).sort_values(by = ['ratio'], ascending = False)
counts

In [None]:
plt.hist(counts.ratio, density=True, facecolor='g', alpha=0.75)
plt.show()