<a href="https://colab.research.google.com/github/cosmo3769/SSL-study/blob/eda/EDA_iNaturalist_aves.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setting up kaggle service to fetch the dataset

Go to your kaggle account. Generate an API token. The file named "kaggle.json" will be downloaded to your local system. Upload the file **kaggle.json** in the colab so to use the kaggle service in colab.  

In [None]:
# Install the kaggle library.

%%capture
! pip install kaggle

In [None]:
! mkdir ~/.kaggle

In [None]:
! cp kaggle.json ~/.kaggle/

In [None]:
! chmod 600 ~/.kaggle/kaggle.json

## Dataset

The dataset is taken from kaggle cometition on [Semi-Supervised Recognition Challenge - FGVC7](https://www.kaggle.com/competitions/semi-inat-2020/data). Here is the [GitHub page](https://github.com/cvl-umass/semi-inat-2020) giving the explanation of the dataset.

Some important points to note about dataset: 

| Split	| Details	| Classes	| Images |
| ----- | ------- | ------- | ------ |
| Train	| Labeled	| 200	    | 3,959  |
| Train	| Unlabeled, in-class	| 200	| 26,640 |
| Train	| Unlabeled, out-of-class |	-	| 122,208 |
| Val	  | Labeled	| 200	| 2,000 |
| Test | Public	| 200	| 4,000 |
| Test | Private | 200 | 4,000 |

In [None]:
! kaggle competitions download -c semi-inat-2020

Downloading semi-inat-2020.zip to /content
100% 14.2G/14.3G [02:21<00:00, 122MB/s]
100% 14.3G/14.3G [02:21<00:00, 108MB/s]


In [None]:
%%capture
! unzip semi-inat-2020.zip

In [None]:
import os 

ANNOTATION_DIR = '/content/annotation/'
# os.listdir(ANNOTATION_DIR)

TRAINVAL_LABELLED_DIR = '/content/trainval_images/trainval_images/'
# os.listdir(TRAINVAL_LABELLED_DIR)

TRAIN_UNLABELLED_INCLASS_DIR = '/content/u_train_in/u_train_in/'
# os.listdir(TRAIN_UNLABELLED_INCLASS_DIR)

TRAIN_UNLABELLED_OUTCLASS_DIR = '/content/u_train_out/u_train_out/'
# os.listdir(TRAIN_UNLABELLED_OUTCLASS_DIR)

TEST_DIR = '/content/test/test/'
# os.listdir(TEST_DIR)

## Annotation Format

The dataset follows the annotation format of the COCO dataset. It is stored in the [JSON Format](https://www.json.org/json-en.html) and are organized as follows: 

```
{
  "info" : info,
  "images" : [image],
  "annotations" : [annotation],
}

info{
  "year" : int,
  "version" : str,
  "description" : str,
  "contributor" : str,
  "url" : str,
  "date_created" : datetime,
}

image{
  "id" : int,
  "width" : int,
  "height" : int,
  "file_name" : str
}

annotation{
  "id" : int,
  "image_id" : int,
  "category_id" : int
}

```



## Labelled training annotations

Showing the **annotations of labelled training images** from the annotation file [anno_l_train.json](/content/annotation/annotation/anno_l_train.json).

In [None]:
import json
import pandas as pd
from pandas import json_normalize

file = ANNOTATION_DIR + 'annotation/anno_l_train.json'

# load data using Python JSON module
with open(file,'r') as f:
    data = json.loads(f.read())
# Flatten data
annotations_labelled_training = pd.json_normalize(data, record_path =['annotations'])

annotations_labelled_training

Unnamed: 0,image_id,id,category_id
0,0,0,0
1,1,1,0
2,2,2,0
3,3,3,0
4,4,4,0
...,...,...,...
3954,3954,3954,199
3955,3955,3955,199
3956,3956,3956,199
3957,3957,3957,199


In [None]:
annotations_labelled_training.shape

(3959, 3)

In [None]:
annotations_labelled_training.columns

Index(['image_id', 'id', 'category_id'], dtype='object')

In [None]:
annotations_labelled_training.dtypes

image_id       int64
id             int64
category_id    int64
dtype: object

In [None]:
annotations_labelled_training['category_id'].value_counts()

23     43
13     42
5      37
73     36
26     36
       ..
197     7
181     7
193     6
199     6
185     5
Name: category_id, Length: 200, dtype: int64

In [None]:
annotations_labelled_training['category_id'].unique()

array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
       169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 18

Showing the **images annotation of labelled training images** from the file [anno_l_train.json](/content/annotation/annotation/anno_l_train.json).

In [None]:
import json
import pandas as pd
from pandas import json_normalize

file = ANNOTATION_DIR + 'annotation/anno_l_train.json'

# load data using Python JSON module
with open(file,'r') as f:
    data = json.loads(f.read())
# Flatten data
images_annotations_labelled_training = pd.json_normalize(data, record_path =['images'])

images_annotations_labelled_training

Unnamed: 0,file_name,width,height,id
0,trainval_images/0/0.jpg,500,388,0
1,trainval_images/0/1.jpg,500,375,1
2,trainval_images/0/2.jpg,500,375,2
3,trainval_images/0/3.jpg,500,331,3
4,trainval_images/0/4.jpg,500,387,4
...,...,...,...,...
3954,trainval_images/199/1.jpg,500,375,3954
3955,trainval_images/199/2.jpg,500,333,3955
3956,trainval_images/199/3.jpg,500,375,3956
3957,trainval_images/199/4.jpg,500,375,3957


Concatenating DataFrames 

In [None]:
training_labelled = pd.concat([annotations_labelled_training , images_annotations_labelled_training.drop(['id'], axis = 1)], axis = 1)
training_labelled

Unnamed: 0,image_id,id,category_id,file_name,width,height
0,0,0,0,trainval_images/0/0.jpg,500,388
1,1,1,0,trainval_images/0/1.jpg,500,375
2,2,2,0,trainval_images/0/2.jpg,500,375
3,3,3,0,trainval_images/0/3.jpg,500,331
4,4,4,0,trainval_images/0/4.jpg,500,387
...,...,...,...,...,...,...
3954,3954,3954,199,trainval_images/199/1.jpg,500,375
3955,3955,3955,199,trainval_images/199/2.jpg,500,333
3956,3956,3956,199,trainval_images/199/3.jpg,500,375
3957,3957,3957,199,trainval_images/199/4.jpg,500,375


## Labelled validation annotations

Showing the **annotation of labelled validation images** from the annotation file [anno_val.json](/content/annotation/annotation/anno_val.json).

In [None]:
import json
import pandas as pd
from pandas import json_normalize

file = ANNOTATION_DIR + 'annotation/anno_val.json'

# load data using Python JSON module
with open(file,'r') as f:
    data = json.loads(f.read())
# Flatten data
annotations_labelled_validation = pd.json_normalize(data, record_path =['annotations'])

annotations_labelled_validation

Unnamed: 0,image_id,id,category_id
0,0,0,0
1,1,1,0
2,2,2,0
3,3,3,0
4,4,4,0
...,...,...,...
1995,1995,1995,199
1996,1996,1996,199
1997,1997,1997,199
1998,1998,1998,199


Showing the **images annotation of labelled validation images** from the annotation file [anno_val.json](/content/annotation/annotation/anno_val.json).

In [None]:
import json
import pandas as pd
from pandas import json_normalize

file = ANNOTATION_DIR + 'annotation/anno_val.json'

# load data using Python JSON module
with open(file,'r') as f:
    data = json.loads(f.read())
# Flatten data
images_annotations_labelled_validation = pd.json_normalize(data, record_path =['images'])

images_annotations_labelled_validation

Unnamed: 0,file_name,width,height,id
0,trainval_images/0/30.jpg,500,278,0
1,trainval_images/0/31.jpg,500,333,1
2,trainval_images/0/32.jpg,375,500,2
3,trainval_images/0/33.jpg,500,375,3
4,trainval_images/0/34.jpg,500,375,4
...,...,...,...,...
1995,trainval_images/199/11.jpg,500,375,1995
1996,trainval_images/199/12.jpg,500,333,1996
1997,trainval_images/199/13.jpg,500,333,1997
1998,trainval_images/199/14.jpg,500,333,1998


Concatenating DataFrame

In [None]:
validation_labelled = pd.concat([annotations_labelled_validation , images_annotations_labelled_validation.drop(['id'], axis = 1)], axis = 1)
validation_labelled

Unnamed: 0,image_id,id,category_id,file_name,width,height
0,0,0,0,trainval_images/0/30.jpg,500,278
1,1,1,0,trainval_images/0/31.jpg,500,333
2,2,2,0,trainval_images/0/32.jpg,375,500
3,3,3,0,trainval_images/0/33.jpg,500,375
4,4,4,0,trainval_images/0/34.jpg,500,375
...,...,...,...,...,...,...
1995,1995,1995,199,trainval_images/199/11.jpg,500,375
1996,1996,1996,199,trainval_images/199/12.jpg,500,333
1997,1997,1997,199,trainval_images/199/13.jpg,500,333
1998,1998,1998,199,trainval_images/199/14.jpg,500,333


## Unlabelled training in class annotations

Showing the **annotation of unlabelled in class images** from the annotation file [annotation_u_train_in.json](/content/annotation/annotation/anno_u_train_in.json).

**NOTE -  Since the images are unlabelled, all the category id given to the image is -1** 

In [None]:
import json
import pandas as pd
from pandas import json_normalize

file = ANNOTATION_DIR + 'annotation/anno_u_train_in.json'

# load data using Python JSON module
with open(file,'r') as f:
    data = json.loads(f.read())
# Flatten data
annotations_unlabelled_inclass_training = pd.json_normalize(data, record_path =['annotations'])

annotations_unlabelled_inclass_training

Unnamed: 0,image_id,id,category_id
0,0,0,-1
1,1,1,-1
2,2,2,-1
3,3,3,-1
4,4,4,-1
...,...,...,...
26635,26635,26635,-1
26636,26636,26636,-1
26637,26637,26637,-1
26638,26638,26638,-1


Showing the **images annotation of unlabelled in class images** from the annotation file [annotation_u_train_in.json](/content/annotation/annotation/anno_u_train_in.json).

In [None]:
import json
import pandas as pd
from pandas import json_normalize

file = ANNOTATION_DIR + 'annotation/anno_u_train_in.json'

# load data using Python JSON module
with open(file,'r') as f:
    data = json.loads(f.read())
# Flatten data
images_annotations_unlabelled_inclass_training = pd.json_normalize(data, record_path =['images'])

images_annotations_unlabelled_inclass_training

Unnamed: 0,file_name,width,height,id
0,u_train_in/0.jpg,375,500,0
1,u_train_in/1.jpg,375,500,1
2,u_train_in/2.jpg,375,500,2
3,u_train_in/3.jpg,380,245,3
4,u_train_in/4.jpg,500,333,4
...,...,...,...,...
26635,u_train_in/26635.jpg,500,375,26635
26636,u_train_in/26636.jpg,500,281,26636
26637,u_train_in/26637.jpg,500,394,26637
26638,u_train_in/26638.jpg,500,333,26638


Concatenating DataFrame

In [None]:
training_unlabelled_inclass = pd.concat([annotations_unlabelled_inclass_training , images_annotations_unlabelled_inclass_training.drop(['id'], axis = 1)], axis = 1)
training_unlabelled_inclass

Unnamed: 0,image_id,id,category_id,file_name,width,height
0,0,0,-1,u_train_in/0.jpg,375,500
1,1,1,-1,u_train_in/1.jpg,375,500
2,2,2,-1,u_train_in/2.jpg,375,500
3,3,3,-1,u_train_in/3.jpg,380,245
4,4,4,-1,u_train_in/4.jpg,500,333
...,...,...,...,...,...,...
26635,26635,26635,-1,u_train_in/26635.jpg,500,375
26636,26636,26636,-1,u_train_in/26636.jpg,500,281
26637,26637,26637,-1,u_train_in/26637.jpg,500,394
26638,26638,26638,-1,u_train_in/26638.jpg,500,333


## Unlabelled training out of class annotations

Showing the **annotation of unlabelled out of class images** from the annotation file [annotation_u_train_out.json](/content/annotation/annotation/anno_u_train_out.json).

**NOTE -  Since the images are unlabelled, all the category id given to the image is -1** 

In [None]:
import json
import pandas as pd
from pandas import json_normalize

file = ANNOTATION_DIR + 'annotation/anno_u_train_out.json'

# load data using Python JSON module
with open(file,'r') as f:
    data = json.loads(f.read())
# Flatten data
annotations_unlabelled_outclass_training = pd.json_normalize(data, record_path =['annotations'])

annotations_unlabelled_outclass_training

Unnamed: 0,image_id,id,category_id
0,0,0,-1
1,1,1,-1
2,2,2,-1
3,3,3,-1
4,4,4,-1
...,...,...,...
122203,122203,122203,-1
122204,122204,122204,-1
122205,122205,122205,-1
122206,122206,122206,-1


Showing the **images annotation of unlabelled out of class images** from the annotation file [annotation_u_train_out.json](/content/annotation/annotation/anno_u_train_out.json).

In [None]:
import json
import pandas as pd
from pandas import json_normalize

file = ANNOTATION_DIR + 'annotation/anno_u_train_out.json'

# load data using Python JSON module
with open(file,'r') as f:
    data = json.loads(f.read())
# Flatten data
images_annotations_unlabelled_outclass_training = pd.json_normalize(data, record_path =['images'])

images_annotations_unlabelled_outclass_training

Unnamed: 0,file_name,width,height,id
0,u_train_out/0.jpg,500,377,0
1,u_train_out/1.jpg,500,333,1
2,u_train_out/2.jpg,500,331,2
3,u_train_out/3.jpg,500,333,3
4,u_train_out/4.jpg,375,500,4
...,...,...,...,...
122203,u_train_out/122203.jpg,333,500,122203
122204,u_train_out/122204.jpg,500,333,122204
122205,u_train_out/122205.jpg,500,337,122205
122206,u_train_out/122206.jpg,500,298,122206


Concatenating DataFrame

In [None]:
training_unlabelled_outclass = pd.concat([annotations_unlabelled_outclass_training , images_annotations_unlabelled_outclass_training.drop(['id'], axis = 1)], axis = 1)
training_unlabelled_outclass

Unnamed: 0,image_id,id,category_id,file_name,width,height
0,0,0,-1,u_train_out/0.jpg,500,377
1,1,1,-1,u_train_out/1.jpg,500,333
2,2,2,-1,u_train_out/2.jpg,500,331
3,3,3,-1,u_train_out/3.jpg,500,333
4,4,4,-1,u_train_out/4.jpg,375,500
...,...,...,...,...,...,...
122203,122203,122203,-1,u_train_out/122203.jpg,333,500
122204,122204,122204,-1,u_train_out/122204.jpg,500,333
122205,122205,122205,-1,u_train_out/122205.jpg,500,337
122206,122206,122206,-1,u_train_out/122206.jpg,500,298


## Test annotations

Showing the **images annotation of test images** from the annotation file [anno_test.json](/content/annotation/annotation/anno_test.json).

**NOTE - Since it is the test data, it has no annotations given in the annotations file, for we have to predict those.**

In [None]:
import json
import pandas as pd
from pandas import json_normalize

file = ANNOTATION_DIR + 'annotation/anno_test.json'

# load data using Python JSON module
with open(file,'r') as f:
    data = json.loads(f.read())
# Flatten data
images_annotations_test = pd.json_normalize(data, record_path =['images'])

images_annotations_test

Unnamed: 0,file_name,width,height,id
0,test/0.jpg,500,375,0
1,test/1.jpg,500,375,1
2,test/2.jpg,500,474,2
3,test/3.jpg,500,375,3
4,test/4.jpg,500,295,4
...,...,...,...,...
7995,test/7995.jpg,500,287,7995
7996,test/7996.jpg,500,333,7996
7997,test/7997.jpg,500,375,7997
7998,test/7998.jpg,500,333,7998


## Dataset Split into training and validation 

Splitting the [trainval_images](/content/trainval_images/trainval_images) dataset(containing both the training and validation images) into training and validation dataset according to the **file_name** column in the **training_labelled** and **validation_labelled** concatenated annotation dataframe.

### Training Split

Creating seperate directory for training dataset named **train**. Copying the training image files from [trainval_images](/content/trainval_images/trainval_images) and pasting to [train](/content/train/train) folder. 

In [None]:
!mkdir train
!mkdir train/train

In [None]:
import os
  
TRAIN_DIR = '/content/train/train/'
  
list = [  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
       169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181,
       182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
       195, 196, 197, 198, 199]

list_string = [str(x) for x in list]
# list_string
  
for items in list_string:
    train_category_dirs = os.path.join(TRAIN_DIR, items)
    os.mkdir(train_category_dirs)

In [None]:
source_path = '/content/trainval_images/trainval_images/'
destination_path = '/content/train/train/'

In [None]:
import shutil

training = training_labelled['file_name'].str.replace(r'trainval_images/', '')
# training

for i, row in enumerate(training):
  filename = row
  source = os.path.join(source_path, filename) 
  destination = os.path.join(destination_path, filename)
  shutil.copy(source, destination)
  # print(destination)

### Validation Split

Creating seperate directory for validation dataset named **val**. Copying the validation image files from [trainval_images](/content/trainval_images/trainval_images) and pasting to [val](/content/val/val) folder. 

In [None]:
!mkdir val
!mkdir val/val

In [None]:
import os
  
VAL_DIR = '/content/val/val/'
  
list = [  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
       169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181,
       182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
       195, 196, 197, 198, 199]

list_string = [str(x) for x in list]
# list_string
  
for items in list_string:
    val_category_dirs = os.path.join(VAL_DIR, items)
    os.mkdir(val_category_dirs)

In [None]:
source_path = '/content/trainval_images/trainval_images/'
destination_path = '/content/val/val/'

In [None]:
import shutil

validation = validation_labelled['file_name'].str.replace(r'trainval_images/', '')
# validation

for i, row in enumerate(validation):
  filename = row
  source = os.path.join(source_path, filename) 
  destination = os.path.join(destination_path, filename)
  shutil.copy(source, destination)
  # print(destination)