# Introduction

The various metainformation for this competition is stored in json format.

We would like to process these jsons so that we can easily build our training matrices.

For this, we will process all jsons and extract dataframes, by normalizing the json data.

# Data ingestion and processing


We will do all data ingestion and processing into a single loop.

In [1]:
import numpy as np
import pandas as pd
import os
import json

In [2]:
json_folder_path = "/kaggle/input/iwildcam2021-fgvc8/metadata"
list_of_files = list(os.listdir(json_folder_path))

for file_name in list_of_files:
    json_path = os.path.join(json_folder_path, file_name)
    print(f"Current json processed: {file_name}")
    with open(json_path) as json_file:
        # read each json
        json_data = json.load(json_file)
        # for each item in the json
        for item in json_data.items():
            # prepare the dataframe name
            file_name_split = file_name.split(".")[0]
            file_name_split = file_name_split.split("_")
            file_name_str = file_name_split[1] + "_" + file_name_split[2]
            print(f"\tCurrent json item processed: {item[0]} length: {len(item[1])}")
            data_frame_name = f"{file_name_str}_{item[0]}_df"
            print(f"\tDynamic dataframe created: {data_frame_name}")
            # dynamic creation of a dataframe, using vars()[data_frame_name]
            vars()[data_frame_name] = pd.json_normalize(json_data.get(item[0]))
            # output the dataframe
            vars()[data_frame_name].to_csv(f"{data_frame_name}", index=False)

Current json processed: iwildcam2021_megadetector_results.json
	Current json item processed: info length: 3
	Dynamic dataframe created: megadetector_results_info_df
	Current json item processed: images length: 263504
	Dynamic dataframe created: megadetector_results_images_df
	Current json item processed: detection_categories length: 2
	Dynamic dataframe created: megadetector_results_detection_categories_df
Current json processed: iwildcam2021_test_information.json
	Current json item processed: images length: 60214
	Dynamic dataframe created: test_information_images_df
Current json processed: iwildcam2021_train_annotations.json
	Current json item processed: images length: 203314
	Dynamic dataframe created: train_annotations_images_df
	Current json item processed: annotations length: 203314
	Dynamic dataframe created: train_annotations_annotations_df
	Current json item processed: categories length: 205
	Dynamic dataframe created: train_annotations_categories_df


In [3]:
print(megadetector_results_images_df.shape)
megadetector_results_images_df.head()

(263504, 3)


Unnamed: 0,detections,id,max_detection_conf
0,"[{'category': '1', 'bbox': [0.6529, 0.5425, 0....",905a3c8c-21bc-11ea-a13a-137349068a90,0.999
1,"[{'category': '1', 'bbox': [0.0147, 0.0, 0.985...",905a3fc0-21bc-11ea-a13a-137349068a90,0.696
2,[],905a420e-21bc-11ea-a13a-137349068a90,0.0
3,"[{'category': '1', 'bbox': [0.0, 0.4669, 0.185...",905a4416-21bc-11ea-a13a-137349068a90,1.0
4,"[{'category': '1', 'bbox': [0.0, 0.0494, 0.528...",905a579e-21bc-11ea-a13a-137349068a90,0.999


Let's further process `megadetector_results_images_df.detections`

Let's find what is the maximum number of  detections from all data.

In [4]:
megadetector_results_images_df['detections_count'] = megadetector_results_images_df["detections"].apply(lambda x: len(x))

In [5]:
print(f"Max detections: {max(megadetector_results_images_df['detections_count'] )}")

Max detections: 34


We will keep this data in this format for now.

In [6]:
print(megadetector_results_info_df.shape)
megadetector_results_info_df.head()

(1, 3)


Unnamed: 0,format_version,detector,detection_completion_time
0,1.0,megadetector_v3,2020-01-10 08:49:05


In [7]:
print(megadetector_results_detection_categories_df.shape)
megadetector_results_detection_categories_df.head()

(1, 2)


Unnamed: 0,2,1
0,person,animal


In [8]:
print(test_information_images_df.shape)
test_information_images_df.head()

(60214, 10)


Unnamed: 0,height,id,seq_id,location,width,datetime,file_name,seq_frame_num,seq_num_frames,sub_location
0,1024,8b31d3be-21bc-11ea-a13a-137349068a90,a91ebc18-0cd3-11eb-bed1-0242ac1c0002,20,1280,2013-06-09 16:01:38.000,8b31d3be-21bc-11ea-a13a-137349068a90.jpg,0,10,
1,1024,8cf202be-21bc-11ea-a13a-137349068a90,a91ebc18-0cd3-11eb-bed1-0242ac1c0002,20,1280,2013-06-09 16:01:39.000,8cf202be-21bc-11ea-a13a-137349068a90.jpg,1,10,
2,1024,8a87e62e-21bc-11ea-a13a-137349068a90,a91ebc18-0cd3-11eb-bed1-0242ac1c0002,20,1280,2013-06-09 16:01:40.000,8a87e62e-21bc-11ea-a13a-137349068a90.jpg,2,10,
3,1024,8e6994f4-21bc-11ea-a13a-137349068a90,a91ebc18-0cd3-11eb-bed1-0242ac1c0002,20,1280,2013-06-09 16:01:41.000,8e6994f4-21bc-11ea-a13a-137349068a90.jpg,3,10,
4,1024,948b29e2-21bc-11ea-a13a-137349068a90,a91ebc18-0cd3-11eb-bed1-0242ac1c0002,20,1280,2013-06-09 16:01:42.000,948b29e2-21bc-11ea-a13a-137349068a90.jpg,4,10,


In [9]:
print(train_annotations_images_df.shape)
train_annotations_images_df.head()

(203314, 10)


Unnamed: 0,seq_num_frames,location,datetime,id,seq_id,width,height,file_name,sub_location,seq_frame_num
0,6,3,2013-06-05 05:44:19.000,8b02698a-21bc-11ea-a13a-137349068a90,30048d32-7d42-11eb-8fb5-0242ac1c0002,1920,1080,8b02698a-21bc-11ea-a13a-137349068a90.jpg,0.0,0
1,6,3,2013-06-05 05:44:20.000,8e5b81de-21bc-11ea-a13a-137349068a90,30048d32-7d42-11eb-8fb5-0242ac1c0002,1920,1080,8e5b81de-21bc-11ea-a13a-137349068a90.jpg,0.0,1
2,6,3,2013-06-05 05:44:21.000,8c6be0e4-21bc-11ea-a13a-137349068a90,30048d32-7d42-11eb-8fb5-0242ac1c0002,1920,1080,8c6be0e4-21bc-11ea-a13a-137349068a90.jpg,0.0,2
3,6,3,2013-06-05 05:44:22.000,8fdf7998-21bc-11ea-a13a-137349068a90,30048d32-7d42-11eb-8fb5-0242ac1c0002,1920,1080,8fdf7998-21bc-11ea-a13a-137349068a90.jpg,0.0,3
4,6,3,2013-06-05 05:44:23.000,96093c50-21bc-11ea-a13a-137349068a90,30048d32-7d42-11eb-8fb5-0242ac1c0002,1920,1080,96093c50-21bc-11ea-a13a-137349068a90.jpg,0.0,4


In [10]:
print(train_annotations_annotations_df.shape)
train_annotations_annotations_df.head()

(203314, 3)


Unnamed: 0,id,image_id,category_id
0,a292dd3c-21bc-11ea-a13a-137349068a90,96b00332-21bc-11ea-a13a-137349068a90,73
1,a0afcfc0-21bc-11ea-a13a-137349068a90,879d74d8-21bc-11ea-a13a-137349068a90,4
2,a306e9c0-21bc-11ea-a13a-137349068a90,9017f7aa-21bc-11ea-a13a-137349068a90,227
3,9eed94c4-21bc-11ea-a13a-137349068a90,90d93c58-21bc-11ea-a13a-137349068a90,250
4,a2a4dd7a-21bc-11ea-a13a-137349068a90,887cd0ec-21bc-11ea-a13a-137349068a90,2


In [11]:
print(train_annotations_categories_df.shape)
train_annotations_categories_df.head()

(205, 2)


Unnamed: 0,id,name
0,0,empty
1,2,tayassu pecari
2,3,dasyprocta punctata
3,4,cuniculus paca
4,6,puma concolor
