# Explore the dataset


In this notebook, we will perform an EDA (Exploratory Data Analysis) on the processed Waymo dataset (data in the `processed` folder). In the first part, you will create a function to display 

In [1]:
from utils import get_dataset

In [8]:
import os
import glob

import matplotlib
matplotlib.use('TkAgg')
            
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import matplotlib.image as mpimg

import numpy as np
from PIL import Image

import cv2
import tensorflow as tf

In [3]:
paths = glob.glob('data/waymo/training_and_validation/*')
i = 0
#filename = os.path.basename(paths)
print(paths[i])
dataset = get_dataset(paths[i])
print('-----------------------------------------------------------')
print(dataset)

data/waymo/training_and_validation/segment-1005081002024129653_5313_150_5333_150_with_camera_labels.tfrecord
INFO:tensorflow:Reading unweighted datasets: ['data/waymo/training_and_validation/segment-1005081002024129653_5313_150_5333_150_with_camera_labels.tfrecord']
INFO:tensorflow:Reading record datasets for input file: ['data/waymo/training_and_validation/segment-1005081002024129653_5313_150_5333_150_with_camera_labels.tfrecord']
INFO:tensorflow:Number of filenames to read: 1
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_deterministic`.
Instructions for updating:
Use `tf.data.Dataset.map()
-----------------------------------------------------------
<DatasetV1Adapter shapes: {image: (None, None, 3), source_id: (), key: (), filename: (), groundtruth_image_confidences: (None,), groundtruth_verified_neg_classes: (

print(dataset)

```
<
	DatasetV1Adapter shapes: 
	{
		image: (None, None, 3), 
		source_id: (), 
		key: (), 
		filename: (), 
		groundtruth_image_confidences: (None,), 
		groundtruth_verified_neg_classes: (None,), 
		groundtruth_not_exhaustive_classes: (None,), 
		groundtruth_boxes: (None, 4), 
		groundtruth_area: (None,), 
		groundtruth_is_crowd: (None,), 
		groundtruth_difficult: (None,), 
		groundtruth_group_of: (None,), 
		groundtruth_weights: (None,), 
		groundtruth_classes: (None,), 
		groundtruth_image_classes: (None,), 
		original_image_spatial_shape: (2,)
	}, 
	types: 
	{
		image: tf.uint8, 
		source_id: tf.string, 
		key: tf.string, 
		filename: tf.string, 
		groundtruth_image_confidences: tf.float32, 
		groundtruth_verified_neg_classes: tf.int64, 
		groundtruth_not_exhaustive_classes: tf.int64, 
		groundtruth_boxes: tf.float32, 
		groundtruth_area: tf.float32, 
		groundtruth_is_crowd: tf.bool, 
		groundtruth_difficult: tf.int64, 
		groundtruth_group_of: tf.bool, 
		groundtruth_weights: tf.float32, 
		groundtruth_classes: tf.int64, 
		groundtruth_image_classes: tf.int64, 
		original_image_spatial_shape: tf.int32
	}
>
```

## Write a function to display an image and the bounding boxes

Implement the `display_instances` function below. This function takes a batch as an input and display an image with its corresponding bounding boxes. The only requirement is that the classes should be color coded (eg, vehicles in red, pedestrians in blue, cyclist in green).

In [6]:
def display_instances(batch):
    """
    This function takes a batch from the dataset and display the image with 
    the associated bounding boxes.
    この関数は、データセットからバッチを取得し、関連する境界ボックスとともに画像を表示します。
    """
    # ADD CODE HERE

    ##### 色指定
    # color for different classes
    colormap = {1:'blue', 2:'green', 4:'red'}
    
    ##### サブプロット領域設定。2行×5列、画像サイズ=(20, 10)
    num_col = 5
    num_row = (len(batch) + num_col -1) // num_col
    f, ax = plt.subplots(num_row, num_col, figsize=(20, 10))
    
    ##### batchのインデックスとデータ分ループ
    for idx, batch_data in enumerate(batch):
        ##### 画像データ取り出し
        img = batch_data["image"]
        ##### サブプロット領域の位置(x, y)算出
        x = idx // num_col
        y = idx % num_col       
        
        ##### サブプロット領域に画像をセット
        ax[x, y].imshow(img)
        
        ##### バウンディボックス、クラス取得
        gt_boxes = batch_data["groundtruth_boxes"]
        gt_classes = batch_data["groundtruth_classes"]
        ##### データごとループ
        for bb, obj_class in zip(gt_boxes, gt_classes):
            ##### バウンディボックスのx,y位置取得、スケーリング
            y1, x1, y2, x2 = bb
            x1 *= img.shape[0]
            y1 *= img.shape[1]
            y2 *= img.shape[0]
            x2 *= img.shape[1]
            ##### バウンディボックスの描画データ作成
            rec = Rectangle((x1, y1), x2-x1, y2-y1, facecolor='none', edgecolor=colormap[obj_class])
            ##### 画像にバウンディボックス描画を追加
            ax[x, y].add_patch(rec)
    plt.tight_layout()
    plt.show()

## Display 10 images 

Using the dataset created in the second cell and the function you just coded, display 10 random images with the associated bounding boxes. You can use the methods `take` and `shuffle` on the dataset.

In [9]:
## STUDENT SOLUTION HERE

batch = dataset.shuffle(100).take(10)
display_instances(list(batch.as_numpy_iterator()))

This display is saved as the following image.  

![Display_10_images](00_report_data\Exploratory_Data_Analysis\Display_10_images.PNG)

## Additional EDA

In this last part, you are free to perform any additional analysis of the dataset. What else would like to know about the data?
For example, think about data distribution. So far, you have only looked at a single file...

In [10]:
##### データセットから100データ分取得
batch = dataset.shuffle(100).take(100)

In [11]:
##### 画像群取得
def get_images(batch):    
    images = []
    for idx, batch_data in enumerate(batch):
        img = batch_data["image"]
        images.append(img)
    return images
    
##### jpg画像保存
def save_jpg(images, save_dir='jpg_images'):
    for idx, img in enumerate(images):
        file_dir = save_dir + '/image' + str(idx) + '.jpg'
        print(type(img))
        img = tf.image.encode_jpeg(img, format='rgb')
        
#        plt.imshow(img)
#        plt.show()
#        mpimg.imsave(file_dir, img)
        cv2.imwrite(file_dir, img)
#    mpimg.imsave(f'{save_dir}/{batch["filename"].decode("utf-8")}.jpg', output)

In [12]:
##### main
range = (0, 255)
save_dir='jpg_images'
images = get_images(batch)
#save_jpg(images, save_dir)

##### デバッグ
#plt.imshow(images[0])
#plt.show()
print(type(images[0]))
print(len(images))

<class 'tensorflow.python.framework.ops.EagerTensor'>
100


In [13]:
def pil2cv(image):
    ''' PIL型 -> OpenCV型 '''
    new_image = np.array(image, dtype=np.uint8)
    if new_image.ndim == 2:  # モノクロ
        pass
    elif new_image.shape[2] == 3:  # カラー
        new_image = cv2.cvtColor(new_image, cv2.COLOR_RGB2BGR)
    elif new_image.shape[2] == 4:  # 透過
        new_image = cv2.cvtColor(new_image, cv2.COLOR_RGBA2BGRA)
    return new_image

In [14]:
##### jpg画像取得
def open_jpg_images(image_dir):
    images = glob.glob(image_dir)
    jpg_images = [mpimg.imread(x) for x in images]
    return jpg_images

##### ヒストグラム表示
def show_histogram(target_type, images, range=(0, 255)):
#    images = [mpimg.imread(x) for x in images]
#    images = [cv2.cvtColor(x, cv2.COLOR_RGB2BGR) for x in images]
    plot_data = [target_type(img) for img in images]
    plt.hist(plot_data, range=range, bins=20)
    plt.show()

def red_mean(img):
    return img[...,0].numpy().mean()

def green_mean(img):
    return img[...,1].numpy().mean()

def blue_mean(img):
    return img[...,2].numpy().mean()

def bright_value_mean(img):
    img = pil2cv(img)
    img = cv2.cvtColor(img, cv2.COLOR_RGB2HSV)
    return img[..., 2].mean()

def hue_mean(img):
    img = pil2cv(img)
    img = cv2.cvtColor(img, cv2.COLOR_RGB2HSV)
    return img[..., 0].mean()

In [15]:
image_dir = 'jpg_images/*.jpg'
jpg_images = open_jpg_images(image_dir)

In [16]:
show_histogram(red_mean, images)

![red_histogram](00_report_data\Exploratory_Data_Analysis\red_histogram.png)

In [17]:
show_histogram(green_mean, images)

![green_histogram](00_report_data\Exploratory_Data_Analysis\green_histogram.png)

In [18]:
show_histogram(blue_mean, images)

![blue_histogram](00_report_data\Exploratory_Data_Analysis\blue_histogram.png)

In [19]:
show_histogram(bright_value_mean, images)

![bright_value_mean_histogram](00_report_data\Exploratory_Data_Analysis\bright_value_mean_histogram.png)

In [20]:
show_histogram(hue_mean, images)

![hue_histogram](00_report_data\Exploratory_Data_Analysis\hue_histogram.png)

In [21]:
##### クラスごとのオブジェクト数調査

In [22]:
##### クラスごとのオブジェクト数取得
def cnt_object_per_class(dataset):
    ##### クラスごとのオブジェクト数カウンタ
    obj_cnt_per_class = {1:0, 2:0, 4:0}

    for data in dataset.take(20000):
        for gt_c in data['groundtruth_classes'].numpy():
            obj_cnt_per_class[gt_c] += 1   
    return obj_cnt_per_class

In [23]:
# distributing data in bar graph
def display_object_per_class(dataset):
    ##### クラスごとのオブジェクト数取得
    obj_cnt_per_class = cnt_object_per_class(dataset)
    
    ##### クラス名とオブジェクト数の紐づけ
    obj_per_classes = {'vehicles':obj_cnt_per_class[1], 'pedestrians':obj_cnt_per_class[2],'cyclists':obj_cnt_per_class[4]}
    classes_name = list(obj_per_classes.keys())

    ##### オブジェクト数取得
    num_of_object = [obj_per_classes[c] for c in classes_name]

    ##### グラフ生成
    fig = plt.figure(figsize=(10,5))

    ##### グラフ設定、表示
    plt.bar(classes_name,num_of_object,color=['blue','green','red'],width=0.4)
    plt.xlabel("classes_name")
    plt.ylabel("num_of_object")
    plt.title("distribution of num of object per classes")
    plt.show()

In [24]:
display_object_per_class(dataset)

![object_per_class](00_report_data\Exploratory_Data_Analysis\object_per_class.png)