# Explore the dataset


In this notebook, we will perform an EDA (Exploratory Data Analysis) on the processed Waymo dataset (data in the `processed` folder). In the first part, you will create a function to display 

In [None]:
import os
import glob
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import cv2
tf.get_logger().setLevel("ERROR") #Turn off annoying tf INFO messages 
from tqdm.notebook import trange
from IPython import display
from collections import Counter

from utils import get_dataset

import exploratory_data_analysis as eda

In [None]:
#Set up matplotlib
%matplotlib inline
plt.rcParams["figure.figsize"] = [18, 12]

#Set the random seed so we can produce reproducible results when we want them
np.random.seed(1)

#Initialize GroundTruthAnnotator we will use for the exercises below
visualizer = eda.GroundTruthAnnotator()

## Write a function to display an image and the bounding boxes

Implement the `display_instances` function below. This function takes a batch as an input and display an image with its corresponding bounding boxes. The only requirement is that the classes should be color coded (eg, vehicles in red, pedestrians in blue, cyclist in green).

In [None]:
## STUDENT SOLUTION HERE
# See GroundTruthAnnotator.batch_annotate_ground_truth() in
# exploratory_data_analysis.py, which completes the same functionality as
# display_instances()

## Display 10 images 

Using the dataset created in the second cell and the function you just coded, display 10 random images with the associated bounding boxes. You can use the methods `take` and `shuffle` on the dataset.

In [None]:
## STUDENT SOLUTION HERE
dataset = get_dataset("data/train/segment-1005081002024129653_5313_150_5333_150_with_camera_labels.tfrecord")
dataset = dataset.shuffle(10)
batch = dataset.take(10)
mosiac_img_path = visualizer.batch_annotate_ground_truth(batch)
print(mosiac_img_path)
img = cv2.imread(mosiac_img_path)
plt.imshow(img)
plt.axis("off")
plt.show()

## Additional EDA

In this last part, you are free to perform any additional analysis of the dataset. What else would like to know about the data?
For example, think about data distribution. So far, you have only looked at a single file...

**-------- STUDENT SOLUTION BELOW --------**

### Compute Statistical Analysis on Dataset

First, let's process the provided data to decide how we want to split it into training and validation sets. Let's pull a batch of images from each tf record file, annotate the images with ground truth labels, and calculate various statistics about class frequency, physical characteristics, image characteristics, etc

In [None]:
BATCH_SIZE = 10

tfrecord_paths = glob.glob("data/train/*.tfrecord")

mosaic_paths = []
train_and_val_stats = eda.StatisticsAggregator()

for tfrecord_idx in trange(len(tfrecord_paths)):
    #Get a batch from this tfrecord file
    tfrecord_path = tfrecord_paths[tfrecord_idx]
    dataset = get_dataset(tfrecord_path)
    dataset = dataset.shuffle(BATCH_SIZE)
    batch = dataset.take(BATCH_SIZE)

    #Create a ground truth visualization mosaic for this batch
    basename = os.path.basename(tfrecord_path)
    mosaic_path = visualizer.batch_annotate_ground_truth(batch, basename)
    mosaic_paths.append(mosaic_path) 

    #Calculate statistics for this batch and add to the aggregated statistics
    #for the combined training and validation sets (we will figure out how we
    #want to split them later)
    batch = list(batch.as_numpy_iterator()) #FIXME: call this in StatisticsAggregator
    for elem in batch:
        train_and_val_stats.calculateAllStats(batch)

#Save out the min, median, and max images of the inspected dataset for various
#image attributes. FIXME: remove
train_and_val_stats.img_examples["mean_brightness"].saveExampleImages("mean_brightness")
train_and_val_stats.img_examples["contrast"].saveExampleImages("contrast")
train_and_val_stats.img_examples["sharpness"].saveExampleImages("sharpness")
          

### Analyzing Image Characteristics

In this section, we will analyze the visual characteristics of the images in the dataset, without regards to the classes in the images.

First, let's just get a quick qualitative visual idea of what the dataset looks like. We'll display annotated mosaics for 10 batches of images (images in one batch all come from the same trip)

In [None]:
imgs_to_show = np.random.choice(mosaic_paths, 10)
for img_path in imgs_to_show:
    img = cv2.imread(img_path)
    plt.imshow(img)
    plt.axis("off")
    plt.tight_layout()
    plt.show()

Next, let's get an idea of the range of images encountered across several different attributes such as image mean brightness, image contrast, and image sharpness. This will give us insight into the range of visual conditions encountered while driving (e.g. dark because of night, low contrast because of fog, blurry because of raindrops on the lens or motion blur). We will display the images representing the minimum, median, and maximum for each attribute in order to get a visual qualitative sense of the range which exists in the dataset:

In [None]:
for attribute_name, example in train_and_val_stats.img_examples.items():
    imgs = [example.min_img, example.median_img, example.max_img]
    values = {"min": example.min_value, 
              "median": example.median_value, 
              "max": example.max_value}
    fig, axes = plt.subplots(1, 3)
    for j, key in enumerate(values.keys()):
        axes[j].imshow(imgs[j])
        axes[j].axis("off")
        axes[j].set_title("%s %s = %.0f" % (key, attribute_name, values[key]))
    fig.tight_layout()

Next, let's get a more complete quantitative insight into the brightness, sharpness, and contrast across the whole dataset by plotting histograms for each attribute: 

In [None]:
%matplotlib inline
for attribute_name in train_and_val_stats.img_attributes:
    plt.title("img " + attribute_name)
    plt.hist(train_and_val_stats.img_statistics[attribute_name], bins="auto")
    plt.ylabel("Frequency")
    plt.show()

### Analyzing Class Characteristics

In this section, we will analyze the characteristics of the ground truth classes found in the images, e.g. how frequently each class appears, its size, and where it tends to appear in the image.
 

#### Class Frequency

First, let's get a sense of how common each class is in the images by plotting a histogram of class frequency per image for each class:

In [None]:
for class_idx in eda.CLASS_TO_LABEL_MAP.keys():
    bins = [i for i in range(max(train_and_val_stats.class_freqs[class_idx]))]
    plt.hist(train_and_val_stats.class_freqs[class_idx], bins, log=True)
    label = eda.CLASS_TO_LABEL_MAP[class_idx]
    plt.title("%s frequency histogram" % label)
    plt.xlabel("number of %ss in image" % label)
    plt.ylabel("frequency")
    plt.show()

#### Class Size

Next, let's get a sense of how large the classes appear in the images, by plotting a histogram of thier bounding box widths and heights:

In [None]:
%matplotlib inline
for class_idx in eda.CLASS_TO_LABEL_MAP.keys():
    fig, axes = plt.subplots(1, 2)
    axes[0].hist(train_and_val_stats.class_widths[class_idx], bins="auto", log=True)
    axes[0].set_title("width")
    axes[0].set_ylabel("frequency")

    axes[1].hist(train_and_val_stats.class_heights[class_idx], bins="auto", log=True)
    axes[1].set_title("height")

    plt.suptitle("%s size" % eda.CLASS_TO_LABEL_MAP[class_idx])
    plt.tight_layout()
    plt.show()

#### Class Locations in Images

Finally, let's get a sense of where the classes tend to be located in the images by plotting a scatter plot of the bounding box centers. We will let the points be semi-transparent to help us get a sense of density.

In [None]:
%matplotlib inline
for class_idx in eda.CLASS_TO_LABEL_MAP.keys():
    xs, ys = zip(*train_and_val_stats.class_bounding_box_centers[class_idx])

    plt.scatter(xs, ys, alpha=0.01)
    plt.xlim(0, 1)
    plt.ylim(0, 1)

    label = eda.CLASS_TO_LABEL_MAP[class_idx]
    plt.title("%s locations in image" % label)
    plt.show()