# xView Vehicle Object Detection Data Prep

This notebook prepares data for training an object detection model on the xView dataset.

Once you have downloaded the training images and labels from the xView competition site, you must unzip them and put the contents of each zipfile in an s3 bucket that you have read/write access to. Provide the uri to this bucket in the cell below. This is the only thing you will need to do in order to run this notebook.

In [None]:
base_dir = "s3://raster-vision-xview-example"

The steps we'll take to prepare the data are as follows:

- Download the xView object detection labels from s3
- Filter out all of the non-vehicle bounding boxes from the labels. Combine all vehicle types into one class. 
- Subset the entire xView dataset to only include the images that are most densely populated with vehicles.
- Split the selected images randomly into 80%/20% training and validation sets
- Split the vehicle labels by image, save off a label GeoJSON file per image, and upload to S3


This process will save off of the split labels to S3, and save off a `train_scenes.csv` and `val_scenes.csv` that is used by the experiment at `xview.object_detection`

In [None]:
import os
import json
import random
import copy

### Get the xView label data from s3

In [None]:
label_path = '/opt/data/xview/labels/xView_train.geojson'
remote_label_path = os.path.join(base_dir, 'xView_train.geojson')
if not os.path.exists(label_path):
    !aws s3 cp $remote_label_path $label_path

### Filter out non-vehicle labels

The xView dataset includes labels for a number of different types of objects. We are only interested in building a detector for objects that can be categorized as vehicles (e.g. 'small car', 'passenger vehicle', 'bus'). We have pre-determined the ids that map to vehicle labels and will use them to extract all the vehicles from the whole xView label set. In this section we also assign a class name of 'vehicle' to all of the resulting labels.

In [None]:
vehicle_type_ids = [17, 18, 19, 20, 21, 23, 24, 25, 26, 27, 28, 29, 32, 
                    53, 54, 55, 56, 57, 59, 60, 61, 62, 63, 64, 65, 66]

In [None]:
label_js = None
with open(label_path) as f:
    label_js = json.loads(f.read())

In [None]:
vehicle_features = []
for f in label_js['features']:
    if f['properties']['type_id'] in vehicle_type_ids:
        f['properties']['class_name'] = 'vehicle'
        vehicle_features.append(f)
label_js['features'] = vehicle_features

### Subset images with the most vehicles

In this section we determine which images contain the most vehicles and are therefor the best candidates for this experiment.

In [None]:
image_to_vehicle_counts = {}
for f in label_js['features']:
    image_id = f['properties']['image_id']
    if image_id not in image_to_vehicle_counts.keys():
        image_to_vehicle_counts[image_id] = 1
    else:
        image_to_vehicle_counts[image_id] += 1

In [None]:
experiment_image_count = round(len(image_to_vehicle_counts.keys()) * 0.1)
sorted_images_and_counts = sorted(image_to_vehicle_counts.items(), key=lambda x: x[1])
selected_images_and_counts = sorted_images_and_counts[-experiment_image_count:]

### Split into train and validation

Split up training and validation data. Use 80% of images in the training set and 20% in the validation set.

In [None]:
ratio = 0.8
training_sample_size = round(ratio * experiment_image_count)
train_sample = random.sample(range(experiment_image_count), training_sample_size)

train_images = []
test_images = []

In [None]:
for i in range(training_sample_size):
    img = selected_images_and_counts[i][0]
    img_uri = os.path.join(base_dir, 'train_images', img)
    if i in train_sample:
        train_images.append(img_uri)
    else:
        test_images.append(img_uri)                

### Divide labels up by image

Using one vehicle label geojson for all of the training and test images can become unwieldy. Instead, we will divide the labels up so that each image has a unique geojson associated with it. We will save off each of these geojsons and upload the base s3 directory you provided at the outset.

Create a CSV that our experiments will use to load up the training and validation data.

In [None]:
processed_labels_dir = '/opt/data/xview/processed_labels/'

In [None]:
def subset_labels(tiff_list, processed_labels_dir):
    def f(tiff_uri):
        tiff_basename = os.path.basename(tiff_uri)
        tiff_features = []
        for l in label_js['features']:
            image_id = l['properties']['image_id']
            if image_id == tiff_basename:
                tiff_features.append(l)
        labels_subset = copy.copy(label_js)
        labels_subset['features'] = tiff_features
        return labels_subset   
    for i in train_images:
        basename = os.path.splitext(os.path.basename(i))[0]
        tiff_geojson = f(i)
        with open(os.path.join(processed_labels_dir, '{}.geojson'.format(basename)), 'w') as file:
            file.write(json.dumps(tiff_geojson, indent=4))
            

In [None]:
subset_labels(train_images, processed_labels_dir)
subset_labels(test_images, processed_labels_dir)

In [None]:
!aws s3 cp --recursive $processed_labels_dir $base_dir

In [None]:
def create_csv(images, csv_name):
    csv_rows = []
    for img in images:
        basename = os.path.splitext(os.path.basename(img))[0]
        labels_path = os.path.join(base_dir,'{}.geojson'.format(basename))
        csv_rows.append('"{}","{}"'.format(img, labels_path))
    with open('/opt/data/xview/{}.csv'.format(csv_name), 'w') as f:
        f.write('\n'.join(csv_rows))

In [None]:
create_csv(train_images, 'training_scenes')
create_csv(test_images, 'val_scenes')