# SpaceNet Rio Chip Classification Data Prep

This notebook prepares data for training a chip classification model on the Rio SpaceNet dataset.

The only thing you'll have to set to run this notebook is to put an S3 URI that you have write access to here:

In [None]:
target_label_prefix = "s3://raster-vision-spacenet/AOI_1_Rio/buildingLabels"

The steps we'll take to make the data are as follows:

- Download the building labels and AOI from the SpaceNet AWS public dataset bucket
- Use the AOI and the image bounds to determine which images can be used for training and validation
- Split the building labels by image, save off a label GeoJSON file per image, and upload to S3
- Split the labeled images into a training and validation set, using the percentage of the AOI each covers, aiming at an 80%/20% split.

This process will save off of the split labels to S3, and save off a `train_scenes.csv` and `val_scenes.csv` that is used by the experiment at `spacenet.chip_classification`

In [None]:
import os
import json
import rastervision as rv
import boto3
import botocore
import rasterio
from shapely.geometry import (Polygon, shape)

In [None]:
s3 = boto3.client('s3')

## Get the label and AOI data from AWS's public dataset of Space Net

In [None]:
label_path = '/opt/data/spacenet/rio/labels/Rio_Buildings_Public_AOI_v2.geojson'
aoi_path = '/opt/data/spacenet/rio/labels/Rio_OUTLINE_Public_AOI.geojson'
if not os.path.exists(label_path):
    !aws s3 cp s3://spacenet-dataset/AOI_1_Rio/srcData/buildingLabels/Rio_Buildings_Public_AOI_v2.geojson $label_path
    !aws s3 cp s3://spacenet-dataset/AOI_1_Rio/srcData/buildingLabels/Rio_OUTLINE_Public_AOI.geojson $aoi_path
    

## Use the AOI to determine what images are inside the training set

Here we compare the AOI to the image extends to deteremine which images we can use for training and validation. We're using `rasterio`'s ability to read the metadata from raster data on S3 without downloading the whole image

In [None]:
aoi = None
with open(aoi_path) as f:
    aoi = shape(json.loads(f.read())['features'][0]['geometry'])

In [None]:
aoi

In [None]:
bucket = 'spacenet-dataset'
key = 'AOI_5_Khartoum/AOI_5_Khartoum_Train.tar.gz'

In [None]:
prefix = 'AOI_1_Rio/srcData/mosaic_3band/'
image_files = list(map(lambda x: 's3://{}/{}'.format(bucket, x['Key']),
                       s3.list_objects(Bucket=bucket, Prefix=prefix)['Contents']))

In [None]:
def bounds_to_shape(bounds):
    return Polygon([[bounds.left, bounds.bottom],
                    [bounds.left, bounds.top],
                    [bounds.right, bounds.top],
                    [bounds.right, bounds.bottom],
                    [bounds.left, bounds.bottom]])
image_to_extents = {}
for img in image_files:
    with rasterio.open(img, 'r') as ds:
        image_to_extents[img] = bounds_to_shape(ds.bounds)


In [None]:
intersecting_images = []
for img in image_to_extents:
    if image_to_extents[img].intersects(aoi):
        intersecting_images.append(img)

In [None]:
intersecting_images

## Match labels to images

Find the labels that intersect with the image's bounding box, which will be saved off into a labels geojson that matches the image name. Upload them to the S3 URI at `target_label_prefix`

In [None]:
label_js = None
with open(label_path) as f:
    label_js = json.loads(f.read())

In [None]:
# Add a class_id and class_name to the properties of each feature
for feature in label_js['features']:
    feature['properties']['class_id'] = 1
    feature['properties']['class_name'] = 'building'

In [None]:
image_to_features = {}
for img in  intersecting_images:
    image_to_features[img] = []
    bbox = image_to_extents[img]
    for feature in label_js['features']:
        if shape(feature['geometry']).intersects(bbox):
            image_to_features[img].append(feature)
    

In [None]:
processed_labels_dir = '/opt/data/spacenet/rio/processed_labels/'
if not os.path.isdir(processed_labels_dir):
    os.makedirs(processed_labels_dir)
for img in image_to_features:
    fc = {}
    fc['type'] = 'FeatureCollection'
    fc['crs'] = label_js['crs']
    fc['features'] = image_to_features[img]
    basename = os.path.splitext(os.path.basename(img))[0]
    with open(os.path.join(processed_labels_dir,'{}.geojson'.format(basename)), 'w') as f:
        f.write(json.dumps(fc, indent=4))

In [None]:
!ls $processed_labels_dir


In [None]:
!aws s3 cp --recursive $processed_labels_dir $target_label_prefix

## Split into train and validation

Split up training and validation data. There's an odd shaped AOI and not that many images, so we'll split the train and validation roughly based on how much area each scene covers of the AOI. 

Create a CSV that our experiments will use to load up the training and validation data.



In [None]:
# Split training and validation
ratio = 0.8
aoi_area = aoi.area
images_to_area = {}
for img in intersecting_images:
    area = image_to_extents[img].intersection(aoi).area
    images_to_area[img] = area / aoi_area

train_imgs = []
val_imgs = []
train_area_covered = 0
for img in sorted(intersecting_images, reverse=True, key=lambda img: images_to_area[img]):
    if train_area_covered < ratio:
        train_imgs.append(img)
        train_area_covered += images_to_area[img]
    else:
        val_imgs.append(img)
print("{} training images with {}% area.".format(len(train_imgs), train_area_covered))
print("{} validation images with {} area.".format(len(val_imgs), 1 - train_area_covered))

In [None]:
csv_rows = []
for img in train_imgs:
    basename = os.path.splitext(os.path.basename(img))[0]
    labels_path = os.path.join(target_label_prefix,'{}.geojson'.format(basename))
    csv_rows.append('"{}","{}"'.format(img, labels_path))
with open('/opt/data/spacenet/rio/training_scenes.csv', 'w') as f:
    f.write('\n'.join(csv_rows))

In [None]:
csv_rows = []
for img in val_imgs:
    basename = os.path.splitext(os.path.basename(img))[0]
    labels_path = os.path.join(target_label_prefix,'{}.geojson'.format(basename))
    csv_rows.append('"{}","{}"'.format(img, labels_path))
with open('/opt/data/spacenet/rio/val_scenes.csv', 'w') as f:
    f.write('\n'.join(csv_rows))