# [SpaceNet](https://aws.amazon.com/public-datasets/spacenet/)

"The current SpaceNet corpus includes **thousands of square kilometers of high resolution imagery** collected from **DigitalGlobe’s commercial satellites** which includes **8-band multispectral data**. This dataset is being made public to advance the development of **algorithms to automatically extract geometric features such as roads, building footprints, and points of interest using satellite imagery**. The currently available Areas of Interest (AOI) are **Rio De Janeiro**, Paris, Las Vegas, Shanghai and Khartoum."

### 0. Dependencies
The [AWS Command Line Interface (CLI)](https://aws.amazon.com/cli/) must be installed with an active AWS account. Configure the AWS CLI using `aws configure`.

### 1. Accessing the SpaceNet Data on AWS
The imagery is [GeoTIFF](https://en.wikipedia.org/wiki/GeoTIFF) satellite imagery and corresponding [GeoJSON](https://en.wikipedia.org/wiki/GeoJSON) building footprints.

The spacenet-dataset S3 bucket is provided as a [Requester Pays bucket](https://docs.aws.amazon.com/AmazonS3/latest/dev/RequesterPaysBuckets.html), so we use [Boto](https://boto3.readthedocs.io/en/latest/index.html), the Amazon Web Services (AWS) SDK for Python.

In [2]:
import boto3
client = boto3.client("s3")
# https://boto3.readthedocs.io/en/latest/reference/customizations/s3.html#boto3.s3.transfer.S3Transfer
transfer = boto3.s3.transfer.S3Transfer(client)

bucket = "spacenet-dataset"
# 20 tiffs listed in "AOI_1_Rio_manifest.txt".
names = [
    "013022223310.tif",
    "013022232021.tif",
    "013022232201.tif",
    "013022223131.tif",
    "013022223113.tif",
    "013022223103.tif",
    "013022223133.tif",
    "013022223132.tif",
    "013022223301.tif",
    "013022223112.tif",
    "013022232020.tif",
    "013022232200.tif",
    "013022223123.tif",
    "013022232022.tif",
    "013022223130.tif",
    "013022232002.tif",
    "013022232023.tif",
    "013022223311.tif",
    "013022232003.tif",
    "013022223121.tif"
]

key_prefix = "AOI_1_Rio/srcData/mosaic_3band/"
!mkdir -p /tmp/spacenet-data
filename_prefix = "/tmp/spacenet-data/"

key_filename_tuples = [
    (key_prefix + name, filename_prefix + name)
    for name in names]

# Download takes 4 minutes for 3-band, 16 minutes for 8-band.
import time
start = time.time()

for (key, filename) in key_filename_tuples:
    transfer.download_file(
        bucket=bucket, key=key, filename=filename,
        extra_args={"RequestPayer":"requester"}
    )

end = time.time()
download_time = end - start
minutes = int(download_time)/60
print("Download time: %d minutes" % int(minutes))

Download time: 5 minutes


## 2. Ingest Images with GeoPySpark

[GeoPySpark](https://github.com/locationtech-labs/geopyspark) is a Python language binding library of the Scala library, [GeoTrellis](https://github.com/locationtech/geotrellis), which reads, writes, and operates on raster data as fast as possible using Spark.

Refer to [Ingesting a Grayscale Image](https://geopyspark.readthedocs.io/en/latest/tutorials/greyscale_ingest_example.html) tutorial for code breakdown.

In [3]:
from pyspark import SparkContext
from geopyspark import geopyspark_conf
conf = geopyspark_conf("local[*]", "spacenet-ingest")
geopysc = SparkContext.getOrCreate(conf)

In [4]:
# Ingest takes X minutes.
import time
start = time.time()

from geopyspark.geotrellis.geotiff import get
from geopyspark.geotrellis.constants import SPATIAL, ZOOM
from geopyspark.geotrellis.catalog import write

# Read the GeoTiff locally
rdd = get(geopysc, SPATIAL, "file:///tmp/spacenet-data/")
# Error: https://github.com/locationtech/geotrellis/issues/2268
metadata = rdd.collect_metadata()

# tile the rdd to the layout defined in the metadata
laid_out = rdd.tile_to_layout(metadata)

# reproject the tiled rasters using a ZoomedLayoutScheme
reprojected = laid_out.reproject("EPSG:3857", scheme=ZOOM)#.cache().repartition(200)

# pyramid the TiledRasterRDD to create 12 new TiledRasterRDDs
# one for each zoom level
pyramided = reprojected.pyramid(start_zoom=12, end_zoom=1)

# Save each TiledRasterRDD locally
for tiled in pyramided:
    write("file:///tmp/spacenet-catalog", "spacenet-ingest", tiled)

end = time.time()
ingest_time = end - start
minutes = int(ingest_time)/60
print("Ingest time: %d minutes" % minutes)

Py4JJavaError: An error occurred while calling o29.collectMetadata.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1, localhost, executor driver): scala.MatchError: (53971,53970) (of class scala.Tuple2$mcII$sp)
	at geotrellis.raster.io.geotiff.reader.TiffTagsReader$.readTag(TiffTagsReader.scala:114)
	at geotrellis.raster.io.geotiff.reader.TiffTagsReader$.read(TiffTagsReader.scala:102)
	at geotrellis.raster.io.geotiff.reader.GeoTiffReader$.readGeoTiffInfo(GeoTiffReader.scala:288)
	at geotrellis.raster.io.geotiff.reader.GeoTiffReader$.readMultiband(GeoTiffReader.scala:174)
	at geotrellis.raster.io.geotiff.reader.GeoTiffReader$.readMultiband(GeoTiffReader.scala:152)
	at geotrellis.raster.io.geotiff.MultibandGeoTiff$.apply(MultibandGeoTiff.scala:93)
	at geotrellis.spark.io.RasterReader$$anon$2.readFully(RasterReader.scala:98)
	at geotrellis.spark.io.RasterReader$$anon$2.readFully(RasterReader.scala:96)
	at geotrellis.spark.io.hadoop.HadoopGeoTiffRDD$$anonfun$apply$1$$anonfun$apply$2.apply(HadoopGeoTiffRDD.scala:111)
	at geotrellis.spark.io.hadoop.HadoopGeoTiffRDD$$anonfun$apply$1$$anonfun$apply$2.apply(HadoopGeoTiffRDD.scala:110)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:185)
	at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1336)
	at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1011)
	at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1009)
	at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
	at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1981)
	at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
	at geotrellis.spark.TileLayerMetadata$.collectMetadataWithCRS(TileLayerMetadata.scala:147)
	at geotrellis.spark.TileLayerMetadata$.fromRdd(TileLayerMetadata.scala:237)
	at geotrellis.spark.package$withCollectMetadataMethods.collectMetadata(package.scala:194)
	at geopyspark.geotrellis.ProjectedRasterRDD.collectMetadata(RasterRDD.scala:212)
	at geopyspark.geotrellis.RasterRDD.collectMetadata(RasterRDD.scala:188)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)
Caused by: scala.MatchError: (53971,53970) (of class scala.Tuple2$mcII$sp)
	at geotrellis.raster.io.geotiff.reader.TiffTagsReader$.readTag(TiffTagsReader.scala:114)
	at geotrellis.raster.io.geotiff.reader.TiffTagsReader$.read(TiffTagsReader.scala:102)
	at geotrellis.raster.io.geotiff.reader.GeoTiffReader$.readGeoTiffInfo(GeoTiffReader.scala:288)
	at geotrellis.raster.io.geotiff.reader.GeoTiffReader$.readMultiband(GeoTiffReader.scala:174)
	at geotrellis.raster.io.geotiff.reader.GeoTiffReader$.readMultiband(GeoTiffReader.scala:152)
	at geotrellis.raster.io.geotiff.MultibandGeoTiff$.apply(MultibandGeoTiff.scala:93)
	at geotrellis.spark.io.RasterReader$$anon$2.readFully(RasterReader.scala:98)
	at geotrellis.spark.io.RasterReader$$anon$2.readFully(RasterReader.scala:96)
	at geotrellis.spark.io.hadoop.HadoopGeoTiffRDD$$anonfun$apply$1$$anonfun$apply$2.apply(HadoopGeoTiffRDD.scala:111)
	at geotrellis.spark.io.hadoop.HadoopGeoTiffRDD$$anonfun$apply$1$$anonfun$apply$2.apply(HadoopGeoTiffRDD.scala:110)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:185)
	at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1336)
	at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1011)
	at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1009)
	at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
	at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	... 1 more
