# [Dependencies](https://spacenetchallenge.github.io/#Dependencies)
> The [AWS Command Line Interface (CLI)](https://aws.amazon.com/cli/) must be installed with an active AWS account. Configure the AWS CLI using ‘aws configure’

# [Accessing the SpaceNet Data on AWS](https://aws.amazon.com/public-datasets/spacenet/#Accessing_the_SpaceNet_Data_on_AWS)
> The SpaceNet dataset is being released in several Areas of Interest. All AOIs will follow a similar directory structure and data format. The imagery is GeoTIFF satellite imagery and corresponding GeoJSON building footprints. You can use the following [aws-cli](https://aws.amazon.com/cli/) command to examine all files available in the dataset (details of file structure below):

> `aws s3 ls spacenet-dataset --request-payer requester`

> For more detailed information on how to access specific files within the dataset, see [here](https://github.com/SpaceNetChallenge/utilities/tree/master/content/download_instructions).

> _The spacenet-dataset S3 bucket is provided as a Requester Pays bucket, see [here](https://docs.aws.amazon.com/AmazonS3/latest/dev/RequesterPaysBuckets.html) for more information._

# Downloading Rio raster and vector data with [Boto](https://boto3.readthedocs.io/en/latest/index.html)
Since the bucket is Request Pays, we cannot successfully curl images. Instead, Boto, the AWS SDK for Python, provides an interface to download files from Request Pays buckets. The [S3Transfer](https://boto3.readthedocs.io/en/latest/reference/customizations/s3.html#boto3.s3.transfer.S3Transfer) class has a download method that can take in a 'RequestPayer' argument.

In [1]:
import os
import boto3

bucket = "spacenet-dataset"

aoi_path = "AOI_1_Rio"
aoi_data_path = os.path.join(aoi_path, "srcData")
building_labels_path = os.path.join(aoi_data_path, "buildingLabels")
mosaic_3band_path = os.path.join(aoi_data_path, "mosaic_3band")

client = boto3.client("s3")
transfer = boto3.s3.transfer.S3Transfer(client)

def download_if_not_exists(key, filename):
    if not os.path.exists(filename):
        transfer.download_file(
            bucket=bucket, key=key, filename=filename,
            extra_args={"RequestPayer": "requester"})

mosaic_3band_object_list = client.list_objects_v2(
    Bucket=bucket, Prefix=mosaic_3band_path,
    RequestPayer='requester')
mosaic_3band_key = [obj["Key"] for obj in mosaic_3band_object_list["Contents"]][0]
mosaic_3band_filename = os.path.join("/tmp", mosaic_3band_key.split("/")[-1])
download_if_not_exists(mosaic_3band_key, mosaic_3band_filename)

outline_filename = "Rio_OUTLINE_Public_AOI.geojson"
outline_key = os.path.join(building_labels_path, outline_filename)
download_if_not_exists(outline_key, outline_filename)

buildings_filename = "Rio_Buildings_Public_AOI_v2.geojson"
buildings_key = os.path.join(building_labels_path, buildings_filename)
download_if_not_exists(buildings_key, buildings_filename)

# Wrangling imagery with [GDAL](http://www.gdal.org/gdal_translate.html)
Since "Compression type JPEG is not supported by [this reader](https://github.com/locationtech/geotrellis/blob/master/raster/src/main/scala/geotrellis/raster/io/geotiff/compression/Decompressor.scala#L119-L122)" at the time of this demo, we need to [gdal_translate](http://www.gdal.org/gdal_translate.html) the image with a different compression type.

In [2]:
from osgeo import gdal

catalog_uri = os.path.join("/tmp", "catalog.tif")

if not os.path.exists(catalog_uri):
    gdal.Translate(
        destName=catalog_uri, srcDS=mosaic_3band_filename,
        creationOptions=['COMPRESS=LZW']
)

# Ingesting imagery for fast viewing with [GeoPySpark](https://github.com/locationtech-labs/geopyspark)

In [3]:
import geopyspark as gps
from pyspark import SparkContext
conf = gps.geopyspark_conf("local[*]", "spacenet-ingest")
conf.set(key='spark.ui.enabled', value='true')
sc = SparkContext.getOrCreate(conf)

catalog_uri = "file:///home/hadoop/notebooks/catalog.tif"
# The following operation takes about X seconds on a reasonably capable 4-core laptop
rdd = gps.geotrellis.geotiff.get(
    gps.geotrellis.constants.LayerType.SPATIAL, 
    catalog_uri,
    max_tile_size=512,
    num_partitions=500)

laid_out = rdd.tile_to_layout(layout = gps.GlobalLayout(), target_crs=3857)
reprojected = laid_out.reproject("EPSG:3857").cache().repartition(600)
pyramided = reprojected.pyramid(start_zoom=12, end_zoom=1)

for tiled in pyramided:
    gps.geotrellis.catalog.write("file:///tmp/spacenet-catalog", "spacenet-ingest", tiled)

Py4JJavaError: An error occurred while calling o29.reproject.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1.0 (TID 2, localhost, executor driver): java.io.FileNotFoundException: File file:/home/hadoop/notebooks/catalog.tif does not exist
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142)
	at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
	at geotrellis.spark.io.hadoop.HdfsUtils$.readRange(HdfsUtils.scala:185)
	at geotrellis.spark.io.hadoop.HdfsRangeReader.readClippedRange(HdfsRangeReader.scala:39)
	at geotrellis.util.RangeReader$class.readRange(RangeReader.scala:36)
	at geotrellis.spark.io.hadoop.HdfsRangeReader.readRange(HdfsRangeReader.scala:31)
	at geotrellis.util.StreamingByteReader$$anonfun$1.apply(StreamingByteReader.scala:90)
	at geotrellis.util.StreamingByteReader$$anonfun$1.apply(StreamingByteReader.scala:90)
	at geotrellis.util.StreamingByteReader$Chunk.data(StreamingByteReader.scala:43)
	at geotrellis.util.StreamingByteReader$Chunk.buffer$lzycompute(StreamingByteReader.scala:48)
	at geotrellis.util.StreamingByteReader$Chunk.buffer(StreamingByteReader.scala:48)
	at geotrellis.util.StreamingByteReader.getChar(StreamingByteReader.scala:110)
	at geotrellis.raster.io.geotiff.reader.TiffTagsReader$.read(TiffTagsReader.scala:83)
	at geotrellis.raster.io.geotiff.reader.GeoTiffReader$.readGeoTiffInfo(GeoTiffReader.scala:289)
	at geotrellis.raster.io.geotiff.reader.GeoTiffReader$.readMultiband(GeoTiffReader.scala:175)
	at geotrellis.raster.io.geotiff.MultibandGeoTiff$.streaming(MultibandGeoTiff.scala:120)
	at geotrellis.spark.io.RasterReader$$anon$2.readWindow(RasterReader.scala:115)
	at geotrellis.spark.io.RasterReader$$anon$2.readWindow(RasterReader.scala:107)
	at geotrellis.spark.io.hadoop.HadoopGeoTiffRDD$$anonfun$apply$5.apply(HadoopGeoTiffRDD.scala:161)
	at geotrellis.spark.io.hadoop.HadoopGeoTiffRDD$$anonfun$apply$5.apply(HadoopGeoTiffRDD.scala:153)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:185)
	at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1336)
	at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1011)
	at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1009)
	at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
	at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1981)
	at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
	at geopyspark.geotrellis.RasterSummary$.collect(RasterSummary.scala:61)
	at geopyspark.geotrellis.ProjectedRasterLayer.tileToLayout(ProjectedRasterLayer.scala:66)
	at geopyspark.geotrellis.ProjectedRasterLayer.reproject(ProjectedRasterLayer.scala:81)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: File file:/home/hadoop/notebooks/catalog.tif does not exist
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142)
	at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
	at geotrellis.spark.io.hadoop.HdfsUtils$.readRange(HdfsUtils.scala:185)
	at geotrellis.spark.io.hadoop.HdfsRangeReader.readClippedRange(HdfsRangeReader.scala:39)
	at geotrellis.util.RangeReader$class.readRange(RangeReader.scala:36)
	at geotrellis.spark.io.hadoop.HdfsRangeReader.readRange(HdfsRangeReader.scala:31)
	at geotrellis.util.StreamingByteReader$$anonfun$1.apply(StreamingByteReader.scala:90)
	at geotrellis.util.StreamingByteReader$$anonfun$1.apply(StreamingByteReader.scala:90)
	at geotrellis.util.StreamingByteReader$Chunk.data(StreamingByteReader.scala:43)
	at geotrellis.util.StreamingByteReader$Chunk.buffer$lzycompute(StreamingByteReader.scala:48)
	at geotrellis.util.StreamingByteReader$Chunk.buffer(StreamingByteReader.scala:48)
	at geotrellis.util.StreamingByteReader.getChar(StreamingByteReader.scala:110)
	at geotrellis.raster.io.geotiff.reader.TiffTagsReader$.read(TiffTagsReader.scala:83)
	at geotrellis.raster.io.geotiff.reader.GeoTiffReader$.readGeoTiffInfo(GeoTiffReader.scala:289)
	at geotrellis.raster.io.geotiff.reader.GeoTiffReader$.readMultiband(GeoTiffReader.scala:175)
	at geotrellis.raster.io.geotiff.MultibandGeoTiff$.streaming(MultibandGeoTiff.scala:120)
	at geotrellis.spark.io.RasterReader$$anon$2.readWindow(RasterReader.scala:115)
	at geotrellis.spark.io.RasterReader$$anon$2.readWindow(RasterReader.scala:107)
	at geotrellis.spark.io.hadoop.HadoopGeoTiffRDD$$anonfun$apply$5.apply(HadoopGeoTiffRDD.scala:161)
	at geotrellis.spark.io.hadoop.HadoopGeoTiffRDD$$anonfun$apply$5.apply(HadoopGeoTiffRDD.scala:153)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:185)
	at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1336)
	at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1011)
	at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1009)
	at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
	at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	... 1 more


# Showing Rio’s outline, imagery, and building footprints on a map with [GeoNotebook](https://github.com/OpenGeoscience/geonotebook)

In [4]:
from geonotebook.wrappers import VectorData
outline_vector = VectorData(outline_filename)
outline_polygons = [polygon for polygon in outline_vector.polygons]
outline_polygon = outline_polygons[0]
outline_centroid = outline_polygon.centroid
x = outline_centroid.x
y = outline_centroid.y
z = 12
M.set_center(x, y, z);
M.add_layer(outline_vector, name=outline_key);

In [5]:
def render_image(tile):
    cells = tile.cells
    # Color correct - use magic numbers
    magic_min, magic_max = 4000, 15176
    norm_range = magic_max - magic_min
    cells = cells.astype('int32')
    # Clamp cells
    cells[(cells != 0) & (cells < magic_min)] = magic_min
    cells[(cells != 0) & (cells > magic_max)] = magic_max
    colored = ((cells - magic_min) * 255) / norm_range
    (r, g, b) = (colored[2], colored[1], colored[0])
    alpha = np.full(r.shape, 255)
    alpha[(cells[0] == tile.no_data_value) & \
          (cells[1] == tile.no_data_value) & \
          (cells[2] == tile.no_data_value)] = 0
    rgba = np.dstack([r,g,b, alpha]).astype('uint8')
    #return Image.fromarray(colored[1], mode='P')
    return Image.fromarray(rgba, mode='RGBA')

# tms_server = gps.TMS.build(pyramid, display=render_image)
# M.add_layer(TMSRasterData(tms_server), name="mosaic")

In [6]:
buildings_vector = VectorData(buildings_filename)
M.add_layer(buildings_vector, name=buildings_key);

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.
