Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest NLCD 2016 Layers into Geoprocessing Service #3393

Closed
6 tasks done
rajadain opened this issue May 12, 2021 · 8 comments
Closed
6 tasks done

Ingest NLCD 2016 Layers into Geoprocessing Service #3393

rajadain opened this issue May 12, 2021 · 8 comments
Assignees
Labels
PA DEP Funding Source: Pennsylvania Department of Environment Protection

Comments

@rajadain
Copy link
Member

rajadain commented May 12, 2021

From the MRLC site: https://www.mrlc.gov/data/nlcd-2016-land-cover-conus, download and ingest the NLCD 2016 data into the MMW Geoprocessing Service. The data should be ingested into EPSG:5070 Conus / Albers projection. This should be a data ingest for calculations, as well as a tiled image ingest for visualization.

The tasks for this card include:

  • Analyzing the NLCD 2016 download to see its contents
    • This may include exploring it in QGIS
    • Extracting the various layers into component files (2016, 2011, 2006, 2001)
    • Other preparations before the ingest
  • Getting environment setup for ingesting: this could be local, or on EC2
  • Ingest the 2016 NLCD layer into GeoTrellis
  • Create a pyramided visual layer for NCLD 2016
  • Ingest the 2016 version of NLCD 2011 into GeoTrellis
  • Create a pyramided visual layer for new NLCD 2011
@rajadain rajadain added the PA DEP Funding Source: Pennsylvania Department of Environment Protection label May 12, 2021
@rajadain rajadain changed the title Ingest NLCD 2016 Layer into Geoprocessing Service Ingest NLCD 2016 Layers into Geoprocessing Service May 12, 2021
@rajadain rajadain self-assigned this Jun 10, 2021
@rajadain
Copy link
Member Author

These are the files in the download:

$ du -sh *

4.1M	NLCD2016_spatial_metadata
 28K	NLCD_2016_Land_Cover_L48.xml
 16G	NLCD_2016_Land_Cover_L48_20190424.ige
 32K	NLCD_2016_Land_Cover_L48_20190424.img
3.9G	NLCD_2016_Land_Cover_L48_20190424.rde
425M	NLCD_2016_Land_Cover_L48_20190424.rrd

The data is in the ERDAS IMAGINE file format:

$ gdalinfo NLCD_2016_Land_Cover_L48_20190424.img

Driver: HFA/Erdas Imagine Images (.img)
Files: NLCD_2016_Land_Cover_L48_20190424.img
       nlcd_2016_land_cover_l48_20190424.ige
       nlcd_2016_land_cover_l48_20190424.rrd
       nlcd_2016_land_cover_l48_20190424.rde
Size is 161190, 104424
Coordinate System is:
PROJCRS["Albers_Conical_Equal_Area",
    BASEGEOGCRS["WGS 84",
        DATUM["World Geodetic System 1984",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4326]],
    CONVERSION["unnamed",
        METHOD["Albers Equal Area",
            ID["EPSG",9822]],
        PARAMETER["Latitude of false origin",23,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8821]],
        PARAMETER["Longitude of false origin",-96,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8822]],
        PARAMETER["Latitude of 1st standard parallel",29.5,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8823]],
        PARAMETER["Latitude of 2nd standard parallel",45.5,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8824]],
        PARAMETER["Easting at false origin",0,
            LENGTHUNIT["meters",1],
            ID["EPSG",8826]],
        PARAMETER["Northing at false origin",0,
            LENGTHUNIT["meters",1],
            ID["EPSG",8827]]],
    CS[Cartesian,2],
        AXIS["easting",east,
            ORDER[1],
            LENGTHUNIT["meters",1]],
        AXIS["northing",north,
            ORDER[2],
            LENGTHUNIT["meters",1]]]
Data axis to CRS axis mapping: 1,2
Origin = (-2493045.000000000000000,3310005.000000000000000)
Pixel Size = (30.000000000000000,-30.000000000000000)
Corner Coordinates:
Upper Left  (-2493045.000, 3310005.000) (130d13'58.18"W, 48d42'26.63"N)
Lower Left  (-2493045.000,  177285.000) (119d47' 9.98"W, 21d44'32.31"N)
Upper Right ( 2342655.000, 3310005.000) ( 63d40'19.89"W, 49d10'37.43"N)
Lower Right ( 2342655.000,  177285.000) ( 73d35'40.55"W, 22d 4'36.23"N)
Center      (  -75195.000, 1743645.000) ( 96d52'22.83"W, 38d43' 4.71"N)
Band 1 Block=512x512 Type=Byte, ColorInterp=Palette
  Description = Layer_1
  Min=0.000 Max=95.000
  Minimum=0.000, Maximum=95.000, Mean=30.478, StdDev=32.691
  Overviews: 80595x52212, 40298x26106, 20149x13053, 10075x6527, 5038x3264, 2519x1632, 1260x816, 630x408, 315x204, 158x102, 79x51
  Metadata:
    LAYER_TYPE=thematic
    OVERVIEWS_ALGORITHM=IMAGINE Nearest Neighbor Resampling
    STATISTICS_EXCLUDEDVALUES=
    STATISTICS_HISTOBINVALUES=7854240512|0|0|0|0|0|0|0|0|0|0|470743994|571568|0|0|0|0|0|0|0|0|258084082|133062048|62536718|22510859|0|0|0|0|0|0|92107968|0|0|0|0|0|0|0|0|0|840903850|1026422406|325741623|0|0|0|0|0|0|0|0|1955705216|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1242679969|0|0|0|0|0|0|0|0|0|563964334|1459015111|0|0|0|0|0|0|0|391909523|0|0|0|0|131904779|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
    STATISTICS_HISTOMAX=255
    STATISTICS_HISTOMIN=0
    STATISTICS_HISTONUMBINS=256
    STATISTICS_MAXIMUM=95
    STATISTICS_MEAN=30.478061895963
    STATISTICS_MEDIAN=21
    STATISTICS_MINIMUM=0
    STATISTICS_MODE=0
    STATISTICS_SKIPFACTORX=1
    STATISTICS_SKIPFACTORY=1
    STATISTICS_STDDEV=32.691287884651
  Color Table (RGB with 256 entries)
...

@rajadain
Copy link
Member Author

The file opens in QGIS, showing the data correctly:

image

@rajadain
Copy link
Member Author

Notably, we should be using https://www.mrlc.gov/data/nlcd-land-cover-conus-all-years which has the 2016 edition of data for all years, since we also need to ingest the 2016 version of NLCD 2011, and potentially the 2016 version of NLCD 2006 and NLCD 2001 as well.

@rajadain
Copy link
Member Author

As can be seen in the gdalinfo output above #3393 (comment), the original ERDAS IMAGINE file is not in EPSG:5070, the projection used by older versions of NLCD, and the one that MMW has standardized upon, but in a non-EPSG custom projection. I converted the original file to an EPSG:5070 GeoTiff using the following commands:

$ gdal_translate NLCD_2016_Land_Cover_L48_20190424.img nlcd-2016-mrls.tif
$ gdalwarp -t_srs 'EPSG:5070' nlcd-2016-mrls.tif nlcd-2016-mrls-5070.tif

Then, to verify that this transformation did not result in any loss of data, I cropped a small portion of both the GeoTiff in the original projection to the Delaware River Basin, using the DRB GeoJSON we use in the app:

$ gdalwarp -cutline drb.geojson -crop_to_cutline nlcd-2016-mrls.tif nlcd-2016-mrls-cropped.tif
$ gdalwarp -cutline drb.geojson -crop_to_cutline nlcd-2016-mrls-5070.tif nlcd-2016-mrls-5070-cropped.tif

Then, I reprojected each cropped GeoTiff, one in the original projection and one EPSG:5070, to EPSG:3857:

$ gdalwarp -t_srs 'EPSG:3857' nlcd-2016-mrls-cropped.tif nlcd-2016-mrls-cropped-3857.tif
$ gdalwarp -t_srs 'EPSG:3857' nlcd-2016-mrls-5070-cropped.tif nlcd-2016-mrls-5070-cropped-3857.tif

The test being that if there are differences in the final EPSG:3857 files, then that would indicate data loss going from the original to EPSG:5070.

For a raster diff, I performed a subtract operation in QGIS:

image

And ran statistics and histogram for the diff:

$ gdalinfo -stats -hist nlcd-2016-mrls-5070-cropped-3857-diff.tif

...
256 buckets from -47.0961 to 2.09608:
  1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 139490096 0 0 0 0 0 0 0 0 0 1 
  NoData Value=-3.4028234663852886e+38
  Metadata:
    STATISTICS_MAXIMUM=2
    STATISTICS_MEAN=-3.4411044224625e-07
    STATISTICS_MINIMUM=-47
    STATISTICS_STDDEV=0.0039875769431737
    STATISTICS_VALID_PERCENT=100

This revealed that the highest value of the diff was 2, much lower than the valid values of the data which start at 11. Most of the diff was NODATA, which indicates that the two EPSG:3857 files, one created from the original projection and one from EPSG:5070, were nearly identical.

I also hexdumped both files and ran a text diff on it:

$ hexdump nlcd-2016-mrls-cropped-3857.tif > nlcd-2016-mrls-cropped-3857.tif.hexdump
$ hexdump nlcd-2016-mrls-5070-cropped-3857.tif > nlcd-2016-mrls-5070-cropped-3857.tif.hexdump
$ diff -u nlcd-2016-mrls-cropped-3857.tif.hexdump nlcd-2016-mrls-5070-cropped-3857.tif.hexdump

Which showed me a total difference of 64 bytes out of 126 MB, making the files virtually identical:

--- nlcd-2016-mrls-cropped-3857.tif.hexdump	2021-06-15 10:55:46.000000000 -0400
+++ nlcd-2016-mrls-5070-cropped-3857.tif.hexdump	2021-06-15 10:57:16.000000000 -0400
@@ -3869,10 +3869,10 @@
 0016bc0 6f 6c 65 3d 22 64 65 73 63 72 69 70 74 69 6f 6e
 0016bd0 22 3e 4c 61 79 65 72 5f 31 3c 2f 49 74 65 6d 3e
 0016be0 0a 3c 2f 47 44 41 4c 4d 65 74 61 64 61 74 61 3e
-0016bf0 0a 00 5b 8b f8 6e 5e aa 43 40 5b 8b f8 6e 5e aa
+0016bf0 0a 00 e0 89 f9 6e 5e aa 43 40 e0 89 f9 6e 5e aa
 0016c00 43 40 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 0016c10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
-0016c20 00 00 f3 7c 69 6b f4 51 60 c1 fc 56 df 7a f6 10
+0016c20 00 00 85 dc 69 6b f4 51 60 c1 12 77 df 7a f6 10
 0016c30 54 41 00 00 00 00 00 00 00 00 01 00 01 00 00 00
 0016c40 07 00 00 04 00 00 01 00 01 00 01 04 00 00 01 00
 0016c50 01 00 02 04 b1 87 19 00 00 00 01 08 b1 87 07 00
@@ -114161,7 +114161,7 @@
 0a0c300 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29
 *
 0a0c320 29 29 29 29 29 29 29 29 29 29 29 29 29 29 2b 2b
-0a0c330 2b 2b 2b 2b 2b 2b 2b 2b 2b 29 29 29 29 29 29 29
+0a0c330 2b 2b 2b 2b 2b 2b 2b 2b 29 29 29 29 29 29 29 29
 0a0c340 15 15 29 29 29 51 51 51 51 51 51 51 29 29 29 29
 0a0c350 29 29 29 29 29 2b 2b 2b 29 29 29 29 29 29 29 29
 0a0c360 29 29 29 2b 2b 2b 2a 2a 2b 2b 51 51 51 51 51 51
@@ -211393,7 +211393,7 @@
 0ed4610 29 29 29 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b
 0ed4620 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b
 0ed4630 2b 2b 2b 2b 2b 2b 2b 29 29 29 29 29 29 29 29 29
-0ed4640 29 29 2b 29 29 2b 29 29 29 29 29 29 29 29 2b 2b
+0ed4640 29 29 2b 29 2b 2b 29 29 29 29 29 29 29 29 2b 2b
 0ed4650 2b 29 29 29 29 29 29 2b 2b 2b 2b 2b 2b 2b 2b 2b
 0ed4660 29 2b 29 2b 2b 29 2b 2b 2b 2b 2b 2b 2b 2b 29 29
 0ed4670 2b 29 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 29 29
@@ -321140,7 +321140,7 @@
 14d35e0 29 29 29 29 29 29 29 29 29 29 29 29 29 29 2b 2b
 14d35f0 2b 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29
 14d3600 29 29 29 29 2b 2b 2b 2b 2a 2b 2b 2b 2b 29 29 15
-14d3610 2b 2a 2b 2b 34 2b 2b 2b 2b 2b 2a 2b 2b 5a 2a 2a
+14d3610 2b 2a 2b 2b 34 2b 2b 2b 2b 2b 2a 2b 5a 5a 2a 2a
 14d3620 2b 2b 29 29 29 29 2b 29 29 29 29 2b 2b 2a 2a 2a
 14d3630 2b 29 2b 2b 29 29 29 29 29 2b 2b 29 29 29 29 29
 14d3640 29 2b 2b 2b 15 15 2b 2b 34 47 47 2a 2b 2b 29 29
@@ -492760,7 +492760,7 @@
 1e0b650 2b 2a 2a 2b 2b 2b 2b 2a 2a 2a 15 15 2b 2b 2a 2a
 1e0b660 2a 2b 2b 2b 2b 2a 2a 2b 2a 15 2b 2b 2b 29 2b 29
 1e0b670 29 2b 29 2b 2b 2b 2b 29 29 29 2b 2b 29 2a 2a 2a
-1e0b680 2a 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2a
+1e0b680 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2a
 1e0b690 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b
 1e0b6a0 2b 2b 2a 2a 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 29 29
 1e0b6b0 15 15 2b 29 2b 2b 2b 2b 2b 2b 2b 15 2a 2a 2a 0b

With this verification, I am satisfied that using gdal_translate and gdalwarp -t_srs 'EPSG:5070' to convert the incoming ERDAS IMAGINE files into GeoTiffs ready for ingesting works correctly and does not lose any data along the way.

@rajadain
Copy link
Member Author

In order to get our ETL tool working again, I had to install Scala 2.11 and Spark v2.xx.

Also, I had to ensure that both AWS_PROFILE=azavea-datahub and AWS_REGION=us-east-1 were explicitly set as environment variables.

@rajadain
Copy link
Member Author

I decided to paint the visual tiles via GDAL, since it is more reliable and does not create the artifacts with EPSG:5070 that GeoTrellis used to. This has since been fixed in more recent versions of GeoTrellis, but upgrading the ETL pipeline would be a large effort.

I made a t3.medium EC2 instance in the azavea-datahub account, and added 128GB of SSD. There, I downloaded the NLCD Full Zip. Then, I created this script:

#!/bin/sh

set -ex

# Example Usage: ./make-tiles.sh 2016

# Unzip the relevant files for the given year only
unzip NLCD_Land_Cover_L48_20190424_full_zip.zip NLCD_$1*

# Reproject to EPSG:3857 (Web Mercator), the projection used on web maps
docker run -v $PWD:/data -w /data --rm -ti osgeo/gdal:alpine-normal-3.3.0 gdalwarp -t_srs 'EPSG:3857' NLCD_$1_Land_Cover_L48_20190424.img nlcd-$1-30m.img

# Convert to RGBA 
docker run -v $PWD:/data -w /data --rm -ti osgeo/gdal:alpine-normal-3.3.0 gdal_translate -of vrt -expand rgba nlcd-$1-30m.img nlcd-$1-30m.vrt

# Paint tiles, from zoom levels 0-13
mkdir nlcd-$1-30m
docker run -v $PWD:/data -w /data --rm -ti osgeo/gdal:alpine-normal-3.3.0 gdal2tiles.py --zoom=0-13 --resampling=near --resume --xyz nlcd-$1-30m.vrt nlcd-$1-30m/

# Upload tiles to S3
aws s3 sync --exclude "*.aux.xml" --acl public-read nlcd-$1-30m/ s3://tiles.us-east-1.azavea.com/nlcd-$1-30m/

# Delete all working files
# NOTE: This needs to be sudo because the tiles created from within docker are owned by `root`
sudo rm -rf NLCD_$1_Land_Cover_L48* nlcd-$1-30m*

Note: I went with running GDAL via Docker because it was hard to find a recent enough native version for Ubuntu that had the latest version of gdal2tiles.py which supports the --xyz parameter, only available since 3.1

Then, I ran the script for each year:

$ ./make-tiles.sh 2016
$ ./make-tiles.sh 2011
$ ./make-tiles.sh 2006
$ ./make-tiles.sh 2001

I ran these scripts inside a tmux instance so I could log out of the server and keep it running in the background. It takes roughly 8-10 hours to fully process one of these versions.

@rajadain
Copy link
Member Author

rajadain commented Jun 24, 2021

Using the work in this branch: https://github.com/azavea/civic-apps-etl/pull/15, I created a new EC2 instance with a c4.2xlarge configuration, with 8 CPU cores and 16GB of RAM, and installed Scala 2.11 and Spark 2.4.8 and jq and awscli and docker.io on it, downloaded the NLCD full zip, then ran the following script:

#!/bin/bash

set -ex

# Unzip the relevant files for the given year only
unzip NLCD_Land_Cover_L48_20190424_full_zip.zip NLCD_$1*

# Reproject to EPSG:5070 (Conus Albers), the projection we use for analysis in MMW
docker run -v $PWD:/data -w /data --rm -ti osgeo/gdal:alpine-normal-3.3.0 gdalwarp -t_srs 'EPSG:5070' NLCD_$1_Land_Cover_L48_20190424.img dev/ca-etl/local-test-data/nlcd-$1-30m-epsg5070.tif

# Run the ingest
pushd dev/ca-etl
# make mmw-create-jar  # I had made the JAR on my local and copied it to the server to skip this step
make YEAR=$1 mmw-ingest-nlcd
popd

# Delete all working files
rm -rf NLCD_$1_Land_Cover* dev/ca-etl/local-test-data/* dev/ca-etl/local-test-catalogs/*

Then ran it with:

$ ./make-rdds.sh 2016
$ ./make-rdds.sh 2011
$ ./make-rdds.sh 2006
$ ./make-rdds.sh 2001

@rajadain
Copy link
Member Author

The RDDs have successfully been ingested and are available in the datahub:

❯ aws --profile=azavea-datahub s3 ls s3://datahub-catalogs-us-east-1/ | grep epsg5070-512-byte
                           PRE nlcd-2001-30m-epsg5070-512-byte/
                           PRE nlcd-2006-30m-epsg5070-512-byte/
                           PRE nlcd-2011-30m-epsg5070-512-byte/
                           PRE nlcd-2016-30m-epsg5070-512-byte/

As are the tiles:

❯ aws --profile=azavea-datahub s3 ls s3://tiles.us-east-1.azavea.com/ | grep 'nlcd.*30m'
                           PRE nlcd-2001-30m/
                           PRE nlcd-2006-30m/
                           PRE nlcd-2011-30m/
                           PRE nlcd-2016-30m/

rajadain added a commit that referenced this issue Jun 28, 2021
rajadain added a commit that referenced this issue Jun 28, 2021
rajadain added a commit that referenced this issue Jun 28, 2021
rajadain added a commit that referenced this issue Jun 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PA DEP Funding Source: Pennsylvania Department of Environment Protection
Projects
None yet
Development

No branches or pull requests

2 participants