<a href="https://colab.research.google.com/github/Vizzuality/soils-revealed-data/blob/master/soils_revealed_get_baseline_soc.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prepare data for the soils-revealed project

https://github.com/Vizzuality/soils-revealed-data

`Edward P. Morris (vizzuality.)`

## Description
This notebook downloads the baseline (2000) soils layer for WCS source. 

```
MIT License

Copyright (c) 2020 Vizzuality

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
```

# Setup

Instructions for setting up the computing environment.

In [0]:
%%bash
# Remove sample_data
rm -r sample_data

## Linux dependencies

Instructions for adding linux (including node, ect.) system packages. 

In [3]:
# Packages for projections and geospatial processing
!apt install -q -y libspatialindex-dev libproj-dev proj-data proj-bin libgeos-dev

Reading package lists...
Building dependency tree...
Reading state information...
proj-data is already the newest version (4.9.3-2).
proj-data set to manually installed.
The following additional packages will be installed:
  libspatialindex-c4v5 libspatialindex4v5
Suggested packages:
  libgdal-doc
The following NEW packages will be installed:
  libgeos-dev libproj-dev libspatialindex-c4v5 libspatialindex-dev
  libspatialindex4v5 proj-bin
0 upgraded, 6 newly installed, 0 to remove and 31 not upgraded.
Need to get 860 kB of archives.
After this operation, 5,014 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libgeos-dev amd64 3.6.2-1build2 [73.1 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libspatialindex4v5 amd64 1.8.5-5 [219 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libspatialindex-c4v5 amd64 1.8.5-5 [51.7 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libproj-dev amd64 4

In [4]:
# Fix for curl certificates (rasterio virtual connectors)
# RasterioIOError: CURL error: error setting certificate verify locations:   CAfile: /etc/pki/tls/certs/ca-bundle.crt   CApath: none
!apt install ca-certificates
#!export CURL_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt
!mkdir -p /etc/pki/tls/certs
!cp /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/certs/ca-bundle.crt
!export CURL_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt
!ls /etc/pki/tls/certs

Reading package lists... Done
Building dependency tree       
Reading state information... Done
ca-certificates is already the newest version (20180409).
0 upgraded, 0 newly installed, 0 to remove and 31 not upgraded.
ca-bundle.crt


## Python packages

Consider using package versions to ensure nothing changes.

`!pip install -q <package-name>`

In [5]:
# connect to Google cloud storage, WebDAV
!pip install -q gcsfs webdavclient3

  Building wheel for webdavclient3 (setup.py) ... [?25l[?25hdone


In [6]:
# geospatial tools
!pip install -q country-converter geopandas owslib

[K     |████████████████████████████████| 51kB 1.7MB/s 
[K     |████████████████████████████████| 931kB 6.8MB/s 
[K     |████████████████████████████████| 204kB 18.6MB/s 
[K     |████████████████████████████████| 14.7MB 291kB/s 
[K     |████████████████████████████████| 10.9MB 51.9MB/s 
[?25h  Building wheel for country-converter (setup.py) ... [?25l[?25hdone


In [7]:
# xarray and Zarr tools
!pip install -q cftime netcdf4 nc-time-axis zarr xarray xclim rioxarray regionmask sparse xarray-extras

[K     |████████████████████████████████| 327kB 2.7MB/s 
[K     |████████████████████████████████| 4.1MB 7.9MB/s 
[K     |████████████████████████████████| 3.3MB 49.5MB/s 
[K     |████████████████████████████████| 143kB 55.6MB/s 
[K     |████████████████████████████████| 3.7MB 55.2MB/s 
[K     |████████████████████████████████| 122kB 57.2MB/s 
[K     |████████████████████████████████| 71kB 10.4MB/s 
[K     |████████████████████████████████| 3.8MB 53.1MB/s 
[K     |████████████████████████████████| 194kB 52.7MB/s 
[K     |████████████████████████████████| 71kB 9.6MB/s 
[K     |████████████████████████████████| 174kB 50.7MB/s 
[K     |████████████████████████████████| 18.2MB 253kB/s 
[K     |████████████████████████████████| 14.4MB 310kB/s 
[K     |████████████████████████████████| 634kB 46.2MB/s 
[K     |████████████████████████████████| 225kB 50.5MB/s 
[K     |████████████████████████████████| 102kB 13.2MB/s 
[?25h  Building wheel for zarr (setup.py) ... [?25l[?25hdo

In [8]:
# Show python package versions
!pip list

Package                  Version        
------------------------ ---------------
absl-py                  0.9.0          
affine                   2.3.0          
alabaster                0.7.12         
albumentations           0.1.12         
altair                   4.1.0          
asciitree                0.3.3          
asgiref                  3.2.7          
astor                    0.8.1          
astropy                  4.0.1.post1    
astunparse               1.6.3          
atari-py                 0.2.6          
atomicwrites             1.4.0          
attrs                    19.3.0         
audioread                2.1.8          
autograd                 1.3            
Babel                    2.8.0          
backcall                 0.1.0          
beautifulsoup4           4.6.3          
bleach                   3.1.5          
blis                     0.4.1          
bokeh                    1.4.0          
boltons                  20.1.0         
boto            

## Authorisation

Setting up connections and authorisation to cloud services.

### Google Cloud

This can be done in the URL or via adding service account credentials.

If you do not share the notebook, you can mount your Drive and and transfer credentials to disk. Note if the notebook is shared you always need to authenticate via URL.  

In [0]:
# Set the Google Cloud project id
project_id = 'soc-platform'
gc_creds = "soc-platform-6a9bf204638c.json"
username = "edward-morris-vizzuality@soc-platform.iam.gserviceaccount.com"
gcs_prefix = "gs://vizz-data-transfer"

In [0]:
# For auth WITHOUT service account
# https://cloud.google.com/resource-manager/docs/creating-managing-projects
#from google.colab import auth
#auth.authenticate_user()
#!gcloud config set project {project_id}

In [0]:
# If the notebook is shared
#from google.colab import drive
#drive.mount('/content/drive')

In [0]:
# If Drive is mounted, copy GC credentials to home (place in your GDrive, and connect Drive)
!cp "/content/drive/My Drive/{gc_creds}" "/root/.{gc_creds}"

In [13]:
# Auth WITH service account
!gcloud auth activate-service-account {username} --key-file=/root/.{gc_creds} --project={project_id}


Activated service account credentials for: [edward-morris-vizzuality@soc-platform.iam.gserviceaccount.com]


In [14]:
# Test GC auth
!gsutil ls {gcs_prefix}

gs://vizz-data-transfer/SOC_maps/
gs://vizz-data-transfer/land-cover/
gs://vizz-data-transfer/movie-tiles/


# Utils

Generic helper functions used in the subsequent processing. For easy navigation each function seperated into a section with the function name.

## copy_gcs

In [0]:
import os
import subprocess

def copy_gcs(source_list, dest_list, opts=""):
  """
  Use gsutil to copy each corresponding item in source_list
  to dest_list.

  Example:
  copy_gcs(["gs://my-bucket/data-file.csv"], ["."])

  """
  for s, d  in zip(source_list, dest_list):
    cmd = f"gsutil -m cp -r {opts} {s} {d}"
    print(f"Processing: {cmd}")
    r = subprocess.call(cmd, shell=True)
    if r == 0:
        print("Task created")
    else:
        print("Task failed")
  print("Finished copy")

## get_image_meta

In [0]:
# Get image metadata
# FIXME is MultiPolygon the correct way to store footprint?
# FIXME can the footprint be simplified?
# FIXME what happens with complex NA image structures?
import rasterio
import os
from collections import namedtuple

def get_image_meta(uri_prefix, file_path, gcs_creds=False):
  """
  Get image metadata

  Uses rasterio, which can handle virtual files systems
  
  Arguments
  ---------
  uri_prefix : str
      The uri prefix string.  
  file_path : str
      The file path of the asset.
  gcs_creds : str
      Path to GCS credentials JSON. #FIXME! 

  Returns
  ------- 
  A named tuple of image properties formatted for earthengine manifest
  """
  # Set GCS creds
  if gcs_creds:
    # FIXME!
    with open("adc.json", "r") as f:
            gcs_creds = json.load(f)
    #print(gcs_creds)
    os.environ['GS_SECRET_ACCESS_KEY'] = gcs_creds.get('client_secret')
    os.environ['GS_ACCESS_KEY_ID'] = gcs_creds.get('client_id')
    #print(os.environ['GS_ACCESS_KEY_ID'])
  
    # hack for ubuntu/debian
    os.environ['CURL_CA_BUNDLE'] = "/etc/ssl/certs/ca-certificates.crt"

  # Make path
  p = os.path.join(uri_prefix,file_path)
  #print(p)
  
  # Get metadata
  with rasterio.open(p, 'r') as src:
    
    # band info
    Band = namedtuple('Band', ['idx', 'dtype', 'description', 'units', 'nodata'])
    bands = [Band(idx=tmp1, dtype=tmp2.swapcase(), description=tmp3, units=tmp4, nodata=tmp5) for tmp1, tmp2, tmp3, tmp4, tmp5 in zip(
    src.indexes, src.dtypes, src.descriptions, src.units, src.nodatavals)]

    # affine transform
    def get_affine_transform(tmp2):
      # FIXME check mapping!
      AffineTransform = namedtuple('AffineTransform', ["scale_x", "shear_x", "translate_x", "shear_y", "scale_y", "translate_y"])
      return AffineTransform(**{
          "scale_x":tmp2[0],
           "shear_x":tmp2[1],
            "translate_x":tmp2[2],
             "shear_y":tmp2[3],
              "scale_y":tmp2[4],
               "translate_y":tmp2[5]})
    aft = get_affine_transform([tmp2 for tmp2 in src.transform.to_gdal()])
    
    # crs
    CRS = namedtuple('CRS', ['epsg'])
    crs = CRS(epsg=src.crs.to_epsg())
    
    # tags
    tags = src.tags()

    # media type (driver)
    mt = src.driver
    
  src.close()
  imageMeta = namedtuple('imageMeta', ["bands", "affine_transform", "crs", "properties", "mtype"])
  return(imageMeta(**{"bands":bands, "affine_transform":aft, "crs":crs, "properties":tags, "mtype":mt}))

# Examples
# --------
# Singleband image
#uri_prefix = "https://storage.googleapis.com/skydipper-water-quality"
#file_path = "/cloud-masks/SENTINEL_2_reference_cloud_masks_Baetens_Hagolle/S2A_MSIL1C_20161221T085352_N0204_R107_T33KWP_20161221T091140.tif"
#print(get_image_meta(uri_prefix, file_path))

# Multiban#d image
#uri_prefix = "https://storage.googleapis.com/skydipper_materials"
#file_path = "/gee_data/L7_AOI_00.tif"
#print(get_image_meta(uri_prefix, file_path))


## print_ds_size

In [0]:
def print_ds_size(ds):
  print('ds size in GB {:0.2f}\n'.format(ds.nbytes / 1e9))

# Processing

Data processing organised into sections.

## Get datasets

In [20]:
import os
import subprocess

# Examine datasets
uri_prefix = "https://files.isric.org/soilgrids/data/recent/ocs"
file_paths = ["ocs_0-30cm_mean.vrt", "ocs_0-30cm_Q0.5.vrt", "ocs_0-30cm_Q0.05.vrt", "ocs_0-30cm_Q0.95.vrt", "ocs_0-30cm_uncertainty.vrt"]  

for fp in file_paths:
  print(get_image_meta(uri_prefix, fp, gcs_creds=False))
  p = os.path.join(uri_prefix, fp)
  cmd = f"rio info {p}"
  r = subprocess.getoutput(cmd)
  print(r)

imageMeta(bands=[Band(idx=1, dtype='INT16', description=None, units=None, nodata=-32768.0)], affine_transform=AffineTransform(scale_x=-19949750.0, shear_x=250.0, translate_x=0.0, shear_y=8361000.0, scale_y=0.0, translate_y=-250.0), crs=CRS(epsg=None), properties={'Code_version': 'v2.0.0', 'Covariates': 'clm_wcl_srdyrsum,clm_wcl_s04rad,clm_wcl_p10tot,veg_mod_eviyravg,clm_wcl_p09tot,veg_mod_nppy15,mor_env_demm,clm_wcl_s08rad,clm_wcl_srdyrstd,clm_wcl_s07rad,clm_wcl_bio19,clm_wcl_p05tot,clm_wcl_bio17,mor_mrg_twi,clm_mod_lstd03std,clm_mod_lstd04std,clm_wcl_bio16,clm_wcl_bio08,clm_mod_lstd09std,clm_mod_lstd11std,clm_wcl_bio18,mor_mrg_vdp,clm_wcl_p04tot,clm_mod_lstd05std,clm_mod_lstd12std,clm_mod_lstd10std,clm_mod_lstd02std,clm_wcl_p12tot,clm_mod_lstd01std,veg_mod_evient,veg_mod_evirng,veg_mod_evimax,veg_mod_evievn,luc_gfc_trely10,mor_mrg_vbf', 'Litter_layers': 'FALSE', 'Model': 'Quantile Regression Forests', 'Model_type': 'ranger', 'Mtry': '15', 'Number_trees': '200', 'Other_synth_profiles':

In [38]:
# Create dataset (note this just creates the structure, ie., does not get data yet)
import xarray as xr
import rioxarray as rxr
import os
import pandas as pd

# Time coordinate
# check if this should be CF defined?
time = pd.to_datetime(['2000-01-01'])

# Variable attributes
# see https://www.nodc.noaa.gov/data/formats/netcdf/v2.0/ for guide to attributes 
attrs_list = [
{
    "standard_name": "mean_of_mass_content_of_organic_carbon_in_zero_to_thirty_centimeter_depth_soil_layer",
    "long_name": "Soil organic content (mean, 0-30cm)",
    "units": "t/ha",
    "description": "The mean vertically integrated mass content (or stock) [M/A] of organic carbon in the 0 to 30 cm depth soil layer."
},
{
    "standard_name": "median_of_mass_content_of_organic_carbon_in_zero_to_thirty_centimeter_depth_soil_layer",
    "long_name": "Soil organic content (median, 0-30cm)",
    "units": "t/ha",
    "description": "The median of vertically integrated mass content (or stock) [M/A] of organic carbon in the 0 to 30 cm depth soil layer."
},
{
    "standard_name": "five_percent_quantile_of_mass_content_of_organic_carbon_in_zero_to_thirty_centimeter_depth_soil_layer",
    "long_name": "Soil organic content (Q0.05, 0-30cm)",
    "units": "t/ha",
    "description": "The 0.05 quantile of vertically integrated mass content (or stock) [M/A] of organic carbon in the 0 to 30 cm depth soil layer."
},
{
    "standard_name": "ninety_five_percent_quantile_of_mass_content_of_organic_carbon_in_zero_to_thirty_centimeter_depth_soil_layer",
    "long_name": "Soil organic content (Q0.95, 0-30cm)",
    "units": "t/ha",
    "description": "The 0.95 quantile of vertically integrated mass content (or stock) [M/A] of organic carbon in the 0 to 30 cm depth soil layer."
},
{
    "standard_name": "uncertainity_of_mass_content_of_organic_carbon_in_zero_to_thirty_centimeter_depth_soil_layer",
    "long_name": "Soil organic content (uncertainity, 0-30cm)",
    "units": "t/ha",
    "description": "The uncertainity of vertically integrated mass content (or stock) [M/A] of organic carbon in the 0 to 30 cm depth soil layer."
}]

# Dataset attributes
attrs = {
    "Conventions": "CF-1.6",
    "title": "Global gridded soil properties",
    "description": "Global spatial distribution of soil properties mapped using state-of-the-art machine learning methods.",
    "institution": "ISRIC",
    "license" : "[CC-BY 4.0 License](https://creativecommons.org/licenses/by/4.0/)",
    "comment" : "SoilGrids is a system for global digital soil mapping that uses state-of-the-art machine learning methods to map the spatial distribution of soil properties across the globe. SoilGrids prediction models are fitted using over 230 000 soil profile observations from the WoSIS database and a series of environmental covariates. Covariates were selected from a pool of over 400 environmental layers from Earth observation derived products and other environmental information including climate, land cover and terrain morphology. The outputs of SoilGrids are global soil property maps at six standard depth intervals (according to the GlobalSoilMap IUSS working group and its specifications) at a spatial resolution of 250 meters. Prediction uncertainty is quantified by the lower and upper limits of a 90% prediction interval. The SoilGrids maps are publicly available under the [CC-BY 4.0 License](https://creativecommons.org/licenses/by/4.0/).",
    "history": "",
    "references" : "[Article describing soil profiles](https://doi.org/10.5194/essd-12-299-2020) , [Soil grids platform](https://soilgrids.org/)"

}

def assign_name(da, name):
  da.name = name
  return da 

# create dataset
ds = xr.merge(\
[\
 rxr.open_rasterio(os.path.join(uri_prefix, file_path)).rename(os.path.splitext(fp)[0]).assign_attrs(at)\
 for fp, at in zip(file_paths, attrs_list)\
 ])\
.assign_coords({'time':time})\
.assign_attrs(attrs)\
.chunk({'x':512, 'y':512, 'band':-1})

print_ds_size(ds)
with xr.set_options(display_width=120):
  print(ds)

ds size in GB 92.42

<xarray.Dataset>
Dimensions:                 (band: 1, time: 1, x: 159246, y: 58034)
Coordinates:
  * band                    (band) int64 1
  * y                       (y) float64 8.361e+06 8.361e+06 8.36e+06 8.36e+06 ... -6.147e+06 -6.147e+06 -6.147e+06
  * x                       (x) float64 -1.995e+07 -1.995e+07 -1.995e+07 -1.995e+07 ... 1.986e+07 1.986e+07 1.986e+07
    spatial_ref             int64 0
  * time                    (time) datetime64[ns] 2000-01-01
Data variables:
    ocs_0-30cm_mean         (band, y, x) uint16 dask.array<chunksize=(1, 512, 512), meta=np.ndarray>
    ocs_0-30cm_Q0.5         (band, y, x) uint16 dask.array<chunksize=(1, 512, 512), meta=np.ndarray>
    ocs_0-30cm_Q0.05        (band, y, x) uint16 dask.array<chunksize=(1, 512, 512), meta=np.ndarray>
    ocs_0-30cm_Q0.95        (band, y, x) uint16 dask.array<chunksize=(1, 512, 512), meta=np.ndarray>
    ocs_0-30cm_uncertainty  (band, y, x) uint16 dask.array<chunksize=(1, 512, 512), meta