## mlhub_datadownload.ipynb

Andrew Burt - a.burt@ucl.ac.uk

### Overview

This notebook can be used to download crop type data (i.e., labelled ground and satellite imagery data) from Radiant MLHub (https://dashboard.mlhub.earth), and prepare these data for subsequent processing and visualisation (e.g., to work with [datadashboard](datadashboard.ipynb)).

It is necessary for the user to have registered an account with MLHub and generated an API key, and to enter it below. It is also necessary to define the dictionary "datavariables" with paths to the directories where the data will be stored.

In [1]:
import os
import getpass
datavariables = {
                  "datadir":"../data/",
                  "metadatadir":"../data/metadata/"
                }
key = getpass.getpass()
os.environ["MLHUB_API_KEY"] = key

 ································································


### Datasets

Here, the datasets stored on MLHub are interacted with using the Python client (radiant-mlhub). The following uses this package to output the available crop type datasets. 

In [2]:
import radiant_mlhub
datasets = radiant_mlhub.client.list_datasets()
for i in range(len(datasets)):
    if "crops" in datasets[i]["id"]:
        print(datasets[i]["id"])

ref_african_crops_kenya_02
ref_african_crops_uganda_01
ref_african_crops_tanzania_01
ref_african_crops_kenya_01
su_african_crops_ghana
su_african_crops_south_sudan


### Collections

MLHub refers to the labelled ground and satellite imagery data as labels and source imagery, respectively. Labels and source imagery comprising each dataset are separated into collections. The collections contained in the dataset 'su_african_crops_ghana' are enumerated below, where it can be seen the four available collections contain the labels, and Planet, Sentinel-1 and Sentinel-2 source imagery.

In [3]:
import radiant_mlhub
dataset = "su_african_crops_ghana"
collections = radiant_mlhub.client.list_collections()
for i in range(len(collections)):
    if dataset in collections[i]["id"]:
        print(collections[i]["id"])

su_african_crops_ghana_labels
su_african_crops_ghana_source_planet
su_african_crops_ghana_source_s1
su_african_crops_ghana_source_s2


### Downloading collections

Collections can be downloaded in their entirety in the form of a tarball. However, if the user has limited resources, or is running in Binder, [datadashboard](datadashboard.ipynb) requires only the labels to be dowloaded (items within the source imagery collections can be individually downloaded on demand, as discussed in the following section).

In [4]:
import glob
import tarfile
radiant_mlhub.client.download_archive("su_african_crops_ghana_labels",f'{datavariables["datadir"]}',overwrite=False)
#radiant_mlhub.client.download_archive("su_african_crops_ghana_source_s1",f'{datavariables["datadir"]}',overwrite=False)
#radiant_mlhub.client.download_archive("su_african_crops_ghana_source_s2",f'{datavariables["datadir"]}',overwrite=False)
tarballs = glob.glob(f'{datavariables["datadir"]}{dataset}_*.tar.gz')
for i in range(len(tarballs)):
    tar = tarfile.open(tarballs[i],"r:gz")
    tar.extractall(path=f'{datavariables["datadir"]}')
    tar.close()

5.0M [00:01,  2.90M/s]                


### Metadata

#### Items inside a collection

[datadashboard](datadashboard.ipynb) requires knowledge of the items available inside each collection (e.g., a labelled ground tile or Sentinel-2 image). The following commented-out code code retrieves the metadata of each item inside a collection, and stores them in a single JSON file in the metadata directory.

Note: this is time consuming, and has already been completed for the collections "su_african_crops_ghana_labels", "su_african_crops_ghana_source_s1" and "su_african_crops_ghana_source_s2" in the dataset "su_african_crop_ghana". It is necessary to unpack these files.

In [5]:
#import json
#for i in range(len(collections)):
#    if dataset in collections[i]["id"]:
#        it = radiant_mlhub.client.list_collection_items(collections[i]["id"],limit=None)
#        items = list(it)
#        for j in range(len(items)):
#            items[j]["assets"].pop("documentation",None)
#        with open(f'{datavariables["metadatadir"]}{collections[i]["id"]}.json',"w") as fp:
#            json.dump(items,fp,separators=(",",":"))            
import tarfile
tarballs = glob.glob(f'{datavariables["metadatadir"]}{dataset}_*.tar.gz')
for i in range(len(tarballs)):
    tar = tarfile.open(tarballs[i],"r:gz")
    tar.extractall(path=f'{datavariables["metadatadir"]}')
    tar.close()            

#### Label crop types

The labels collection (i.e., the labelled ground data) comprises of N tiles, whereby pixel values are dictated by crop type (for the dataset "su_african_crops_ghana", see https://doi.org/10.34911/rdnt.ry138p). A list of dictionaries is required to link pixel value to crop type. Here, this list is output to a single JSON file in the metadata directory. An additional "colour" pair has been included to provide a unique and distinct RGB value for each crop type (the palette was generated on https://mokole.com/palette.html).

In [6]:
import json
labelids = [
            {"crop":"unknown","id":0,"colour":"#000000"},
            {"crop":"ground nut","id":1,"colour":"#00008b"},
            {"crop":"maize","id":2,"colour":"#daa520"},
            {"crop":"rice","id":3,"colour":"#8b008b"},
            {"crop":"soya bean","id":4,"colour":"#ff4500"},
            {"crop":"yam","id":5,"colour":"#ffff00"},
            {"crop":"intercrop","id":6,"colour":"#00ff00"},
            {"crop":"sorghum","id":7,"colour":"#00fa9a"},
            {"crop":"okra","id":8,"colour":"#dc143c"},
            {"crop":"cassava","id":9,"colour":"#00bfff"},
            {"crop":"millet","id":10,"colour":"#0000ff"},
            {"crop":"tomato","id":11,"colour":"#ff00ff"},
            {"crop":"cowpea","id":12,"colour":"#1e90ff"},
            {"crop":"sweet potato","id":13,"colour":"#db7093"},
            {"crop":"babala beans","id":14,"colour":"#eee8aa"},
            {"crop":"salad vegetables","id":15,"colour":"#ff1493"},
            {"crop":"bra and ayoyo","id":16,"colour":"#808080"},
            {"crop":"watermelon","id":17,"colour":"#556b2f"},
            {"crop":"zabla","id":18,"colour":"#483d8b"},
            {"crop":"nili","id":19,"colour":"#008000"},
            {"crop":"kpalika","id":20,"colour":"#9acd32"},
            {"crop":"cotton","id":21,"colour":"#20b2aa"},
            {"crop":"akata","id":22,"colour":"#ffa07a"},
            {"crop":"nyenabe","id":23,"colour":"#ee82ee"},
            {"crop":"pepper","id":24,"colour":"#e6e6fa"} 
           ]
with open(f'{datavariables["metadatadir"]}{dataset}_labels_id.json',"w") as fp:
    json.dump(labelids,fp,skipkeys="colour",separators=(',', ':'))