# Raw data

## Overview

In this project we use the following raw data:

* *Corine Land Cover* (CLC) from 2018 and information about the class nomenclature.

* Sentinel-2 grid

* Harmonized Landsat Sentinel-2 (HLS)

This notebook describes the raw data collection and creation process. 
Thus all the data described here can be found in the *data/raw* folder.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import geopandas as gpd
from pathlib import Path
from shapely import wkt
from urllib.request import urlretrieve

import nasa_hls

from src import configs
prjconf = configs.ProjectConfigParser()

We will prepare the data for the following tiles this notebook:

In [2]:
tilenames = prjconf.get("Params", "tiles").split(" ")
tilenames

['32UNU', '32UPU', '32UQU', '33UUP']

## Create raw data

### CLC

We downloaded *Corine Land Cover Raster - 100m* dataset manually after registration from [Copernicus Land Monitoring Service](https://land.copernicus.eu/pan-european/corine-land-cover/clc2018?tab=download) and extracted the file into *data/raw/clc/clc2018_clc2018_v2018_20b2_raster100m*.
The most important file is *data/raw/clc/clc2018_clc2018_v2018_20b2_raster100m/clc2018_clc2018_V2018.20b2.tif*

We copied the CLC legend from the *CORINE LAND COVER LEGEND* table found on the [nomenclature site of clc.gios.gov.pl](http://clc.gios.gov.pl/index.php/9-gorne-menu/clc-informacje-ogolne/58-klasyfikacja-clc-2), pasted it into LibreOffice Calc and saved it as ';'-separated csv file under *data/raw/clc/clc_legend_raw.csv*.

An extended legend with the empty cells filled up and the level 2 and 3 class indices added is created here in the following cell and can be found under *data/raw/clc/clc_legend.csv* once the cell has been executed.

In [3]:
path__clc_legend_raw = prjconf.get_path("Raw", "rootdir") / "clc" / "clc_legend_raw.csv"
path__clc_legend = prjconf.get_path("Raw", "rootdir") / "clc" / "clc_legend.csv"

if not path__clc_legend.exists():
    clc_legend = pd.read_csv(path__clc_legend_raw, delimiter=";").iloc[0:44, :]
    clc_legend.columns = ["l1_name", "l2_name", "l3_name", "grid_code", "rgb"]
    clc_legend_ids = clc_legend["l3_name"].str[:5].str.split(".", expand=True)
    clc_legend["l1_id"] = clc_legend_ids[0].astype("uint8")
    clc_legend["l2_id"] = (clc_legend_ids[0] + clc_legend_ids[1]).astype("uint8")
    clc_legend["l3_id"] = (clc_legend_ids[0] + clc_legend_ids[1] + clc_legend_ids[2]).astype("int")
    clc_legend["l1_name"] = clc_legend["l1_name"].str[3::]
    clc_legend["l2_name"] = clc_legend["l2_name"].str[4::]
    clc_legend["l3_name"] = clc_legend["l3_name"].str[6::]
    clc_legend = clc_legend.fillna(method="ffill")
    clc_legend.to_csv(path__clc_legend, index=False)

Fast access to important file paths:

In [4]:
print(prjconf.get_path("Raw", "clc"))
print(prjconf.get_path("Raw", "clc_legend"))

/home/ben/Devel/Projects/classify-hls/data/raw/clc/clc2018_clc2018_v2018_20b2_raster100m/clc2018_clc2018_V2018.20b2.tif
/home/ben/Devel/Projects/classify-hls/data/raw/clc/clc_legend.csv


### Tile grid

We download the Sentinel-2 grid in the following cell. 
The link to this nice Sentinel-2 grid file has been found on the [bencevans/sentinel-2-grid GitHub project](https://github.com/bencevans/sentinel-2-grid). 

In [5]:
url = 'https://unpkg.com/sentinel-2-grid/data/grid.json'
path__tile_grid = prjconf.get_path("Raw", "tile_grid")
path__tile_grid.parent.mkdir(exist_ok=True, parents=True)
if not path__tile_grid.exists():
    urlretrieve(url, path__tile_grid)

From this file we create single footprint file for the tile we want to process.
This is a good starting point for using Snakemake later.

In [6]:
footprints_exist = [prjconf.get_path("Raw", "tile_footprint", tile).exists() for tile in tilenames]
if not all(footprints_exist):
    tile_grid = gpd.read_file(path__tile_grid)
    for tile in tilenames:
        path__tile_footprint = prjconf.get_path("Raw", "tile_footprint", tile)
        if not Path(path__tile_footprint).exists() or overwrite:
            tile = tile_grid[tile_grid["name"] == tile]
            tile = tile.to_crs(epsg=tile["epsg"].values[0])
            tile["geometry"] = tile["utmWkt"].apply(wkt.loads)
            Path(path__tile_footprint).parent.mkdir(parents=True, exist_ok=True)
            tile.to_file(path__tile_footprint, driver="GPKG")

Fast access to important parameters and file paths:

In [7]:
print(prjconf.get_path("Raw", "tile_grid"))

for tile in tilenames:
    print(prjconf.get_path("Raw", "tile_footprint", tile))

/home/ben/Devel/Projects/classify-hls/data/raw/footprints/tiles/tiles_grid.geojson
/home/ben/Devel/Projects/classify-hls/data/raw/footprints/tiles/footprint_32UNU.gpkg
/home/ben/Devel/Projects/classify-hls/data/raw/footprints/tiles/footprint_32UPU.gpkg
/home/ben/Devel/Projects/classify-hls/data/raw/footprints/tiles/footprint_32UQU.gpkg
/home/ben/Devel/Projects/classify-hls/data/raw/footprints/tiles/footprint_33UUP.gpkg


### HLS

We download data from the [Harmonized Landsat Sentinel-2 (HLS) Product](https://hls.gsfc.nasa.gov/) with the [nasa_hls Python package](https://benmack.github.io/nasa_hls/build/html/index.html) in the following cell.

In [8]:
for tile in tilenames:
    df_datasets = nasa_hls.get_available_datasets(products=["L30"],
                                                  years=[2018],
                                                  tiles=[tile],
                                                  return_list=False)
    print(f"Number of scenes queried for tile {tile}: {df_datasets.shape[0]}")
    dir__hls_tile = prjconf.get_path("Raw", "hls_tile", tile=tile)
    nasa_hls.download_batch(dir__hls_tile, df_datasets)
    
    path__hls_tile_lut = prjconf.get_path("Raw", "hls_tile_lut", tile=tile)
    if not path__hls_tile_lut.exists():
        hdf_files = list(dir__hls_tile.rglob("*.hdf"))
        df = nasa_hls.dataframe_from_hdf_paths(hdf_files)
        df["tile"] = tile
        df.to_csv(path__hls_tile_lut, index=False)
    else:
        pass 
        # TODO: it would be good to check if there are new files and rewrite the csv ONLY if this is the case 

100%|██████████| 1/1 [00:01<00:00,  1.33s/it]
100%|██████████| 67/67 [00:00<00:00, 2801.14it/s]


Number of scenes queried for tile 32UNU: 67


100%|██████████| 1/1 [00:01<00:00,  1.29s/it]
100%|██████████| 67/67 [00:00<00:00, 2839.37it/s]


Number of scenes queried for tile 32UPU: 67


100%|██████████| 1/1 [00:00<00:00,  1.01it/s]
100%|██████████| 49/49 [00:00<00:00, 2901.08it/s]


Number of scenes queried for tile 32UQU: 49


100%|██████████| 1/1 [00:01<00:00,  1.08s/it]
  0%|          | 0/67 [00:00<?, ?it/s]

Number of scenes queried for tile 33UUP: 67


100%|██████████| 67/67 [04:00<00:00,  3.59s/it] 


Fast access to important directories:

In [13]:
for tile in tilenames:
    print(prjconf.get_path("Raw", "hls_tile", tile))
    print(prjconf.get_path("Raw", "hls_tile_lut", tile=tile))

/home/ben/Devel/Projects/classify-hls/data/raw/hls/32UNU
/home/ben/Devel/Projects/classify-hls/data/raw/hls/hls_32UNU_lut.csv
/home/ben/Devel/Projects/classify-hls/data/raw/hls/32UPU
/home/ben/Devel/Projects/classify-hls/data/raw/hls/hls_32UPU_lut.csv
/home/ben/Devel/Projects/classify-hls/data/raw/hls/32UQU
/home/ben/Devel/Projects/classify-hls/data/raw/hls/hls_32UQU_lut.csv
/home/ben/Devel/Projects/classify-hls/data/raw/hls/33UUP
/home/ben/Devel/Projects/classify-hls/data/raw/hls/hls_33UUP_lut.csv
