# World cereal


The aim of this notebook is to generate a Q1 dataset to ingest through the EOTDL from the [World Cereal](https://zenodo.org/records/7593734) dataset.


## Legends


If we want to get information about the LC, CT and IRR from each feature, we need to get and normalize the legend, whic is stored [here](https://zenodo.org/records/7584463). You can skip this section if you already have formatted the legends. By default, they will be stored at `legends`.


In [2]:
import pandas as pd

df = pd.read_excel(
    "https://zenodo.org/records/7584463/files/WorldCereal_LC_CT_IRR_legends.xlsx",
    sheet_name="Legend",
)
df.head()

Unnamed: 0,LAND COVER,Unnamed: 1,Unnamed: 2,Unnamed: 3,CROP TYPE,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,IRRIGATION,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15
0,,Name,Final Values,,Level 0,Level 1,Level 2,Name,Final Values,Landcover,,Level 0,Level 1,Level 2,Name,Final Values
1,0.0,No information,0,,0,,,Unknown,0,0,,0,,,no information,0
2,10.0,Cropland,10,,1000,,,Cereals,1000,11,,100,,,rainfed,100
3,11.0,Annual cropland,11,,,1100,,Wheat,1100,11,,200,,,irrigated,200
4,12.0,Perennial cropland,12,,,,1110,Winter wheat,1110,11,,,210,,fully irrigated,210


In [4]:
from src.utils import curate_legend

land_cover_df = curate_legend(df, 0, 3)
crop_type_df = curate_legend(df, 4, 9)
irrigation_df = curate_legend(df, 11, 16)

In [5]:
land_cover_df.head()

Unnamed: 0,NaN,Name,Final Values
1,0,No information,0
2,10,Cropland,10
3,11,Annual cropland,11
4,12,Perennial cropland,12
5,13,Grassland *,13


In [6]:
crop_type_df.head()

Unnamed: 0,Level 0,Level 1,Level 2,Name,Final Values
1,0.0,,,Unknown,0
2,1000.0,,,Cereals,1000
3,,1100.0,,Wheat,1100
4,,,1110.0,Winter wheat,1110
5,,,1120.0,Spring wheat,1120


In [7]:
irrigation_df.head()

Unnamed: 0,Level 0,Level 1,Level 2,Name,Final Values
1,0.0,,,no information,0
2,100.0,,,rainfed,100
3,200.0,,,irrigated,200
4,,210.0,,fully irrigated,210
5,,,213.0,fully irrigated - surface,213


Now we have our legends normalized as DataFrames, we can save them to use it later. By default they will be saved at `legends`.


In [17]:
from os import makedirs

makedirs("legends", exist_ok=True)
land_cover_df.to_csv("legends/land_cover.csv", index=False)
crop_type_df.to_csv("legends/crop_type.csv", index=False)
irrigation_df.to_csv("legends/irrigation.csv", index=False)

## Q1 generation


Let's generate the Q1 dataset. First of all, we should create a new STAC catalog or use an existing one if we want to append new collections.


In [28]:
import pystac

catalog = pystac.Catalog(id="world-cereal", description="World Cereal Catalog")

If you already have a STAC catalog for the dataset, uncomment the following line.


In [34]:
# catalog = pystac.Catalog.from_file(
#     "world_cereal/catalog.json"
# )  # set your catalog path here

In [35]:
catalog

By default, this notebook iterates over every dataset unziped file stored at the `data` folder, but this can be changed. So, every `.zip` file from [here](https://zenodo.org/records/7593734) should be downloaded and unzipped.


In [36]:
from os.path import join, isdir
from os import listdir

data_dir = "data"  # Change this to your data directory
world_cereal_dirs = [
    join(data_dir, dir) for dir in listdir(data_dir) if isdir(join(data_dir, dir))
]
world_cereal_dirs[:5]

['data/COPERNICUS-GEOGLAM']

The approach is the following: first, we are going to generate a STAC collection for every `zip` file. Then, instead of generating a STAC item for each
shapefile, we are going to generate a STAC subcollection for each shapefile with the needed assets such as the `PDF` and `XLSX` files
and needed information as country and year. Then, we are going to generate a STAC item for each feature in the shapefile, with its
properties in terms of geometry, LC, CT, IRR, and so on.

An example of what the dataset structure would look like:

- world_cereal.json (catalog)

  - AAFC Crop Inventory.json (collection, this is an example .zip from the dataset)

    - 2016_CAN_AAFC-ACIGTD_POINT_110.json (subcollection, this is a shapefile, and the PDF and XLSX files would be referenced here)
      - feature_1.json (item, this is a feature from the shapefile)
      - feature_2.json
      - feature_3.json
      - …
      - (up to +500k)
    - 2017_CAN_AAFC-ACIGTD_POINT_110.json (subcollection, another shapefile)
      - feature_1.json
      - …

  - LPIS_2017_BE_Flanders_full_POLY_110.json (collection, another example .zip)
    - 2017_BE_Flanders_full_POLY_110.json (subcollection)
      - feature_1.json
      - ….

> Important: we are going to export the shapefiles as parquet files to both speed up the process and be able to ingest them into the EOTDL, as shapefiles are composed by several files.


In [37]:
from glob import glob
from os.path import basename, join, splitext, exists
import geopandas as gpd
from src.utils import (
    save_shapefiles_as_parquet,
    get_files_extent,
    generate_stac_item,
    XLSX_MEDIA_TYPE,
)


for dir in world_cereal_dirs:
    collection_id = basename(dir)
    if catalog.get_child(collection_id) is not None:
        print(f"Skipping {collection_id} as it already exists")
        continue

    shapefiles = glob(join(dir, "*.shp"))
    if len(shapefiles) == 0:
        print(f"Skipping {dir} as no shapefiles found")
        continue

    # We are exporting shapefiles to parquet files to speed up the process
    # and to be able to ingest them into the EOTDL, as shapefiles are
    # composed by several files
    parquet_files = save_shapefiles_as_parquet(shapefiles)

    # Add collection to the catalog
    spatial_extent, temporal_extent = get_files_extent(parquet_files)
    extent = pystac.Extent(
        spatial=pystac.SpatialExtent(bboxes=spatial_extent),
        temporal=pystac.TemporalExtent(intervals=[temporal_extent]),
    )

    collection = pystac.Collection(
        id=collection_id, description=collection_id, extent=extent
    )
    catalog.add_child(collection)

    # Generate collection from every parquet file
    for file in parquet_files:
        file_name = splitext(basename(file))[0]
        file_gdf = gpd.read_parquet(file)
        spatial_extent, temporal_extent = get_files_extent([file])
        extent = pystac.Extent(
            spatial=pystac.SpatialExtent(bboxes=spatial_extent),
            temporal=pystac.TemporalExtent(intervals=[temporal_extent]),
        )
        # Create properties
        country = file_name.split("_")[1]
        year = file_name.split("_")[0]
        properties = {
            "country": country,
            "year": year,
        }
        # Create collection
        file_collection = pystac.Collection(
            id=file_name, description=file_name, extent=extent, extra_fields=properties
        )
        # Add PDF and XLSX assets
        # Important: it also search if exists any PDF file with different name as the shapefile
        pdf = file.replace(".parquet", ".pdf")
        additional_pdf_files = glob(join(dir, f"*.pdf"))
        if len(additional_pdf_files) == 1 and not exists(pdf):
            pdf = additional_pdf_files[0]
        if exists(pdf):
            file_collection.add_asset(
                "pdf",
                pystac.Asset(
                    href=pdf,
                    media_type=pystac.MediaType.PDF,
                    title="PDF",
                ),
            )
        xlsx = file.replace(".parquet", ".xlsx")
        if exists(xlsx):
            file_collection.add_asset(
                "xlsx",
                pystac.Asset(
                    href=xlsx,
                    media_type=XLSX_MEDIA_TYPE,
                    title="XLSX",
                ),
            )
        collection.add_child(file_collection)

        # Create an STAC item by every feature and add them to the collection
        for _, feature in file_gdf.iterrows():
            item = generate_stac_item(feature, file_name)
            file_collection.add_item(item)

Now we can save the catalog to our desired location.


In [38]:
catalog.normalize_and_save(
    root_href="world_cereal", catalog_type=pystac.CatalogType.SELF_CONTAINED
)