# Creating datasets


You can leverage EOTDL tools to create a brand new dataset. Here we cover:

1. **Data exploration**: given an area of interest, query available sentinel data for your dataset.
2. **Data access**: download your data for creating the dataset.
3. **Data preparation**: clean your data, perform feature engineering, data analysis, labelling, etc.

Once your dataset is ready, you can ingest it to the EOTDL like we have seen in the previous notebook and start working with it as any other dataset in the repository.


## Exploration


First of all, let's explore the area of interest that we have selected for this workshop. In this case we have chosen the [Boadella reservoir](https://es.wikipedia.org/wiki/Embalse_de_Darnius_Boadella) in Catalonia, Spain, which geometry is in the data folder as `example_data/boadella.geojson`. Here we use [leafmap](https://leafmap.org/) for visualizing it, but feel free to use your preferred solution.


In [1]:
# !pip install leafmap

In [1]:
import leafmap
import geopandas as gpd

in_geojson = "example_data/boadella.geojson"
gdf = gpd.read_file(in_geojson)

centroid_coords = gdf["geometry"].centroid
centroid = [
    centroid_coords.y.values[0],
    centroid_coords.x.values[0],
]  # We are going to use the centroid later

m = leafmap.Map(center=centroid, zoom=13)
m.add_geojson(in_geojson, layer_name="Boadella reservoir")
m

Map(center=[42.347577325903515, 2.815024677909404], controls=(ZoomControl(options=['position', 'zoom_in_text',…

When creating AI-Ready datasets it is usual to work at a fixed resolution. You can either retrieve full scenes and cut patches, or use EOTDL functionality to generate appropriate bounding boxes. With the aim that all the images in the dataset have 512x512 pixels, we are going to use the centroid that we extracted before from the geoJSON and generate a bounding box that will result in a 512x512 pixels image at 10m resolution since we are going to use Sentinel data.


In [2]:
from eotdl.tools import bbox_from_centroid

boadella_bbox = bbox_from_centroid(
    x=centroid[0], y=centroid[1], pixel_size=10, width=512, height=512
)
boadella_bbox

[2.784022776094264, 42.324467423078886, 2.8460492944612303, 42.37067879125418]

Let's visualize the bounding box on a map!


In [4]:
from eotdl.tools import bbox_to_polygon

# Create a polygon from the bbox
boadella_polygon = bbox_to_polygon(boadella_bbox)
# Create a GeoDataFrame from the polygon
gdf = gpd.GeoDataFrame(geometry=[boadella_polygon])
# Save the bounding box as a geoJSON file, if needed
gdf.to_file(
    "example_data/boadella_bbox.geojson", driver="GeoJSON"
)  # Uncomment to save the bbox as a GeoJSON file

m.add_geojson("example_data/boadella_bbox.geojson", layer_name="Boadella bbox")
m

Map(bottom=776066.0, center=[42.347577325903515, 2.815024677909404], controls=(ZoomControl(options=['position'…

Now that he have our desired bounding box we can look for available Sentinel-2 imagery on it. This can be done through the EOTDL.

First, we can look for which Sentinel sensors are supported in the EOTDL


In [5]:
from eotdl.access import SUPPORTED_COLLECTION_IDS

SUPPORTED_COLLECTION_IDS

['sentinel-1-grd',
 'sentinel-2-l1c',
 'sentinel-2-l2a',
 'dem',
 'hls',
 'landsat-ot-l2',
 'landsat-ot-l1',
 'landsat-tm-l2',
 'landsat-tm-l1']

If we want to look for available Sentinel-2 imagery in our AoI, we must define a range of dates in which to search for the images. We have already defined a time interval for this workshop, which is in the `workshop_data/dates.csv` file.


In [8]:
import csv

dates = list()
with open("example_data/dates.csv", "r") as file:
    reader = csv.reader(file)
    for row in reader:
        dates.append(row[0])
dates.sort()

dates[:5]

['2020-01-13', '2020-01-28', '2020-02-02', '2020-06-21', '2020-09-14']

Although we have the specific dates, we are going to search for the entire time interval, just as a demonstrator.


> We use Sentinle Hub under the hood, so you will need appropriate credentials. You can generate them automatically from your user [profile](https://www.eotdl.com/profile) by accepting the terms and conditions. When you login to the EOTDL, via the library or CLI, we retrieve and store this information for you, so you don't need to worry about it. However there are a couple of gotchas: <br><br> 1. If you already have a Sentinel HUB account with the same email as your EOTDL account, you will need to retrieve the credentials from Sentinel Hub Dashboard and set them as env variables. <br> 2. The credentials generated via EOTDL may expire after some time (we are working on this). If this happens, let us know in Discord to fix the issue. <br><br> In any case, you can provide your own credentials by setting the appropriate environment variables: `SH_CLIENT_ID` and `SH_CLIENT_SECRET`.


In [9]:
from eotdl.access import search_sentinel_imagery

time_interval = (dates[0], dates[-1])

r = search_sentinel_imagery(time_interval, boadella_bbox, "sentinel-2-l2a")
response = list(r)
response[:5]

[{'id': 'S2B_MSIL2A_20220601T103629_N0400_R008_T31TDG_20220601T135543',
  'properties': {'datetime': '2022-06-01T10:49:26Z', 'eo:cloud_cover': 0.23}},
 {'id': 'S2B_MSIL2A_20220601T103629_N0400_R008_T31TDH_20220601T135543',
  'properties': {'datetime': '2022-06-01T10:49:14Z', 'eo:cloud_cover': 12.82}},
 {'id': 'S2A_MSIL2A_20220527T103631_N0400_R008_T31TDG_20220527T183616',
  'properties': {'datetime': '2022-05-27T10:49:34Z', 'eo:cloud_cover': 85.6}},
 {'id': 'S2A_MSIL2A_20220527T103631_N0400_R008_T31TDH_20220527T183616',
  'properties': {'datetime': '2022-05-27T10:49:19Z', 'eo:cloud_cover': 30.42}},
 {'id': 'S2B_MSIL2A_20220522T103629_N0400_R008_T31TDG_20220522T124154',
  'properties': {'datetime': '2022-05-22T10:49:27Z', 'eo:cloud_cover': 12.99}}]

They make sense, as the [revisit time](https://docs.sentinel-hub.com/api/latest/data/sentinel-2-l2a/#basic-facts) for Sentinel-2 is 5 days.

As a final step, let's check the number of dates with available images.


In [10]:
print(len(response))

342


To sum up this section, we have explored our AoI, generated a bounding box and a time interval in which to look for imagery and searched for Sentinel-2 imagery.


## Download


The next step is to download the images. On the one hand, we can download image by image, as follows.


In [11]:
from eotdl.access import download_sentinel_imagery

first_date = dates[0]

download_sentinel_imagery(
    "data/sentinel_2", first_date, boadella_bbox, "sentinel-2-l2a"
)

On the other hand, we can search and download all available images within a time interval, as follows. This is the recommended way for a bulk download, but it has the drawback that we cannot control the quality of the images, as for example know their cloud cover.


In [12]:
demostration_dates = (dates[0], dates[2])

download_sentinel_imagery(
    output="data/sentinel_2",
    time_interval=demostration_dates,
    bounding_box=boadella_bbox,
    collection_id="sentinel-2-l2a",
)

That's all! We have downloaded the images for our dataset. Let's check them!


In [13]:
from glob import glob

rasters = glob("data/sentinel_2/*.tiff")
rasters[:5]

['data/sentinel_2/sentinel-2-l2a_2020-01-28.tiff',
 'data/sentinel_2/sentinel-2-l2a_2020-02-02.tiff',
 'data/sentinel_2/sentinel-2-l2a_2020-01-13.tiff',
 'data/sentinel_2/sentinel-2-l2a_2020-01-18.tiff',
 'data/sentinel_2/sentinel-2-l2a_2020-01-23.tiff']

One last optional step is to rename the images and cleanup the directory.


In [14]:
files = glob('data/sentinel_2/*')
for file in files:
    new_file_name = file.replace('sentinel-2-l2a', 'Boadella').replace('.tiff', '.tif')
    ! mv $file $new_file_name

!rm -r data/sentinel_2/*.json

In [15]:
!ls data/sentinel_2

Boadella_2020-01-13.tif Boadella_2020-01-23.tif Boadella_2020-02-02.tif
Boadella_2020-01-18.tif Boadella_2020-01-28.tif


## Data Preparation


As the final step towards creating our training dataset, we need to make the data AI-Ready. There are multitude of tasks that can be performed here, such as:

- **Data cleaning**: remove corrupted images, remove images with too much cloud cover, etc.
- **Feature engineering**: calculate vegetation indices, calculate statistics, etc.
- **Data analysis**: plot time series, plot histograms, etc.
- **Labelling**: create labels for the images, etc.

For each one, feel free to use your favourite tools. Here we are going to demonstrate labelling using [SCANEO](https://github.com/earthpulse/scaneo).

SCANEO is a labelling web application that allows tagging satellite images (to identify, e.g., objects present, terrain types, etc.) in an easy and fast way. The service provided by SCANEO is vital since it is necessary to prepare the satellite data so that it can be processed by neural networks, enabling active learning.

Before running the web interface, we need to make sure we have the `scaneo` package installed in our machine and, if not, install it.


In [15]:
# !pip install scaneo

You can run `scaneo` with the following options


In [13]:
!scaneo --help

[1m                                                                                [0m
[1m [0m[1;33mUsage: [0m[1mscaneo [OPTIONS][0m[1m                                                       [0m[1m [0m
[1m                                                                                [0m
[2m╭─[0m[2m Options [0m[2m───────────────────────────────────────────────────────────────────[0m[2m─╮[0m
[2m│[0m [1;36m-[0m[1;36m-port[0m                [1;32m-p[0m      [1;33mINTEGER[0m  Port to run the server on             [2m│[0m
[2m│[0m                                        [2m[default: 8000]          [0m             [2m│[0m
[2m│[0m [1;36m-[0m[1;36m-host[0m                [1;32m-h[0m      [1;33mTEXT   [0m  Host to run the server on             [2m│[0m
[2m│[0m                                        [2m[default: localhost]     [0m             [2m│[0m
[2m│[0m [1;36m-[0m[1;36m-workers[0m             [1;32m-w[0m      [1;33mINTEGER[0m  

You can run `scaneo` by opening a terminal and running:

```
scaneo
```

Then, you can then access the web interface at `http://localhost:8000`.

> You can change the host and port with `scaneo --host 0.0.0.0 --port 8000`.

In [14]:
!scaneo --host 0.0.0.0 --port 8000

Running command: uvicorn api:app --port 8000 --host 0.0.0.0 --app-dir /Users/juan/Desktop/eotdl/.venv/lib/python3.12/site-packages/scaneo
[32mINFO[0m:     Started server process [[36m22214[0m]
[32mINFO[0m:     Waiting for application startup.
[32mINFO[0m:     Application startup complete.
[32mINFO[0m:     Uvicorn running on [1mhttp://0.0.0.0:8000[0m (Press CTRL+C to quit)
[32mINFO[0m:     127.0.0.1:55843 - "[1mGET / HTTP/1.1[0m" [32m200 OK[0m
[32mINFO[0m:     127.0.0.1:55843 - "[1mGET /_app/immutable/assets/0.rJhS_7rg.css HTTP/1.1[0m" [32m200 OK[0m
[32mINFO[0m:     127.0.0.1:55845 - "[1mGET /_app/immutable/chunks/entry.D-chmopU.js HTTP/1.1[0m" [32m200 OK[0m
[32mINFO[0m:     127.0.0.1:55847 - "[1mGET /_app/immutable/entry/app.Ce3jy-cx.js HTTP/1.1[0m" [32m200 OK[0m
[32mINFO[0m:     127.0.0.1:55844 - "[1mGET /_app/immutable/entry/start.qFyZ7ZKQ.js HTTP/1.1[0m" [32m200 OK[0m
[32mINFO[0m:     127.0.0.1:55848 - "[1mGET /_app/immutable/chunks/prelo

Your annotations will be stored alongside the images as GeoJSON files containig the segmentation masks as multipolygons, bounding boxes for detection tasks or classification labels.


In [15]:
!ls data/sentinel_2/*.geojson

data/sentinel_2/Boadella_2020-02-02.geojson


Once your data is ready you can ingest it to EOTDL like we have seen in the previous notebook and start working with it as any other dataset in the repository.


In [None]:
text = """---
name: Boadella-tutorial
authors: 
  - Juan B. Pedro
license: free
source: https://github.com/earthpulse/eotdl/blob/develop/tutorials/tutorials/notebooks/04_creating.ipynb
---

# Boadella-tutorial

This is a toy dataset created during the community webinar.
"""

with open("data/sentinel_2/README.md", "w") as outfile:
    outfile.write(text)

In [17]:
from eotdl.datasets import ingest_dataset

ingest_dataset("data/sentinel_2")

Ingesting folder
Ingesting directory: data/sentinel_2


Preparing files: 100%|██████████| 8/8 [00:00<00:00, 916.16it/s]
Ingesting files: 100%|██████████| 8/8 [00:04<00:00,  1.65it/s]


PosixPath('data/sentinel_2/catalog.parquet')

If you add more images or labels to the dataset, you can re-upload and a new version will be automatically generated.

## Learn more with our use cases

There is much more on EOTDL and SCANEO for creating and labelling datasets as well as training models in the [EOTDL use cases](https://github.com/earthpulse/eotdl/tree/main/tutorials/usecases) section.