In [1]:
%load_ext autoreload
%autoreload 2

# Creating datasets

This is probably the most interesting part of the tutorial, where you get to leverage EOTDL tools to create a brand new dataset. Here we cover:

1. **Data exploration**: given an area of interest, query available sentinel data for your dataset.
2. **Data access**: download your data for creating the dataset.
3. **Data preparation**: clean your data, perform feature engineering, data analysis, labelling, etc.

Once your dataset is ready, you can ingest it to the EOTDL like we have seen in the previous notebook and start working with it as any other dataset in the repository.

## Exploration

First of all, let's explore the area of interest that we have selected for this workshop. In this case we have chosen the [Boadella reservoir](https://es.wikipedia.org/wiki/Embalse_de_Darnius_Boadella) in Catalonia, Spain, which geometry is in the data folder as `workshop_data/boadella.geojson`. Here we use [leafmap](https://leafmap.org/) for visualizing it, but feel free to use your preferred solution.

In [2]:
# !pip install leafmap

In [5]:
import leafmap
import geopandas as gpd

in_geojson = 'workshop_data/boadella.geojson'
gdf = gpd.read_file(in_geojson)

centroid_coords = gdf['geometry'].centroid
centroid = [centroid_coords.y.values[0], centroid_coords.x.values[0]]   # We are going to use the centroid later

m = leafmap.Map(center=centroid, zoom=13)
m.add_geojson(in_geojson, layer_name="Boadella reservoir")
m

Map(center=[42.347577325903515, 2.815024677909404], controls=(ZoomControl(options=['position', 'zoom_in_text',…

When creating AI-Ready datasets it is usual to work at a fixed resolution. You can either retrieve full scenes and cut patches, or use EOTDL functionality to generate appropriate bounding boxes. With the aim that all the images in the dataset have 512x512 pixels, we are going to use the centroid that we extracted before from the geoJSON and generate a bounding box that will result in a 512x512 pixels image at 10m resolution since we are going to use Sentinel data.

In [6]:
from eotdl.tools import bbox_from_centroid

boadella_bbox = bbox_from_centroid(x=centroid[0], y=centroid[1], pixel_size=10, width=512, height=512)
boadella_bbox

[2.7920278066359443, 42.330578684998784, 2.8380215491828635, 42.36457137143557]

Let's visualize the bounding box on a map!

In [7]:
from eotdl.tools import bbox_to_polygon

# Create a polygon from the bbox
boadella_polygon = bbox_to_polygon(boadella_bbox)
# Create a GeoDataFrame from the polygon
gdf = gpd.GeoDataFrame(geometry=[boadella_polygon])
# Save the bounding box as a geoJSON file, if needed
gdf.to_file('workshop_data/boadella_bbox.geojson', driver='GeoJSON')   # Uncomment to save the bbox as a GeoJSON file

m.add_geojson('workshop_data/boadella_bbox.geojson', layer_name="Boadella bbox")
m

Map(bottom=776016.0, center=[42.347577325903515, 2.815024677909404], controls=(ZoomControl(options=['position'…

Now that he have our desired bounding box we can look for available Sentinel-2 imagery on it. This can be done through the EOTDL. 

First, we can look for which Sentinel sensors are supported in the EOTDL

In [8]:
from eotdl.access import SUPPORTED_SENSORS

SUPPORTED_SENSORS

('sentinel-1-grd', 'sentinel-2-l1c', 'sentinel-2-l2a', 'dem')

If we want to look for available Sentinel-2 imagery in our AoI, we must define a range of dates in which to search for the images. We have already defined a time interval for this workshop, which is in the `workshop_data/dates.csv` file.

In [9]:
import csv

dates = list()
with open("workshop_data/dates.csv", "r") as file:
    reader = csv.reader(file)
    for row in reader:
        dates.append(row[0])
dates.sort()

dates[:5]

['2020-01-13', '2020-01-28', '2020-02-02', '2020-06-21', '2020-09-14']

Although we have the specific dates, we are going to search for the entire time interval, just as a demonstrator.

> We use Sentinle Hub under the hood, so you will need appropriate credentials. You can generate them automatically from your user [profile](https://www.eotdl.com/profile) by accepting the terms and conditions. When you login to the EOTDL, via the library or CLI, we retrieve and store this information for you, so you don't need to worry about it. However there are a couple of gotchas: <br><br> 1. If you already have a Sentinel HUB account with the same email as your EOTDL account, you will need to retrieve the credentials from Sentinel Hub Dashboard. <br> 2. The credentials generated via EOTDL will expire after some time. <br><br> In any case, you can provide your own credentials by setting the appropriate environment variables: `SH_CLIENT_ID` and `SH_CLIENT_SECRET`.

In [10]:
from eotdl.access import search_sentinel_imagery

time_interval = (dates[0], dates[-1])

r = search_sentinel_imagery(time_interval, boadella_bbox, 'sentinel-2-l2a')
response = list(r)
response[:5]

DownloadFailedException: Failed to download from:
https://services.sentinel-hub.com/api/v1/catalog/1.0.0/search
with HTTPError:
403 Client Error: Forbidden for url: https://services.sentinel-hub.com/api/v1/catalog/1.0.0/search
Server response: "{"code": 403, "description": "Invalid or expired account."}"

They make sense, as the [revisit time](https://docs.sentinel-hub.com/api/latest/data/sentinel-2-l2a/#basic-facts) for Sentinel-2 is 5 days.

As a final step, let's check the number of dates with available images.

In [17]:
print(len(response))

342


To sum up this section, we have explored our AoI, generated a bounding box and a time interval in which to look for imagery and searched for Sentinel-2 imagery.

Let's continue in the [01_download](./01_download.ipynb) notebook and download the images!

## Download

The next step is to download the images. On the one hand, we can download image by image, as follows.

In [11]:
from eotdl.access import download_sentinel_imagery

first_date = dates[0]

# Uncomment to demonstrate
# download_sentinel_imagery('workshop_data/sentinel_2', first_date, boadella_bbox, 'sentinel-2-l2a')

On the other hand, we can search and download all available images within a time interval, as follows. This is the recommended way for a bulk download, but it has the drawback that we cannot control the quality of the images, as for example know their cloud cover.

In [12]:
from eotdl.access import search_and_download_sentinel_imagery

# Uncomment to demonstrate
# search_and_download_sentinel_imagery(
#     output='workshop_data/sentinel_2',
#     time_interval=dates[:3],
#     bounding_box=boadella_bbox,
#     sensor='sentinel-2-l2a'
# )

Despite what we have seen, in the `workshop_data/dates.csv` file we already have a list with the acquisition dates of valid, cloud-free and good quality images. This is a slower but safer solution. So, let's download them!

In [13]:
for date in dates:
    download_sentinel_imagery('workshop_data/sentinel_2', date, boadella_bbox, 'sentinel-2-l2a')

DownloadFailedException: Failed to download from:
https://services.sentinel-hub.com/api/v1/process
with HTTPError:
403 Client Error: Forbidden for url: https://services.sentinel-hub.com/api/v1/process
Server response: "{"status": 403, "reason": "Forbidden", "message": "Invalid or expired account.", "code": "COMMON_INSUFFICIENT_PERMISSIONS"}"

That's all! We have downloaded the images for our dataset. Let's check them!

In [None]:
from glob import glob

rasters = glob('workshop_data/sentinel_2/*.tif')
rasters[:5]

We can look for them metadata files, too.

In [14]:
jsons = glob('workshop_data/sentinel_2/*.json')
jsons[:5]

NameError: name 'glob' is not defined

It looks amazing! One last step, in order to kind of "label" the downloaded images to be easily ingested by the EOTDL and generate STAC metadata in next steps could be to rename the images, maintaining the acquisiton date but replacing the sensor type in the filename by `Boadella`. This is not mandatory, but it will be useful for our usecase.

In [15]:
files = glob('workshop_data/sentinel_2/*')
for file in files:
    new_file_name = file.replace('sentinel-2-l2a', 'Boadella')
    ! mv $file $new_file_name

NameError: name 'glob' is not defined

## Data Preparation

As the final step towards creating our training dataset, we need to make the data AI-Ready. There are multitude of tasks that can be performed here, such as:

- **Data cleaning**: remove corrupted images, remove images with too much cloud cover, etc.
- **Feature engineering**: calculate vegetation indices, calculate statistics, etc.
- **Data analysis**: plot time series, plot histograms, etc.
- **Labelling**: create labels for the images, etc.

For each one, feel free to use your favourite tools. Here we are going to demonstrate labelling using SCANEO.

SCANEO is a labelling web application that allows tagging satellite images (to identify, e.g., objects present, terrain types, etc.) in an easy and fast way. The service provided by SCANEO is vital since it is necessary to prepare the satellite data so that it can be processed by neural networks, enabling active learning. 

Before running the web interface, we need to make sure we have the `scaneo` package installed in our machine and, if not, install it.

In [19]:
# !pip install scaneo  

[autoreload of jsonschema._types failed: Traceback (most recent call last):
  File "/home/juan/miniconda3/envs/eotdl/lib/python3.8/site-packages/IPython/extensions/autoreload.py", line 273, in check
    superreload(m, reload, self.old_objects)
  File "/home/juan/miniconda3/envs/eotdl/lib/python3.8/site-packages/IPython/extensions/autoreload.py", line 496, in superreload
    update_generic(old_obj, new_obj)
  File "/home/juan/miniconda3/envs/eotdl/lib/python3.8/site-packages/IPython/extensions/autoreload.py", line 393, in update_generic
    update(a, b)
  File "/home/juan/miniconda3/envs/eotdl/lib/python3.8/site-packages/IPython/extensions/autoreload.py", line 361, in update_class
    update_instances(old, new)
  File "/home/juan/miniconda3/envs/eotdl/lib/python3.8/site-packages/IPython/extensions/autoreload.py", line 319, in update_instances
    object.__setattr__(ref, "__class__", new)
TypeError: __class__ assignment: 'TypeChecker' object layout differs from 'TypeChecker'
]
[autoreloa

You can run `scaneo` with the following options

In [20]:
!scaneo --help

[1m                                                                                [0m
[1m [0m[1;33mUsage: [0m[1mscaneo [OPTIONS][0m[1m                                                       [0m[1m [0m
[1m                                                                                [0m
[2m╭─[0m[2m Options [0m[2m───────────────────────────────────────────────────────────────────[0m[2m─╮[0m
[2m│[0m [1;36m-[0m[1;36m-port[0m                [1;32m-p[0m      [1;33mINTEGER[0m  Port to run the server on             [2m│[0m
[2m│[0m                                        [2m[default: 8000]          [0m             [2m│[0m
[2m│[0m [1;36m-[0m[1;36m-reload[0m              [1;32m-r[0m      [1;33m       [0m  Reload the server when files change   [2m│[0m
[2m│[0m                                        [2m[default: True]                    [0m   [2m│[0m
[2m│[0m [1;36m-[0m[1;36m-host[0m                [1;32m-h[0m      [1;33mTEXT   [0m  

As seen, we have several options in `scaneo` usage, such as selecting the default port to run the server, the host, environment parameters, and so on. In this workshop, what we need is as simple as give the path of our dataset.

In [21]:
!scaneo --data workshop_data/sentinel_2

Environment file .env not found.
Running command: IMAGE=vector DATA=workshop_data/sentinel_2 uvicorn api:app --port 8000 --host localhost --reload --app-dir /home/juan/miniconda3/envs/eotdl/lib/python3.8/site-packages/scaneo
[32mINFO[0m:     Will watch for changes in these directories: ['/home/juan/Desktop/eotdl/tutorials/workshops/bids23']
[32mINFO[0m:     Uvicorn running on [1mhttp://localhost:8000[0m (Press CTRL+C to quit)
[32mINFO[0m:     Started reloader process [[36m[1m82043[0m] using [36m[1mStatReload[0m
[32mINFO[0m:     Started server process [[36m82045[0m]
[32mINFO[0m:     Waiting for application startup.
[32mINFO[0m:     Application startup complete.
^C
[32mINFO[0m:     Shutting down
[32mINFO[0m:     Finished server process [[36m82045[0m]
[31mERROR[0m:    Traceback (most recent call last):
  File "/home/juan/miniconda3/envs/eotdl/lib/python3.8/site-packages/starlette/routing.py", line 674, in lifespan
    await receive()
  File "/home/juan/minico

In [23]:
%%html
<iframe src="http://localhost:8000/" width="100%" height="700"></iframe>

![scaneo](./images/scaneo.png)

Once your data is ready you can ingest it to EOTDL like we have seen in the previous notebook and start working with it as any other dataset in the repository.

In [24]:
import yaml

metadata = {
	'authors': ['Fran Martin', 'Juan B. Pedro'],
	'license': 'free',
	'source': 'https://earthpulse.ai',
	'name': 'Boadella-BiDS23',
}

with open('workshop_data/Boadella/metadata.yml', 'w') as outfile:
	yaml.dump(metadata, outfile, default_flow_style=False)

FileNotFoundError: [Errno 2] No such file or directory: 'workshop_data/sentinel_2/metadata.yml'

In [None]:
!eotdl datasets ingest -p workshop_data/Boadella 

However, you might want to wait for the next tutorial where you will find how to generate STAC metadata for this dataset in order to ingest it to the EOTDL as Q1 or Q2 datasets, leveraging advanced functionality.

## Roadmap

...

## Discussion and Contribution opportunities

...
