# Creating datasets


This is probably the most interesting part of the tutorial, where you get to leverage EOTDL tools to create a brand new dataset. Here we cover:

1. **Data exploration**: given an area of interest, query available sentinel data for your dataset.
2. **Data access**: download your data for creating the dataset.
3. **Data preparation**: clean your data, perform feature engineering, data analysis, labelling, etc.

Once your dataset is ready, you can ingest it to the EOTDL like we have seen in the previous notebook and start working with it as any other dataset in the repository.


## Exploration


First of all, let's explore the area of interest that we have selected for this workshop. In this case we have chosen the [Boadella reservoir](https://es.wikipedia.org/wiki/Embalse_de_Darnius_Boadella) in Catalonia, Spain, which geometry is in the data folder as `workshop_data/boadella.geojson`. Here we use [leafmap](https://leafmap.org/) for visualizing it, but feel free to use your preferred solution.


In [None]:
# !pip install leafmap

In [None]:
import leafmap
import geopandas as gpd

in_geojson = "workshop_data/boadella.geojson"
gdf = gpd.read_file(in_geojson)

centroid_coords = gdf["geometry"].centroid
centroid = [
    centroid_coords.y.values[0],
    centroid_coords.x.values[0],
]  # We are going to use the centroid later

m = leafmap.Map(center=centroid, zoom=13)
m.add_geojson(in_geojson, layer_name="Boadella reservoir")
m

When creating AI-Ready datasets it is usual to work at a fixed resolution. You can either retrieve full scenes and cut patches, or use EOTDL functionality to generate appropriate bounding boxes. With the aim that all the images in the dataset have 512x512 pixels, we are going to use the centroid that we extracted before from the geoJSON and generate a bounding box that will result in a 512x512 pixels image at 10m resolution since we are going to use Sentinel data.


In [None]:
from eotdl.tools import bbox_from_centroid

boadella_bbox = bbox_from_centroid(
    x=centroid[0], y=centroid[1], pixel_size=10, width=512, height=512
)
boadella_bbox

Let's visualize the bounding box on a map!


In [None]:
from eotdl.tools import bbox_to_polygon

# Create a polygon from the bbox
boadella_polygon = bbox_to_polygon(boadella_bbox)
# Create a GeoDataFrame from the polygon
gdf = gpd.GeoDataFrame(geometry=[boadella_polygon])
# Save the bounding box as a geoJSON file, if needed
gdf.to_file(
    "workshop_data/boadella_bbox.geojson", driver="GeoJSON"
)  # Uncomment to save the bbox as a GeoJSON file

m.add_geojson("workshop_data/boadella_bbox.geojson", layer_name="Boadella bbox")
m

Now that he have our desired bounding box we can look for available Sentinel-2 imagery on it. This can be done through the EOTDL.

First, we can look for which Sentinel sensors are supported in the EOTDL


In [None]:
from eotdl.access import SUPPORTED_COLLECTION_IDS

SUPPORTED_COLLECTION_IDS

If we want to look for available Sentinel-2 imagery in our AoI, we must define a range of dates in which to search for the images. We have already defined a time interval for this workshop, which is in the `workshop_data/dates.csv` file.


In [None]:
import csv

dates = list()
with open("workshop_data/dates.csv", "r") as file:
    reader = csv.reader(file)
    for row in reader:
        dates.append(row[0])
dates.sort()

dates[:5]

Although we have the specific dates, we are going to search for the entire time interval, just as a demonstrator.


> We use Sentinle Hub under the hood, so you will need appropriate credentials. You can generate them automatically from your user [profile](https://www.eotdl.com/profile) by accepting the terms and conditions. When you login to the EOTDL, via the library or CLI, we retrieve and store this information for you, so you don't need to worry about it. However there are a couple of gotchas: <br><br> 1. If you already have a Sentinel HUB account with the same email as your EOTDL account, you will need to retrieve the credentials from Sentinel Hub Dashboard and set them as env variables. <br> 2. The credentials generated via EOTDL may expire after some time (we are working on this). If this happens, let us know in Discord to fix the issue. <br><br> In any case, you can provide your own credentials by setting the appropriate environment variables: `SH_CLIENT_ID` and `SH_CLIENT_SECRET`.


In [None]:
from eotdl.access import search_sentinel_imagery

time_interval = (dates[0], dates[-1])

r = search_sentinel_imagery(time_interval, boadella_bbox, "sentinel-2-l2a")
response = list(r)
response[:5]

They make sense, as the [revisit time](https://docs.sentinel-hub.com/api/latest/data/sentinel-2-l2a/#basic-facts) for Sentinel-2 is 5 days.

As a final step, let's check the number of dates with available images.


In [None]:
print(len(response))

To sum up this section, we have explored our AoI, generated a bounding box and a time interval in which to look for imagery and searched for Sentinel-2 imagery.


## Download


The next step is to download the images. On the one hand, we can download image by image, as follows.


In [None]:
from eotdl.access import download_sentinel_imagery

first_date = dates[0]

# Uncomment to demonstrate
download_sentinel_imagery(
    "data/sentinel_2", first_date, boadella_bbox, "sentinel-2-l2a"
)

On the other hand, we can search and download all available images within a time interval, as follows. This is the recommended way for a bulk download, but it has the drawback that we cannot control the quality of the images, as for example know their cloud cover.


In [None]:
# Uncomment to demonstrate

demostration_dates = (dates[0], dates[2])

download_sentinel_imagery(
    output="data/sentinel_2",
    time_interval=demostration_dates,
    bounding_box=boadella_bbox,
    collection_id="sentinel-2-l2a",
)

That's all! We have downloaded the images for our dataset. Let's check them!


In [None]:
from glob import glob

rasters = glob("data/sentinel_2/*.tiff")
rasters[:5]

One last optional step is to rename the images and cleanup the directory.


In [None]:
files = glob('data/sentinel_2/*')
for file in files:
    new_file_name = file.replace('sentinel-2-l2a', 'Boadella').replace('.tiff', '.tif')
    ! mv $file $new_file_name

!rm -r data/sentinel_2/*.json

In [None]:
!ls data/sentinel_2

## Discussion and Contribution opportunities


Feel free to ask questions now (live or through Discord) and make suggestions for future improvements.

- What features concerning data exploration would you like to see?
- What other features concerning data download would you like to see?
- What features and tools concerning data preparation would you like to see?
- What does your typical workflow look like?
- Do you already use any labelling tool?
- What does you ideal labelling tool looks like?
