# Download Sentinel-2 Data from EarthDaily Analytics' Earth Platform

This notebook provides a guide for downloading [Sentinel-2](https://sentinels.copernicus.eu/web/sentinel/missions/sentinel-2) data from the [EarthDaily Analytics Earth Platform](https://earthplatform.eds.earthdaily.com/) and displaying some of the data downloaded from this platform.

Ideally, this is a template for downloading Sentinel-2 data for an area **you** are interested in learning more about and inspiring **you** to perform your own analysis based on your own personal whimsy.

We start by introducing techniques for querying the Platform by using a GeoJSON Polygon created for the Lower Mainland; recommend how you can go about creating your own; and then proceed with an example for querying and displaying data from this store.

### What you Need to Run this Notebook:

(1) A file located in the same directory as this notebook called `.env` containing the following format and values:

```bash
CLIENT_ID="very_real_client_id_for_earthdaily_analytics_platform"
CLIENT_SECRET="very_real_client_secret_for_earthdaily_analytics_platform"
AUTH_TOKEN_URL="very_real_url_for_connecting_to_earthdaily_platform"
API_URL="very_real_api_url_for_reaching_earthdaily_dataset"
```

These values should be provided to you, but if you do not have them, please ask someone in the know.

(2) jazz hands?

Now that we have that squared away we are going to start by installing some required Python packages for this notebook.

In [None]:
# pip install folium fsspec geopandas mapclassify python-dotenv pystac-client shapely

In [143]:
from datetime import datetime, timedelta
from pathlib import Path
import json
import os
import requests
import typing as T
import glob
import shutil
import csv
import sys

from dotenv import load_dotenv
from pystac_client import Client
from pystac.item import Item
from shapely.geometry import shape
from shapely.geometry import Polygon
from shapely.prepared import prep
from shapely.geometry import MultiPolygon, Point
import rasterio
from rasterio.features import rasterize
from rasterio.windows import Window
from dateutil.parser import parse

import cv2
import folium
import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tifffile as tiff
from PIL import Image
import ast
import imagecodecs
import torch


## Areas of Interest

To start, we need to create an area of interest that we want to analyze further.

If you have an area in mind already that you want to explore and learn more about, you can create your own GeoJSON file at [geojson.io](https://geojson.io/) or using an open source service like [QGIS](https://qgis.org/en/site/).

You can also download pre-built Shapefiles from services like [OpenStreetMap](https://download.geofabrik.de/) or from some of the resources mentioned in [this post by Carleton University](https://library.carleton.ca/find/gis/geospatial-data/shapefiles-canada-united-states-and-world).

We are going to start with a GeoJSON polygon that I've pre-built using geojson.io's interface that covers the some of Vancouver, BC, Canada and will inspect some Sentinel-2 tiles from the EarthDaily platform.

In [2]:


# Example GeoJSON geometry
geojson_geometry = {
    # I drew a Polygon at geojson.io and put the contents of the `geometry` attribute here
    "type": "Polygon",
    "coordinates": [
        [
            [-123.19443904574592, 49.24932487550103],
            [-123.17802762978633, 49.13635374192958],
            [-123.03251301430745, 49.04966338761224],
            [-122.73054282797881, 49.06113525720673],
            [-122.68349674929038, 49.13420622071877],
            [-122.77868300152072, 49.255752174748864],
            [-122.98984330819309, 49.29073056863652],
            [-123.19443904574592, 49.24932487550103],
        ]
    ],
}

# Convert the GeoJSON geometry into a shapely geometry
# shapely_geometry = shape(geojson_geometry)

# Please Note: You can also create, download, or import your own Shapefile to use with GeoPandas
# and create a GeoDataFrame from it. To do this see the docs at:
# https://geopandas.org/en/stable/docs/user_guide/io.html#reading-spatial-data
# and take a look at the geopandas read_file method.

# Create a GeoDataFrame with the shapely geometry
# Note that the Coordinate Reference System (CRS) defines how the two-dimensional,
# projected geometry relates to real world locations.
# gdf = gpd.GeoDataFrame([{"geometry": shapely_geometry}], crs="EPSG:4326")
gdf = gpd.read_file('../shapefiles/campbell_river.shp')
gdf = gdf.set_crs("EPSG:4326")
shapely_geometry = gdf.geometry[0]
gdf = gdf[gdf.geometry.is_valid]
aoi = gdf[gdf.geometry.is_valid].unary_union

# Calculate the centroid of your GeoDataFrame to center the map
centroid = gdf.geometry.centroid.unary_union.centroid

# Create a folium map centered on the centroid of your shapefile
m = folium.Map(location=[centroid.y, centroid.x], zoom_start=10)

# Add the GeoDataFrame as a layer to the folium map
folium.GeoJson(gdf, name="geojson").add_to(m)

# Add layer control to toggle the geojson layer
folium.LayerControl().add_to(m)

# Display the map
m


  centroid = gdf.geometry.centroid.unary_union.centroid


## Downloading Data

To download Sentinel-2 data from the Earth Platform we are going to use the [SpatioTemporal Asset Catalog (STAC)](https://stacspec.org/en) interface - a common language used to describe geospatial information. To do this, we need to authenticate with this platform using our precious credentials and then we can use the GeoDataFrame object we've already created to download data from the platform.

We demonstrate how to authorization token from this service in the following cell.

In [105]:
# load_dotenv(
#     # You will need to create this file with your
#     # CLIENT_ID, CLIENT_SECRET, AUTH_TOKEN_URL, and API_URL
#     dotenv_path=".env"
# )

CLIENT_ID = os.environ.get("CLIENT_ID")
CLIENT_SECRET = os.environ.get("CLIENT_SECRET")
AUTH_TOKEN_URL = os.environ.get("AUTH_TOKEN_URL")
API_URL = os.environ.get("API_URL")


def get_new_token(client_id: str, client_secret: str, auth_token_url: str):
    """
    Authenticate with the Earth Platform and obtain a new access token.
    """
    token_req_payload = {"grant_type": "client_credentials"}
    token_response = requests.post(
        auth_token_url,
        data=token_req_payload,
        allow_redirects=False,
        auth=(client_id, client_secret),
    )
    token_response.raise_for_status()  # Raise an exception if the request failed

    tokens = json.loads(token_response.text)
    return tokens["access_token"]


token = get_new_token(CLIENT_ID, CLIENT_SECRET, AUTH_TOKEN_URL)
# Open a client to the STAC API
client = Client.open(API_URL, headers={"Authorization": f"Bearer {token}"})

## Getting overlapping tiles

Currently, we compare the satellite tiles with the selected area's polygon to identify the intersecting regions. These intersecting areas are then divided into smaller, regularly sized square tiles.

In [4]:
def load_canada_map() -> gpd.GeoDataFrame:
    # CRS is EPSG:4326
    gdf = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
    return gdf[gdf.name == "Canada"]

def grid_bounds(geom, delta):
    # Convert a larger shapefile into grids...
    # Logic retrieved from:
    # https://www.matecdev.com/posts/shapely-polygon-gridding.html
    minx, miny, maxx, maxy = geom.bounds
    nx = int((maxx - minx)/delta)
    ny = int((maxy - miny)/delta)
    gx, gy = np.linspace(minx,maxx,nx), np.linspace(miny,maxy,ny)
    grid = []
    for i in range(len(gx)-1):
        for j in range(len(gy)-1):
            poly_ij = Polygon([[gx[i],gy[j]],[gx[i],gy[j+1]],[gx[i+1],gy[j+1]],[gx[i+1],gy[j]]])
            grid.append(poly_ij)

    return grid

def partition(geom, delta):
    prepared_geom = prep(geom)
    grid = list(filter(prepared_geom.intersects, grid_bounds(geom, delta)))
    return grid

In [5]:
areas_geojson = load_canada_map()

polygons = partition(areas_geojson.iloc[0].geometry, 5)

  gdf = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))


In [6]:
def get_intersecting_polygons(polygons, aoi):
    """
    Given an area with larger multipolygons, and a group of polygons return a list of
    polygons that intersect with the area of interest.
    """
    intersecting_polygons = []
    for polygon in polygons:
        if aoi.intersects(polygon):
            intersecting_polygons.append(polygon)
    return intersecting_polygons

wanted_polygons = get_intersecting_polygons(polygons, aoi)
len(polygons), len(wanted_polygons)

(91, 2)

## Query the STAC API

Now that we have established a connection to the client, we can use the geometry we've selected and obtain Sentinel-2 tiles for that area of interest. We are going to start by returning the available items for our area of interest within our query constraints and will then decide which tiles we want to download for further use.

In [7]:
def remove_small_tiles(
    gdf: gpd.GeoDataFrame, min_area: float = 1e6, reproject: bool = True
):
    """
    Given a GeoDataFrame as input, remove all geometries that are less than the specified area.

    Returns:
        gdf: A GeoDataFrame with geometries removed.
    """
    gdf_projected = gdf.copy()
    if reproject:
        lcc_crs = "+proj=lcc +lat_1=40 +lat_2=65 +lon_0=125 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"
        gdf_projected = gdf.to_crs(lcc_crs)

    gdf_projected["area_km"] = gdf_projected["geometry"].area / 1000
    gdf["proj_area"] = gdf_projected["area_km"]
    gdf = gdf.loc[gdf["proj_area"] > min_area]

    return gdf

def add_geometries_iteratively(
    gdf: gpd.GeoDataFrame, intersection_threshold: float = 0.95, debug: bool = False
) -> tuple[MultiPolygon, gpd.GeoDataFrame]:
    """
    Given a GeoDataFrame as input, construct an area iteratively by adding all of the geometries together.

    Returns:
        merged_geometry: A shapely Polygon or MultiPolygon object representing the merged geometry.
        selected_geometries: A list of GeoDataFrame rows that were selected to be merged.
    """
    assert intersection_threshold < 1, "The intersection threshold must be less than 1."
    # Initialize an empty geometry
    merged_geometry = None
    selected_geometries = []

    # Iterate over each row in the GeoDataFrame
    for idx, row in gdf.iterrows():
        geometry = row["geometry"]
        intersected = False  # Flag to track if the row has intersected geometry

        # Check if the current geometry is a MultiPolygon
        if isinstance(geometry, MultiPolygon):
            # Iterate over each polygon in the MultiPolygon
            for polygon in geometry.geoms:
                if merged_geometry is None:
                    merged_geometry = polygon
                else:
                    if (
                        merged_geometry.intersection(polygon).area / polygon.area
                    ) > intersection_threshold:
                        intersected = True
                    else:
                        merged_geometry = merged_geometry.union(polygon)
        else:
            # If the current geometry is not a MultiPolygon
            if merged_geometry is None:
                merged_geometry = geometry
            else:
                if (
                    merged_geometry.intersection(geometry).area / geometry.area
                ) > intersection_threshold:
                    intersected = True
                else:
                    merged_geometry = merged_geometry.union(geometry)

        # Print a message if the row has no non-intersecting area
        if intersected:
            if debug:
                print(
                    f"Row {idx} has no area that does not already intersect with the merged geometry."
                )
        else:
            selected_geometries.append(row)

    return (merged_geometry, gpd.GeoDataFrame(selected_geometries, crs=gdf.crs))

def get_all_overlapping_tiles(polygons, start_date, end_date):
    print(f"Getting area for: {len(polygons)} polygons")
    all_items, all_gdfs = [], []
    for polygon in polygons:
        poly_obj = {
            "type": "Polygon",
            "coordinates": list(polygon.__geo_interface__["coordinates"])
        }
        items, tile_gdf = get_sentinel2_data(client, poly_obj, start_date, end_date)
        if len(items) == 0:
            print("No items found for given area... Not great.")
            continue


        tile_gdf = remove_small_tiles(tile_gdf, reproject=True)
        _, tile_gdf = add_geometries_iteratively(tile_gdf)

        wanted_gdf = tile_gdf[tile_gdf.intersects(aoi)]

        wanted_tiles = [name.split("/")[-1] for name in wanted_gdf["earthsearch:s3_path"].tolist()]
        wanted_items = [item for item in items if item.id in wanted_tiles]
        all_items.append(wanted_items)
        all_gdfs.append(wanted_gdf)


    return all_items, all_gdfs

def get_sentinel2_data(
    client: Client,
    aoi: dict,
    start_date: str,
    end_date: str,
    cloud_cover: float = 10,
    max_items: int = 500,
):
    """
    Download Sentinel-2 data from the Earth Platform.
    """
    query = client.search(
        collections=["sentinel-2-l2a"],
        datetime=f"{start_date}T00:00:00.000000Z/{end_date}T00:00:00.000000Z",  # 2023-07-10T00:00:00.000000Z/2023-07-20T00:00:00.000000Z
        intersects=aoi,  # The area of interest; you can also query by bbox, or other geometry
        query={"eo:cloud_cover": {"lte": cloud_cover}},
        sortby=[
            {
                "field": "properties.eo:cloud_cover",
                "direction": "asc",
            },  # Sort by cloud cover from lowest to highest
        ],
        limit=max_items,  # This is the number of items to be returned per page
        max_items=max_items,  # This is number of items to page over
    )

    items = list(query.items())
    if len(items) == 0:
        raise Exception(
            "No items found, try enlarging search area or increasing cloud cover threshold."
        )
    print(f"Found: {len(items):d} tiles.")

    # Convert STAC items into a GeoJSON FeatureCollection
    stac_json = query.item_collection_as_dict()
    gdf = gpd.GeoDataFrame.from_features(stac_json, crs="EPSG:4326")

    return items, gdf

In [None]:
sentinel_items, sentinel_gdf = get_all_overlapping_tiles(wanted_polygons, start_date, end_date)
print(f"Found {len(sentinel_items)} Sentinel-2 items/tiles.")
# combine all gdf in sentinel_gdf to a single gdf
sentinel_gdf = gpd.GeoDataFrame(pd.concat(sentinel_gdf), crs=sentinel_gdf[0].crs)
# combine all items in sentinel_items to a single list
sentinel_items = [item for sublist in sentinel_items for item in sublist]

In [None]:
# Take a look at the tiles found for the given query on the map
sentinel_gdf.explore(color="green")

In [None]:
# The items object contains a list of STAC items which includes
# metadata about the satellite imagery and links to the actual data.
# You can access the first item in the list like this:
sentinel_items[0]

## Download our tiles

Now we want to download our Sentinel-2 tiles from EarthDaily's Earth Store. In this example we are going to download all of the high, medium, and low resolution bands that are captured by the Sentinel-2 satellites. To learn a bit more about the significance of each captured band we are downloading, see this tutorial [here](https://gisgeography.com/sentinel-2-bands-combinations/).

Note that the information we download in the below cells does not exhaust the available data for each tile. You can find links to different files like class labels and metadata on the tile by downloading more files found for a given Sentinel-2 item.

In [8]:
# This cell creates utility functions to download the files associated with a STAC item to a local
# file system.

def download_file(href: str, outpath: Path):
    """
    Given a URL, download the file to the specified path.
    """
    with requests.get(href, stream=True) as r:
        r.raise_for_status()
        with open(outpath, "wb") as f:
            for chunk in r.iter_content(chunk_size=8192):
                f.write(chunk)


def download_files_for_item(
    item: Item, asset_dict: dict[str, str], outpath: Path, debug: bool = True
) -> bool:
    """
    Save all files of interest for a given item.

    If one file fails to download return False, otherwise return True.
    """
    if not outpath.exists():
        outpath.mkdir(exist_ok=True, parents=True)

    for key, value in asset_dict.items():
        if debug:
          print(f"Downloading {key} and relabeling to {value}")
        if key in item.assets:
            if key in ["tileinfo_metadata"]:
                file_outpath = outpath / f"{value}.json"
            else:
                file_outpath = outpath / f"{value}.tiff"
            if not file_outpath.exists():
                try:
                    download_file(item.assets[key].href, file_outpath)
                except requests.ConnectionError:
                    print(
                        f"Failed to download {item.assets[key].href} for item {item.id}"
                    )
                    return False
                except requests.exceptions.ReadTimeout:
                    print(
                        f"Experienced a read timeout for {item.assets[key].href} for item {item.id}"
                    )
                    return False
            else:
                print(f"Skipping {item.assets[key].href} as it already exists.")

    return True

In [9]:
# Spatial Resolution Data Found Here: https://sentinels.copernicus.eu/web/sentinel/user-guides/sentinel-2-msi/resolutions/spatial
# 10 meter resolution bands, order is important for Max resolution bands, B04=Red, B03=Green, B02=Blue, B08=NIR
MAX_RESOLUTION = ["B04", "B03", "B02", "B08"]
MID_RESOLUTION = [
    "B05",
    "B06",
    "B07",
    "B8A",
    "B11",
    "B12",
    # "scl",
]  # 20 meter resolution bands
MIN_RESOLUTION = ["B01", "B09"]  # 60 meter resolution bands
# MIN_RESOLUTION = ["B01", "B09", "aot"]  # 60 meter resolution bands

MIN_RESOLUTION_SIZE = 128
MID_RESOLUTION_SIZE = 384
MAX_RESOLUTION_SIZE = 768

EXTRACTED_BANDS = [
    ("B02", 10),
    ("B03", 10),
    ("B04", 10),
    ("B08", 10),
    ("B05", 20),
    ("B06", 20),
    ("B07", 20),
    ("B8A", 20),
    ("B11", 20),
    ("B12", 20),
    # ("scl", 20),
    ("B01", 60),
    ("B09", 60),
    # ("aot", 60),
]

# This ordering is based on the order of the bands read into the dataloader
# Basically - read 10m, 20m, 60m bands in that order
BAND_ORDERING = [
    "B04",
    "B03",
    "B02",
    "B08",
    "B05",
    "B06",
    "B07",
    "B8A",
    "B11",
    "B12",
    "B01",
    "B09",
]

def get_band_specification(filepath: Path) -> tuple[str, int]:
    """
    Get the metadata from a Sentinel-2 file.
    """
    for band in MIN_RESOLUTION:
        if band == filepath.stem:
            return (band, MIN_RESOLUTION_SIZE)
    for band in MID_RESOLUTION:
        if band == filepath.stem:
            return (band, MID_RESOLUTION_SIZE)
    for band in MAX_RESOLUTION:
        if band == filepath.stem:
            return (band, MAX_RESOLUTION_SIZE)

    return ("", 0)

In [10]:
def get_aoi_shape(gdf: gpd.GeoDataFrame, target_crs: str) -> MultiPolygon:
    return gdf.to_crs(target_crs).unary_union


def create_masks(gdf: Path, out_path: Path, bounding_box: Polygon, meta: dict):
    gdf = gdf[gdf.intersects(bounding_box)]
    if len(gdf) == 0:
        raise ValueError("Intersecting area must have at least one polygon.")
    geometries = [(geom, value) for geom, value in zip(gdf.geometry, gdf.label)]

    with rasterio.open(out_path, "w+", **meta) as dest:
        out_arr = dest.read(1)
        burned = rasterize(
            geometries,
            out=out_arr,
            transform=dest.transform,
            fill=0,
            default_value=0,
            dtype=rasterio.uint8,
        )
        dest.write_band(1, burned)

def create_coordinates_file(bound: list, original_crs: gpd.GeoDataFrame.crs, segment_output_file: Path):
    '''
    Create a json file with the coordinates of the segment

    Args:
    bound: list of coordinates of the segment
    segment_output_file: Path to the segment file

    Returns:
    None
    '''
    tile_geometry = [Point(coord) for coord in bound]

    tile_gdf = gpd.GeoDataFrame(geometry=tile_geometry)
    tile_gdf.crs = "EPSG:32609"

    tile_gdf = tile_gdf.to_crs(original_crs)

    # get a list of the coordinates of the segment
    coordinates = list(list(point.coords)[0] for point in tile_gdf.geometry)

    geojson_dict = {
    "geometry": {
        "coordinates": [coordinates]
    }
    }
    coordinate_file = segment_output_file.parent / "coordinates.json"
    if not os.path.exists(coordinate_file):
        # file does not exist, create json file
        with open(coordinate_file, 'w') as f:
            json.dump(geojson_dict, f)

def generate_tiles(
    input_file: Path,
    output_dir: Path,
    band_name: str,
    window_size: int = 768,
    aoi_gdf: T.Optional[gpd.GeoDataFrame] = None,
) -> int:
    """
    Split Sentinel files into square tiles of a given window size.
    """
    all_bounds = []
    with rasterio.open(input_file.as_posix()) as src:
        height, width = src.shape

        # Calculate the number of segments in both dimensions
        num_rows = (height + window_size - 1) // window_size
        num_cols = (width + window_size - 1) // window_size
        num_processed = 0
        skipped = 0
        blank = 0
        misshapen = 0
        no_mask = 0
        aoi_gdf["label"] = 1
        original_crs = aoi_gdf.crs
        aoi_gdf = aoi_gdf.to_crs(src.crs)
        aoi = aoi_gdf.unary_union

        for row in range(num_rows):
            for col in range(num_cols):
                row_start = row * window_size
                col_start = col * window_size

                # Calculate the actual segment size based on the remaining pixels
                seg_height = min(window_size, height - row_start)
                seg_width = min(window_size, width - col_start)
                seg_window = Window(col_start, row_start, seg_width, seg_height)
                # skip if window is not a perfect square or the shape of the window is not the same as the window size
                if seg_height != seg_width or seg_height != window_size:
                    skipped += 1
                    misshapen += 1
                    continue

                seg_data = src.read(window=seg_window)
                seg_profile = src.profile.copy()
                seg_transform = src.window_transform(
                    seg_window
                )  # don't use rasterio.windows.transform...

                # Create a Polygon for the window's bounds
                bound = [
                    seg_transform * (0, 0),
                    seg_transform * (seg_window.width, 0),
                    seg_transform * (seg_window.width, seg_window.height),
                    seg_transform * (0, seg_window.height),
                ]
                polygon = Polygon(bound)

                if not polygon.intersects(aoi):
                    no_mask += 1
                    continue

                all_bounds.append(polygon)
                seg_profile.update(
                    width=seg_height,
                    height=seg_width,
                    transform=seg_transform,
                    compress="lzw",
                )

                segment_output_file = output_dir / f"{row}_{col}" / f"{band_name}.tif"
                segment_output_file.parent.mkdir(parents=True, exist_ok=True)


                create_coordinates_file(bound, original_crs, segment_output_file)

                with rasterio.open(
                    segment_output_file, "w", **seg_profile
                ) as segment_dst:
                    segment_dst.write(seg_data)

                mask_path = output_dir / f"{row}_{col}" / "mask.tif"
                if not mask_path.exists() and window_size == 768:
                    create_masks(aoi_gdf, mask_path, polygon, seg_profile)

                num_processed += 1

        print(
            f"Processed tiles: {num_processed}, blank tiles: {blank}, misshapen tiles: {misshapen}, no_mask: {no_mask}"
        )

        return num_processed, skipped, all_bounds

In [38]:
high_resolution_bands = {"red": "B04", "green": "B03", "blue": "B02", "nir": "B08"}
mid_resolution_bands = {
    "rededge1": "B05",
    "rededge2": "B06",
    "rededge3": "B07",
    "nir08": "B8A",
    "swir16": "B11",
    "swir22": "B12",
}
low_resolution_bands = {"coastal": "B01", "nir09": "B09"}
other_files = {
    "scl": "scl",  # Scene Classification Map
    "tileinfo_metadata": "metadata",  # Tile Metadata
}
all_download_files = { # Modify this variable to change the files that are downloaded
    **high_resolution_bands,
    **mid_resolution_bands,
    **low_resolution_bands,
    **other_files,
}


In [41]:
def download_and_tile_files(
    gdf: gpd.GeoDataFrame,
    items: list[Item],
    download_files: dict[str, str],
    aoi_gdf: gpd.GeoDataFrame,
    output_dir: Path):
    """
    Given a GeoDataFrame of items and a list of STAC items, download the
    files to a given output directory.

    Parameters:
      items: A list of items that correspond to tiles found in the GeoDataFrame and include paths to files to be
        downloaded in this function.
      download_files: A dictionary of strings where the keys correspond to names of items on the Earth Platform
        and the values correspond to their Sentinel-2 name.
    """
    gdf["downloaded"] = False
    downloaded = 0
    for index, tile in enumerate(items):
        dt_obj = datetime.strptime(tile.properties["datetime"], "%Y-%m-%dT%H:%M:%S.%fZ")
        formatted_date = dt_obj.strftime("%Y%m%d")
        out_path = output_dir / tile.id / formatted_date
        downloaded = download_files_for_item(tile, download_files, out_path)

        if downloaded:
            gdf.loc[
                gdf["s2:granule_id"] == tile.properties["s2:granule_id"], "downloaded"
            ] = True
            for file in out_path.iterdir():
                band_name, window_size = get_band_specification(file)
                if band_name and window_size:
                    out_dir = file.parent / "tiles"
                    if not out_dir.exists():
                        out_dir.mkdir(parents=True)
                    generate_tiles(file, out_dir, band_name, window_size, aoi_gdf)

        if downloaded:
            downloaded += 1
        else:
            print(f"Unable to download file for item with id: {tile.id} at index: {index} in items list.")

    print(
        f"Downloaded all bands for {downloaded} tiles. Failed to download at least one "
        + f"band or file for {len(items) - downloaded} tiles."
    )


# output_dir = Path("../content/sentinel_tiles") # Used for Google CoLab

# # We are only going to download 2 tiles here, but feel free to modify this function
# # call to download more data!
# download_and_tile_files(sentinel_gdf[0:2], sentinel_items[0:2], all_download_files, gdf, output_dir)

In [101]:
date_list = [
#  ['2017-04-17', '2017-04-26'],
#  ['2017-05-09', '2017-05-22'],
#  ['2017-05-25', '2017-06-08'],
#  ['2017-06-13', '2017-06-27'],
#  ['2017-06-29', '2017-07-12'],
#  ['2017-07-15', '2017-07-30'],
#  ['2017-07-31', '2017-08-14'],
#  ['2017-08-17', '2017-09-01'],
#  ['2017-09-02', '2017-09-17'],
#  ['2017-09-18', '2017-10-03'],
#  ['2017-10-05', '2017-10-20'],
#  ['2017-10-23', '2017-11-02'],
#  ['2018-03-12', '2018-03-20'],
#  ['2018-04-15', '2018-04-26'],
#  ['2018-05-05', '2018-05-20'],
#  ['2018-05-22', '2018-06-06'],
#  ['2018-06-09', '2018-06-23'],
#  ['2018-06-25', '2018-07-07'],
#  ['2018-07-11', '2018-07-26'],
#  ['2018-07-27', '2018-08-11'],
#  ['2018-08-12', '2018-08-25'],
#  ['2018-08-28', '2018-09-06'],
# ##  ['2018-09-13', '2018-09-13'],
#  ['2018-09-29', '2018-10-14'],
#  ['2018-10-19', '2018-10-27'],
#  ['2019-03-18', '2019-04-01'],
#  ['2019-04-24', '2019-05-09'],
#  ['2019-05-10', '2019-05-24'],
#  ['2019-05-27', '2019-06-11'],
#  ['2019-06-13', '2019-06-28'],
#  ['2019-06-29', '2019-07-14'],
#  ['2019-07-15', '2019-07-29'],
#  ['2019-08-03', '2019-08-18'],
#  ['2019-08-19', '2019-09-03'],
#  ['2019-09-04', '2019-09-18'],
#  ['2019-09-20', '2019-10-05'],
#  ['2019-10-08', '2019-10-14'],
## ['2019-10-30', '2019-10-30'],
## ['2019-11-28', '2019-11-28'],
#  ['2020-03-16', '2020-03-29'],
#  ['2020-04-09', '2020-04-22'],
#  ['2020-04-25', '2020-05-08'],
# ## ['2020-06-07', '2020-06-16'],
#  ['2020-06-30', '2020-07-12'],
#  ['2020-07-24', '2020-07-28'],
#  ['2020-07-29', '2020-07-31'],
#  ['2020-08-11', '2020-08-17'],
#  ['2020-08-27', '2020-09-10'],
#  ['2020-09-18', '2020-09-22'],
## ['2020-10-30', '2020-10-30'],
## ['2020-12-21', '2020-12-21'],
 ['2021-04-01', '2021-04-16'],
 ['2021-04-17', '2021-04-25'],
 ['2021-05-05', '2021-05-20'],
 ['2021-05-21', '2021-06-05'],
 ['2021-06-08', '2021-06-22'],
 ['2021-06-25', '2021-07-09'],
 ['2021-07-11', '2021-07-26'],
 ['2021-07-28', '2021-08-12'],
 ['2021-08-15', '2021-08-27'],
 ['2021-09-02', '2021-09-14'],
## ['2021-09-22', '2021-09-22'],
## ['2021-11-03', '2021-11-05'],
## ['2021-12-11', '2021-12-13'],
## ['2023-01-24', '2023-01-24']
 ]

downloaded_dates = [['2022-02-06', '2022-02-17'],
 ['2022-03-09', '2022-03-09'],
 ['2022-04-07', '2022-04-07'],
 ['2022-05-26', '2022-05-29'],
 ['2022-06-28', '2022-07-11'],
 ['2022-07-14', '2022-07-26'],
 ['2022-07-31', '2022-08-15'],
 ['2022-08-17', '2022-09-01'],
 ['2022-09-02', '2022-09-17'],
 ['2022-09-19', '2022-10-04'],
 ['2022-10-05', '2022-10-20'],
 ['2022-10-22', '2022-11-05'],
 ['2022-11-08', '2022-11-20'],
 ['2022-11-29', '2022-11-29']]

In [106]:
for start_date, end_date in date_list:
    print("--------------------")
    print(f"Start date: {start_date}")
    print(f"End date: {end_date}")
    print("--------------------")
    print(f"Dropping {len(gdf) - len(gdf[gdf.geometry.is_valid])} invalid geometries.")
    gdf = gdf[gdf.geometry.is_valid]
    aoi = gdf[gdf.geometry.is_valid].unary_union

    all_items, all_gdfs = get_all_overlapping_tiles(wanted_polygons, start_date, end_date)
    if len(all_items) == 0:
        continue

    # all_gdfs = gpd.GeoDataFrame(pd.concat(all_gdfs), crs=all_gdfs[0].crs)
    items = [item for sublist in all_items for item in sublist] # covers western canada
    gdfs = pd.concat(all_gdfs)
    print(f"Number of items: {len(items)}")
    print(f"Number of gdfs: {len(gdfs)}")

    output_dir = Path("../datasets/sentinel2")
    download_and_tile_files(gdfs, items, all_download_files, gdf, output_dir)

--------------------
Start date: 2020-07-29
End date: 2020-07-31
--------------------
Dropping 0 invalid geometries.
Getting area for: 2 polygons
Found: 69 tiles.
Found: 60 tiles.
Number of items: 14
Number of gdfs: 14
Downloading red and relabeling to B04
Skipping https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/9/U/YQ/2020/7/S2B_9UYQ_20200730_1_L2A/B04.tif as it already exists.
Downloading green and relabeling to B03
Skipping https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/9/U/YQ/2020/7/S2B_9UYQ_20200730_1_L2A/B03.tif as it already exists.
Downloading blue and relabeling to B02
Skipping https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/9/U/YQ/2020/7/S2B_9UYQ_20200730_1_L2A/B02.tif as it already exists.
Downloading nir and relabeling to B08
Skipping https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/9/U/YQ/2020/7/S2B_9UYQ_20200730_1_L2A/B08.tif as it already exists.
Downloading rededge1 and relabeling to

## Organize wildfire data
Organize wildfire data by time and location

In [126]:
fire_df.drop(fire_df.index, inplace=True)

In [154]:
# Get all filenames in the directory into a list
filenames = os.listdir('../datasets/bc_fire_points/')

# Read all the csv files into one dataframe
fire_df_1 = pd.concat([pd.read_csv('../datasets/bc_fire_points/' + f) for f in filenames[1:]])

fire_df_2 = pd.read_csv('../datasets/bc_fire_points/' + filenames[0])


In [155]:
# drop columns that are not needed
fire_df_1 = fire_df_1.drop(["Fire Number", "X", "Y"], axis=1)

# Merge the two columns into one
fire_df_1['coordinates'] = fire_df_1['LONGITUDE'].astype(str) + ', ' + fire_df_1['LATITUDE'].astype(str)

# Change name Ignition Data to date
fire_df_1 = fire_df_1.rename(columns={"IGNITION DATE": "date"})

fire_df_1['date'] = pd.to_datetime(fire_df_1['date'], errors='coerce')
fire_df_1 = fire_df_1.dropna()

# Convert the date column to YYYYMMDD
fire_df_1['date'] = fire_df_1['date'].dt.strftime('%Y%m%d')

# drop columns that are not needed
fire_df_1 = fire_df_1.drop(["LATITUDE", "LONGITUDE"], axis=1)
fire_df_1

  fire_df_1['date'] = pd.to_datetime(fire_df_1['date'], errors='coerce')


Unnamed: 0,date,coordinates
0,20210421,"-123.824717, 51.48205"
1,20210626,"-121.962133, 52.050917"
2,20210709,"-120.414017, 53.4259"
3,20210828,"-122.7818, 52.520117"
4,20210802,"-121.535912, 52.139225"
...,...,...
1866,20191030,"-126.0737, 49.286"
1867,20190626,"-125.7094, 49.3239"
1868,20190626,"-125.4045, 49.346"
1869,20190626,"-125.4014, 49.3429"


In [156]:
# drop columns that are not needed
fire_df_2 = fire_df_2.drop(["Fire Number", "X", "Y"], axis=1)

# Merge the two columns into one
fire_df_2['coordinates'] = fire_df_2['LONGITUDE'].astype(str) + ', ' + fire_df_2['LATITUDE'].astype(str)

# Change name Ignition Data to date
fire_df_2 = fire_df_2.rename(columns={"IGNITION DATE": "date"})

fire_df_2['date'] = pd.to_datetime(fire_df_2['date'], errors='coerce')
fire_df_2 = fire_df_2.dropna()

# Convert the date column to YYYYMMDD
fire_df_2['date'] = fire_df_2['date'].dt.strftime('%Y%m%d')

# drop columns that are not needed
fire_df_2 = fire_df_2.drop(["LATITUDE", "LONGITUDE"], axis=1)
fire_df_2

Unnamed: 0,date,coordinates
0,20220717,"-121.019467, 52.623"
1,20220824,"-122.375417, 51.0147"
2,20220823,"-118.872217, 49.543983"
3,20220812,"-120.567867, 50.472167"
4,20220823,"-121.4897, 49.221417"
...,...,...
1796,20220822,"-126.947533, 54.082667"
1797,20220714,"-121.446867, 49.782967"
1798,20220819,"-126.409033, 54.18675"
1799,20230116,"-118.743783, 52.43955"


In [158]:
# Combine the two dataframes
fire_df = pd.concat([fire_df_2, fire_df_1])
fire_df

Unnamed: 0,date,coordinates
0,20220717,"-121.019467, 52.623"
1,20220824,"-122.375417, 51.0147"
2,20220823,"-118.872217, 49.543983"
3,20220812,"-120.567867, 50.472167"
4,20220823,"-121.4897, 49.221417"
...,...,...
1866,20191030,"-126.0737, 49.286"
1867,20190626,"-125.7094, 49.3239"
1868,20190626,"-125.4045, 49.346"
1869,20190626,"-125.4014, 49.3429"


In [159]:
fire_df.to_csv('../dataset_tables/fire_points.csv', index=False)

In [None]:
fire_df = pd.read_csv('../dataset_tables/fire_points.csv')

In [160]:
# Create a polygon by a list of coordinates
coordinates_list = [[-128.54948032,51.37397123],
    [-123.42622011591264, 51.26222303279471],
    [-123.52937300599798, 48.20558993286168],
    [-128.3913125586851, 48.7163322815431]]
wanted_polygon = Polygon(coordinates_list)

# Create a column called points
fire_df['points'] = fire_df['coordinates'].apply(lambda x: Point(float(x.split(',')[0]), float(x.split(',')[1])))

# filter column points by the polygon
fire_df = fire_df[fire_df['points'].apply(lambda x: wanted_polygon.contains(x))]

# sort the dataframe by date
fire_df = fire_df.sort_values(by='date')

In [161]:
fire_df

Unnamed: 0,date,coordinates,points
1192,20150526,"-123.7458, 48.8463",POINT (-123.7458 48.8463)
3031,20180312,"-125.1046, 50.1684",POINT (-125.1046 50.1684)
1731,20180415,"-124.0843, 49.1989",POINT (-124.0843 49.1989)
3053,20180416,"-125.3536, 50.0375",POINT (-125.3536 50.0375)
1748,20180418,"-124.312, 49.3026",POINT (-124.312 49.3026)
...,...,...,...
1794,20221117,"-125.311517, 49.42095",POINT (-125.311517 49.42095)
113,20221119,"-123.9036, 49.529167",POINT (-123.9036 49.529167)
576,20221120,"-124.669383, 49.289117",POINT (-124.669383 49.289117)
1741,20221129,"-123.722233, 48.6723",POINT (-123.722233 48.6723)


## Organize satellite file names into dataframe

This script will move downloaded images to `prepared_dataset`. The process is briefed as follows:

- Images downloaded will be stored in `download_file` folder.
- Under `download_file` folder, images will be grouped according to the polygon they belong to.
- In each polygon, 2 types of images are present: geotiff with band information, and a mask file.
- Geotiffs should be moved and stored in `prepared_dataset/images_directory{group_id}` folder.
- Mask files are stored in `prepare_dataset/mask_directory{group_id}` folder.

In [162]:
raw_df.drop(fire_df.index, inplace=True)

In [115]:
def fast_scandir(dirname: str) -> list:
    """
    Scan and return all subfolders of a directory.
    """
    subfolders= [f.path for f in os.scandir(dirname) if f.is_dir()]
    for dirname in list(subfolders):
        subfolders.extend(fast_scandir(dirname))
    return subfolders

def get_subfolders_with_keyword(keyword: str, subfolders_list: list) -> list:
    subfolders_with_keyword_list = []

    for folder in subfolders_list:
        if keyword in folder:
            subfolders_with_keyword_list.append(folder)

    return subfolders_with_keyword_list

In [163]:
source_path = "../datasets/sentinel2"
subfolders_list = fast_scandir(source_path)
print(f"Number of subfolders: {len(subfolders_list)}")

subfolders_with_keyword_list = get_subfolders_with_keyword("tiles/", subfolders_list) # note that we need the / to get folders
print(f"Number of subfolders with keyword: {len(subfolders_with_keyword_list)}")

Number of subfolders: 18142
Number of subfolders with keyword: 16780


In [164]:
# Organized the tiles information into a dataframe
raw_df = pd.DataFrame(subfolders_with_keyword_list, columns=["path_name"])
detailed_df = pd.DataFrame([x.rsplit('/') for x in raw_df['path_name']])

# insert detailed_df into raw_df
raw_df = pd.concat([raw_df, detailed_df], axis=1)
raw_df.drop([0, 1, 5], axis=1, inplace=True)

# rename columns
raw_df = raw_df.rename(columns={"path_name": "path_name_sentinel2", 2: "satellite", 3: "imagery_id", 4: "date", 6: "tile_id"})

# generate saving path name by data and tile_id starting with "..prepared_dataset"
raw_df['saving_path'] = raw_df.apply(lambda x: f"../prepared_dataset/{x['date']}/{x['tile_id']}", axis=1)

In [165]:
# open a file called coordinates.json from path_name_sentinel2
# save the content of the file into a new column called 'coordinates'

def get_coordinates_from_json(path_name_sentinel2: str) -> str:
    with open(f"{path_name_sentinel2}/coordinates.json", "r") as file:
        tile_info = json.load(file)
        coordinates = tile_info['geometry']['coordinates'][0]
        coordinates_str = ",".join(str(tuple(coord)) for coord in coordinates)
        return coordinates_str

raw_df['coordinates'] = raw_df.apply(lambda x: get_coordinates_from_json(x['path_name_sentinel2']), axis=1)

In [166]:
raw_df

Unnamed: 0,path_name_sentinel2,satellite,imagery_id,date,tile_id,saving_path,coordinates
0,../datasets/sentinel2/S2A_9UYR_20181004_0_L2A/...,sentinel2,S2A_9UYR_20181004_0_L2A,20181004,3_4,../prepared_dataset/20181004/3_4,"(-125.7607154787941, 50.29994118390697),(-125...."
1,../datasets/sentinel2/S2A_9UYR_20181004_0_L2A/...,sentinel2,S2A_9UYR_20181004_0_L2A,20181004,3_3,../prepared_dataset/20181004/3_3,"(-125.8683865696154, 50.30289429674448),(-125...."
2,../datasets/sentinel2/S2A_9UYR_20181004_0_L2A/...,sentinel2,S2A_9UYR_20181004_0_L2A,20181004,9_4,../prepared_dataset/20181004/9_4,"(-125.788519596963, 49.88615910147051),(-125.6..."
3,../datasets/sentinel2/S2A_9UYR_20181004_0_L2A/...,sentinel2,S2A_9UYR_20181004_0_L2A,20181004,9_3,../prepared_dataset/20181004/9_3,"(-125.89527020203685, 49.88906946882801),(-125..."
4,../datasets/sentinel2/S2A_9UYR_20181004_0_L2A/...,sentinel2,S2A_9UYR_20181004_0_L2A,20181004,12_5,../prepared_dataset/20181004/12_5,"(-125.69590068808976, 49.67626604565151),(-125..."
...,...,...,...,...,...,...,...
16775,../datasets/sentinel2/S2B_10UCA_20180512_0_L2A...,sentinel2,S2B_10UCA_20180512_0_L2A,20180512,13_3,../prepared_dataset/20180512/13_3,"(-131.45048004952844, 49.628402834406735),(-13..."
16776,../datasets/sentinel2/S2B_10UCA_20180512_0_L2A...,sentinel2,S2B_10UCA_20180512_0_L2A,20180512,4_0,../prepared_dataset/20180512/4_0,"(-131.80499804996853, 50.24215420271168),(-131..."
16777,../datasets/sentinel2/S2B_10UCA_20180512_0_L2A...,sentinel2,S2B_10UCA_20180512_0_L2A,20180512,11_0,../prepared_dataset/20180512/11_0,"(-131.77701636880613, 49.759208128694205),(-13..."
16778,../datasets/sentinel2/S2B_10UCA_20180512_0_L2A...,sentinel2,S2B_10UCA_20180512_0_L2A,20180512,1_13,../prepared_dataset/20180512/1_13,"(-130.41154640225076, 50.474664553619796),(-13..."


In [120]:
raw_df.iloc[0]['coordinates']

'(-125.7607154787941, 50.29994118390697),(-125.65306262838969, 50.29688850061719),(-125.65789652279332, 50.22793508254463),(-125.7653945658079, 50.23098034961218)'

## Merge folder names for the satellite and fire information

In [167]:
# convert the string coordinates into list of coordinates
raw_df['coordinates'] = raw_df['coordinates'].apply(lambda x: [list(coord) for coord in ast.literal_eval(x)])
fire_df['coordinates'] = fire_df['coordinates'].apply(lambda x: [float(coord) for coord in ast.literal_eval(x)])

# Convert coordinates list in raw_df to Polygon
raw_df['coordinates'] = raw_df['coordinates'].apply(lambda x: Polygon(x))

# Convert coordinates list in fire_df to Point
fire_df['coordinates'] = fire_df['coordinates'].apply(lambda x: Point(x))

# Convert date(int) to datetime(YYYY-MM-DD)
fire_df['date'] = pd.to_datetime(fire_df['date'], format='%Y%m%d')
raw_df['date'] = pd.to_datetime(raw_df['date'], format='%Y%m%d')

In [168]:
fire_df

Unnamed: 0,date,coordinates,points
1192,2015-05-26,POINT (-123.7458 48.8463),POINT (-123.7458 48.8463)
3031,2018-03-12,POINT (-125.1046 50.1684),POINT (-125.1046 50.1684)
1731,2018-04-15,POINT (-124.0843 49.1989),POINT (-124.0843 49.1989)
3053,2018-04-16,POINT (-125.3536 50.0375),POINT (-125.3536 50.0375)
1748,2018-04-18,POINT (-124.312 49.3026),POINT (-124.312 49.3026)
...,...,...,...
1794,2022-11-17,POINT (-125.311517 49.42095),POINT (-125.311517 49.42095)
113,2022-11-19,POINT (-123.9036 49.529167),POINT (-123.9036 49.529167)
576,2022-11-20,POINT (-124.669383 49.289117),POINT (-124.669383 49.289117)
1741,2022-11-29,POINT (-123.722233 48.6723),POINT (-123.722233 48.6723)


In [169]:
# Create a new column in raw_df called "if_fire" and set it to None
raw_df['if_fire'] = False
raw_df.head()

Unnamed: 0,path_name_sentinel2,satellite,imagery_id,date,tile_id,saving_path,coordinates,if_fire
0,../datasets/sentinel2/S2A_9UYR_20181004_0_L2A/...,sentinel2,S2A_9UYR_20181004_0_L2A,2018-10-04,3_4,../prepared_dataset/20181004/3_4,POLYGON ((-125.7607154787941 50.29994118390697...,False
1,../datasets/sentinel2/S2A_9UYR_20181004_0_L2A/...,sentinel2,S2A_9UYR_20181004_0_L2A,2018-10-04,3_3,../prepared_dataset/20181004/3_3,POLYGON ((-125.8683865696154 50.30289429674448...,False
2,../datasets/sentinel2/S2A_9UYR_20181004_0_L2A/...,sentinel2,S2A_9UYR_20181004_0_L2A,2018-10-04,9_4,../prepared_dataset/20181004/9_4,"POLYGON ((-125.788519596963 49.88615910147051,...",False
3,../datasets/sentinel2/S2A_9UYR_20181004_0_L2A/...,sentinel2,S2A_9UYR_20181004_0_L2A,2018-10-04,9_3,../prepared_dataset/20181004/9_3,POLYGON ((-125.89527020203685 49.8890694688280...,False
4,../datasets/sentinel2/S2A_9UYR_20181004_0_L2A/...,sentinel2,S2A_9UYR_20181004_0_L2A,2018-10-04,12_5,../prepared_dataset/20181004/12_5,POLYGON ((-125.69590068808976 49.6762660456515...,False


In [170]:
# compare the columns of data and coordinates(list) in fire_df
# with the columns of data and coordinates(Polygon) raw_df
# if the coordinates of fire_df is within the coordinates of raw_df
# then set the if_fire column to True
for index, row in fire_df.iterrows():
    for index2, row2 in raw_df.iterrows():
        point = row['coordinates']
        polygon = row2['coordinates']
        date = row['date']
        date2 = row2['date']
        if polygon.contains(point) and timedelta(days=0) <= date - date2 <= timedelta(days=2):
            raw_df.at[index2, 'if_fire'] = True
            print(f"fire time: {date} and tile time: {date2}.")

fire time: 2018-05-14 00:00:00 and tile time: 2018-05-14 00:00:00.
fire time: 2018-05-23 00:00:00 and tile time: 2018-05-22 00:00:00.
fire time: 2018-06-20 00:00:00 and tile time: 2018-06-18 00:00:00.
fire time: 2018-06-20 00:00:00 and tile time: 2018-06-18 00:00:00.
fire time: 2018-06-20 00:00:00 and tile time: 2018-06-18 00:00:00.
fire time: 2018-06-20 00:00:00 and tile time: 2018-06-18 00:00:00.
fire time: 2018-06-20 00:00:00 and tile time: 2018-06-18 00:00:00.
fire time: 2018-06-20 00:00:00 and tile time: 2018-06-18 00:00:00.
fire time: 2018-07-04 00:00:00 and tile time: 2018-07-03 00:00:00.
fire time: 2018-07-11 00:00:00 and tile time: 2018-07-11 00:00:00.
fire time: 2018-07-11 00:00:00 and tile time: 2018-07-11 00:00:00.
fire time: 2018-07-18 00:00:00 and tile time: 2018-07-16 00:00:00.
fire time: 2018-07-31 00:00:00 and tile time: 2018-07-31 00:00:00.
fire time: 2018-08-07 00:00:00 and tile time: 2018-08-07 00:00:00.
fire time: 2018-08-07 00:00:00 and tile time: 2018-08-07 00:00

In [173]:
raw_df['if_fire'].value_counts()

if_fire
False    16744
True        36
Name: count, dtype: int64

In [178]:
raw_df.to_csv('../dataset_tables/raw_df1.csv', index=False)

In [174]:
# Randomly select 5% of the rows in each month where 'if_fire' is False
# and merge these rows with the rows where 'if_fire' is True

raw_df['month'] = raw_df['date'].dt.month

# For rows where 'if_fire' is False, group by month and randomly select 5% of the rows in each group
df_false = raw_df[raw_df['if_fire'] == False].groupby('month').apply(lambda x: x.sample(frac=0.01))

# Get the rows where 'if_fire' is True
df_true = raw_df[raw_df['if_fire'] == True]

# Merge the two dataframes
df_sentinel2 = pd.concat([df_false, df_true])

df_sentinel2.drop(columns=['month'], inplace=True)

df_sentinel2

  df_false = raw_df[raw_df['if_fire'] == False].groupby('month').apply(lambda x: x.sample(frac=0.01))


Unnamed: 0,path_name_sentinel2,satellite,imagery_id,date,tile_id,saving_path,coordinates,if_fire
"(3, 3925)",../datasets/sentinel2/S2B_9UXS_20190318_0_L2A/...,sentinel2,S2B_9UXS_20190318_0_L2A,2019-03-18,11_2,../prepared_dataset/20190318/11_2,POLYGON ((-127.3671432187832 50.68006747113423...,False
"(3, 6174)",../datasets/sentinel2/S2A_10UCA_20200321_1_L2A...,sentinel2,S2A_10UCA_20200321_1_L2A,2020-03-21,0_10,../prepared_dataset/20200321/0_10,POLYGON ((-130.73863718301166 50.5393115473436...,False
"(3, 5735)",../datasets/sentinel2/S2A_9UXR_20180318_0_L2A/...,sentinel2,S2A_9UXR_20180318_0_L2A,2018-03-18,6_7,../prepared_dataset/20180318/6_7,POLYGON ((-126.84916812523272 50.1179485953767...,False
"(3, 11450)",../datasets/sentinel2/S2A_10UDB_20180315_1_L2A...,sentinel2,S2A_10UDB_20180315_1_L2A,2018-03-15,11_3,../prepared_dataset/20180315/11_3,POLYGON ((-130.0900231414613 50.68640863150005...,False
"(3, 4291)",../datasets/sentinel2/S2A_9UXS_20190320_0_L2A/...,sentinel2,S2A_9UXS_20190320_0_L2A,2019-03-20,11_3,../prepared_dataset/20190320/11_3,POLYGON ((-127.2584889553842 50.67849433697920...,False
...,...,...,...,...,...,...,...,...
14380,../datasets/sentinel2/S2A_9UYR_20220801_0_L2A/...,sentinel2,S2A_9UYR_20220801_0_L2A,2022-08-01,3_4,../prepared_dataset/20220801/3_4,POLYGON ((-125.7607154787941 50.29994118390697...,True
14391,../datasets/sentinel2/S2A_9UYR_20220801_0_L2A/...,sentinel2,S2A_9UYR_20220801_0_L2A,2022-08-01,3_2,../prepared_dataset/20220801/3_2,POLYGON ((-125.97607529975805 50.3057477994724...,True
15930,../datasets/sentinel2/S2A_9UYR_20180703_0_L2A/...,sentinel2,S2A_9UYR_20180703_0_L2A,2018-07-03,9_6,../prepared_dataset/20180703/9_6,POLYGON ((-125.57507206004699 49.8800440026963...,True
16049,../datasets/sentinel2/S2A_9UXR_20220920_0_L2A/...,sentinel2,S2A_9UXR_20220920_0_L2A,2022-09-20,3_12,../prepared_dataset/20220920/3_12,"POLYGON ((-126.300924835124 50.31374939214619,...",True


In [175]:
df_sentinel2['if_fire'].value_counts()

if_fire
False    167
True      36
Name: count, dtype: int64

In [176]:
def split_data(df):
    train_size = int(len(df) * 0.6)
    test_size = int(len(df) * 0.2)

    df = df.sample(frac=1, random_state=42).reset_index(drop=True)

    df['split'] = 'val'

    df.loc[:train_size, 'split'] = 'train'
    df.loc[train_size:train_size+test_size, 'split'] = 'test'

    return df

df_true = df_sentinel2[df_sentinel2['if_fire'] == True]
df_false = df_sentinel2[df_sentinel2['if_fire'] == False]

df_sentinel2 = pd.concat([split_data(df_true), split_data(df_false)])

In [177]:
df_sentinel2

Unnamed: 0,path_name_sentinel2,satellite,imagery_id,date,tile_id,saving_path,coordinates,if_fire,split
0,../datasets/sentinel2/S2B_9UYR_20180522_0_L2A/...,sentinel2,S2B_9UYR_20180522_0_L2A,2018-05-22,8_6,../prepared_dataset/20180522/8_6,POLYGON ((-125.5701791979101 49.94899566498456...,True,train
1,../datasets/sentinel2/S2B_9UYS_20180807_1_L2A/...,sentinel2,S2B_9UYS_20180807_1_L2A,2018-08-07,7_13,../prepared_dataset/20180807/7_13,POLYGON ((-124.7367744909437 50.88987134337061...,True,train
2,../datasets/sentinel2/S2A_9UYR_20190522_0_L2A/...,sentinel2,S2A_9UYR_20190522_0_L2A,2019-05-22,7_8,../prepared_dataset/20190522/7_8,POLYGON ((-125.3512876632677 50.01140717529444...,True,train
3,../datasets/sentinel2/S2B_9UXR_20180618_0_L2A/...,sentinel2,S2B_9UXR_20180618_0_L2A,2018-06-18,7_11,../prepared_dataset/20180618/7_11,POLYGON ((-126.42357531752924 50.0401949435548...,True,train
4,../datasets/sentinel2/S2B_9UYQ_20220816_0_L2A/...,sentinel2,S2B_9UYQ_20220816_0_L2A,2022-08-16,2_6,../prepared_dataset/20220816/2_6,POLYGON ((-125.6041482479572 49.46469315711875...,True,train
...,...,...,...,...,...,...,...,...,...
162,../datasets/sentinel2/S2A_10UDU_20180703_0_L2A...,sentinel2,S2A_10UDU_20180703_0_L2A,2018-07-03,4_4,../prepared_dataset/20180703/4_4,POLYGON ((-129.93788566570646 48.4728277358531...,False,val
163,../datasets/sentinel2/S2A_10UCV_20180809_0_L2A...,sentinel2,S2A_10UCV_20180809_0_L2A,2018-08-09,10_11,../prepared_dataset/20180809/10_11,POLYGON ((-130.5778654389746 48.95110402298405...,False,val
164,../datasets/sentinel2/S2B_9UXR_20190424_0_L2A/...,sentinel2,S2B_9UXR_20190424_0_L2A,2019-04-24,4_5,../prepared_dataset/20190424/4_5,POLYGON ((-127.05829770719795 50.2597936447649...,False,val
165,../datasets/sentinel2/S2A_10UCV_20200828_0_L2A...,sentinel2,S2A_10UCV_20200828_0_L2A,2020-08-28,1_2,../prepared_dataset/20200828/1_2,POLYGON ((-131.55302938550602 49.5554760291356...,False,val


In [179]:
# move column split to the front
cols = list(df_sentinel2.columns)
cols = [cols[-1]] + cols[:-1]
df_sentinel2 = df_sentinel2[cols]

In [180]:
df_sentinel2

Unnamed: 0,split,path_name_sentinel2,satellite,imagery_id,date,tile_id,saving_path,coordinates,if_fire
0,train,../datasets/sentinel2/S2B_9UYR_20180522_0_L2A/...,sentinel2,S2B_9UYR_20180522_0_L2A,2018-05-22,8_6,../prepared_dataset/20180522/8_6,POLYGON ((-125.5701791979101 49.94899566498456...,True
1,train,../datasets/sentinel2/S2B_9UYS_20180807_1_L2A/...,sentinel2,S2B_9UYS_20180807_1_L2A,2018-08-07,7_13,../prepared_dataset/20180807/7_13,POLYGON ((-124.7367744909437 50.88987134337061...,True
2,train,../datasets/sentinel2/S2A_9UYR_20190522_0_L2A/...,sentinel2,S2A_9UYR_20190522_0_L2A,2019-05-22,7_8,../prepared_dataset/20190522/7_8,POLYGON ((-125.3512876632677 50.01140717529444...,True
3,train,../datasets/sentinel2/S2B_9UXR_20180618_0_L2A/...,sentinel2,S2B_9UXR_20180618_0_L2A,2018-06-18,7_11,../prepared_dataset/20180618/7_11,POLYGON ((-126.42357531752924 50.0401949435548...,True
4,train,../datasets/sentinel2/S2B_9UYQ_20220816_0_L2A/...,sentinel2,S2B_9UYQ_20220816_0_L2A,2022-08-16,2_6,../prepared_dataset/20220816/2_6,POLYGON ((-125.6041482479572 49.46469315711875...,True
...,...,...,...,...,...,...,...,...,...
162,val,../datasets/sentinel2/S2A_10UDU_20180703_0_L2A...,sentinel2,S2A_10UDU_20180703_0_L2A,2018-07-03,4_4,../prepared_dataset/20180703/4_4,POLYGON ((-129.93788566570646 48.4728277358531...,False
163,val,../datasets/sentinel2/S2A_10UCV_20180809_0_L2A...,sentinel2,S2A_10UCV_20180809_0_L2A,2018-08-09,10_11,../prepared_dataset/20180809/10_11,POLYGON ((-130.5778654389746 48.95110402298405...,False
164,val,../datasets/sentinel2/S2B_9UXR_20190424_0_L2A/...,sentinel2,S2B_9UXR_20190424_0_L2A,2019-04-24,4_5,../prepared_dataset/20190424/4_5,POLYGON ((-127.05829770719795 50.2597936447649...,False
165,val,../datasets/sentinel2/S2A_10UCV_20200828_0_L2A...,sentinel2,S2A_10UCV_20200828_0_L2A,2020-08-28,1_2,../prepared_dataset/20200828/1_2,POLYGON ((-131.55302938550602 49.5554760291356...,False


## Organize satellite files by tiles

In [181]:
def fast_scandir(dirname: str) -> list:
    """
    Scan and return all subfolders of a directory.

    Parameters:
    dirname: root directory name to be scanned.

    Returns:
    A list of subfolders under the given directory.
    """
    subfolders = [f.path for f in os.scandir(dirname) if f.is_dir()]
    for dirname in list(subfolders):
        subfolders.extend(fast_scandir(dirname))
    return subfolders


def get_folders_with_keyword(keyword: str, folders_list: list) -> list:
    """
    From a list of folders, cherry pick the ones with given keyword.

    Parameters:
    keyword: keyword that folders directory must contain.
    folders_list: list of folders to be searched.

    Returns:
    A list of paths to folders that contain the given keyword.
    """
    folders_with_keyword_list = []

    for folder in folders_list:
        if keyword in folder:
            folders_with_keyword_list.append(folder)

    return folders_with_keyword_list


def get_list_of_files_in_directory(directory_name: str, keyword: str = ".tif") -> list:
    """
    Given a folder directory, get a list of files with certain keywords.
    For the usecase, we search for ".tif" files.

    Parameters:
    directory_name: name of the folder directory.
    keyword: keyword the files inside the directory must contain.

    Returns:
    A list of files whose names contain the specified keyword, under the specified directory.
    """
    return [f"{directory_name}/{f}" for f in os.listdir(directory_name) if f.endswith(keyword)]


def move_file(source_path: str, label: str, id: int) -> None:
    """
    Move one file from source to destination. The destination folder is separated
    based on label, date, and id. The moved file will retain its original name.

    Parameters:
    source_path: path to the file to be moved.
    label: either "image" or "mask".
    id: suffix to the destination folder name.
    """
    file_name = source_path.split("/")[-1]
    file_date = source_path.split("/")[-4]
    destination_folder = f"../prepared_dataset/{label}_directory_{file_date}_{id}"
    destination_path = os.path.join(destination_folder, file_name)


    if not os.path.isdir(destination_folder):
        os.makedirs(destination_folder)

    if os.path.isfile(destination_path):
        print("File exists.")
        return

    shutil.copy(source_path, destination_path)
    print(f"File copied to destination: {destination_path}.")

def get_path_str(source_path: str, label: str, id: int) -> str:
    """
    Get the new file name after moving the file.

    Parameters:
    source_path: path to the file to be moved.
    label: either "image" or "mask".
    id: suffix to the destination folder name.
    """
    file_name = source_path.split("/")[-1]
    file_date = source_path.split("/")[-4]
    path_name = f"{label}_directory_{file_date}_{id}"

    return path_name


def batch_move_files(source_path_list: list, df: pd) -> None:
    """
    Move files from a list of source paths. In this use case, images under `tiles` folder from `download_file`
    are moved to `prepare_dataset` folder. Images that are originally inside the same directory are grouped
    using the same id. Band geotif's are put under `image_directory{id}` while mask tif's are put under
    `mask_directory{id}`.

    Parameters:
    source_path_list: list of source paths.
    """
    path_dict = {}

    for i in range(len(source_path_list)):
        current_path = source_path_list[i]
        current_folder = current_path.rsplit(
            "/", 1)[0]  # split on the last occurrence

        if current_folder not in path_dict:
            path_dict[current_folder] = len(path_dict)
            df.loc[df['path_name_sentinel2'] == current_folder, 'data_path'] = get_path_str(current_path, "image", path_dict[current_folder])
            df.loc[df['path_name_sentinel2'] == current_folder, 'mask_path'] = get_path_str(current_path, "mask", path_dict[current_folder])

        current_id = path_dict[current_folder]

        if "mask" in current_path:
            move_file(current_path, "mask", current_id)
        else:
            move_file(current_path, "image", current_id)

    return df


In [182]:
# Get all folders from download_file folder
# source_path = os.path.join(os.path.dirname(__file__), "../datasets/sentinel2")
# subfolders_list = fast_scandir(source_path)

# Get subfolders_list from the path_name_sentinel2 column
subfolders_list = df_sentinel2['path_name_sentinel2'].tolist()

# Get folders that are under tiles folder
# note that we need the / to get folders
subfolders_with_keyword_list = get_folders_with_keyword(
    "tiles/", subfolders_list)

all_files = []
for subfolder in subfolders_with_keyword_list:
    current_list = get_list_of_files_in_directory(subfolder)
    all_files.extend(current_list)

# add two column to the dataframe
df_sentinel2['data_path'] = ''
df_sentinel2['mask_path'] = ''
df_sentinel2 = batch_move_files(all_files, df_sentinel2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sentinel2['data_path'] = ''
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sentinel2['mask_path'] = ''


File copied to destination: ../prepared_dataset/image_directory_20180522_0/B08.tif.
File copied to destination: ../prepared_dataset/image_directory_20180522_0/B09.tif.
File copied to destination: ../prepared_dataset/image_directory_20180522_0/B8A.tif.
File copied to destination: ../prepared_dataset/mask_directory_20180522_0/mask.tif.
File copied to destination: ../prepared_dataset/image_directory_20180522_0/B02.tif.
File copied to destination: ../prepared_dataset/image_directory_20180522_0/B03.tif.
File copied to destination: ../prepared_dataset/image_directory_20180522_0/B01.tif.
File copied to destination: ../prepared_dataset/image_directory_20180522_0/B04.tif.
File copied to destination: ../prepared_dataset/image_directory_20180522_0/B11.tif.
File copied to destination: ../prepared_dataset/image_directory_20180522_0/B05.tif.
File copied to destination: ../prepared_dataset/image_directory_20180522_0/B07.tif.
File copied to destination: ../prepared_dataset/image_directory_20180522_0/B

In [183]:
df_sentinel2

Unnamed: 0,split,path_name_sentinel2,satellite,imagery_id,date,tile_id,saving_path,coordinates,if_fire,data_path,mask_path
0,train,../datasets/sentinel2/S2B_9UYR_20180522_0_L2A/...,sentinel2,S2B_9UYR_20180522_0_L2A,2018-05-22,8_6,../prepared_dataset/20180522/8_6,POLYGON ((-125.5701791979101 49.94899566498456...,True,image_directory_20180522_0,mask_directory_20180522_0
1,train,../datasets/sentinel2/S2B_9UYS_20180807_1_L2A/...,sentinel2,S2B_9UYS_20180807_1_L2A,2018-08-07,7_13,../prepared_dataset/20180807/7_13,POLYGON ((-124.7367744909437 50.88987134337061...,True,image_directory_20180807_1,mask_directory_20180807_1
2,train,../datasets/sentinel2/S2A_9UYR_20190522_0_L2A/...,sentinel2,S2A_9UYR_20190522_0_L2A,2019-05-22,7_8,../prepared_dataset/20190522/7_8,POLYGON ((-125.3512876632677 50.01140717529444...,True,image_directory_20190522_2,mask_directory_20190522_2
3,train,../datasets/sentinel2/S2B_9UXR_20180618_0_L2A/...,sentinel2,S2B_9UXR_20180618_0_L2A,2018-06-18,7_11,../prepared_dataset/20180618/7_11,POLYGON ((-126.42357531752924 50.0401949435548...,True,image_directory_20180618_3,mask_directory_20180618_3
4,train,../datasets/sentinel2/S2B_9UYQ_20220816_0_L2A/...,sentinel2,S2B_9UYQ_20220816_0_L2A,2022-08-16,2_6,../prepared_dataset/20220816/2_6,POLYGON ((-125.6041482479572 49.46469315711875...,True,image_directory_20220816_4,mask_directory_20220816_4
...,...,...,...,...,...,...,...,...,...,...,...
162,val,../datasets/sentinel2/S2A_10UDU_20180703_0_L2A...,sentinel2,S2A_10UDU_20180703_0_L2A,2018-07-03,4_4,../prepared_dataset/20180703/4_4,POLYGON ((-129.93788566570646 48.4728277358531...,False,image_directory_20180703_198,mask_directory_20180703_198
163,val,../datasets/sentinel2/S2A_10UCV_20180809_0_L2A...,sentinel2,S2A_10UCV_20180809_0_L2A,2018-08-09,10_11,../prepared_dataset/20180809/10_11,POLYGON ((-130.5778654389746 48.95110402298405...,False,image_directory_20180809_199,mask_directory_20180809_199
164,val,../datasets/sentinel2/S2B_9UXR_20190424_0_L2A/...,sentinel2,S2B_9UXR_20190424_0_L2A,2019-04-24,4_5,../prepared_dataset/20190424/4_5,POLYGON ((-127.05829770719795 50.2597936447649...,False,image_directory_20190424_200,mask_directory_20190424_200
165,val,../datasets/sentinel2/S2A_10UCV_20200828_0_L2A...,sentinel2,S2A_10UCV_20200828_0_L2A,2020-08-28,1_2,../prepared_dataset/20200828/1_2,POLYGON ((-131.55302938550602 49.5554760291356...,False,image_directory_20200828_201,mask_directory_20200828_201


In [184]:
# drop some columns of the dataframe
df_sentinel2 = df_sentinel2.drop(columns=["path_name_sentinel2", "satellite", "tile_id", "saving_path"])

In [185]:
df_sentinel2

Unnamed: 0,split,imagery_id,date,coordinates,if_fire,data_path,mask_path
0,train,S2B_9UYR_20180522_0_L2A,2018-05-22,POLYGON ((-125.5701791979101 49.94899566498456...,True,image_directory_20180522_0,mask_directory_20180522_0
1,train,S2B_9UYS_20180807_1_L2A,2018-08-07,POLYGON ((-124.7367744909437 50.88987134337061...,True,image_directory_20180807_1,mask_directory_20180807_1
2,train,S2A_9UYR_20190522_0_L2A,2019-05-22,POLYGON ((-125.3512876632677 50.01140717529444...,True,image_directory_20190522_2,mask_directory_20190522_2
3,train,S2B_9UXR_20180618_0_L2A,2018-06-18,POLYGON ((-126.42357531752924 50.0401949435548...,True,image_directory_20180618_3,mask_directory_20180618_3
4,train,S2B_9UYQ_20220816_0_L2A,2022-08-16,POLYGON ((-125.6041482479572 49.46469315711875...,True,image_directory_20220816_4,mask_directory_20220816_4
...,...,...,...,...,...,...,...
162,val,S2A_10UDU_20180703_0_L2A,2018-07-03,POLYGON ((-129.93788566570646 48.4728277358531...,False,image_directory_20180703_198,mask_directory_20180703_198
163,val,S2A_10UCV_20180809_0_L2A,2018-08-09,POLYGON ((-130.5778654389746 48.95110402298405...,False,image_directory_20180809_199,mask_directory_20180809_199
164,val,S2B_9UXR_20190424_0_L2A,2019-04-24,POLYGON ((-127.05829770719795 50.2597936447649...,False,image_directory_20190424_200,mask_directory_20190424_200
165,val,S2A_10UCV_20200828_0_L2A,2020-08-28,POLYGON ((-131.55302938550602 49.5554760291356...,False,image_directory_20200828_201,mask_directory_20200828_201


In [186]:
df_sentinel2['if_fire'].value_counts()

if_fire
False    167
True      36
Name: count, dtype: int64

## Data process

After marking the data with fire occurrance, we can process the date to calculate the features we need.

In [219]:
def save_all_numerical_values_to_df(split, dataset):
    new_df = pd.DataFrame(columns=['ID', 'split', 'Label', 'NDVI', 'NBR', 'NDWI', 'NDBI', 'RGB'])
    # Loop through all the indices in the dataset
    for index in range(len(dataset)):
        numerical_values = dataset.get_numerical_values(index)
        if numerical_values:  # Check if the dictionary is not empty
            ndvi_list = numerical_values["NDVI"].flatten().tolist()
            nbr_list = numerical_values["NBR"].flatten().tolist()
            ndwi_list = numerical_values["NDWI"].flatten().tolist()
            ndbi_list = numerical_values["NDBI"].flatten().tolist()
            rgb_list = numerical_values["RGB"].flatten().tolist()
            location = numerical_values["Location"].replace(
                "image_directory_", "")

            # Add the values to the dataframe
            new_row = {
                "ID": location,
                "split": split,
                "Label": "",
                "NDVI": ndvi_list,
                "NBR": nbr_list,
                "NDWI": ndwi_list,
                "NDBI": ndbi_list,
                "RGB": rgb_list,

            }
            new_df = new_df._append(new_row, ignore_index=True)
    return new_df

In [188]:
import dataset

df = df_sentinel2[df_sentinel2['split'] == 'val']
sat_dataset = dataset.SATDataset(split='val',
                                data_path=Path("../prepared_dataset"),
                                df=df)
len(sat_dataset)

40

In [189]:
from importlib import reload

In [231]:
reload(dataset)

<module 'dataset' from '/Users/glenn_hyh/Documents/github/bc-wildfire-prediction/notebooks/dataset.py'>

In [191]:
import dataset

# Check if foulder processed_images_folder exists
# if not create the folder
if not os.path.exists("../processed_images_folder"):
    os.makedirs("../processed_images_folder")

for split in ['test', 'train', 'val']:
    df = df_sentinel2[df_sentinel2['split'] == split]
    sat_dataset = dataset.SATDataset(split=split,
                                    data_path=Path("../prepared_dataset"),
                                    df=df)
    for index in range(len(sat_dataset)):
        # Get the filepaths for the current index
        filepath = sat_dataset.filepaths[index]

        # Extract the original directory name
        original_dir_name = Path(filepath).name

        # Get the transformed images
        images_dict = sat_dataset.get_images(
            index)  # Ensure you call the correct method

        # Create a new directory path for the processed images
        processed_dir_path_str = f"../processed_images_folder/{original_dir_name}"
        processed_dir_path = Path(processed_dir_path_str)
        processed_dir_path.mkdir(parents=True, exist_ok=True)

        for img_type, image in images_dict.items():
            # Skip saving the mask if you only want the indices and RGB images
            if img_type == "Mask":
                continue

            if img_type == "NBR":
                # Apply color mapping for NBR
                plt.imshow(image, cmap=plt.cm.RdYlGn, vmin=-1, vmax=1)
                plt.colorbar(orientation='vertical')
                plt.axis('off')

                # Save the current figure to a numpy array
                fig = plt.gcf()
                plt.draw()
                image_np = np.frombuffer(fig.canvas.tostring_rgb(), dtype=np.uint8)
                image_np = image_np.reshape(
                    fig.canvas.get_width_height()[::-1] + (3,))

                # Convert numpy array to PIL Image
                pil_image = Image.fromarray(image_np)

                # Define the filename, replacing .tif with .png for NBR
                filename = f"{img_type}.png"

                # Close the figure to free memory
                plt.close(fig)
            else:
                # Normalize the image data to 0-255 for other image types
                # Clip to the range you want
                image = np.clip(image, 0, np.max(image))
                image_8bit = ((image - np.min(image)) /
                                (np.max(image) - np.min(image)) * 255).astype('uint8')

                # If the image has more than one channel, convert it to RGB
                if image_8bit.ndim > 2 and image_8bit.shape[2] > 3:
                    # Convert multi-band images (e.g., 4 bands) to RGB (3 bands) before saving as JPEG
                    image_8bit = image_8bit[:, :, :3]

                # Create the PIL Image from the numpy array
                pil_image = Image.fromarray(image_8bit)

                # Define the filename, replacing .tif with .jpg for other image types
                filename = f"{img_type}.jpg"

            # Define the full path for the file
            filepath = processed_dir_path / filename

            # Save the image
            pil_image.save(filepath)

  NDBI = (B11 - B08) / (B11 + B08)
  image_np = np.frombuffer(fig.canvas.tostring_rgb(), dtype=np.uint8)
  (np.max(image) - np.min(image)) * 255).astype('uint8')
  return band / np.percentile(band, percentile)
  image_8bit = ((image - np.min(image)) /
  return band / np.percentile(band, percentile)
  NDVI = self.safe_divide(B08 - B04, B08 + B04)
  NDWI = self.safe_divide(B03 - B08, B08 + B03)
  NDBI = (B11 - B08) / (B11 + B08)


In [232]:
numerical_df = pd.DataFrame(columns=['ID', 'split', 'Label', 'NDVI', 'NBR', 'NDWI', 'NDBI', 'RGB'])

# Instantiate the dataset
for split in ['test', 'train', 'val']:
    df = df_sentinel2[df_sentinel2['split'] == split]
    sat_dataset = dataset.SATDataset(split=split,
                                    data_path=Path("../prepared_dataset"),
                                    df=df)
    new_df = save_all_numerical_values_to_df(split, sat_dataset)
    numerical_df = pd.concat([numerical_df, new_df])
    # add the calculated_df to the numerical_df
    # numerical_df = pd.concat([numerical_df, calculated_df])


Filepath: image_directory_20220809_21
Location info: image_directory_20220809_21
Filepath: image_directory_20180703_22
Location info: image_directory_20180703_22
Filepath: image_directory_20220809_23
Location info: image_directory_20220809_23
Filepath: image_directory_20220801_24


  
  RGB = np.stack([B04, B03, B02], axis=-1)
  location_info = os.path.basename(filepath)


Location info: image_directory_20220801_24
Filepath: image_directory_20220920_25
Location info: image_directory_20220920_25
Filepath: image_directory_20180618_26
Location info: image_directory_20180618_26
Filepath: image_directory_20190529_27
Location info: image_directory_20190529_27
Filepath: image_directory_20220730_28
Location info: image_directory_20220730_28
Filepath: image_directory_20221117_136
Location info: image_directory_20221117_136
Filepath: image_directory_20180901_137
Location info: image_directory_20180901_137
Filepath: image_directory_20181021_138
Location info: image_directory_20181021_138
Filepath: image_directory_20180618_139


  return band / np.percentile(band, percentile)


Location info: image_directory_20180618_139
Filepath: image_directory_20180315_140
Location info: image_directory_20180315_140
Filepath: image_directory_20220920_141
Location info: image_directory_20220920_141
Filepath: image_directory_20180315_142
Location info: image_directory_20180315_142
Filepath: image_directory_20221015_143
Location info: image_directory_20221015_143
Filepath: image_directory_20190615_144
Location info: image_directory_20190615_144
Filepath: image_directory_20190531_145
Location info: image_directory_20190531_145
Filepath: image_directory_20180731_146
Location info: image_directory_20180731_146
Filepath: image_directory_20180703_147


  return band / np.percentile(band, percentile)


Location info: image_directory_20180703_147
Filepath: image_directory_20180522_148
Location info: image_directory_20180522_148
Filepath: image_directory_20180616_149
Location info: image_directory_20180616_149
Filepath: image_directory_20200321_150
Location info: image_directory_20200321_150
Filepath: image_directory_20190529_151
Location info: image_directory_20190529_151
Filepath: image_directory_20180711_152


  return band / np.percentile(band, percentile)


Location info: image_directory_20180711_152
Filepath: image_directory_20190906_153
Location info: image_directory_20190906_153
Filepath: image_directory_20220801_154
Location info: image_directory_20220801_154
Filepath: image_directory_20220809_155
Location info: image_directory_20220809_155
Filepath: image_directory_20220725_156
Location info: image_directory_20220725_156
Filepath: image_directory_20221015_157
Location info: image_directory_20221015_157
Filepath: image_directory_20181013_158
Location info: image_directory_20181013_158
Filepath: image_directory_20220801_159
Location info: image_directory_20220801_159
Filepath: image_directory_20200321_160
Location info: image_directory_20200321_160
Filepath: image_directory_20221008_161


  return band / np.percentile(band, percentile)


Location info: image_directory_20221008_161
Filepath: image_directory_20190827_162
Location info: image_directory_20190827_162
Filepath: image_directory_20200418_163
Location info: image_directory_20200418_163
Filepath: image_directory_20220714_164
Location info: image_directory_20220714_164
Filepath: image_directory_20221008_165
Location info: image_directory_20221008_165
Filepath: image_directory_20180422_166


  return band / np.percentile(band, percentile)
  return band / np.percentile(band, percentile)


Location info: image_directory_20180422_166
Filepath: image_directory_20220722_167
Location info: image_directory_20220722_167
Filepath: image_directory_20181021_168
Location info: image_directory_20181021_168
Filepath: image_directory_20200907_169
Location info: image_directory_20200907_169
Filepath: image_directory_20180522_0
Location info: image_directory_20180522_0
Filepath: image_directory_20180807_1


  return band / np.percentile(band, percentile)
  
  RGB = np.stack([B04, B03, B02], axis=-1)
  location_info = os.path.basename(filepath)


Location info: image_directory_20180807_1
Filepath: image_directory_20190522_2
Location info: image_directory_20190522_2
Filepath: image_directory_20180618_3
Location info: image_directory_20180618_3
Filepath: image_directory_20220816_4
Location info: image_directory_20220816_4
Filepath: image_directory_20220801_5
Location info: image_directory_20220801_5
Filepath: image_directory_20190929_6
Location info: image_directory_20190929_6
Filepath: image_directory_20180711_7
Location info: image_directory_20180711_7
Filepath: image_directory_20190827_8
Location info: image_directory_20190827_8
Filepath: image_directory_20180514_9
Location info: image_directory_20180514_9
Filepath: image_directory_20221117_10
Location info: image_directory_20221117_10
Filepath: image_directory_20220920_11


  return band / np.percentile(band, percentile)
  return band / np.percentile(band, percentile)
  return band / np.percentile(band, percentile)


Location info: image_directory_20220920_11
Filepath: image_directory_20200730_12
Location info: image_directory_20200730_12
Filepath: image_directory_20220809_13
Location info: image_directory_20220809_13
Filepath: image_directory_20180618_14
Location info: image_directory_20180618_14
Filepath: image_directory_20200416_15
Location info: image_directory_20200416_15
Filepath: image_directory_20180711_16
Location info: image_directory_20180711_16
Filepath: image_directory_20220725_17
Location info: image_directory_20220725_17
Filepath: image_directory_20190807_18


  return band / np.percentile(band, percentile)
  return band / np.percentile(band, percentile)
  return band / np.percentile(band, percentile)
  return band / np.percentile(band, percentile)


Location info: image_directory_20190807_18
Filepath: image_directory_20200730_19
Location info: image_directory_20200730_19
Filepath: image_directory_20180817_20
Location info: image_directory_20180817_20
Filepath: image_directory_20190804_36
Location info: image_directory_20190804_36
Filepath: image_directory_20180807_37
Location info: image_directory_20180807_37
Filepath: image_directory_20200730_38
Location info: image_directory_20200730_38
Filepath: image_directory_20180731_39


  return band / np.percentile(band, percentile)
  return band / np.percentile(band, percentile)


Location info: image_directory_20180731_39
Filepath: image_directory_20180815_40
Location info: image_directory_20180815_40
Filepath: image_directory_20190529_41
Location info: image_directory_20190529_41
Filepath: image_directory_20180817_42
Location info: image_directory_20180817_42
Filepath: image_directory_20180715_43
Location info: image_directory_20180715_43
Filepath: image_directory_20180814_44


  
  RGB = np.stack([B04, B03, B02], axis=-1)
  location_info = os.path.basename(filepath)


Location info: image_directory_20180814_44
Filepath: image_directory_20220831_45
Location info: image_directory_20220831_45
Filepath: image_directory_20180422_46
Location info: image_directory_20180422_46
Filepath: image_directory_20180512_47
Location info: image_directory_20180512_47
Filepath: image_directory_20190424_48
Location info: image_directory_20190424_48
Filepath: image_directory_20200727_49


  
  RGB = np.stack([B04, B03, B02], axis=-1)
  location_info = os.path.basename(filepath)
  return band / np.percentile(band, percentile)


Location info: image_directory_20200727_49
Filepath: image_directory_20200413_50
Location info: image_directory_20200413_50
Filepath: image_directory_20190529_51
Location info: image_directory_20190529_51
Filepath: image_directory_20200905_52
Location info: image_directory_20200905_52
Filepath: image_directory_20180618_53
Location info: image_directory_20180618_53
Filepath: image_directory_20181021_54
Location info: image_directory_20181021_54
Filepath: image_directory_20200416_55
Location info: image_directory_20200416_55
Filepath: image_directory_20190721_56
Location info: image_directory_20190721_56
Filepath: image_directory_20180422_57
Location info: image_directory_20180422_57
Filepath: image_directory_20190424_58
Location info: image_directory_20190424_58
Filepath: image_directory_20200321_59


  return band / np.percentile(band, percentile)
  return band / np.percentile(band, percentile)
  return band / np.percentile(band, percentile)


Location info: image_directory_20200321_59
Filepath: image_directory_20190928_60
Location info: image_directory_20190928_60
Filepath: image_directory_20180715_61
Location info: image_directory_20180715_61
Filepath: image_directory_20220920_62
Location info: image_directory_20220920_62
Filepath: image_directory_20180703_63


  
  RGB = np.stack([B04, B03, B02], axis=-1)
  location_info = os.path.basename(filepath)
  return band / np.percentile(band, percentile)


Location info: image_directory_20180703_63
Filepath: image_directory_20220725_64
Location info: image_directory_20220725_64
Filepath: image_directory_20221015_65
Location info: image_directory_20221015_65
Filepath: image_directory_20190522_66
Location info: image_directory_20190522_66
Filepath: image_directory_20181004_67
Location info: image_directory_20181004_67
Filepath: image_directory_20220907_68
Location info: image_directory_20220907_68
Filepath: image_directory_20191009_69
Location info: image_directory_20191009_69
Filepath: image_directory_20181013_70
Location info: image_directory_20181013_70
Filepath: image_directory_20180318_71
Location info: image_directory_20180318_71
Filepath: image_directory_20190827_72
Location info: image_directory_20190827_72
Filepath: image_directory_20190906_73
Location info: image_directory_20190906_73
Filepath: image_directory_20180715_74
Location info: image_directory_20180715_74
Filepath: image_directory_20200905_75
Location info: image_directo

  return band / np.percentile(band, percentile)
  return band / np.percentile(band, percentile)


Location info: image_directory_20180517_77
Filepath: image_directory_20221015_78
Location info: image_directory_20221015_78
Filepath: image_directory_20190827_79
Location info: image_directory_20190827_79
Filepath: image_directory_20190805_80
Location info: image_directory_20190805_80
Filepath: image_directory_20190807_81
Location info: image_directory_20190807_81
Filepath: image_directory_20180517_82
Location info: image_directory_20180517_82
Filepath: image_directory_20220920_83
Location info: image_directory_20220920_83
Filepath: image_directory_20200418_84
Location info: image_directory_20200418_84
Filepath: image_directory_20220920_85
Location info: image_directory_20220920_85
Filepath: image_directory_20190906_86
Location info: image_directory_20190906_86
Filepath: image_directory_20200816_87
Location info: image_directory_20200816_87
Filepath: image_directory_20180315_88
Location info: image_directory_20180315_88
Filepath: image_directory_20180716_89
Location info: image_directo

  return band / np.percentile(band, percentile)
  return band / np.percentile(band, percentile)
  
  NDBI = (B11 - B08) / (B11 + B08)
  return band / np.percentile(band, percentile)


Location info: image_directory_20180807_90
Filepath: image_directory_20180315_91
Location info: image_directory_20180315_91
Filepath: image_directory_20180512_92
Location info: image_directory_20180512_92
Filepath: image_directory_20181021_93
Location info: image_directory_20181021_93
Filepath: image_directory_20221002_94
Location info: image_directory_20221002_94
Filepath: image_directory_20180511_95
Location info: image_directory_20180511_95
Filepath: image_directory_20190529_96
Location info: image_directory_20190529_96
Filepath: image_directory_20190320_97
Location info: image_directory_20190320_97
Filepath: image_directory_20200908_98
Location info: image_directory_20200908_98
Filepath: image_directory_20190529_99
Location info: image_directory_20190529_99
Filepath: image_directory_20180728_100


  return band / np.percentile(band, percentile)
  return band / np.percentile(band, percentile)
  return band / np.percentile(band, percentile)


Location info: image_directory_20180728_100
Filepath: image_directory_20180815_101
Location info: image_directory_20180815_101
Filepath: image_directory_20180716_102
Location info: image_directory_20180716_102
Filepath: image_directory_20180318_103
Location info: image_directory_20180318_103
Filepath: image_directory_20221015_104
Location info: image_directory_20221015_104
Filepath: image_directory_20200816_105


  return band / np.percentile(band, percentile)
  
  RGB = np.stack([B04, B03, B02], axis=-1)
  location_info = os.path.basename(filepath)


Location info: image_directory_20200816_105
Filepath: image_directory_20190318_106
Location info: image_directory_20190318_106
Filepath: image_directory_20190928_107
Location info: image_directory_20190928_107
Filepath: image_directory_20181004_108
Location info: image_directory_20181004_108
Filepath: image_directory_20200730_109
Location info: image_directory_20200730_109
Filepath: image_directory_20180522_110
Location info: image_directory_20180522_110
Filepath: image_directory_20220831_111


  return band / np.percentile(band, percentile)
  return band / np.percentile(band, percentile)


Location info: image_directory_20220831_111
Filepath: image_directory_20180522_112
Location info: image_directory_20180522_112
Filepath: image_directory_20180511_113
Location info: image_directory_20180511_113
Filepath: image_directory_20190827_114
Location info: image_directory_20190827_114
Filepath: image_directory_20190424_115
Location info: image_directory_20190424_115
Filepath: image_directory_20200413_116


  return band / np.percentile(band, percentile)


Location info: image_directory_20200413_116
Filepath: image_directory_20190929_117
Location info: image_directory_20190929_117
Filepath: image_directory_20220714_118
Location info: image_directory_20220714_118
Filepath: image_directory_20180704_119
Location info: image_directory_20180704_119
Filepath: image_directory_20200730_120
Location info: image_directory_20200730_120
Filepath: image_directory_20190531_121


  return band / np.percentile(band, percentile)


Location info: image_directory_20190531_121
Filepath: image_directory_20190807_122
Location info: image_directory_20190807_122
Filepath: image_directory_20190827_123
Location info: image_directory_20190827_123
Filepath: image_directory_20180814_124
Location info: image_directory_20180814_124
Filepath: image_directory_20221016_125
Location info: image_directory_20221016_125
Filepath: image_directory_20190906_126


  return band / np.percentile(band, percentile)
  return band / np.percentile(band, percentile)


Location info: image_directory_20190906_126
Filepath: image_directory_20180616_127
Location info: image_directory_20180616_127
Filepath: image_directory_20220925_128
Location info: image_directory_20220925_128
Filepath: image_directory_20180703_129
Location info: image_directory_20180703_129
Filepath: image_directory_20180715_130
Location info: image_directory_20180715_130
Filepath: image_directory_20190529_131
Location info: image_directory_20190529_131
Filepath: image_directory_20180822_132
Location info: image_directory_20180822_132
Filepath: image_directory_20220927_133
Location info: image_directory_20220927_133
Filepath: image_directory_20200908_134
Location info: image_directory_20200908_134
Filepath: image_directory_20180715_135
Location info: image_directory_20180715_135
Filepath: image_directory_20220831_29
Location info: image_directory_20220831_29
Filepath: image_directory_20190522_30
Location info: image_directory_20190522_30
Filepath: image_directory_20180731_31
Location 

  return band / np.percentile(band, percentile)
  return band / np.percentile(band, percentile)


Location info: image_directory_20180716_32
Filepath: image_directory_20180807_33
Location info: image_directory_20180807_33
Filepath: image_directory_20200416_34
Location info: image_directory_20200416_34
Filepath: image_directory_20180618_35
Location info: image_directory_20180618_35
Filepath: image_directory_20221003_170


  
  RGB = np.stack([B04, B03, B02], axis=-1)
  location_info = os.path.basename(filepath)
  
  RGB = np.stack([B04, B03, B02], axis=-1)
  location_info = os.path.basename(filepath)
  
  RGB = np.stack([B04, B03, B02], axis=-1)
  location_info = os.path.basename(filepath)


Location info: image_directory_20221003_170
Filepath: image_directory_20190720_171
Location info: image_directory_20190720_171
Filepath: image_directory_20180731_172
Location info: image_directory_20180731_172
Filepath: image_directory_20220808_173
Location info: image_directory_20220808_173
Filepath: image_directory_20180616_174
Location info: image_directory_20180616_174
Filepath: image_directory_20221015_175
Location info: image_directory_20221015_175
Filepath: image_directory_20190721_176
Location info: image_directory_20190721_176
Filepath: image_directory_20180616_177
Location info: image_directory_20180616_177
Filepath: image_directory_20220808_178
Location info: image_directory_20220808_178
Filepath: image_directory_20190424_179
Location info: image_directory_20190424_179
Filepath: image_directory_20180703_180


  return band / np.percentile(band, percentile)
  return band / np.percentile(band, percentile)
  return band / np.percentile(band, percentile)


Location info: image_directory_20180703_180
Filepath: image_directory_20221015_181
Location info: image_directory_20221015_181
Filepath: image_directory_20190906_182
Location info: image_directory_20190906_182
Filepath: image_directory_20180512_183
Location info: image_directory_20180512_183
Filepath: image_directory_20221002_184


  
  return band / np.percentile(band, percentile)


Location info: image_directory_20221002_184
Filepath: image_directory_20200321_185
Location info: image_directory_20200321_185
Filepath: image_directory_20180616_186
Location info: image_directory_20180616_186
Filepath: image_directory_20181021_187
Location info: image_directory_20181021_187
Filepath: image_directory_20190929_188
Location info: image_directory_20190929_188
Filepath: image_directory_20221008_189


  
  RGB = np.stack([B04, B03, B02], axis=-1)
  location_info = os.path.basename(filepath)
  return band / np.percentile(band, percentile)
  return band / np.percentile(band, percentile)


Location info: image_directory_20221008_189
Filepath: image_directory_20190807_190
Location info: image_directory_20190807_190
Filepath: image_directory_20220809_191
Location info: image_directory_20220809_191
Filepath: image_directory_20220808_192
Location info: image_directory_20220808_192
Filepath: image_directory_20220824_193
Location info: image_directory_20220824_193
Filepath: image_directory_20220725_194
Location info: image_directory_20220725_194
Filepath: image_directory_20190928_195
Location info: image_directory_20190928_195
Filepath: image_directory_20221117_196
Location info: image_directory_20221117_196
Filepath: image_directory_20190430_197
Location info: image_directory_20190430_197
Filepath: image_directory_20180703_198
Location info: image_directory_20180703_198
Filepath: image_directory_20180809_199
Location info: image_directory_20180809_199
Filepath: image_directory_20190424_200
Location info: image_directory_20190424_200
Filepath: image_directory_20200828_201
Loca

  return band / np.percentile(band, percentile)
  return band / np.percentile(band, percentile)


In [233]:
numerical_df

Unnamed: 0,ID,split,Label,NDVI,NBR,NDWI,NDBI,RGB
0,20220809_21,test,,"[0.40542083574962706, 0.13816436809367577, 0.2...","[-0.018048837374210626, -0.022215183884555796,...","[-0.16766662211622213, -0.041781886145547904, ...","[-0.1275460740849826, -0.11788744998250435, -0...","[0.27549177385735396, 0.4641771758131303, 0.35..."
1,20180703_22,test,,"[0.7396573604930866, 0.7108559240781527, 0.692...","[-0.07242936052469821, -0.05202998152450064, -...","[-0.589373437062575, -0.5949060634243009, -0.5...","[-0.19794920275961803, -0.20065247476994347, -...","[0.10070696919469906, 0.17385974194288453, 0.0..."
2,20220809_23,test,,"[-0.46648036970684925, -0.5741617430043857, -0...","[-0.12319780225163679, 0.09724291559828845, -0...","[0.4741320360487367, 0.591565137478413, 0.2928...","[-0.10035606638638488, -0.25121087843002143, 0...","[2.46936874509603, 2.5183713309702886, 3.00157..."
3,20220801_24,test,,"[-0.12356214478431785, -0.24088930636702904, 0...","[-0.03425114915927098, -0.09652637774532737, -...","[0.18924738879964226, 0.2506837463154132, -0.0...","[0.1157166595520524, 0.11414583646627384, -0.0...","[0.7418210902451574, 0.848803292831117, 0.8327..."
4,20220920_25,test,,"[0.5271381528541318, 0.6601627329626689, 0.529...","[0.015264221634757618, -0.006630177384957283, ...","[-0.19656343075590893, -0.29809935980711, -0.2...","[-0.3562317303804691, -0.2896162542387093, -0....","[0.26063187598154774, 0.5651808683691067, 0.29..."
...,...,...,...,...,...,...,...,...
35,20180703_198,val,,"[0.6010225563573091, 0.5221744990291163, 0.488...","[0.04004692676391484, -0.00922206219296032, -0...","[-0.2101894345826825, -0.16382999662797534, -0...","[-0.4339237911002304, -0.3343184997589737, -0....","[0.19655055155649576, 0.5147459711420965, 0.29..."
36,20180809_199,val,,"[0.5622912894360781, 0.5476280041906767, 0.502...","[-0.0345677463549987, 0.0035426425371962718, -...","[-0.2030226320486741, -0.2051228022036792, -0....","[-0.17661895733731914, -0.20467742516086251, -...","[0.23267592165321566, 0.5501744982109611, 0.36..."
37,20190424_200,val,,"[nan, nan, nan, nan, nan, nan, nan, nan, nan, ...","[nan, nan, nan, nan, nan, nan, nan, nan, nan, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[nan, nan, nan, nan, nan, nan, nan, nan, nan, ...","[nan, nan, nan, nan, nan, nan, nan, nan, nan, ..."
38,20200828_201,val,,"[nan, nan, nan, nan, nan, nan, nan, nan, nan, ...","[nan, nan, nan, nan, nan, nan, nan, nan, nan, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[nan, nan, nan, nan, nan, nan, nan, nan, nan, ...","[nan, nan, nan, nan, nan, nan, nan, nan, nan, ..."


In [234]:
# add a column called ID to df_sentinel2
# the value is convert from column data_path
# by removing the "image_directory_" prefix
df_sentinel2.insert(0, "ID", df_sentinel2['data_path'].str.replace("image_directory_", ""))

ValueError: cannot insert ID, already exists

In [235]:
# If the row in df_sentinel2 has same ID as the row in numerical_df
# then give the if)fire column in df_sentinel2 the same value as the Label column in numerical_df

for index, row in df_sentinel2.iterrows():
    label = row['if_fire']
    id = row['ID']
    if numerical_df['ID'].isin([id]).any():
        numerical_df.loc[numerical_df['ID'] == id, 'Label'] = label


In [236]:
numerical_df.sort_values(by='ID', inplace=True)

In [237]:
# print the row where the Label is True
numerical_df[numerical_df['Label'] == True]

Unnamed: 0,ID,split,Label,NDVI,NBR,NDWI,NDBI,RGB
9,20180514_9,train,True,"[0.6033185770513726, 0.41760846533043583, 0.35...","[0.012524571158500312, -0.022856919184525214, ...","[-0.3018938654845362, -0.19483902474036877, -0...","[-0.3265207274633461, -0.23490578591091738, -0...","[0.21483211426189855, 0.4656107925803516, 0.25..."
0,20180522_0,train,True,"[0.41767031805476257, 0.23446854539333492, 0.1...","[-0.013169502004029675, -0.029835211333464194,...","[-0.21135593805524552, -0.09005081795502748, -...","[-0.006296666292050169, 0.03407789893674559, 0...","[0.26375933103110666, 0.41804536941072035, 0.3..."
14,20180618_14,train,True,"[0.5087346928419032, 0.5053017252174287, 0.334...","[-0.019091056746911635, 0.030959749428965663, ...","[-0.19479827854141304, -0.1919816061925272, -0...","[-0.2747215589469209, -0.2726033075254354, -0....","[0.2761270643080269, 0.5714995114096294, 0.391..."
5,20180618_26,test,True,"[0.758579340701943, 0.6869592455432909, 0.7333...","[0.15442686131502828, -0.05406714801692072, -0...","[-0.6472365158143287, -0.5724427832405139, -0....","[-0.29394967051653925, -0.11136600649036897, -...","[0.052109864546122514, 0.0812896290664508, 0.0..."
3,20180618_3,train,True,"[0.40702503626903275, 0.5043751478562788, 0.56...","[-0.09362067739440764, 0.09447736443181912, 0....","[-0.08296217027756833, -0.20214230478937625, -...","[-0.25012911132899374, -0.42965709166290234, -...","[0.174518737859858, 0.35065617958743167, 0.318..."
6,20180618_35,val,True,"[0.5312659014601802, 0.4611401612021387, 0.498...","[-0.0130898193211847, 4.899379646278345e-05, -...","[-0.2457690034204794, -0.22009610842133673, -0...","[-0.11546230298813316, -0.10101522174856746, -...","[0.15617379151227306, 0.30888660379618, 0.1559..."
1,20180703_22,test,True,"[0.7396573604930866, 0.7108559240781527, 0.692...","[-0.07242936052469821, -0.05202998152450064, -...","[-0.589373437062575, -0.5949060634243009, -0.5...","[-0.19794920275961803, -0.20065247476994347, -...","[0.10070696919469906, 0.17385974194288453, 0.0..."
16,20180711_16,train,True,"[nan, nan, nan, nan, nan, nan, nan, nan, nan, ...","[nan, nan, nan, nan, nan, nan, nan, nan, nan, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[nan, nan, nan, nan, nan, nan, nan, nan, nan, ...","[nan, nan, nan, nan, nan, nan, nan, nan, nan, ..."
7,20180711_7,train,True,"[nan, nan, nan, nan, nan, nan, nan, nan, nan, ...","[nan, nan, nan, nan, nan, nan, nan, nan, nan, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[nan, nan, nan, nan, nan, nan, nan, nan, nan, ...","[nan, nan, nan, nan, nan, nan, nan, nan, nan, ..."
3,20180716_32,val,True,"[0.4895514038649967, 0.4070313580218455, 0.416...","[-0.04977949259755378, -0.06408025547354929, -...","[-0.15607310168883434, -0.09471733797602057, -...","[-0.14395491790046716, -0.12247556369596832, -...","[0.18742823973438108, 0.3992620575875729, 0.12..."


In [239]:
df_sentinel2.sort_values(by='ID', inplace=True)
df_sentinel2[df_sentinel2['if_fire'] == True]

Unnamed: 0,ID,split,imagery_id,date,coordinates,if_fire,data_path,mask_path
9,20180514_9,train,S2A_9UYR_20180514_0_L2A,2018-05-14,POLYGON ((-125.46837629296328 49.8768393522646...,True,image_directory_20180514_9,mask_directory_20180514_9
0,20180522_0,train,S2B_9UYR_20180522_0_L2A,2018-05-22,POLYGON ((-125.5701791979101 49.94899566498456...,True,image_directory_20180522_0,mask_directory_20180522_0
14,20180618_14,train,S2B_9UXR_20180618_0_L2A,2018-06-18,POLYGON ((-126.32410702760176 49.8997673952883...,True,image_directory_20180618_14,mask_directory_20180618_14
26,20180618_26,test,S2B_9UXR_20180618_0_L2A,2018-06-18,POLYGON ((-126.20931857999868 50.0352373902161...,True,image_directory_20180618_26,mask_directory_20180618_26
3,20180618_3,train,S2B_9UXR_20180618_0_L2A,2018-06-18,POLYGON ((-126.42357531752924 50.0401949435548...,True,image_directory_20180618_3,mask_directory_20180618_3
35,20180618_35,val,S2B_9UXR_20180618_0_L2A,2018-06-18,POLYGON ((-126.31643949143137 50.0377655632022...,True,image_directory_20180618_35,mask_directory_20180618_35
22,20180703_22,test,S2A_9UYR_20180703_0_L2A,2018-07-03,POLYGON ((-125.57507206004699 49.8800440026963...,True,image_directory_20180703_22,mask_directory_20180703_22
16,20180711_16,train,S2B_9UYR_20180711_0_L2A,2018-07-11,POLYGON ((-124.95818368695447 49.5837280906687...,True,image_directory_20180711_16,mask_directory_20180711_16
7,20180711_7,train,S2B_9UYQ_20180711_1_L2A,2018-07-11,POLYGON ((-124.95831731349323 49.5821130753058...,True,image_directory_20180711_7,mask_directory_20180711_7
32,20180716_32,val,S2A_9UXR_20180716_0_L2A,2018-07-16,POLYGON ((-126.22125450281315 49.8282569308730...,True,image_directory_20180716_32,mask_directory_20180716_32


In [240]:
numerical_df['Label'].value_counts()

Label
False    167
True      36
Name: count, dtype: int64

In [241]:
numerical_df.to_csv('../dataset_tables/numerical_df_1.csv', index=False)
df_sentinel2.to_csv('../dataset_tables/df_sentinel2_1.csv', index=False)