# Bigearthnet data exploration and sampling for Sentinel 2

In this exercise, we carry out a sampling of bigearthnet Sentinel 2 data. The goal is to document the process used for sampling the large bigearthnet dataset to some manageable sampling dataset.

# Problem: usability of bigearhnet

Bigearthnet is an very useful and incredible rich dataset on the classification of land cover scene using sentinel 1 and sentinel 2 data. It is however very difficult to use due to its large size and the way it is currently distributed i.e. as a large tar gz file.

It requires downloading a 65 Gb file and then unzip it. Our analysis and experiment found that once at its original size it amounts to 120 giga of data with 5


Useful links:

https://www.kaggle.com/code/nilesh789/land-cover-classification-with-eurosat-dataset

https://www.kaggle.com/datasets/kmader/satellite-images-of-hurricane-damage

- resnet50 model hub with bigearthnet:

https://www.kaggle.com/models/google/resnet50?tfhub-redirect=true

https://lgslm.medium.com/land-use-and-land-cover-classification-using-a-resnet-deep-learning-architecture-e353e7131ea4

data

https://github.com/jerpint/bigearthnet/blob/main/notebooks/bigearthnet_demo.ipynb




# Set up environment and load libraries

- load libraries
- install packages and tools
- authenticate to google drive and gcp account

In [None]:
#!pip uninstall tensorflow -y
#!pip install  tensorflow==2.13 #specific version needed for BERT

In [None]:
###### Library used in this script
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import os, glob

#ML imports
import sklearn
from sklearn import metrics
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler


#Keras import
from tensorflow import keras
#from keras.models import Model
import tensorflow as tf

from tensorflow.keras.optimizers import Adam
from tensorflow.keras import layers
from tensorflow.keras.layers import Input, Conv2D, Concatenate, Activation, MaxPool2D, UpSampling2D, Conv2DTranspose
from tensorflow.keras import models
from tensorflow.keras.models import Model
from tensorflow.keras.utils import plot_model


#from tensorflow.keras.utils import np_utils
sns.set_style('darkgrid')
pd.set_option('display.max_colwidth', None)

In [None]:
#install gdal to run from the terminal
!sudo add-apt-repository ppa:ubuntugis/ppa -y
!sudo apt-get update
!sudo apt-get install gdal-bin
!gdalinfo --version

PPA publishes dbgsym, you may need to include 'main/debug' component
Repository: 'deb https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu/ jammy main'
Description:
Official stable UbuntuGIS packages.


More info: https://launchpad.net/~ubuntugis/+archive/ubuntu/ppa
Adding repository.
Found existing deb entry in /etc/apt/sources.list.d/ubuntugis-ubuntu-ppa-jammy.list
Adding deb entry to /etc/apt/sources.list.d/ubuntugis-ubuntu-ppa-jammy.list
Found existing deb-src entry in /etc/apt/sources.list.d/ubuntugis-ubuntu-ppa-jammy.list
Adding disabled deb-src entry to /etc/apt/sources.list.d/ubuntugis-ubuntu-ppa-jammy.list
Adding key to /etc/apt/trusted.gpg.d/ubuntugis-ubuntu-ppa.gpg with fingerprint 6B827C12C2D425E227EDCA75089EBE08314DF160
Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:4 http://s

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
#set up libararies for GIS/Geospatial
try: # try the following b+lock of code
  import geopandas as gpd
except: # if the try block throws an error, run the following
  !pip install geopandas # install geopandas
  import geopandas as gpd

try:
  import contextily as ctx # import
  import rasterio
except:
  !apt install libproj-dev proj-data proj-bin
  !apt install libgeos-dev
  !pip install cython
  !pip install cartopy
  !pip install rasterio
import fiona # library for reading/writing GIS files, comes w/ geopandas
from shapely.geometry import Point, LineString, Polygon

!pip install pyproj
# Mapping,vector related imports
import geopandas as gpd
#import descartes #nessary for plotting in geopandas
from cartopy import crs as ccrs
from pyproj import Proj
#from osgeo import osr
from shapely import geometry
from shapely.geometry import Point
from shapely.geometry import box
from shapely.geometry import shape
from shapely.geometry import Polygon

from collections import OrderedDict
#rasterio imports, gdal and imagery utility
import rasterio
from rasterio.windows import Window
from osgeo import gdal
from rasterio import plot
from PIL import Image

#xarray and rio
!pip install rioxarray
!pip install mapclassify
!pip install earthpy

import mapclassify
import folium
import pyproj as proj
import xarray as xr
import rioxarray as rxr
import earthpy as et
import earthpy.plot as ep
from folium.utilities import none_max
import folium
from pyproj import Transformer

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libproj-dev is already the newest version (9.1.1-1~jammy0).
libproj-dev set to manually installed.
proj-data is already the newest version (9.1.1-1~jammy0).
proj-data set to manually installed.
The following NEW packages will be installed:
  proj-bin
0 upgraded, 1 newly installed, 0 to remove and 38 not upgraded.
Need to get 197 kB of archives.
After this operation, 504 kB of additional disk space will be used.
Get:1 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy/main amd64 proj-bin amd64 9.1.1-1~jammy0 [197 kB]
Fetched 197 kB in 1s (209 kB/s)
Selecting previously unselected package proj-bin.
(Reading database ... 122847 files and directories currently installed.)
Preparing to unpack .../proj-bin_9.1.1-1~jammy0_amd64.deb ...
Unpacking proj-bin (9.1.1-1~jammy0) ...
Setting up proj-bin (9.1.1-1~jammy0) ...
Processing triggers for man-db (2.10.2-1) ...
Reading package lists... Don

In [None]:
#GCP account authentification
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


In [None]:
from google.colab import drive

drive.mount('/content/gdrive')

Mounted at /content/gdrive


# Functions
In the next part of the script, we declare all the functions used in the sripts.  It is good practice to place functions at the beginning of a script or in an external source file. Here are the 13 functions used:

* **create_dir_and_check_existence**:  create and output directory given a path. The output directory will be the working directory throughout the analysis.


In [None]:
def create_dir_and_check_existence(path):

    #Create a new directory
    try:
        os.makedirs(path)
    except:
        print ("directory already exists")

from matplotlib import pyplot

# Parameters and Arguments

It is good practice to set all parameters and input arguments at the beginning of the script. This allows for better control and can make modifications of the scripts for other applications easier. Some arguments relate to path directories, input files and general parameters for use in the analyses (e.g. proportion of hold out).


In [None]:
############################################################################
#####  Parameters and argument set up ###########

#ARG 1
in_dir = '/content/gdrive/MyDrive/Colab Notebooks/air-pollution-remote-sensing-deep-learning/data'
out_dir = '/content/gdrive/MyDrive/Colab Notebooks/air-pollution-remote-sensing-deep-learning/'

#in_filename = 'Tweets.csv'
out_suffix = 'lc_2024-02-22'
test_proportion = 0.2
random_seed= 42
create_out_dir = True

#ARG 7
## Input data
data_dir = '/content/gdrive/MyDrive/Colab Notebooks/air-pollution-remote-sensing-deep-learning/data'
#ARG 8
run_model = True #if True, model is trained, note this may take several hours.

#ARG 9
# Use pre-trained model if run_model is False
model_path = None
#model_path ='/content/gdrive/MyDrive/Colab Notebooks/deep-learning-nlp-intro/intro_transfer_learning_BERT_USE/outputs/output_data_transfer_learning_bert_2024-01-25'
#ARG 10
epoch_val = 100
#ARG 11


In [None]:
################# START SCRIPT ###############################

######### PART 0: Set up the output dir ################

#set up the working directory
#Create output directory

if create_out_dir==True:
    out_dir_new = "output_data_"+out_suffix
    out_dir = os.path.join(out_dir,"outputs",out_dir_new)
    create_dir_and_check_existence(out_dir)
    os.chdir(out_dir)        #set working directory
else:
    os.chdir(out_dir) #use working dir defined earlier


directory already exists


In [None]:
print(out_dir)

/content/gdrive/MyDrive/Colab Notebooks/air-pollution-remote-sensing-deep-learning/outputs/output_data_lc_2024-02-22


In [None]:
!pwd

/content/gdrive/MyDrive/Colab Notebooks/air-pollution-remote-sensing-deep-learning/outputs/output_data_lc_2024-02-22


#0.Workflow/pipeline

We describe here the modeling pipelin we set up for this exercise:
1. **Download and unzip data**
- Obtain data from the website or link provided
- Quick exploration: this is not done in the notebook but we document the steps
2. **Use bigearhnet common to explore data**
- defining layers
- exploring sentence similarity
3. **Subset: Generate sampling data**
 - defining the layers
 - exploring the number of parameters
4. **Conclusions**


# 1. **Download and unzip data**
- Obtain data from the website or link provided
- Quick exploration: this is not done in the notebook but we document the steps


In [None]:
#took 87minand 3s
#!wget https://bigearth.net/downloads/BigEarthNet-S2-v1.0.tar.gz
#cd /content
#!tar -xvzf '/content/gdrive/MyDrive/Colab Notebooks/air-pollution-remote-sensing-deep-learning/data/BigEarthNet-S2-v1.0.tar.gz'

#https://askubuntu.com/questions/25347/what-command-do-i-need-to-unzip-extract-a-tar-gz-file

- https://docs.kai-tub.tech/bigearthnet_common/intro.html
- https://docs.kai-tub.tech/bigearthnet_common/10_sets.html
- https://github.com/jerpint/bigearthnet/tree/main?tab=readme-ov-file

https://bigearth.net/

https://docs.kai-tub.tech/ben-docs/libraries.html


#2. **Use bigearhnet common to explore data**
- defining layers
- exploring sentence similarity


In [None]:
#!pip install jax>=0.4.9
!pip install bigearthnet_common

Collecting bigearthnet_common
  Downloading bigearthnet_common-2.8.0-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
Collecting colorama<0.5.0,>=0.4.3 (from typer[all]>=0.6->bigearthnet_common)
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Collecting shellingham<2.0.0,>=1.3.0 (from typer[all]>=0.6->bigearthnet_common)
  Downloading shellingham-1.5.4-py2.py3-none-any.whl (9.8 kB)
Installing collected packages: shellingham, colorama, bigearthnet_common
Successfully installed bigearthnet_common-2.8.0 colorama-0.4.6 shellingham-1.5.4


In [None]:
#import bigearthnet_common as ben
#ben.get_patches_to_country_mapping()
#https://github.com/kai-tub/bigearthnet_common

In [None]:
!pip install typing-extensions<4.6.0,>=3.6.6

/bin/bash: line 1: 4.6.0,: No such file or directory


In [None]:
from bigearthnet_common.base import s1_to_s2_patch_name
from bigearthnet_common.constants import BEN_S1_RE, BEN_S2_RE
from bigearthnet_common.example_data import (
    get_s1_example_folder_path,
    get_s1_example_patch_path,
    get_s2_example_folder_path,
    get_s2_example_patch_path,
)


In [None]:
'''
https://docs.kai-tub.tech/bigearthnet_common/intro.html
ben_build_csv_sets <FILE_PATH> S2 --seasons Winter --seasons Summer --countries Serbia --remove-unrecommended-dl-patches
'''

In [None]:
import bigearthnet_common
val_patches = bigearthnet_common.base.get_patches_to_country_mapping(use_s2_patch_names=True)
print(type(val_patches))
len(val_patches)


<class 'dict'>


590326

In [None]:
list(val_patches.items())[:5] #first 5 items for dictionary

[('S2B_MSIL2A_20170906T101019_33_85', 'Finland'),
 ('S2A_MSIL2A_20170803T094031_78_45', 'Serbia'),
 ('S2A_MSIL2A_20170717T113321_67_66', 'Ireland'),
 ('S2B_MSIL2A_20171219T095409_2_66', 'Austria'),
 ('S2B_MSIL2A_20180522T093029_31_41', 'Finland')]

In [None]:
df_country = pd.DataFrame.from_dict(val_patches, orient='index',columns=['country']).reset_index()
df_country.columns = ['patch','country']
print((df_country.shape))
print(type(df_country))
df_country.head()

(590326, 2)
<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,patch,country
0,S2B_MSIL2A_20170906T101019_33_85,Finland
1,S2A_MSIL2A_20170803T094031_78_45,Serbia
2,S2A_MSIL2A_20170717T113321_67_66,Ireland
3,S2B_MSIL2A_20171219T095409_2_66,Austria
4,S2B_MSIL2A_20180522T093029_31_41,Finland


In [None]:
val_patches_season = bigearthnet_common.base.get_patches_to_season_mapping(use_s2_patch_names=True)
print(type(val_patches_season))
len(val_patches_season)

<class 'dict'>


590326

In [None]:
list(val_patches_season.items())[:5] #first 5 items for dictionary

[('S2B_MSIL2A_20170906T101019_33_85', 'Fall'),
 ('S2A_MSIL2A_20170803T094031_78_45', 'Summer'),
 ('S2A_MSIL2A_20170717T113321_67_66', 'Summer'),
 ('S2B_MSIL2A_20171219T095409_2_66', 'Winter'),
 ('S2B_MSIL2A_20180522T093029_31_41', 'Spring')]

In [None]:
#DataFrame.from_dict(data, orient='columns', dtype=None, columns=None)[source]


df_season = pd.DataFrame.from_dict(val_patches_season, orient='index',columns=['season']).reset_index()
df_season.columns = ['patch','season']
print((df_season.shape))
print(type(df_season))
df_season.head()

(590326, 2)
<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,patch,season
0,S2B_MSIL2A_20170906T101019_33_85,Fall
1,S2A_MSIL2A_20170803T094031_78_45,Summer
2,S2A_MSIL2A_20170717T113321_67_66,Summer
3,S2B_MSIL2A_20171219T095409_2_66,Winter
4,S2B_MSIL2A_20180522T093029_31_41,Spring


In [None]:
df_bigearthnet = df_country.merge(df_season, on='patch')
print((df_bigearthnet.shape))
print(type(df_bigearthnet))
df_bigearthnet.head()

(590326, 3)
<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,patch,country,season
0,S2B_MSIL2A_20170906T101019_33_85,Finland,Fall
1,S2A_MSIL2A_20170803T094031_78_45,Serbia,Summer
2,S2A_MSIL2A_20170717T113321_67_66,Ireland,Summer
3,S2B_MSIL2A_20171219T095409_2_66,Austria,Winter
4,S2B_MSIL2A_20180522T093029_31_41,Finland,Spring


#3. **Subset: Generate sampling data**
 - defining the layers
 - exploring the number of parameters


In [None]:
df_bigearthnet_sample = (df_bigearthnet.loc[df_bigearthnet['season'].isin(['Summer'])]
               .sample(frac=0.002).reset_index(drop=True))
print(df_bigearthnet_sample.shape)
df_bigearthnet_sample.head()

(258, 3)


Unnamed: 0,patch,country,season
0,S2A_MSIL2A_20170803T094031_75_14,Serbia,Summer
1,S2A_MSIL2A_20170827T092031_19_79,Serbia,Summer
2,S2B_MSIL2A_20170801T095029_2_63,Austria,Summer
3,S2A_MSIL2A_20170704T112111_61_31,Portugal,Summer
4,S2A_MSIL2A_20170613T101032_74_30,Finland,Summer


In [None]:
df_bigearthnet_sample['country'].value_counts()

Finland        80
Portugal       45
Serbia         38
Ireland        38
Lithuania      23
Austria        17
Switzerland    12
Belgium         5
Name: country, dtype: int64

In [None]:
source_folders = df_bigearthnet_sample['patch'].tolist()
print(source_folders[:5])

['S2A_MSIL2A_20170803T094031_75_14', 'S2A_MSIL2A_20170827T092031_19_79', 'S2B_MSIL2A_20170801T095029_2_63', 'S2A_MSIL2A_20170704T112111_61_31', 'S2A_MSIL2A_20170613T101032_74_30']


In [None]:
df_bigearthnet_sample['destination_folder'] = 'sampling_1/'+ df_bigearthnet_sample['patch']
df_bigearthnet_sample['source_folder'] = 'BigEarthNet-v1.0/'+ df_bigearthnet_sample['patch']
df_bigearthnet_sample.head()
print(df_bigearthnet_sample.shape)
df_bigearthnet_sample.head()

(258, 5)


Unnamed: 0,patch,country,season,destination_folder,source_folder
0,S2A_MSIL2A_20170803T094031_75_14,Serbia,Summer,sampling_1/S2A_MSIL2A_20170803T094031_75_14,BigEarthNet-v1.0/S2A_MSIL2A_20170803T094031_75_14
1,S2A_MSIL2A_20170827T092031_19_79,Serbia,Summer,sampling_1/S2A_MSIL2A_20170827T092031_19_79,BigEarthNet-v1.0/S2A_MSIL2A_20170827T092031_19_79
2,S2B_MSIL2A_20170801T095029_2_63,Austria,Summer,sampling_1/S2B_MSIL2A_20170801T095029_2_63,BigEarthNet-v1.0/S2B_MSIL2A_20170801T095029_2_63
3,S2A_MSIL2A_20170704T112111_61_31,Portugal,Summer,sampling_1/S2A_MSIL2A_20170704T112111_61_31,BigEarthNet-v1.0/S2A_MSIL2A_20170704T112111_61_31
4,S2A_MSIL2A_20170613T101032_74_30,Finland,Summer,sampling_1/S2A_MSIL2A_20170613T101032_74_30,BigEarthNet-v1.0/S2A_MSIL2A_20170613T101032_74_30


In [None]:
df_bigearthnet_sample.to_csv('big_earthnet_sampling_1.csv',index=False)

In [None]:
import os
import shutil

def copy_folders(source_folders, destination_folder):
    """
    Copy a list of folders to a destination folder.

    Args:
    - source_folders: List of source folder paths.
    - destination_folder: Destination folder path.
    """
    for folder in source_folders:
        if os.path.isdir(folder):
            folder_name = os.path.basename(folder)
            destination_path = os.path.join(destination_folder, folder_name)
            shutil.copytree(folder, destination_path)
            print(f"Folder '{folder_name}' copied to '{destination_folder}'")
        else:
            print(f"'{folder}' is not a valid folder path")
    return 'folder copied'


# Example usage:
source_folders = ["source_folder1", "source_folder2"]
destination_folder = "destination_folder"
copy_folders(source_folders, destination_folder)


'source_folder1' is not a valid folder path
'source_folder2' is not a valid folder path


'folder copied'

In [None]:
df_bigearthnet_sample['destination_folder'].tolist()

['sampling_1/S2A_MSIL2A_20170803T094031_75_14',
 'sampling_1/S2A_MSIL2A_20170827T092031_19_79',
 'sampling_1/S2B_MSIL2A_20170801T095029_2_63',
 'sampling_1/S2A_MSIL2A_20170704T112111_61_31',
 'sampling_1/S2A_MSIL2A_20170613T101032_74_30',
 'sampling_1/S2A_MSIL2A_20170818T103021_22_54',
 'sampling_1/S2B_MSIL2A_20170817T101019_30_57',
 'sampling_1/S2A_MSIL2A_20170617T113321_79_89',
 'sampling_1/S2A_MSIL2A_20170803T094031_90_0',
 'sampling_1/S2B_MSIL2A_20170829T105019_66_11',
 'sampling_1/S2A_MSIL2A_20170613T101032_77_67',
 'sampling_1/S2B_MSIL2A_20170818T112109_25_3',
 'sampling_1/S2A_MSIL2A_20170717T113321_60_54',
 'sampling_1/S2A_MSIL2A_20170816T095031_76_36',
 'sampling_1/S2B_MSIL2A_20170718T115359_23_56',
 'sampling_1/S2A_MSIL2A_20170816T095031_71_42',
 'sampling_1/S2A_MSIL2A_20170717T113321_79_55',
 'sampling_1/S2B_MSIL2A_20170831T95030_17_84',
 'sampling_1/S2A_MSIL2A_20170813T112121_69_34',
 'sampling_1/S2B_MSIL2A_20170817T101019_33_85',
 'sampling_1/S2B_MSIL2A_20170709T094029_17_8

In [None]:
copy_folders(df['source_folder'].tolist(),'sampling_1')



In [None]:
#https://github.com/kai-tub/ben-docs/blob/main/docs/raw-data.ipynb
#https://github.com/avulaankith/BigEarthNet

SyntaxError: invalid syntax (<ipython-input-27-83f318642d34>, line 1)

In [None]:
list_dirs_data = os.listdir(os.path.join(in_dir,'sampling_1'))
print(len(list_dirs_data))
list_dirs_data[:5]

258


['S2B_MSIL2A_20170808T094029_66_6',
 'S2B_MSIL2A_20170814T100029_32_37',
 'S2B_MSIL2A_20170802T092029_14_66',
 'S2A_MSIL2A_20170818T103021_7_57',
 'S2A_MSIL2A_20170701T093031_55_90']

In [None]:
os.listdir(os.path.join(in_dir, 'sampling_1','S2B_MSIL2A_20170808T094029_66_6'))

['S2B_MSIL2A_20170808T094029_66_6_B09.tif',
 'S2B_MSIL2A_20170808T094029_66_6_B12.tif',
 'S2B_MSIL2A_20170808T094029_66_6_B8A.tif',
 'S2B_MSIL2A_20170808T094029_66_6_B07.tif',
 'S2B_MSIL2A_20170808T094029_66_6_B06.tif',
 'S2B_MSIL2A_20170808T094029_66_6_B01.tif',
 'S2B_MSIL2A_20170808T094029_66_6_B08.tif',
 'S2B_MSIL2A_20170808T094029_66_6_B03.tif',
 'S2B_MSIL2A_20170808T094029_66_6_B02.tif',
 'S2B_MSIL2A_20170808T094029_66_6_B05.tif',
 'S2B_MSIL2A_20170808T094029_66_6_B11.tif',
 'S2B_MSIL2A_20170808T094029_66_6_labels_metadata.json',
 'S2B_MSIL2A_20170808T094029_66_6_B04.tif']

use list and virtual files with gdal

In [None]:
#https://github.com/bparment1/Earth_Observation_Remote_Sensing/blob/main/n5_workflow_sentinel2_processing_VT_flooding.ipynb

In [None]:
#os.listdir(os.path.join(in_dir, 'sampling_1','S2B_MSIL2A_20170808T094029_66_6'))#need glob

fileglob = "*.tif"
in_path = os.path.join(in_dir, 'sampling_1','S2B_MSIL2A_20170808T094029_66_6')
path_raster_sat = os.path.join(in_path,fileglob) #classified

files_raster_sat = glob.glob(path_raster_sat,recursive=True)
files_sat_df = pd.DataFrame({'files_sat':files_raster_sat})
files_sat_df#should only extract band 2,3

Unnamed: 0,files_sat
0,/content/gdrive/MyDrive/Colab Notebooks/air-pollution-remote-sensing-deep-learning/data/sampling_1/S2B_MSIL2A_20170808T094029_66_6/S2B_MSIL2A_20170808T094029_66_6_B09.tif
1,/content/gdrive/MyDrive/Colab Notebooks/air-pollution-remote-sensing-deep-learning/data/sampling_1/S2B_MSIL2A_20170808T094029_66_6/S2B_MSIL2A_20170808T094029_66_6_B12.tif
2,/content/gdrive/MyDrive/Colab Notebooks/air-pollution-remote-sensing-deep-learning/data/sampling_1/S2B_MSIL2A_20170808T094029_66_6/S2B_MSIL2A_20170808T094029_66_6_B8A.tif
3,/content/gdrive/MyDrive/Colab Notebooks/air-pollution-remote-sensing-deep-learning/data/sampling_1/S2B_MSIL2A_20170808T094029_66_6/S2B_MSIL2A_20170808T094029_66_6_B07.tif
4,/content/gdrive/MyDrive/Colab Notebooks/air-pollution-remote-sensing-deep-learning/data/sampling_1/S2B_MSIL2A_20170808T094029_66_6/S2B_MSIL2A_20170808T094029_66_6_B06.tif
5,/content/gdrive/MyDrive/Colab Notebooks/air-pollution-remote-sensing-deep-learning/data/sampling_1/S2B_MSIL2A_20170808T094029_66_6/S2B_MSIL2A_20170808T094029_66_6_B01.tif
6,/content/gdrive/MyDrive/Colab Notebooks/air-pollution-remote-sensing-deep-learning/data/sampling_1/S2B_MSIL2A_20170808T094029_66_6/S2B_MSIL2A_20170808T094029_66_6_B08.tif
7,/content/gdrive/MyDrive/Colab Notebooks/air-pollution-remote-sensing-deep-learning/data/sampling_1/S2B_MSIL2A_20170808T094029_66_6/S2B_MSIL2A_20170808T094029_66_6_B03.tif
8,/content/gdrive/MyDrive/Colab Notebooks/air-pollution-remote-sensing-deep-learning/data/sampling_1/S2B_MSIL2A_20170808T094029_66_6/S2B_MSIL2A_20170808T094029_66_6_B02.tif
9,/content/gdrive/MyDrive/Colab Notebooks/air-pollution-remote-sensing-deep-learning/data/sampling_1/S2B_MSIL2A_20170808T094029_66_6/S2B_MSIL2A_20170808T094029_66_6_B05.tif


In [None]:
raster_file_list_m = files_sat_df['files_sat'].tolist()

In [None]:
def generate_files_data_df(in_path):
    """
    Generate data frame for the input raster files. This function assumes a specific directory tree structure.
    We use the structure from provided by Volodymyr Mnih as part of his PhD thesis (https://www.cs.toronto.edu/~vmnih/data/).
    This data is also used in numerous publication. The data frame will contain for each scene id, a corresponding
    sat (RGB aerial imagery) and map (classfied building image) as well as label named 'type' corresponding to
    train, test, and validation data.

    Input Arguments:

    in_path: parent directory path to the data files

    Return Outputs:

    files_df: pandas data frame containing directory path to training, testing and validation images.

    """

    fileglob = "*.tiff"
    path_raster_sat = os.path.join(in_path+'/**/sat/',fileglob) #raw raster
    fileglob = "*.tif"
    path_raster_map = os.path.join(in_path+'/**/map/',fileglob) #classified

    files_raster_sat = glob.glob(path_raster_sat,recursive=True)
    files_raster_map = glob.glob(path_raster_map,recursive=True)

    files_map_df = pd.DataFrame({'files_map':files_raster_map})
    files_sat_df = pd.DataFrame({'files_sat':files_raster_sat})

    files_sat_df['scene_id_sat'] = files_sat_df['files_sat'].apply(lambda x: os.path.basename(x).replace('.tiff',""))
    files_map_df['scene_id_map'] = files_map_df['files_map'].apply(lambda x: os.path.basename(x).replace('.tif',""))

    files_df = (files_sat_df.merge(files_map_df,
                                  left_on='scene_id_sat',
                                  right_on='scene_id_map',
                                  how='inner')
                             .drop(columns=['scene_id_map'])
                             .rename(columns={'scene_id_sat':'scene_id'})
               )
    files_df['type']= files_df['files_sat'].apply(lambda x: os.path.basename(os.path.dirname(os.path.dirname(x)))) #make it more elegant later

    return files_df

In [None]:
      #raster_file_list_m = [f'/vsicurl/{u}' for u in raster_file_list]
bands_pattern_val_list =['B04','B03','B02']

ds = gdal.BuildVRT( f'RGB_after.vrt',
                   raster_file_list_m,
                    separate=True, #keep files in separate bands, useful for files covering the same area
                    VRTNodata=0,
                    srcNodata=0)
#print(ds.RasterCount)
ds=None

NameError: name 'out_filename_vrt' is not defined

In [None]:
dst = rasterio.open('RGB_after.vrt')
plot.show(dst)

In [None]:
out_filename,cmd_str=create_RGB(in_filename='RGB_after.vrt',
           out_filename='RGB_after_stretched.tif',
           scale_list=None,
           out_dtype='Byte')
print(cmd_str)

#4. **Conclusions**

In [None]:
############################# END OF SCRIPT ###################################