# GeoGrapher tutorial - the basics

This tutorial shows how to use GeoGrapher to create a remote sensing dataset
from vector data. If you are reading the html version in the documentation
and would prefer the actual ipynb file you can find it
[here](https://github.com/dida-do/GeoGrapher/tree/main/notebooks/tutorial_nb_basics.ipynb).
As vector data, we will use bounding boxes for sports stadiums.
You can download the file `stadiums.geojson` containing the vector data from [here](https://github.com/dida-do/GeoGrapher/tree/main/notebooks/stadiums.geojson).

Contents:

1. Creating an empty dataset
2. Adding vector data
3. Downloading rasters for the vector data
4. Opening and saving a connector

## 1. Creating an empty dataset

First, we import geographer, as well as some other imports we will need.

In [1]:
import geographer as gg
import geopandas as gpd
from pathlib import Path

The GeoGrapher library is built around the `Connector` class. A connector organizes a dataset of raster and vector data. To create an empty dataset, we use the `from_scratch` factory method:

The connector keeps track of the containment and intersection relations between raster and vector data in a bipartite graph. See our blogpost for a detailed explanation of why we want to keep track of this information.

In [2]:
from geographer import Connector

DATA_DIR = Path("gg_example_dataset")

connector = Connector.from_scratch(
    data_dir=DATA_DIR,
    task_vector_classes=["football", "baseball"],
)

This creates a connector with a dataset in `DATA_DIR`.

The `task_vector_classes` argument defines the classes that objects can belong to for multi-class segmentation. It is used when creating labels (see [this tutorial notebook](https://github.com/dida-do/GeoGrapher/tree/main/notebooks/tutorial_nb_cut_label_cluster.ipynb)). It is optional and at this not important here.

The most important attributes of a connector are its `rasters` and `vectors` attributes. These are geopandas GeoDataFrames. The `vectors` GeoDataFrame contains the vector geometries of the stadiums as well as tabular information about the stadiums (name, country, etc). It also contains an `"raster_count"` column, which we will explain later. The `rasters` GeoDataFrame contains as geometries the bounding boxes of the rasters in our dataset as well as tabular information about the rasters (e.g. raster name, date, etc).

In [4]:
connector.vectors

Unnamed: 0_level_0,geometry,raster_count
vector_name,Unnamed: 1_level_1,Unnamed: 2_level_1


In [5]:
connector.rasters

Unnamed: 0_level_0,geometry
raster_name,Unnamed: 1_level_1


As you can see both GeoDataFrames are empty. 

## 2. Adding vector data
Let's try adding our stadiums to the `vectors`. 

First, we read a GeoDataFrame containing the vector data from disk. You can download the example geojson file [here](https://github.com/dida-do/GeoGrapher/tree/main/notebooks/stadiums.geojson).

In [6]:
stadiums = gpd.read_file("stadiums.geojson")
stadiums

Unnamed: 0,vector_name,location,type,geometry
0,Munich Olympiastadion,"Munich, Germany",football,"POLYGON Z ((11.54677 48.17472 0.00000, 11.5446..."
1,Munich Track and Field Stadium1,"Munich, Germany",football,"POLYGON Z ((11.54382 48.17279 0.00000, 11.5438..."
2,Munich Olympia Track and Field2,"Munich, Germany",football,"POLYGON Z ((11.54686 48.17892 0.00000, 11.5468..."
3,Munich Staedtisches Stadion Dantestr,"Munich, Germany",football,"POLYGON Z ((11.52913 48.16874 0.00000, 11.5291..."
4,Vasil Levski National Stadium,"Sofia, Bulgaria",football,"POLYGON Z ((23.33410 42.68813 0.00000, 23.3340..."
5,Bulgarian Army Stadium,"Sofia, Bulgaria",football,"POLYGON Z ((23.34065 42.68492 0.00000, 23.3406..."
6,Arena Sofia,"Sofia, Bulgaria",football,"POLYGON Z ((23.34018 42.68318 0.00000, 23.3401..."
7,Jingu Baseball Stadium,"Tokyo, Japan",baseball,"POLYGON Z ((139.71597 35.67490 0.00000, 139.71..."
8,Japan National Stadium,"Tokyo, Japan",football,"POLYGON Z ((139.71482 35.67644 0.00000, 139.71..."


It will be convenient to set the index to the vector_name column:
TODO EXPLAIN WHY? DO WE NEEDS INDEX TO BE STRINGS?

In [7]:
stadiums = stadiums.set_index("vector_name")
stadiums

Unnamed: 0_level_0,location,geometry
vector_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Munich Olympiastadion,"Munich, Germany","POLYGON Z ((11.54677 48.17472 0.00000, 11.5446..."
Munich Track and Field Stadium1,"Munich, Germany","POLYGON Z ((11.54382 48.17279 0.00000, 11.5438..."
Munich Olympia Track and Field2,"Munich, Germany","POLYGON Z ((11.54686 48.17892 0.00000, 11.5468..."
Munich Staedtisches Stadion Dantestr,"Munich, Germany","POLYGON Z ((11.52913 48.16874 0.00000, 11.5291..."
Vasil Levski National Stadium,"Sofia, Bulgaria","POLYGON Z ((23.33410 42.68813 0.00000, 23.3340..."
Bulgarian Army Stadium,"Sofia, Bulgaria","POLYGON Z ((23.34065 42.68492 0.00000, 23.3406..."
Arena Sofia,"Sofia, Bulgaria","POLYGON Z ((23.34018 42.68318 0.00000, 23.3401..."
Jingu Baseball Stadium,"Tokyo, Japan","POLYGON Z ((139.71597 35.67490 0.00000, 139.71..."
Japan National Stadium,"Tokyo, Japan","POLYGON Z ((139.71482 35.67644 0.00000, 139.71..."


Now, we can integrate the vector features into the dataset, i.e. into the connector:

In [12]:
connector.add_to_vectors(stadiums)

The stadiums have now been added to the connector's `vectors` GeoDataFrame:

In [13]:
connector.vectors

Unnamed: 0_level_0,geometry,raster_count,location
vector_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Munich Olympiastadion,"POLYGON Z ((11.54677 48.17472 0.00000, 11.5446...",0,"Munich, Germany"
Munich Track and Field Stadium1,"POLYGON Z ((11.54382 48.17279 0.00000, 11.5438...",0,"Munich, Germany"
Munich Olympia Track and Field2,"POLYGON Z ((11.54686 48.17892 0.00000, 11.5468...",0,"Munich, Germany"
Munich Staedtisches Stadion Dantestr,"POLYGON Z ((11.52913 48.16874 0.00000, 11.5291...",0,"Munich, Germany"
Vasil Levski National Stadium,"POLYGON Z ((23.33410 42.68813 0.00000, 23.3340...",0,"Sofia, Bulgaria"
Bulgarian Army Stadium,"POLYGON Z ((23.34065 42.68492 0.00000, 23.3406...",0,"Sofia, Bulgaria"
Arena Sofia,"POLYGON Z ((23.34018 42.68318 0.00000, 23.3401...",0,"Sofia, Bulgaria"
Jingu Baseball Stadium,"POLYGON Z ((139.71597 35.67490 0.00000, 139.71...",0,"Tokyo, Japan"
Japan National Stadium,"POLYGON Z ((139.71482 35.67644 0.00000, 139.71...",0,"Tokyo, Japan"


## 3. Downloading rasters for the vector data

To download rasters for the stadiums, we use the `RasterDownloaderForVectors`. This class needs to be passed a `DownloaderForSingleVector` to interface with the particular data source for our rasters, and a `RasterDownloadProcessor` to process the downloaded files. In this example, we would like to download Sentinel-2, so we choose the `SentinelDownloaderForSingleVector` to interface with [Copernicus Open Access Hub](https://scihub.copernicus.eu/) and the Sentinel2Processor to process the downloaded zipped .SAFE files to GeoTiff files (see [here](https://sentinels.copernicus.eu/web/sentinel/user-guides/sentinel-2-msi/data-formats) for an explanation of the Sentinel-2 data format). The GeoTiff format is a georeferenced version for remote sensing raster data of the Tiff format for normal rasters.

Here, we define the downloader:

In [14]:
from geographer.downloaders import (
    RasterDownloaderForVectors,
    SentinelDownloaderForSingleVector,
    Sentinel2Processor,
)

downloader_for_single_vector = SentinelDownloaderForSingleVector()
download_processor = Sentinel2Processor()

downloader = RasterDownloaderForVectors(
    downloader_for_single_vector=downloader_for_single_vector,
    download_processor=download_processor,
)

To use the Copernicus SciHub API we need to a username and password. You can sign up for an account [here](https://scihub.copernicus.eu/dhus/#/self-registration). The password and username will be assumed to be stored in a .ini file. The format of the file should be as follows.

In [15]:
credentials_ini_path = DATA_DIR / "copernicus_scihub_credentials.ini"

To download rasters and add them to our dataset we then run the following command.

In [14]:
downloader.download(
    connector=connector,
    target_raster_count=2,  # optional, defaults to 1. See explanation below.
    credentials=credentials_ini_path,  # could also directly supply (username, password) tuple
    producttype="L2A",
    max_percent_cloud_coverage=10,
    resolution=10,  # resolution of extracted GeoTiff
    date=("NOW-364DAYS", "NOW"),
    area_relation="Contains",
)

0it [00:00, ?it/s]

Downloading S2A_MSIL2A_20220722T092041_N0400_R093_T34TFN_20220722T134859.zip:   0%|          | 0.00/1.18G [00:…

MD5 checksumming:   0%|          | 0.00/1.18G [00:00<?, ?B/s]

2022-09-21 22:50:49,613 - geographer.downloaders.sentinel2_safe_unpacking - INFO - Using all zero band for gml mask ('CLOUDS', 'B00') for S2A_MSIL2A_20220722T092041_N0400_R093_T34TFN_20220722T134859.SAFE


Downloading S2A_MSIL2A_20220413T092031_N0400_R093_T34TFN_20220413T123632.zip:   0%|          | 0.00/1.23G [00:…

MD5 checksumming:   0%|          | 0.00/1.23G [00:00<?, ?B/s]

2022-09-21 22:57:35,719 - geographer.downloaders.sentinel2_safe_unpacking - INFO - Using all zero band for gml mask ('CLOUDS', 'B00') for S2A_MSIL2A_20220413T092031_N0400_R093_T34TFN_20220413T123632.SAFE


Downloading S2A_MSIL2A_20220627T100611_N0400_R022_T32UPU_20220627T162810.zip:   0%|          | 0.00/967M [00:0…

MD5 checksumming:   0%|          | 0.00/967M [00:00<?, ?B/s]

2022-09-21 23:03:51,514 - geographer.downloaders.sentinel2_safe_unpacking - INFO - Using all zero band for gml mask ('CLOUDS', 'B00') for S2A_MSIL2A_20220627T100611_N0400_R022_T32UPU_20220627T162810.SAFE


Downloading S2B_MSIL2A_20220804T101559_N0400_R065_T32UPU_20220804T130854.zip:   0%|          | 0.00/1.21G [00:…

MD5 checksumming:   0%|          | 0.00/1.21G [00:00<?, ?B/s]

2022-09-21 23:10:43,443 - geographer.downloaders.sentinel2_safe_unpacking - INFO - Using all zero band for gml mask ('CLOUDS', 'B00') for S2B_MSIL2A_20220804T101559_N0400_R065_T32UPU_20220804T130854.SAFE


Downloading S2A_MSIL2A_20220412T012701_N0400_R074_T54SUE_20220412T042315.zip:   0%|          | 0.00/1.22G [00:…

MD5 checksumming:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

2022-09-21 23:17:38,499 - geographer.downloaders.sentinel2_safe_unpacking - INFO - Using all zero band for gml mask ('CLOUDS', 'B00') for S2A_MSIL2A_20220412T012701_N0400_R074_T54SUE_20220412T042315.SAFE


Downloading S2A_MSIL2A_20220701T012711_N0400_R074_T54SUE_20220701T043318.zip:   0%|          | 0.00/1.22G [00:…

MD5 checksumming:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

2022-09-21 23:24:31,570 - geographer.downloaders.sentinel2_safe_unpacking - INFO - Using all zero band for gml mask ('CLOUDS', 'B00') for S2A_MSIL2A_20220701T012711_N0400_R074_T54SUE_20220701T043318.SAFE


Notice that we set the optional `target_raster_count` which defines the number of distinct rasters each stadium should be contained in to download per argument to 2. The `rasters` attribute now contains information about the rasters:

In [12]:
connector.rasters

Unnamed: 0_level_0,raster_processed?,timestamp,orig_crs_epsg_code,geometry
raster_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
S2A_MSIL2A_20220722T092041_N0400_R093_T34TFN_20220722T134859.tif,True,2022-07-22-09:20:41,32634,"POLYGON ((23.54663 42.33578, 23.58754 43.32358..."
S2A_MSIL2A_20220413T092031_N0400_R093_T34TFN_20220413T123632.tif,True,2022-04-13-09:20:31,32634,"POLYGON ((23.54663 42.33578, 23.58754 43.32358..."
S2A_MSIL2A_20220627T100611_N0400_R022_T32UPU_20220627T162810.tif,True,2022-06-27-10:06:11,32632,"POLYGON ((11.79809 47.73104, 11.85244 48.71769..."
S2B_MSIL2A_20220804T101559_N0400_R065_T32UPU_20220804T130854.tif,True,2022-08-04-10:15:59,32632,"POLYGON ((11.79809 47.73104, 11.85244 48.71769..."
S2A_MSIL2A_20220412T012701_N0400_R074_T54SUE_20220412T042315.tif,True,2022-04-12-01:27:01,32654,"POLYGON ((140.00972 35.15084, 139.99743 36.140..."
S2A_MSIL2A_20220701T012711_N0400_R074_T54SUE_20220701T043318.tif,True,2022-07-01-01:27:11,32654,"POLYGON ((140.00972 35.15084, 139.99743 36.140..."


Now, let's take another look at the `vectors`. It contains an `"raster_count"` column. This column tells us how many rasters each vector feature (i.e. in our case stadium) is fully contained in. Previously, these values were all 0, but now they are all 2. This reflects the value of 2 we passed to the optional `target_raster_count` argument above.

In [29]:
connector.vectors

Unnamed: 0_level_0,raster_count,location,download_exception,type,geometry
vector_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Munich Olympiastadion,2,"Munich, Germany",NoImgsForVectorFeatureFoundError('No images fo...,football,"POLYGON Z ((11.54677 48.17472 0.00000, 11.5446..."
Munich Track and Field Stadium1,2,"Munich, Germany",NoImgsForVectorFeatureFoundError('No images fo...,football,"POLYGON Z ((11.54382 48.17279 0.00000, 11.5438..."
Munich Olympia Track and Field2,2,"Munich, Germany",NoImgsForVectorFeatureFoundError('No images fo...,football,"POLYGON Z ((11.54686 48.17892 0.00000, 11.5468..."
Munich Staedtisches Stadion Dantestr,2,"Munich, Germany",NoImgsForVectorFeatureFoundError('No images fo...,football,"POLYGON Z ((11.52913 48.16874 0.00000, 11.5291..."
Vasil Levski National Stadium,2,"Sofia, Bulgaria",NoImgsForVectorFeatureFoundError('No images fo...,football,"POLYGON Z ((23.33410 42.68813 0.00000, 23.3340..."
Bulgarian Army Stadium,2,"Sofia, Bulgaria",NoImgsForVectorFeatureFoundError('No images fo...,football,"POLYGON Z ((23.34065 42.68492 0.00000, 23.3406..."
Arena Sofia,2,"Sofia, Bulgaria",NoImgsForVectorFeatureFoundError('No images fo...,football,"POLYGON Z ((23.34018 42.68318 0.00000, 23.3401..."
Jingu Baseball Stadium,2,"Tokyo, Japan",NoImgsForVectorFeatureFoundError('No images fo...,baseball,"POLYGON Z ((139.71597 35.67490 0.00000, 139.71..."
Japan National Stadium,2,"Tokyo, Japan",NoImgsForVectorFeatureFoundError('No images fo...,football,"POLYGON Z ((139.71482 35.67644 0.00000, 139.71..."


The connector keeps track of the containment and intersection relations between vector features and rasters in the form of an internal bipartite graph. We can ask questions about this graph, such as which rasters contain (or intersect) a given vector feature (stadium):

In [14]:
# rasters containing a vector feature
vector_name = "Munich Olympiastadion"
containing_rasters = connector.rasters_containing_vector(vector_name)
print(f"rasters containing {vector_name}:\n{containing_rasters} \n")

# vector features intersecting a raster
raster_name = containing_rasters[0]
intersecting_vectors = connector.vectors_intersecting_raster(raster_name)
print(f"vector features (stadiums) intersecting {raster_name}:\n{intersecting_vectors}")

rasters containing Munich Olympiastadion:
['S2A_MSIL2A_20220627T100611_N0400_R022_T32UPU_20220627T162810.tif', 'S2B_MSIL2A_20220804T101559_N0400_R065_T32UPU_20220804T130854.tif'] 

vector features (stadiums) intersecting S2A_MSIL2A_20220627T100611_N0400_R022_T32UPU_20220627T162810.tif:
['Munich Staedtisches Stadion Dantestr', 'Munich Olympia Track and Field2', 'Munich Olympiastadion', 'Munich Track and Field Stadium1']


## 4. Loading and saving a connector

To save the connector, we use the `save` method. This will save the connector to the `connector` subdirectory of the connector's `data_dir`:

In [16]:
connector.save()

In our case, saving the connector wasn't actually neccessary, since the `downloader`'s `download` method automatically saves the connector.

To load an existing connector, we use the `from_data_dir` method:

In [None]:
connector = Connector.from_data_dir(DATA_DIR)

To see how to cut this dataset and create labels for it so that we can do ML with it, read through the [Creating a ML dataset tutorial notebook](https://github.com/dida-do/GeoGrapher/tree/main/notebooks/tutorial_nb_cut_label_cluster.ipynb).
