# Streetscapes workspaces

This notebook illustrates how to use project workspaces in Streetscapes. You can load different data sources and ML models, process streetview images and save generated data to your workspace.

In [2]:
# --------------------------------------
import warnings

warnings.filterwarnings("ignore")

# --------------------------------------
import ibis

ibis.options.interactive = True

# --------------------------------------
from streetscapes.models import ModelType
from streetscapes.sources import SourceType
from streetscapes.streetview import SVWorkspace

Load a workspace (or create it if it doesn't exist). You can also pass the path to an `.env` file containing configuration options for the workspace. By default, the workspace looks for an `.env` file inside the workspace directory, and if it doesn't exist, it walks the parent tree until it finds one or it reaches the root of the file system. Environment variables are also recognised automatically.

In [None]:
ws = SVWorkspace("./Amsterdam", create=True)

Add some data sources to the workspace. Supported data sources are available through the `SourceType` enum.

In [None]:
gss = ws.add_source(SourceType.GlobalStreetscapes)
kv = ws.add_source(SourceType.KartaView)
mp = ws.add_source(SourceType.Mapillary)

In [None]:
gss, kv, mp

Show the contents of the workspace directory.

In [None]:
ws.show_contents()

Let's see where a new file would be placed. This file doesn't have to exist, it is just a path constructed with respect to the workspace root directory.

In [None]:
test_file_path = ws.get_workspace_path("test/test_file.txt")
test_file_path

Get the image URLs for a KartaView and a Mapillary image.

<span style="color:red;">NOTE</span>: Please make sure that you have a Mapillary token specified in the `.env` file associated with this workspace.

In [None]:
kv_img_url = kv.get_image_url(1208524)
kv_img_url

In [None]:
mp_img_url = mp.get_image_url("4911230068985425")
mp_img_url

## Loading data sources

Let's have a more detailed look into the data sources that we have loaded. Starting with the `Global Streetscapes` source, let's see how we can load and manipulate subsets of the available data. First, show the contents of the root directory for the `Global Streetscapes` source.

In [None]:
gss.show_contents()

Load and display information from the `info.csv` file from the `Global Streetscapes` source. All datasets below are Ibis tables, which makes subsetting and filtering extremely fast.

In [None]:
info = gss.load_csv("info", gss.root_dir)
info.count()

In [None]:
info.head()

Load the entire `streetscapes.parquet` set from `Global Streetscapes`. We don't need to specify the `.parquet` extension since we are using the `load_parquet()` method.

In [None]:
streetscapes = gss.load_parquet("streetscapes")
streetscapes.head()

Filter the loaded dataset by view direction (i.e., select only entries that have `view_direction` set to `side`) and show the results.

In [None]:
streetscapes.select("uuid", "view_direction").filter(streetscapes["view_direction"] == "side")

Now, prepare a more complex subset of `Global Streetscapes` by selecting entries for the city of Amsterdam with a view direction set to `side`. The subset name is composed of a path relative to the root directory of the workspace (`subsets`) and the file name (`amsterdam`). By default, subsets are saved as `parquet` files, so we don't have to specify the extension. Here, we specify that we would like to recreate the subset at every run of the notebook (`recreate=True`), as well as that we would like to save the file.

In [None]:
# Subset name (path relative to the root directory of the workspace + file name without the .parquet extension)
subset = "subsets/amsterdam"

# Criteria used to filter the large Global Streetscapes dataset.
criteria = {"city": "Amsterdam", "view_direction": "side"}

# Columns to keep as in the subset.
columns = {'uuid', 'source', 'city', 'lat', 'lon', "orig_id", "source"}

# Create the subset and assign it to a variable that we can use below.
# The method also returns the path to the saved subset if the dataset was saved to disk (triggered by save=True).
(ams, ams_path) = ws.load_dataset(gss, subset, criteria=criteria, columns=columns, recreate=True, save=True)

Check the path to the saved subset file. It should be a path relative to the root directory of the workspace.

In [None]:
ams_path

Here, we check the first few lines of the subset...

In [None]:
ams.head()

...and the total number of rows.

In [None]:
ams.count()

Let's load the subset from the saved file and verify that it is identical to the one assigned to `ams`.

In [None]:
ams_loaded = ws.load_parquet("subsets/amsterdam")

In [None]:
ams_loaded.head()

In [None]:
ams_loaded.count()

## Downloading images

Next, we will download images for the Amsterdam subset that we created above by using the images sources that we loaded into our workspace. We can download all the images corresponding to a data source in one go. However, the unified API of all image sources requires that the dataset contain two hardwired columns: `source` and `image_id`. This is a design choice to avoid having to handle potentially very different sources. Ibis makes it trivial to remap column names with the `select` method by providing a dictionary with the desired column names as keys and the existing columns that they map to as values.

In [None]:
src_table = ams.select({"source": "source", "image_id": "orig_id"})

Extract the source types in the table. A source must be supported ***and*** loaded in order to be recognised.

In [None]:
source_types = ws.get_source_types_from_table(src_table)
source_types

Now we are ready to instruct the workspace manager to download the images. We can request only a sample of all the images (useful for initial prototyping and demonstrations like this one). Only missing images will be downloaded.

In [None]:
sample = ws.download_images(src_table, sample=10)

Peek into the sample.

In [None]:
sample

Some of the images that we request might not exist any more on the image source. Figure out which images have been downloaded and which ones are missing.

In [None]:
existing, missing = ws.check_image_status(sample)
print(f"==[ existing: {[i.name for e in existing.values() for i in e]}")
print(f"==[ missing: {missing}")

## Loading models

We can load various models and apply them to the image data that we have loaded or generated so far. Currently, Streetscapes supports two segmentation models:

- `MaskFormer`: A relatively small and nimble model that recognises objects from a fixed number of categories.
- `DinoSAM`: A combination of two independent models that work together to perform instance segmentation simply by providing a prompt. It is much more flexible than `MaskFormer` in that it recognises *arbitrary categories*; however, it is much slower.

We will segment the images with both models to illustrate their differences. First, we load the models using an API analogous to that for data sources, with the exception that models are spawned globally and can be reused across multiple workspaces since they are workspace-agnostic. This is another design choice to minimise memory consumption for potentially large models.

In [None]:
mf = ws.spawn_model(ModelType.MaskFormer)
ds = ws.spawn_model(ModelType.DinoSAM)

We define the categories of objects that we would like to look for in the images that we are segmenting. Categories are defined hierarchically as a nested dictionary. Subcategories (such as `window` and `door` below) will be identified as separate categories, but the pixels that they occupy will be subtracted from ones attributed to their parent (here, `building`). In this way, it is possible to extract building façades excluding windows and doors. Internally, this nested dictionary is flattened, and any overlaps are handled after instances of the corresponding categories have been identified.

In [None]:
labels = {
    "building": {
        "window": None,
        "door": None,
    },
    "vegetation": None,
    "car": None,
    "truck": None,
    "road": None,
}

Segment all images contained in a dataset. Here, we use our`sample` dataset that we created above. A batch size (defaulting to `10`) can be specified to speed up the segmentation, but here we use a batch size of 1 to show the progress more clearly.

In [None]:
mf_segmentations = ws.segment_from_dataset(sample, mf, labels, batch_size=1)

The masks and the instances are saved as separate files with the same name as the input image but in different formats (NumPy archived arrays and Parquet files, respectively) so that they can be loaded later together. Here, we print the name of the file containing the mask for the first segmented image.

In [None]:
mf_segmentations[0].mask_path.name

Show the categories (out of the ones that we requested) identified by this model.

In [None]:
mf_segmentations[0].get_instance_table().select('label').distinct()

Visualise the parts of the image corresponding to some object categories of interest (here, we ask for everything that is labelled as a `building`).

In [None]:
mf_segmentations[0].visualise('building')

If the `visualise()` method is called without an argument, all the identified categories are visualised.

In [None]:
mf_segmentations[0].visualise()

We can also extract instances for individual categories and and visualise them in isolation.

In [None]:
buildings = mf_segmentations[0].get_instances("building")
buildings[0].visualise(mf_segmentations[0].get_image())

We will now execute the same pipeline with the `DinoSAM` model.

In [None]:
ds_segmentations = ws.segment_from_dataset(sample, ds, labels, batch_size=1)

In [None]:
ds_segmentations[0].get_instance_table().select('label').distinct()

In [None]:
ds_segmentations[0].visualise()

In [None]:
ds_vis = ds_segmentations[0].get_instances("building")

In [None]:
ds_vis[3].visualise(ds_segmentations[0].get_image())