## Create a subset of global-streetscapes dataset

In [20]:
# --------------------------------------
import warnings

warnings.filterwarnings("ignore")

# --------------------------------------
import ibis
ibis.options.interactive = True

# --------------------------------------
import streetscapes as scs

### Create or load the subset

In [21]:
# Directory containing CSV files
data_dir = scs.conf.CSV_DIR

# Directory containing Parquet files
parquet_dir = scs.conf.PARQUET_DIR

# Name of the subset to create
subset = "amsterdam_side"

Load the entire dataset. We are going to progressively extract subsets from it below.

In [22]:
df_all = scs.load_subset()

[35mStreetscapes[0m | [36m2025-03-18@10:28:24[0m | [1mLoading 'streetscapes.parquet'...[0m
[35mStreetscapes[0m | [36m2025-03-18@10:28:24[0m | [1mDone[0m


### Subset dataset

In this case we are choosing images of Amsterdam, during the day with a viewing direction from the side. First, we filter by city.

In [23]:
df_ams = df_all[df_all["city"] == "Amsterdam"]

Show a data excerpt.

In [24]:
df_ams.head()

Filter the remainder by lighting condition. First, we check what options there are in the data.

In [25]:
df_ams[["lighting_condition"]].distinct()

Filter by lighting condition (here, we use `day`).

In [26]:
df_day = df_ams[df_ams["lighting_condition"] == "day"]
df_day.columns

['uuid',
 'source',
 'orig_id',
 'glare',
 'lighting_condition',
 'pano_status',
 'platform',
 'quality',
 'reflection',
 'view_direction',
 'weather',
 'lat',
 'lon',
 'datetime_local',
 'year',
 'month',
 'day',
 'hour',
 'width',
 'height',
 'heading',
 'projection_type',
 'hFoV',
 'vFoV',
 'sequence_index',
 'sequence_id',
 'sequence_img_count',
 'Bird',
 'Ground-Animal',
 'Curb',
 'Fence',
 'Guard-Rail',
 'Barrier',
 'Wall',
 'Bike-Lane',
 'Crosswalk---Plain',
 'Curb-Cut',
 'Parking',
 'Pedestrian-Area',
 'Rail-Track',
 'Road',
 'Service-Lane',
 'Sidewalk',
 'Bridge',
 'Building',
 'Tunnel',
 'Person',
 'Bicyclist',
 'Motorcyclist',
 'Other-Rider',
 'Lane-Marking---Crosswalk',
 'Lane-Marking---General',
 'Mountain',
 'Sand',
 'Sky',
 'Snow',
 'Terrain',
 'Vegetation',
 'Water',
 'Banner',
 'Bench',
 'Bike-Rack',
 'Billboard',
 'Catch-Basin',
 'CCTV-Camera',
 'Fire-Hydrant',
 'Junction-Box',
 'Mailbox',
 'Manhole',
 'Phone-Booth',
 'Pothole',
 'Street-Light',
 'Pole',
 'Traffic-Sig

Finally, filter by view direction (we use `side` here).

In [27]:
df_side = df_day[df_day["view_direction"] == "side"]
df_side.columns

['uuid',
 'source',
 'orig_id',
 'glare',
 'lighting_condition',
 'pano_status',
 'platform',
 'quality',
 'reflection',
 'view_direction',
 'weather',
 'lat',
 'lon',
 'datetime_local',
 'year',
 'month',
 'day',
 'hour',
 'width',
 'height',
 'heading',
 'projection_type',
 'hFoV',
 'vFoV',
 'sequence_index',
 'sequence_id',
 'sequence_img_count',
 'Bird',
 'Ground-Animal',
 'Curb',
 'Fence',
 'Guard-Rail',
 'Barrier',
 'Wall',
 'Bike-Lane',
 'Crosswalk---Plain',
 'Curb-Cut',
 'Parking',
 'Pedestrian-Area',
 'Rail-Track',
 'Road',
 'Service-Lane',
 'Sidewalk',
 'Bridge',
 'Building',
 'Tunnel',
 'Person',
 'Bicyclist',
 'Motorcyclist',
 'Other-Rider',
 'Lane-Marking---Crosswalk',
 'Lane-Marking---General',
 'Mountain',
 'Sand',
 'Sky',
 'Snow',
 'Terrain',
 'Vegetation',
 'Water',
 'Banner',
 'Bench',
 'Bike-Rack',
 'Billboard',
 'Catch-Basin',
 'CCTV-Camera',
 'Fire-Hydrant',
 'Junction-Box',
 'Mailbox',
 'Manhole',
 'Phone-Booth',
 'Pothole',
 'Street-Light',
 'Pole',
 'Traffic-Sig

Check how many rows are left after filtering.

In [28]:
df_side.count()

┌──────┐
│ [1;36m3728[0m │
└──────┘

### Create dataframe to download images

Only keep the information needed to download the images and save to a csv file. 

In [29]:
df_to_download = df_side[["uuid", "source", "orig_id"]]
df_to_download.head()

In [30]:
df_to_download.to_parquet(parquet_dir / f"{subset}.parquet")

In [31]:
df_ams = ibis.read_parquet(parquet_dir / f"{subset}.parquet")

In [32]:
df_ams.head()

We can achieve the same outcome by using a Streetscapes function. For now, we can specify basic conditions using the `operator` module, such as `equal to` (`operator.eq`), `greater / less than` (`operator.gt` / `operator.lt`) and so forth. A missing operator is implicitly interpreted as `operator.eq`. We are working on more sophisticated filtering options.

In [None]:
# Define the criteria for creating the subset
criteria = {
    "city": "Amsterdam", # Equivalent to "city": (operator.eq, "Amsterdam")
    "view_direction": "side",
    "lighting_condition": "day",
}

# Define the columns to keep in the subset
columns = ["uuid", "source", "orig_id", "hour"]

# Create or load the subset
df_city = scs.load_subset(
    subset,
    criteria=criteria,
    columns=columns,
    recreate=True,
    save=False,
)

[35mStreetscapes[0m | [36m2025-03-18@10:49:26[0m | [1mCreating subset 'amsterdam_side'...[0m
[35mStreetscapes[0m | [36m2025-03-18@10:49:26[0m | [1mDone[0m


Make sure that the number of columns match what we obtained above.

In [47]:
df_city.count()

┌──────┐
│ [1;36m3728[0m │
└──────┘

Show a data excerpt.

In [48]:
df_city.head()