# Remarkable Trees Paris – Advanced Pipeline

This notebook explores the distribution and characteristics of remarkable trees across Paris neighborhoods using data from [Paris Open Data](https://opendata.paris.fr/explore/dataset/arbresremarquablesparis/information/). The dataset, created in 2006 by the Direction des Espaces Verts et de l'Environnement - Ville de Paris, includes geo-located remarkable trees found in diverse locations such as gardens, cemeteries, streets, schools, and early childhood institutions. These trees are notable for their age, size, rarity, or historical significance.

The study maps these trees to their respective neighborhoods (quartiers) and enriches the data with the following neighborhood-level metrics:
- **Count of remarkable trees**: Total number of remarkable trees per neighborhood.
- **Average circumference**: Mean circumference of trees (in cm) per neighborhood.
- **Average height**: Mean height of trees (in meters) per neighborhood.
- **Most common genus**: The predominant tree genus in each neighborhood.
- **Oldest plantation date**: The earliest recorded plantation date per neighborhood.
- **Summary of resumes**: An LLM-generated summary (in English) of the combined 'Résumé' (summary notes) for all remarkable trees in each neighborhood.
- **Summary of descriptions**: An LLM-generated summary (in English) of the combined 'Descriptif' (detailed descriptions) for all remarkable trees in each neighborhood.

Through this pipeline, the notebook processes the data, applies spatial filters, and visualises the enriched metrics on interactive maps, offering insights into how remarkable trees are distributed and characterized across Paris.

⚠️ Please Note — Within The Documentation's Interactive Examples ⚠️

First and foremost, please bear with us; some of our Jupyter Notebooks cannot be interactive and are thus displayed as is in the documentation.  Feel free to install the library and test it out locally.  Next, determine whether they are interactive, which means you can see the output of each cell.  As a result, because it is not a good practice to save datasets in a GitHub (or any other Git in general) repository, we attempted to import urban datasets from `HuggingFace` using `from_huggingface(.)` rather than `from_file(.)`, which would need local file availability.  Nonetheless, this was (1) not always viable (certain datasets are not on `HuggingFace`), and (2) this does not preclude you from using `from_file(.)` or any other available via the API reference's `Loader` module.

In [None]:
#####################################################################################

# ⚠️ INFORMATION ABOUT THE CURRENT CELL ⚠️
# The following shows custom aggregation functions 
# used later on in the pipeline

# Make sure to export your OPEN AI key as an env of your terminal's instance.

#####################################################################################

import pandas as pd
import ell

def most_common_genre(series):
    if series.empty:
        return None
    mode = series.mode()
    return mode.iloc[0] if not mode.empty else None

def oldest_plantation_date(series):
    if series.empty:
        return None
    if not pd.api.types.is_datetime64_any_dtype(series):
        try:
            series = pd.to_datetime(series, errors='coerce', utc=True)
        except Exception as e:
            raise ValueError(f"Could not convert series to datetime: {e}")
    return series.min()

@ell.simple(model="gpt-4")
def summarize_texts(texts: str):
    """You are a urban planner expert and to write summarisation text for urban offices of city councils."""
    return f"Résumez les textes suivants de manière très concise, output tout en Anglais s'il te plait :\n\n{texts}"

def summarize_resumes(series):
    if series.empty:
        return None
    combined_text = " ".join(series)
    try:
        summary = summarize_texts(combined_text)
        return summary
    except Exception as e:
        print(f"Error generating summary: {e}")
        return "Summary unavailable"

In [None]:
#####################################################################################

# ⚠️ INFORMATION ABOUT THE CURRENT CELL ⚠️
# Some data wrangling are necessary due to the raw data being not
# computable enough hence the "manual" load to create a pre-processed
# version of the dataset

#####################################################################################

from urban_mapper import CSVLoader
import urban_mapper

file_path = "./arbresremarquablesparis.csv"
df = CSVLoader(file_path, "idbase", "idbase", separator=";")._load_data_from_file()

df[['latitude', 'longitude']] = df['Geo point'].str.split(',', expand=True)
df['latitude'] = df['latitude'].str.strip().astype(float)
df['longitude'] = df['longitude'].str.strip().astype(float)

df.drop(columns=["Geo point"], axis=1, inplace=True)
df.to_parquet("./trees_paris.parquet")

mapper = urban_mapper.UrbanMapper()
mapper.table_vis.interactive_display(df)

In [None]:
from urban_mapper.pipeline import UrbanPipeline
import urban_mapper as um

pipeline = UrbanPipeline([
    ("urban_layer", (
        um.UrbanMapper().urban_layer
        .with_type("region_neighborhoods")
        .from_place("Paris, France")
        .with_mapping(
            longitude_column="longitude",
            latitude_column="latitude",
            output_column="nearest_quartier"
        )
        .build()
    )),
    ("loader", (
        um.UrbanMapper().loader
        .from_file("./trees_paris.parquet")
        .with_columns(longitude_column="longitude", latitude_column="latitude")
        .build()
    )),
    ("filter", um.UrbanMapper().filter.with_type("BoundingBoxFilter").build()),
    ("enrich_trees_count", (
        um.UrbanMapper().enricher
        .with_data(group_by="nearest_quartier")
        .count_by(output_column="ramarquable_trees_count")
        .build()
    )),
    ("enrich_avg_circonference", (
        um.UrbanMapper().enricher
        .with_data(group_by="nearest_quartier", values_from="circonference en cm")
        .aggregate_by(method="mean", output_column="avg_circonference")
        .build()
    )),
    ("enrich_avg_hauteur", (
        um.UrbanMapper().enricher
        .with_data(group_by="nearest_quartier", values_from="hauteur en m")
        .aggregate_by(method="mean", output_column="avg_hauteur")
        .build()
    )),
    ("enrich_most_common_genre", (
        um.UrbanMapper().enricher
        .with_data(group_by="nearest_quartier", values_from="genre")
        .aggregate_by(method=most_common_genre, output_column="most_common_genre")
        .build()
    )),
    ("enrich_oldest_plantation", (
        um.UrbanMapper().enricher
        .with_data(group_by="nearest_quartier", values_from="date de plantation")
        .aggregate_by(method=oldest_plantation_date, output_column="oldest_plantation_date")
        .build()
    )),
    ("enrich_resume_summary", (
        um.UrbanMapper().enricher
        .with_data(group_by="nearest_quartier", values_from="Résumé")
        .aggregate_by(method=summarize_resumes, output_column="resume_summary")
        .build()
    )),
    ("enrich_description_summary", (
        um.UrbanMapper().enricher
        .with_data(group_by="nearest_quartier", values_from="Descriptif")
        .aggregate_by(method=summarize_resumes, output_column="descriptif_summary")
        .build()
    )),
    ("visualiser", (
        um.UrbanMapper().visual
        .with_type("Interactive")
        .with_style({
            "tiles": "CartoDB dark_matter",
            "tooltip": [
                "ramarquable_trees_count",
                "avg_circonference",
                "avg_hauteur",
                "most_common_genre",
                "oldest_plantation_date",
                "resume_summary",
                "descriptif_summary",
                "name"
            ],
            "colorbar_text_color": "white",
        })
        .build()
    ))
])

In [None]:
# Execute the pipeline
mapped_data, enriched_layer = pipeline.compose_transform()

In [None]:
# Visualise the enriched metrics
fig = pipeline.visualise([
    "ramarquable_trees_count",
    "avg_circonference",
    "avg_hauteur",
    "most_common_genre",
    "oldest_plantation_date",
    "resume_summary",
    "descriptif_summary",
])

fig

In [None]:
# Save the pipeline
pipeline.save("./remarquable_trees_paris.dill")

In [None]:
# Export the pipeline to JupyterGIS for collaborative exploration
pipeline.to_jgis(
    filepath="remarquable_trees_paris_with_llm.JGIS",
    urban_layer_name="Remarquable Trees In paris analysis",
)