# Performance Comparison Demo: GeoParquet vs. Shapefile with DuckDB

This notebook demonstrates the performance comparison between querying geospatial data from GeoParquet files and Shapefiles, both with and without using DuckDB.

The core logic for creating the datasets and running the benchmarks is located in `main.py`.

In [None]:
# 1. 取得你的專案程式碼
# 確保你的 repo 是公開的，或者提供 token
!git clone https://github.com/your-username/your-project.git

# 2. 進入專案資料夾
# %cd 是 Colab 的 "magic command"，用來切換當前工作目錄
%cd your-project

# 3. 安裝 uv (用 pip 裝是最快的方式)
print("--- 正在安裝 uv ---")
!pip install uv

# 4. 用 uv 安裝依賴
# uv 會自動尋找 pyproject.toml 或 requirements.txt
#
# 注意：在 Colab 這種臨時環境中，我們不需要建立 venv，
# 直接安裝到 Colab 的 "系統環境" 中是最簡單的。
print("\n--- 正在使用 uv 安裝依賴套件 ---")
!uv pip install .

In [6]:
import os
import sys
from IPython.display import display, Markdown

# Add the current directory to the system path to allow importing main.py
if os.getcwd() not in sys.path:
    sys.path.append(os.getcwd())

try:
    from main import create_datasets, benchmark_queries
    display(Markdown("Successfully imported `create_datasets` and `benchmark_queries` from `main.py`."))
except ImportError as e:
    display(Markdown(f"<font color='red'>Error: Could not import functions from `main.py`. Please ensure `main.py` is in the same directory and contains `create_datasets` and `benchmark_queries` functions.</font><br>Details: {e}"))
    # Exit or raise error if crucial functions are missing
    raise

Successfully imported `create_datasets` and `benchmark_queries` from `main.py`.

## 1. Create Datasets

This step will generate a large dataset of random points and save them as both a GeoParquet file (`data/points.parquet`) and a Shapefile (`data/points.shp`). This process might take a few moments.

In [7]:
# Execute the function to create the datasets
create_datasets()

--- Creating a dataset with 1000000 points ---
Saving to GeoParquet...
Saving to Shapefile...
--- Datasets created ---


## 2. Run Benchmarks

Now, we will run the benchmarking process. This will compare the query efficiency of GeoParquet and Shapefiles using `geopandas` directly and with `DuckDB`. The results, including execution times, will be printed below. This process might also take some time depending on the dataset size and your system's performance.

In [8]:
# Execute the benchmarking function
benchmark_queries()


--- Benchmarking Queries ---

-- Geopandas without DuckDB --
Shapefile with Geopandas: 1.4814 seconds, found 6205 points.
GeoParquet with Geopandas: 0.4540 seconds, found 6205 points.

-- With DuckDB --
Shapefile with DuckDB: 2.2476 seconds, found 6205 points.
GeoParquet with DuckDB: 0.2179 seconds, found 6205 points.


## Summary of Results

The output above shows the time taken for each query method. Typically, for large datasets and complex spatial operations, the combination of **GeoParquet with DuckDB** tends to be the most performant due to its columnar storage optimization and in-process analytical capabilities.