# Fetch Open Datasets of PV locations

Many of these datasets are located in [Zenodo](https://zenodo.org/), a general-purpose open-access repository developed under the European OpenAIRE program and operated by CERN. Others are hosted in figshare, a web-based platform for sharing research data and other types of content. The rest are hosted in GitHub repositories or other open-access platforms.The datasets are available in various formats, including CSV, GeoJSON, and shapefiles, and raster masks. We'll be using open-source Python libraries to download and process them into properly georeferenced geoparquet files
that will serve as a base for our duckdb tables that we'll manage with dbt

Here we list the dataset titles alongside their first author, DOI links, and their number of labels:
- "A solar panel dataset of very high resolution satellite imagery to support the Sustainable Development Goals" - C. Clark et al, 2023 | [paper DOI](https://doi.org/10.1038/s41597-023-02539-8) | [dataset DOI](https://doi.org/10.6084/m9.figshare.22081091.v3) | 2,542 object labels (per spatial resolution)
- "A global inventory of photovoltaic solar energy generating units" - L. Kruitwagen et al, 2021 | [paper DOI](https://doi.org/10.1038/s41586-021-03957-7) | [dataset DOI](https://doi.org/10.5281/zenodo.5005867) | 50,426 for training, cross-validation, and testing; 68,661 predicted polygon labels 
- "A harmonised, high-coverage, open dataset of solar photovoltaic installations in the UK" - D. Stowell et al, 2020 | [paper DOI](https://doi.org/10.1038/s41597-020-00739-0) | [dataset DOI](https://zenodo.org/records/4059881) | 265,418 data points (over 255,000 are stand-alone installations, 1067 solar farms, and rest are subcomponents within solar farms)
- "A crowdsourced dataset of aerial images with annotated solar photovoltaic arrays and installation metadata" - G. Kasmi, 2023 | [paper DOI](https://doi.org/10.1038/s41597-023-01951-4) | [dataset DOI](https://doi.org/10.5281/zenodo.6865878) | > 28K points of PV installations; 13K+ segmentation masks for PV arrays; metadata for 8K+ installations
- "Georectified polygon database of ground-mounted large-scale solar photovoltaic sites in the United States" - K. Sydny, 2023 | [paper DOI](https://doi.org/10.1038/s41597-023-02644-8) | [dataset DOI](https://www.sciencebase.gov/catalog/item/6671c479d34e84915adb7536) | 4186 data points (Note: these correspond to PV _facilities_ rather than individual panel arrays or objects and need filtering of duplicates with other datasets and further processing to extract the PV arrays in the facility)
- "Vectorized solar photovoltaic installation dataset across China in 2015 and 2020" - J. Liu et al, 2024 | [paper DOI](https://doi.org/10.1038/s41597-024-04356-z) | [dataset link](https://github.com/qingfengxitu/ChinaPV) | 3,356 PV labels (inspect quality!)
- "Multi-resolution dataset for photovoltaic panel segmentation from satellite and aerial imagery" - H. Jiang, 2021 | [paper DOI](https://doi.org/10.5194/essd-13-5389-2021) | [dataset DOI](https://doi.org/10.5281/zenodo.5171712) | 3,716 samples of PV data points
- "An Artificial Intelligence Dataset for Solar Energy Locations in India" - A. Ortiz, 2022 | [paper DOI](https://doi.org/10.1038/s41597-022-01499-9) | [dataset link 1](https://researchlabwuopendata.blob.core.windows.net/solar-farms/solar_farms_india_2021.geojson) or [dataset link 2](https://raw.githubusercontent.com/microsoft/solar-farms-mapping/refs/heads/main/data/solar_farms_india_2021_merged_simplified.geojson) | 117 geo-referenced points of solar installations across India
- "GloSoFarID: Global multispectral dataset for Solar Farm IDentification in satellite imagery" - Z. Yang, 2024 | [paper DOI](https://doi.org/10.48550/arXiv.2404.05180) | [dataset DOI](https://github.com/yzyly1992/GloSoFarID/tree/main/data_coordinates) | 6,793 PV samples across 3 years (double counting of samples)
- "Distributed solar photovoltaic array location and extent dataset for remote sensing object identification" - K. Bradbury, 2016 | [paper DOI](https://doi.org/10.1038/sdata.2016.106) | [dataset DOI](https://doi.org/10.6084/m9.figshare.3385780.v4) | polygon annotations for 19,433 PV modules in 4 cities in California, USA

In [36]:
from IPython.display import display
from IPython.display import clear_output
from tqdm import tqdm
import ipywidgets as widgets
from ipywidgets import Layout
from dotenv import load_dotenv

import numpy as np
import xarray as xr
from branca.colormap import linear
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import pandas as pd
# import matplotlib.pyplot as plt

import geopandas as gpd
import rasterio
import shapely
import pygeohash
import folium
import lonboard
import pydeck as pdk
# import openeo 
# import pystac_client

# import easystac
# import cubo

import duckdb as dd 
import datahugger
import sciencebasepy
from seedir import seedir

# python libraries
import os
import json
import requests
import urllib.parse
from pathlib import Path
import subprocess
import tempfile
import shutil
import pprint as pp
import time
import re
from zipfile import ZipFile
import random

In [2]:
# create dict of metadata for datasets
# this will be used for interactive widget and managing downloads

# for maxar dataset
# Catalogue ID 1040050029DC8C00; use to find geospatial extent coords
# The geocoordinates for each solar panel object may be determined using the native resolution labels (found in the labels_native directory). 
# The center and width values for each object, along with the relative location information provided by the naming convention for each label, 
# may be used to determine the pixel coordinates for each object in the full, corresponding native resolution tile. 
# The pixel coordinates may be translated to geocoordinates using the EPSG:32633 coordinate system and the following geotransform for each tile:

# Tile 1: (307670.04, 0.31, 0.0, 5434427.100000001, 0.0, -0.31)
# Tile 2: (312749.07999999996, 0.31, 0.0, 5403952.860000001, 0.0, -0.31)
# Tile 3: (312749.07999999996, 0.31, 0.0, 5363320.540000001, 0.0, -0.31)
# see here on gdal format geotransform: https://gdal.org/en/stable/tutorials/geotransforms_tut.html

# look into adding dataset crs or projection to metadata dict
# note that most of these details are hardcoded and difficult to parse ahead of time
# load environment variables
load_dotenv()
DATASET_DIR = Path(os.getenv('DATA_PATH'))
dataset_metadata = {
    'deu_maxar_vhr_2023': {
        'doi': '10.6084/m9.figshare.22081091.v3',
        'repo': 'figshare',
        'compression': 'zip',
        'label_fmt': 'yolo_fmt_txt',
        'has_imgs': False,
        'label_count': 2542 # solar panel objects (ie not individual panels)
    },
    'uk_crowdsourced_pv_2020': {
        'doi': '10.5281/zenodo.4059881',
        'repo': 'zenodo',
        'compression': None,
        'label_fmt': 'geojson',
        'geom_type': {'features': ['Point', 'Polygon', 'MultiPolygon']},
        'crs': None, # default to WGS84 when processing
        'has_imgs': False,
        'label_count': 265418
    },
    # note for report later: Maxar Technologies (MT) was primarily used to determine the extent of solar arrays
    'usa_eia_large_scale_pv_2023': {
        'doi': '10.5281/zenodo.8038684',
        'repo': 'sciencebase',
        'compression': 'zip',
        'label_fmt': 'shapefile',
        'has_imgs': False,
        'label_count': 4186
    },
    'chn_med_res_pv_2024': {
        # using github files since zenodo shapefiles fail to load in QGIS
        'doi': 'https://github.com/qingfengxitu/ChinaPV/tree/main',
        'repo': 'github',
        'compression': None,
        'label_fmt': 'shapefile',
        'has_imgs': False,
        'label_count': 3356
    },
    'usa_cali_usgs_pv_2016': {
        'doi': '10.6084/m9.figshare.3385780.v4',
        'repo': 'figshare',
        'compression': None,
        'label_fmt': 'geojson',
        'crs': 'NAD83',
        'geom_type': {'features': 'Polygon'},
        'has_imgs': False,
        'label_count': 19433
    },
    'chn_jiangsu_vhr_pv_2021': {
        'doi': '10.5281/zenodo.5171712',
        'repo': 'zenodo',
        'compression': 'zip',
        # look into geotransform details for processing these labels
        'label_fmt': 'pixel_mask',
        'has_imgs': True,
        'label_count': 3716
    },
    'ind_pv_solar_farms_2022': {
        'doi': 'https://raw.githubusercontent.com/microsoft/solar-farms-mapping/refs/heads/main/data/solar_farms_india_2021_merged_simplified.geojson',
        'repo': 'github',
        'compression': None,
        'label_fmt': 'geojson',
        'geom_type': {'features': 'MultiPolygon'}, 
        'crs': 'WGS84',
        'has_imgs': False,
        'label_count': 117
    },
    'fra_west_eur_pv_installations_2023': {
        'doi': '10.5281/zenodo.6865878',
        'repo': 'zenodo',
        'compression': 'zip',
        'label_fmt': 'json',
        'geom_type': {'Polygon': ['Point']},
        'crs': None, 
        'has_imgs': True, 
        'label_count': (13303, 7686)
    },
    'global_pv_inventory_sent2_spot_2021': {
        'doi': '10.5281/zenodo.5005867',
        'repo': 'zenodo',
        'compression': None,
        'label_fmt': 'geojson',
        'geom_type': ['Polygon'],
        'crs': 'WGS84',
        'has_imgs': False,
        'label_count': 50426
    },
    'global_pv_inventory_sent2_2024': {
        'doi': 'https://github.com/yzyly1992/GloSoFarID/tree/main/data_coordinates',
        'repo': 'github',
        'compression': None,
        'label_fmt': 'json',
        'crs': None, # default to WGS84 when processing
        'geom_type': ['Point'], # normal json with no geometry attribute
        'has_imgs': True, 
        'label_count': 6793
    }

}

dataset_choices = [
    # 'global_pv_inventory_sent2_2024',
    'global_pv_inventory_sent2_spot_2021',
    # 'fra_west_eur_pv_installations_2023',
    'ind_pv_solar_farms_2022',
    'usa_cali_usgs_pv_2016',
    # 'chn_med_res_pv_2024',
    # 'usa_eia_large_scale_pv_2023',
    # 'uk_crowdsourced_pv_2020',
    # 'deu_maxar_vhr_2023'   
]

In [3]:
# Initialize a list to store selected datasets
# mostly gen by github copilot with Claude 3.7 model
selected_datasets = dataset_choices.copy()

def format_dataset_info(dataset):
    """Create a formatted HTML table for dataset metadata"""
    metadata = dataset_metadata[dataset]
    
    # Create table with metadata
    html = f"""
    <style>
    .dataset-table {{
        border-collapse: collapse;
        width: 30%;
        margin: 20px auto;
        font-family: Arial, sans-serif;
    }}
    .dataset-table th, .dataset-table td {{
        border: 1px solid #ddd;
        padding: 8px;
        text-align: left;
    }}
    .dataset-table th {{
        background-color: #f2f2f2;
        font-weight: bold;
    }}
    </style>
    <table class="dataset-table">
        <tr><th>Metadata</th><th>Value</th></tr>
        <tr><td>DOI/URL</td><td>{metadata['doi']}</td></tr>
        <tr><td>Repository</td><td>{metadata['repo']}</td></tr>
        <tr><td>Compression</td><td>{metadata['compression'] or 'None'}</td></tr>
        <tr><td>Label Format</td><td>{metadata['label_fmt']}</td></tr>
        <tr><td>Has Images</td><td>{metadata['has_imgs']}</td></tr>
        <tr><td>Label Count</td><td>{metadata.get('label_count', 'Unknown')}</td></tr>
    </table>
    """
    return html

# Create an accordion to display selected datasets with centered layout
dataset_accordion = widgets.Accordion(
    children=[widgets.HTML(format_dataset_info(ds)) for ds in selected_datasets],
    layout=Layout(width='50%', margin='0 auto')
)
for i, ds in enumerate(selected_datasets):
    dataset_accordion.set_title(i, ds)

# Define a function to add or remove datasets
def manage_datasets(action, dataset=None):
    global selected_datasets, dataset_accordion
    
    if action == 'add' and dataset and dataset not in selected_datasets:
        selected_datasets.append(dataset)
    elif action == 'remove' and dataset and dataset in selected_datasets:
        selected_datasets.remove(dataset)
    
    # Update the accordion with current selections
    dataset_accordion.children = [widgets.HTML(format_dataset_info(ds)) for ds in selected_datasets]
    for i, ds in enumerate(selected_datasets):
        dataset_accordion.set_title(i, ds)
    
    f"Currently selected datasets: {len(selected_datasets)}"

# Create dropdown for available datasets
dataset_dropdown = widgets.Dropdown(
    options=list(dataset_metadata.keys()),
    description='Dataset:',
    disabled=False,
    layout=Layout(width='70%', margin='20 20 auto 20 20')
)

# Create buttons for actions
add_button = widgets.Button(description="Add Dataset", button_style='success')
remove_button = widgets.Button(description="Remove Dataset", button_style='danger')

# Define button click handlers
def on_add_clicked(b):
    manage_datasets('add', dataset_dropdown.value)

def on_remove_clicked(b):
    manage_datasets('remove', dataset_dropdown.value)

# Link buttons to handlers
add_button.on_click(on_add_clicked)
remove_button.on_click(on_remove_clicked)

## Dataset Selection Interface
#### Use the dropdown and buttons below to customize which solar panel datasets will be fetched and processed.
- Select a dataset from the dropdown:
    - Click "Add Dataset" to include it in processing
    - Click "Remove Dataset" to exclude it
- View metadata table in the selected dataset's dropdown

In [4]:
# Display the widgets
display(widgets.HBox([dataset_dropdown, add_button, remove_button]))
display(dataset_accordion)

HBox(children=(Dropdown(description='Dataset:', layout=Layout(margin='20 20 auto 20 20', width='70%'), options…

Accordion(children=(HTML(value='\n    <style>\n    .dataset-table {\n        border-collapse: collapse;\n     …

# Fetching and Organizing datasets for later-preprocessing

We will use [datahugger](https://j535d165.github.io/datahugger/) to fetch datasets hosted in Zenodo, figshare, and GitHub. 

We will sciencebase for the dataset hosted in the USGS ScienceBase Catalog.
We will pre-process and convert datasets into geojson, if not already formatted, and manage these using [geopandas](https://geopandas.org/). These will be further processed into geoparquet files for use in duckdb tables used to manage and later consolidate the datasets with dbt.  
- The datasets will be stored in the `data/` directory
    - the geoparquet files will be stored in the `data/geoparquet/` directory

#### Processing

In [19]:
# move to utility functions later
# def fetch_github_repo_files(dataset_name, 


# use the metadata to fetch the dataset files using datahugger
def fetch_dataset_files(dataset_name, max_mb=100, force=False):
    metadata = dataset_metadata[dataset_name]
    doi = metadata['doi']
    repo = metadata['repo']
    compression = metadata['compression']
    label_fmt = metadata['label_fmt']
    # convert to bytes
    max_dl = max_mb * 1024 * 1024
    dataset_dir = os.path.join(os.getenv('DATA_PATH'), 'raw', 'labels', dataset_name)
    geofile_regex = r'^(.*\.(geojson|json|shp|zip|csv))$'
    dst = os.path.join(os.getcwd(), dataset_dir)
    dst_p = Path(dst)

    # prettyprint metadata and dst info
    # pp.pprint(metadata)
    # print(f"Destination: {dataset_dir}")
    # print(f"Max download size: {max_mb} MB")
    # print(f"Force Download: {force}")

    dataset_tree = {}

    # TODO: move different repo handling to separate functions

    # use datahugger to fetch files from most repos
    if repo in ['figshare', 'zenodo']:

        ds_tree = datahugger.get(doi, dst, max_file_size=max_dl, force_download=force)
        # compare files to be fetched (after filtering on max file size) with existing files  
        files_to_fetch = [f['name'] for f in ds_tree.dataset.files if f['size'] <= max_dl]
        ds_files = [os.path.join(root, fname) for root, dirs, files in os.walk(dst_p) for fname in files if re.match(geofile_regex, fname)]
        # flag for avoiding extracting zip when already extracted
        # is_unzipped = all(f in ds_files for f in files_to_fetch) and len(ds_files) > 1
        # TODO: handle .zip files that consist of a redundant copy of the entire dataset
        if metadata['compression'] == 'zip' and any(f.endswith('.zip') for f in ds_files):
            print(f"Dataset metadata for {dataset_name} indicates handling of one or more downloaded zip files.")
            # check if the zip file was fetched and directly extract if it's the only file in the dataset
            extracted_files = []
            if len(ds_files) <= 2 and ds_files[0].endswith('.zip'):
                zip_file = dst_p / ds_files[0]
                # print(f"Found single zip file for dataset: {zip_file}")
                # extract the zip file and delete it 
                with ZipFile(zip_file, 'r') as zip_ref:
                    extracted_files = zip_ref.namelist()
                    zip_ref.extractall(dst)
                
                # remove the zip file
                # try:
                #     os.remove(zip_file)
                #     print(f"Removed {os.path.relpath(zip_file)} after extraction")
                # except Exception as e:
                #     print(f"Error removing {zip_file}: {e}")
                # check if zip file consisted of a single dir and move contents up one level
                top_level_dir = dst_p / extracted_files[0]
                if top_level_dir.is_dir():
                    # move only first level dirs and files to our dataset dir
                    for item in top_level_dir.iterdir():
                        if item.name.endswith('.zip'):
                            continue
                        # don't copy if already exists and is non-empty
                        # TODO: add non-empty check
                        elif os.path.exists(dst_p / item.name):
                            print(f"Skipping {item} as it already exists in {os.path.relpath(dst)}")
                            continue
                        elif item.parent == top_level_dir:
                            print(f"Moving {item} to {os.path.relpath(dst)}")
                            shutil.move(item, dst)
                    # remove the top level dir
                    shutil.rmtree(top_level_dir)

                ds_files = [os.path.join(root, fname) for root, dirs, files in os.walk(dst_p) for fname in files if re.match(geofile_regex, fname)]
                print(f"Moved items from {os.path.relpath(top_level_dir)} to:\n{os.path.relpath(dst_p)}")
                print(f"After extraction and moving, we have {len(ds_files)} files in {os.path.relpath(dst)}:\n{ds_files}")

            elif len(ds_files) > 2:
                # multiple files in addition to the zip file; handle on case by case basis
                print(f"Multiple files found in {dst_p}:\n{os.listdir(dst_p)}")
        # no further processing needed; get file list directly from datahugger
        else: 
            ds_files = [os.path.join(root, fname) for root, dirs, files in os.walk(dst_p) for fname in files if re.match(geofile_regex, fname)]


        dataset_tree = {
            'dataset': dataset_name,
            'output_dir': ds_tree.output_folder,
            'files': ds_files,
            'fs_tree': seedir(dst_p, depthlimit=5, printout=False, regex=True, include_files=geofile_regex)
        }

    elif repo == 'github':
        # Handle GitHub repositories using git partial cloning of repo 
        
        # Create destination directory if it doesn't exist
        os.makedirs(dst, exist_ok=True)
        # Parse the GitHub URL
        # [user, repo, tree, branch, rest of path]
        parts = doi.replace('https://github.com/', '').split('/')
        repo_path = f"{parts[0]}/{parts[1]}"
        
        # Extract branch and path
        branch = 'main'  # Default branch
        path = ''
        
        # check if local path exists and contains expected files
        if os.path.exists(dst) and any(os.path.splitext(fname)[1] in ['.geojson', '.json', '.shp', '.zip'] for fname in os.listdir(dst)) and not force:  
            print(f"Destination path for {dataset_name}'s repo already exists and contains expected files.")
            # print in bold
            print(f"\033[1mSkipping Download!\033[0m")
            # fetch dataset dir info from Pathlib and tree from seedir 
            tree = seedir(dst_p, depthlimit=5, printout=False, regex=True, include_files=geofile_regex)
            # get list of files in Path object that satisfy regex
            ds_files = [os.path.join(root, fname) for root, dirs, files in os.walk(dst_p) for fname in files if re.match(geofile_regex, fname)]
            dataset_tree = {
                'dataset': dataset_name,
                'output_dir': dst,
                'files': ds_files,
                'fs_tree': tree
            }

        # Check if it's a folder/repository or a single file
        elif '/blob/' not in doi and 'raw.githubusercontent.com' not in doi:
            try:
                if 'tree' in parts:
                    tree_index = parts.index('tree')
                    branch = parts[tree_index + 1]
                    path = '/'.join(parts[tree_index + 2:]) if len(parts) > tree_index + 2 else ''
                
                # Create a temporary directory for the sparse checkout
                with tempfile.TemporaryDirectory() as temp_dir:
                    # Initialize the git repository and set up sparse checkout
                    commands = [f"git clone --filter=blob:limit={max_mb}m --depth 1 https://github.com/{repo_path}.git {dataset_name}"]
                    # print(f"Running commands: {commands}")
                    # Execute git commands
                    for cmd in commands:
                        
                        process = subprocess.run(cmd, shell=True, cwd=temp_dir, 
                                               capture_output=True, text=True)
                        # show command output (debug)
                        print(f"Command stdout: {process.stdout}")
                        if process.returncode != 0:
                            raise Exception(f"Git command failed: {cmd}\n{process.stderr}")
                    
                    # Copy only the files in the dir specified in DOI/URL
                    repo_ds_dir = os.path.join(temp_dir, dataset_name, path) if path else os.path.join(temp_dir, dataset_name)
                    files_list = []
                    #
                    for root, _, files in os.walk(repo_ds_dir):
                        for file in files:
                            if file.startswith('.git'):
                                continue
                            src_file = os.path.join(root, file)
                            # Create relative path
                            rel_path = os.path.relpath(src_file, repo_ds_dir)
                            dst_file = os.path.join(dst, rel_path)
                            
                            # Create destination directory if needed
                            os.makedirs(os.path.dirname(dst_file), exist_ok=True)
                            
                            # Copy the file
                            shutil.copy2(src_file, dst_file)
                            files_list.append(dst_file)
                            print(f"Copied {rel_path} to ./{dataset_dir}/{rel_path}")

                dataset_tree = {
                    'dataset': dataset_name,
                    'output_dir': dst,
                    'files': files_list,
                    'fs_tree': seedir(dst_p, depthlimit=5, printout=False, regex=True, include_files=geofile_regex)
                }
                
            except Exception as e:
                print(f"Error performing git clone: {e}")
                return None
        else:
            # It's a single file (raw URL or blob URL)
            try:
                # Convert blob URL to raw URL if needed
                if '/blob/' in doi:
                    raw_url = doi.replace('github.com', 'raw.githubusercontent.com').replace('/blob/', '/')
                else:
                    raw_url = doi
                
                # Extract filename from URL
                filename = os.path.basename(urllib.parse.urlparse(raw_url).path)
                local_file_path = os.path.join(dst, filename)
                
                # Download the file
                response = requests.get(raw_url, stream=True)
                response.raise_for_status()
                
                # Check file size
                file_size = int(response.headers.get('content-length', 0))
                if file_size > max_dl:
                    print(f"File size ({file_size} bytes) exceeds maximum allowed size ({max_dl * 1024 * 1024} MB)")
                    return None
                
                with open(local_file_path, 'wb') as f:
                    for chunk in tqdm(response.iter_content(chunk_size=8192), desc=f"Downloading {filename}", unit='KB'):
                        f.write(chunk)
                print(f"Downloaded {filename} to {os.path.relpath(local_file_path)}")
                dataset_tree = {
                    'dataset': dataset_name,
                    'output_dir': dst,
                    'files': [local_file_path],
                    'fs_tree': ds_tree.tree()
                }
                
            except Exception as e:
                print(f"Error downloading GitHub file: {e}")

    elif repo == 'sciencebase':
        # Initialize ScienceBase client
        # sb = sciencebasepy.SbSession()
        
        # # Extract the item ID from the DOI or URL
        # # DOIs like 10.5281/zenodo.8038684 or URLs with item ID
        # item_id = doi.split('/')[-1] if '/' in doi else doi
        
        # try:
        #     # Get item details
        #     item = sb.get_item(item_id)
            
        #     # Create destination directory
        #     os.makedirs(dst, exist_ok=True)
            
        #     # Download all files associated with the item
        #     downloaded_files = []
            
        #     # Get item files
        #     files = sb.get_item_file_info(item_id)
            
        #     for file_info in files:
        #         file_name = file_info['name']
        #         file_url = file_info['url']
                
        #         # Check file size if available
        #         if 'size' in file_info and file_info['size'] > max_dl:
        #             print(f"Skipping file {file_name} as it exceeds the maximum download size")
        #             continue
                
        #         # Download the file
        #         local_file_path = os.path.join(dst, file_name)
        #         sb.download_file(file_url, local_file_path)
                
        #         downloaded_files.append(local_file_path)
        #         print(f"Downloaded {file_name} to {local_file_path}")
        print("Not Implemented yet")
        return None

    print(f"Fetched {len(dataset_tree['files'])} dataset files for {dataset_name} in {os.path.relpath(dataset_tree['output_dir'])}:")
    print(dataset_tree['fs_tree'])

    return dataset_tree

In [20]:
# check datahugger get arguments
# print(datahugger.get.__doc__)

In [21]:
# test_ds = selected_datasets[2]
# test_doi = dataset_metadata[test_ds]['doi']
# max_mb = 500
# dst_dir = os.path.join(os.getcwd(), os.getenv('DATA_PATH'), 'raw', 'labels', test_ds)
# t = datahugger.get(test_doi, dst_dir, print_only=True)

In [22]:
# iterate through the selected datasets and fetch files
# iterate through the selected datasets and fetch files
ds_trees = {}
max_mb = int(os.getenv('MAX_LABEL_MB', 100))
print(f"Fetching {len(selected_datasets)} datasets with files of max size {max_mb} MB")

# Create widgets for controlling the fetching process
fetch_output = widgets.Output(
    layout=widgets.Layout(
        width='80%', 
        border='1px solid #ddd', 
        padding='10px',
        overflow='auto'
    )
)
# Apply direct CSS styling for text wrapping (Note: unvalidated)
display(widgets.HTML("""
<style>
.jupyter-widgets-output-area pre {
    white-space: pre-wrap !important;       /* CSS3 */
    word-wrap: break-word !important;        /* Internet Explorer 5.5+ */
    overflow-wrap: break-word !important;
    max-width: 100%;
}
</style>
"""))
control_panel = widgets.VBox(layout=widgets.Layout(width='20%', padding='10px', overflow='auto', word_wrap='break-word'))
fetch_button = widgets.Button(description="Fetch Next Dataset", button_style="primary")
progress_label = widgets.HTML("Waiting to start...")
dataset_index = 0

# Function to fetch the next dataset
def fetch_next_dataset(button=None):
    global dataset_index
    global dataset_metadata
    
    if dataset_index >= len(selected_datasets):
        with fetch_output:
            print("All datasets have been fetched!")
            progress_label.value = f"<b>Completed:</b> {dataset_index}/{len(selected_datasets)} datasets"
        fetch_button.disabled = True
        return
    
    dataset = selected_datasets[dataset_index]
    progress_label.value = f"<b>Fetching:</b> {dataset_index+1}/{len(selected_datasets)}<br><b>Current:</b> {dataset}"
    
    with fetch_output:
        clear_output(wait=True)
        print(f"Fetching dataset files for {dataset} using DOI/URL:\n {dataset_metadata[dataset]['doi']}")
        ds_tree = fetch_dataset_files(dataset, max_mb=max_mb, force=force_download_checkbox.value)
        
        if ds_tree:
            ds_trees[dataset] = ds_tree
            # update metadata dict with local filesystem info
            dataset_metadata[dataset]['output_dir'] = ds_tree['output_dir']
            dataset_metadata[dataset]['files'] = ds_tree['files']
            dataset_metadata[dataset]['tree'] = ds_tree['fs_tree']
            # print the dataset file tree
        else:
            print(f"Failed to fetch dataset {dataset}")
    
    dataset_index += 1
    progress_label.value = f"<b>Completed:</b> {dataset_index}/{len(selected_datasets)}<br><b>Next:</b> {selected_datasets[dataset_index] if dataset_index < len(selected_datasets) else 'Done'}"

# Add a checkbox for force download option
force_download_checkbox = widgets.Checkbox(
    value=False,
    description='Force Download',
    tooltip='If checked, download will be forced even if files exist locally',
    layout=widgets.Layout(width='auto')
)

# Configure the button callback
fetch_button.on_click(fetch_next_dataset)

# Create the control panel
dataset_progress = widgets.HTML(f"Datasets selected: {len(selected_datasets)}")
fetch_status = widgets.HTML(
    f"Status: Ready to begin",
    layout=widgets.Layout(margin="10px 0")
)

# Create the control panel with left alignment
control_panel.children = [
    widgets.HTML("<h3 style='align:left;'>Fetch Control</h3>"), 
    dataset_progress,
    force_download_checkbox,
    widgets.HTML("<hr style='margin:10px 0'>"),
    progress_label,
    fetch_button
]

# Add custom CSS to ensure alignment
display(widgets.HTML("""
<style>
.widget-html {
    text-align: left !important;
}
.widget-checkbox {
    justify-content: flex-start !important;
}
.widget-button {
    width: 100% !important;
}
</style>
"""))

Fetching 3 datasets with files of max size 300 MB


HTML(value='\n<style>\n.jupyter-widgets-output-area pre {\n    white-space: pre-wrap !important;       /* CSS3…

HTML(value='\n<style>\n.widget-html {\n    text-align: left !important;\n}\n.widget-checkbox {\n    justify-co…

#### Fetching selected datasets and visualizing metadata and file structure

In [23]:
# Display the widget layout
display(widgets.HBox([fetch_output, control_panel]))

# Set up for first fetch
if selected_datasets:
    progress_label.value = f"<b>Ready to start:</b><br>First dataset: {selected_datasets[0]}"
else:
    progress_label.value = "<b>No datasets selected</b>"
    fetch_button.disabled = True

HBox(children=(Output(layout=Layout(border='1px solid #ddd', overflow='auto', padding='10px', width='80%')), V…

In [24]:
# keep subset of metadata dict for selected datasets
selected_metadata = {ds: dataset_metadata[ds] for ds in selected_datasets}
get_ds_files = lambda ds: dataset_metadata[ds]['files']
get_ds_dir = lambda ds: dataset_metadata[ds]['output_dir']
fra_ds_folder = 'replication'
# make a manual selection of the set of files we'll use from each dataset
selected_ds_files = {
    # 'global_pv_inventory_sent2_2024':
        # [f for f in get_ds_files('global_pv_inventory_sent2_2024') if f.endswith('.json')],
    'global_pv_inventory_sent2_spot_2021':
        [f for f in get_ds_files('global_pv_inventory_sent2_spot_2021') if f.endswith('polygons.geojson') or f.endswith('set.geojson')],
    # 'fra_west_eur_pv_installations_2023':
    #     [os.path.join(root, fname) for root, _, files in os.walk(get_ds_dir('fra_west_eur_pv_installations_2023')) for fname in files ],
    'ind_pv_solar_farms_2022': 
        [f for f in get_ds_files('ind_pv_solar_farms_2022') if f.endswith('.geojson')],
    'usa_cali_usgs_pv_2016':
        # grab all except the normal json
        [f for f in get_ds_files('usa_cali_usgs_pv_2016') if not f.endswith('.json')]
}

# build and output tree for selected datasets
selected_ds_dirs = [get_ds_dir(ds) for ds in selected_datasets]
print("All selected datasets have been fetched with the following file tree:\n")
selected_ds_tree = seedir(DATASET_DIR / 'raw' / 'labels', depthlimit=10, printout=True, regex=False, include_folders=selected_datasets, style='plus')

All selected datasets have been fetched with the following file tree:

labels/
+-global_pv_inventory_sent2_spot_2021/
| +-predicted_set.geojson
| +-cv_polygons.geojson
| +-cv_tiles.geojson
| +-test_tiles.geojson
| +-test_polygons.geojson
| +-trn_polygons.geojson
| +-trn_tiles.geojson
| +-global_pv_inventory_all.zip
+-ind_pv_solar_farms_2022/
| +-solar_farms_india_2021_merged_simplified.geojson
+-usa_cali_usgs_pv_2016/
  +-SolarArrayPolygons.geojson
  +-.DS_Store
  +-polygonVertices_LatitudeLongitude.csv
  +-polygonVertices_PixelCoordinates.csv
  +-polygonDataExceptVertices.csv
  +-SolarArrayPolygons.json


In [30]:
# cali_files = selected_ds_files['usa_cali_usgs_pv_2016']
# cali_str = "\n".join([f"{i}: {os.path.relpath(f)}" for i, f in enumerate(cali_files)])
# print(f"\n\nSelected dataset files:\n{cali_str}")
# global_files = selected_ds_files['global_pv_inventory_sent2_spot_2021']
# global_str = "\n".join([f"{i}: {os.path.relpath(f)}" for i, f in enumerate(global_files)])
# print(f"\n\nSelected dataset files:\n{global_str}")
# india_files = selected_ds_files['ind_pv_solar_farms_2022']
# india_str = "\n".join([f"{i}: {os.path.relpath(f)}" for i, f in enumerate(india_files)])
# print(f"\n\nSelected dataset files:\n{india_str}")

#### Global inventory of solar PV units (Kruitwagen et al, 2021)

From Zenodo:
```
Repository contents:

trn_tiles.geojson: 18,570 rectangular areas-of-interest used for sampling training patch data.

trn_polygons.geojson: 36,882 polygons obtained from OSM in 2017 used to label training patches.

cv_tiles.geojson: 560 rectangular areas-of-interest used for sampling cross-validation data seeded from WRI GPPDB

cv_polygons.geojson: 6,281 polygons corresponding to all PV solar generating units present in cv_tiles.geojson at the end of 2018.

test_tiles.geojson: 122 rectangular regions-of-interest used for building the test set.

test_polygons.geojson: 7,263 polygons corresponding to all utility-scale (>10kW) solar generating units present in test_tiles.geojson at the end of 2018.

predicted_polygons.geojson: 68,661 polygons corresponding to predicted polygons in global deployment, capturing the status of deployed photovoltaic solar energy generating capacity at the end of 2018.
```

In [48]:
# additional preprocessing specific to each dataset (mostly attaching any included metadata)
def global_pv_inventory_spot_processing(gdf, dataset_name, output_dir, subset_bbox=None, geom_type='Polygon', rm_invalid=True):
    all_cols = [
        'unique_id', 'area', 'confidence', 'install_date', 'iso-3166-1', 'iso-3166-2', 'gti', 'pvout', 'capacity_mw', 'match_id', 'wdpa_10km', 'LC_CLC300_1992', 'LC_CLC300_1993',
        'LC_CLC300_1994', 'LC_CLC300_1995', 'LC_CLC300_1996', 'LC_CLC300_1997', 'LC_CLC300_1998', 'LC_CLC300_1999', 'LC_CLC300_2000', 'LC_CLC300_2001', 'LC_CLC300_2002',
        'LC_CLC300_2003', 'LC_CLC300_2004', 'LC_CLC300_2005', 'LC_CLC300_2006', 'LC_CLC300_2007', 'LC_CLC300_2008', 'LC_CLC300_2009', 'LC_CLC300_2010', 'LC_CLC300_2011',
        'LC_CLC300_2012', 'LC_CLC300_2013', 'LC_CLC300_2014', 'LC_CLC300_2015', 'LC_CLC300_2016', 'LC_CLC300_2017', 'LC_CLC300_2018', 'mean_ai', 'GCR', 'eff', 'ILR',
        'area_error', 'lc_mode', 'lc_arid', 'lc_vis', 'geometry', 'aoi_idx', 'aoi', 'id', 'Country', 'Province', 'Project', 'WRI_ref', 'Polygon Source', 'Date', 'building',
        'operator', 'generator_source', 'amenity', 'landuse', 'power_source', 'shop', 'sport', 'tourism', 'way_area', 'access', 'construction', 'denomination', 'historic',
        'leisure', 'man_made', 'natural', 'ref', 'religion', 'surface', 'z_order', 'layer', 'name', 'barrier', 'addr_housenumber', 'office', 'power', 'osm_id', 'military'
    ]
    # remove unwanted columns
    keep_cols = ['geometry', 'unique_id', 'area', 'confidence', 'install_date', 'capacity_mw', 'iso-3166-2', 'pvout', 'osm_id', 'Project', 'construction']
    print(f"Filtering from {len(all_cols)} columns to {len(keep_cols)} columns:\n{keep_cols}")
    gdf = gdf[keep_cols]
    return gdf
def global_pv_inventory_sent2_processing(gdf, dataset_name, output_dir, subset_bbox=None, geom_type='Polygon', rm_invalid=True):
    return gdf
def india_pv_solar_farms_processing(gdf, dataset_name, output_dir, subset_bbox=None, geom_type='Polygon', rm_invalid=True):
    return gdf
def usa_cali_usgs_pv_processing(gdf, dataset_name, output_dir, subset_bbox=None, geom_type='Polygon', rm_invalid=True):
    return gdf
def usa_eia_large_scale_pv_processing(gdf, dataset_name, output_dir, subset_bbox=None, geom_type='Polygon', rm_invalid=True):
    return gdf
def usa_eia_large_scale_pv_processing(gdf, dataset_name, output_dir, subset_bbox=None, geom_type='Polygon', rm_invalid=True):
    return gdf

def filter_duplicates(gdf, geom_type='Polygon', overlap_thresh=0.75):
    """
    Remove duplicate geometries from a GeoDataFrame based on a specified overlap threshold.
    
    Args:
        gdf (GeoDataFrame): Input GeoDataFrame.
        geom_type (str): Geometry type to filter by. Default is 'Polygon'.
        overlap_thresh (float): Overlap threshold for removing duplicates. Default is 0.75.
        
    Returns:
        gdf (GeoDataFrame): GeoDataFrame with duplicates removed.
    """
    # First identify exact duplicates
    gdf = gdf.drop_duplicates('geometry')
    
    # Identify geometries that overlap substantially
    overlaps = []
    # Use spatial index for efficiency
    spatial_index = gdf.sindex
    
    for idx, geom in enumerate(gdf.geometry):
        # Find potential overlaps using the spatial index
        possible_matches = list(spatial_index.intersection(geom.bounds))
        # Remove self from matches
        if idx in possible_matches:
            possible_matches.remove(idx)
        
        for other_idx in possible_matches:
            other_geom = gdf.iloc[other_idx].geometry
            if geom.intersects(other_geom):
                # Calculate overlap percentage (relative to the smaller polygon)
                intersection_area = geom.intersection(other_geom).area
                min_area = min(geom.area, other_geom.area)
                overlap_percentage = intersection_area / min_area
                
                # If overlap is significant (e.g., >75%)
                if overlap_percentage > overlap_thresh:
                    # Keep the geometry with the larger area
                    if geom.area < other_geom.area:
                        overlaps.append(idx)
                    
                    else:
                        overlaps.append(other_idx)
                        break
    
    # Remove overlapping geometries
    if overlaps:
        print(f"Removing {len(overlaps)} geometries with >{overlap_thresh*100}% overlap")
        gdf = gdf.drop(gdf.index[overlaps]).reset_index(drop=True)
    
    return gdf

# basic processing for geojson, shapefiles, and already georeferenced data
def process_geojson(geojson_files, dataset_name, output_dir=None, subset_bbox=None, geom_type='Polygon', rm_invalid=True, overlap_thresh=0.75, out_fmt='geoparquet'):
    """
    Process a GeoJSON file and return a GeoDataFrame.
    
    Args:
        file_path (str): Path to the GeoJSON file.
        dataset_name (str): Name of the dataset.
        geom_type (str): Geometry type to filter by. Default is 'Polygon'.
        
    Returns:
        gdf (GeoDataFrame): Processed GeoDataFrame.
    """
    if output_dir and not os.path.exists(output_dir):
        os.makedirs(output_dir, exist_ok=True)
    ds_dataframes = []

    for fname in geojson_files:
        if fname.endswith('.geojson') or fname.endswith('.json'):
            # Check if the file is a valid GeoJSON
            try:
                gdf = gpd.read_file(fname)
            except Exception as e:
                print(f"Error reading {os.path.relpath(fname)}: {e}")
                continue
            ds_dataframes.append(gdf)
    
    if len(ds_dataframes) == 0:
        print(f"No valid GeoJSON files found in {dataset_name}.")
        print(f"Skipping dataset {dataset_name}")
        return None
        
    # Concatenate all dataframes into a single GeoDataFrame
    gdf = gpd.GeoDataFrame(pd.concat(ds_dataframes, ignore_index=True))
    # make sure the geometry column is included and named correctly
    if 'geometry' not in gdf.columns:
        gdf['geometry'] = gdf.geometry

    # Basic info about the dataset
    print(f"Loaded geodataframe with raw counts of {len(gdf)} PV installations")
    print(f"Coordinate reference system: {gdf.crs}")
    print(f"Available columns: {gdf.columns.tolist()}")
    
    # Add dataset name as a new column
    gdf['dataset'] = dataset_name
    
    # Convert to WGS84 if not already in that CRS
    if gdf.crs is not None and gdf.crs.to_string() != 'EPSG:4326':
        # convert to WGS84 in cases of other crs (eg NAD83 for Cali dataset)
        gdf = gdf.to_crs(epsg=4326)
    if subset_bbox is not None:
        # Filter the GeoDataFrame by the georeferenced bounding box
        gdf = gdf.cx[subset_bbox[0]:subset_bbox[2], subset_bbox[1]:subset_bbox[3]]
    
    # DQ and cleaning
    # check for missing and invalid geometries
    invalid_geoms = gdf[gdf.geometry.is_empty | ~gdf.geometry.is_valid]
    if len(invalid_geoms) > 0 and rm_invalid:
        print(f"Warning: {len(invalid_geoms)} invalid or empty geometries found and will be removed.")
        # Optionally remove invalid geometries
        gdf = gdf[~gdf.geometry.is_empty & gdf.geometry.is_valid].reset_index(drop=True)
    # Eliminating duplicates and geometries that overlap too much
    if geom_type == 'Polygon':
        gdf = filter_duplicates(gdf, geom_type=geom_type, overlap_thresh=overlap_thresh)

    # perform any additional processing specific to the dataset for metadata and other attributes
    if dataset_name == 'global_pv_inventory_sent2_2024':
        print("Processing global_pv_inventory_sent2_2024 metadata")
        gdf = global_pv_inventory_sent2_processing(gdf, dataset_name, output_dir, subset_bbox=subset_bbox, geom_type=geom_type)
    elif dataset_name == 'global_pv_inventory_sent2_spot_2021':
        print("Processing global_pv_inventory_sent2_spot_2021 metadata")
        gdf = global_pv_inventory_spot_processing(gdf, dataset_name, output_dir, subset_bbox=subset_bbox, geom_type=geom_type)
    elif dataset_name == 'ind_pv_solar_farms_2022':
        print("Processing ind_pv_solar_farms_2022 metadata")
        gdf = india_pv_solar_farms_processing(gdf, dataset_name, output_dir, subset_bbox=subset_bbox, geom_type=geom_type)
    elif dataset_name == 'usa_cali_usgs_pv_2016':
        print("Processing usa_cali_usgs_pv_2016 metadata")
        gdf = usa_cali_usgs_pv_processing(gdf, dataset_name, output_dir, subset_bbox=subset_bbox, geom_type=geom_type)
    elif dataset_name == 'usa_eia_large_scale_pv_2023':
        print("Processing usa_eia_large_scale_pv_2023 metadata")
        gdf = usa_eia_large_scale_pv_processing(gdf, dataset_name, output_dir, subset_bbox=subset_bbox, geom_type=geom_type)
    
    # add some basic geometry info
    # if not gdf.crs.is_geographic:
    #     gdf['area_m2'] = gdf.geometry.area
    # else:
    #     # todo: check if other crs is more appropriate
    gdf_proj = gdf['geometry'].to_crs(epsg=4326)
    gdf['area_m2'] = gdf_proj.geometry.area
    
    # gdf['centroid_lon'] = gdf.geometry.centroid.x
    gdf['centroid_lon'] = gdf_proj.geometry.centroid.x
    # gdf['centroid_lat'] = gdf.geometry.centroid.y
    gdf['centroid_lat'] = gdf_proj.geometry.centroid.y
    # use gpd conversion argument
    # gdf['bbox'] = gdf.geometry.apply(lambda geom: geom.bounds) 

    print(f"After filtering and cleaning, we have {len(gdf)} PV installations")
    print(f"Coordinate reference system: {gdf.crs}")
    print(f"Available columns: {gdf.columns.tolist()}")

    if output_dir:
        out_path = os.path.join(output_dir, f"{dataset_name}_processed.{out_fmt}")

        if out_fmt == 'geoparquet':
            gdf.to_parquet(out_path, 
                index=None, 
                compression='snappy',
                geometry_encoding='WKB', 
                write_covering_bbox=True,
                schema_version='1.1.0')
        else:
            gdf.to_file(out_path, driver='GeoJSON', index=None)
        print(f"Saved processed GeoDataFrame to {os.path.relpath(out_path)}")
    
    return gdf


In [None]:
random.shuffle(selected_datasets)
# go through the selected datasets and process them
for ds in selected_datasets:
    ds_files = selected_ds_files[ds]
    ds_dir = get_ds_dir(ds)
    out_dir = DATASET_DIR / 'raw' / 'labels' / 'geoparquet'
    print(f"Processing dataset {ds} with {len(ds_files)} files in {os.path.relpath(ds_dir)}")
    ds_gdf = process_geojson(
                geojson_files=ds_files,
                dataset_name=ds,
                output_dir=out_dir
    )
    if ds_gdf is not None:
        
        display(ds_gdf.describe())
        print(ds_gdf.info)
        display(ds_gdf)

Processing dataset usa_cali_usgs_pv_2016 with 4 files in datasets/raw/labels/usa_cali_usgs_pv_2016
Loaded geodataframe with raw counts of 19433 PV installations
Coordinate reference system: EPSG:4326
Available columns: ['polygon_id', 'centroid_latitude', 'centroid_longitude', 'centroid_latitude_pixels', 'centroid_longitude_pixels', 'city', 'area_pixels', 'area_meters', 'image_name', 'nw_corner_of_image_latitude', 'nw_corner_of_image_longitude', 'se_corner_of_image_latitude', 'se_corner_of_image_longitude', 'datum', 'projection_zone', 'resolution', 'jaccard_index', 'polygon_vertices_pixels', 'geometry']
Processing usa_cali_usgs_pv_2016 metadata
After filtering and cleaning, we have 19373 PV installations
Coordinate reference system: EPSG:4326
Available columns: ['polygon_id', 'centroid_latitude', 'centroid_longitude', 'centroid_latitude_pixels', 'centroid_longitude_pixels', 'city', 'area_pixels', 'area_meters', 'image_name', 'nw_corner_of_image_latitude', 'nw_corner_of_image_longitude',

  return lib.intersects(a, b, **kwargs)
  return lib.intersection(a, b, **kwargs)

  gdf['area_m2'] = gdf_proj.geometry.area

  gdf['centroid_lon'] = gdf_proj.geometry.centroid.x

  gdf['centroid_lat'] = gdf_proj.geometry.centroid.y


Unnamed: 0,polygon_id,centroid_latitude,centroid_longitude,centroid_latitude_pixels,centroid_longitude_pixels,area_pixels,area_meters,nw_corner_of_image_latitude,nw_corner_of_image_longitude,se_corner_of_image_latitude,se_corner_of_image_longitude,resolution,jaccard_index,area_m2,centroid_lon,centroid_lat
count,19373.0,19373.0,19373.0,19373.0,19373.0,19373.0,19373.0,19373.0,19373.0,19373.0,19373.0,19373.0,19373.0,19373.0,19373.0,19373.0
mean,9809.361276,36.784912,-119.947159,2385.155993,2588.654589,445.239668,40.07157,36.791225,-119.956025,36.778193,-119.938606,0.3,0.605667,4.056519e-09,-119.947159,36.784912
std,5697.759931,0.87445,0.597423,1483.504468,1472.819678,1377.865131,124.007862,0.874867,0.597034,0.87417,0.597506,1.065286e-13,0.388086,1.259295e-08,0.597423,0.87445
min,1.0,34.14519,-121.380875,2.667675,2.878649,4.229101,0.380619,34.149319,-121.382306,34.138531,-121.365522,0.3,0.0,3.886139e-11,-121.380875,34.14519
25%,4889.0,36.787572,-119.856832,1065.071283,1355.111998,131.249715,11.812474,36.793267,-119.864906,36.78015,-119.847606,0.3,0.0,1.194895e-09,-119.856832,36.787572
50%,9744.0,36.832362,-119.754707,2261.083358,2592.542739,223.570045,20.121304,36.835711,-119.765072,36.822581,-119.747778,0.3,0.822004,2.034797e-09,-119.754707,36.832362
75%,14655.0,36.868902,-119.685668,3732.191838,3806.929754,379.341839,34.140766,36.876231,-119.694044,36.863103,-119.676789,0.3,0.888372,3.454263e-09,-119.685668,36.868902
max,19863.0,38.068118,-119.133756,4997.249972,5987.691893,68384.5702,6154.611318,38.070422,-119.151072,38.056661,-119.131089,0.3,1.0,6.319998e-07,-119.133756,38.068118


Unnamed: 0,polygon_id,centroid_latitude,centroid_longitude,centroid_latitude_pixels,centroid_longitude_pixels,city,area_pixels,area_meters,image_name,nw_corner_of_image_latitude,...,datum,projection_zone,resolution,jaccard_index,polygon_vertices_pixels,geometry,dataset,area_m2,centroid_lon,centroid_lat
0,1,36.92631,-119.840555,107.618458,3286.151487,Fresno,1513.254134,136.192872,11ska460890,36.926336,...,NAD83,11,0.3,0.91402,"[ [ 3360.4950690000001, 131.63116400000001 ], ...","POLYGON ((-119.8403 36.92625, -119.84068 36.92...",usa_cali_usgs_pv_2016,1.376423e-08,-119.840555,36.92631
1,2,36.926477,-119.840561,45.977659,3286.352946,Fresno,1727.907934,155.511714,11ska460890,36.926336,...,NAD83,11,0.3,0.829071,"[ [ 3361.1538460000002, 69.615385000000003 ], ...","POLYGON ((-119.84031 36.92642, -119.8408 36.92...",usa_cali_usgs_pv_2016,1.571668e-08,-119.840561,36.926477
2,3,36.926542,-119.840506,22.280851,3303.465657,Fresno,1242.184349,111.796591,11ska460890,36.926336,...,NAD83,11,0.3,0.937961,"[ [ 3358.0157260000001, 48.136862999999998 ], ...","POLYGON ((-119.84032 36.92648, -119.84032 36.9...",usa_cali_usgs_pv_2016,1.129864e-08,-119.840506,36.926542
3,4,36.921008,-119.842847,2048.362567,2547.366116,Fresno,688.93342,62.004008,11ska460890,36.926336,...,NAD83,11,0.3,0.842634,"[ [ 2571.5917159999999, 2068.0493099999999 ], ...","POLYGON ((-119.84276 36.92096, -119.84277 36.9...",usa_cali_usgs_pv_2016,6.26639e-09,-119.842847,36.921008
4,5,36.920976,-119.842906,2060.01489,2529.504997,Fresno,1060.890554,95.48015,11ska460890,36.926336,...,NAD83,11,0.3,0.890998,"[ [ 2563.7810650000001, 2091.3984220000002 ], ...","POLYGON ((-119.84279 36.92089, -119.84299 36.9...",usa_cali_usgs_pv_2016,9.649632e-09,-119.842906,36.920976


Processing dataset ind_pv_solar_farms_2022 with 1 files in datasets/raw/labels/ind_pv_solar_farms_2022
Loaded geodataframe with raw counts of 1363 PV installations
Coordinate reference system: EPSG:4326
Available columns: ['State', 'Area', 'Latitude', 'Longitude', 'fid', 'geometry']
Processing ind_pv_solar_farms_2022 metadata
After filtering and cleaning, we have 1285 PV installations
Coordinate reference system: EPSG:4326
Available columns: ['State', 'Area', 'Latitude', 'Longitude', 'fid', 'geometry', 'dataset', 'area_m2', 'centroid_lon', 'centroid_lat']
Saved processed GeoDataFrame to datasets/raw/labels/geoparquet/ind_pv_solar_farms_2022_processed.geoparquet


  return lib.intersects(a, b, **kwargs)
  return lib.intersection(a, b, **kwargs)

  gdf['area_m2'] = gdf_proj.geometry.area

  gdf['centroid_lon'] = gdf_proj.geometry.centroid.x

  gdf['centroid_lat'] = gdf_proj.geometry.centroid.y


Unnamed: 0,Area,Latitude,Longitude,fid,area_m2,centroid_lon,centroid_lat
count,1285.0,1285.0,1285.0,1285.0,1285.0,1285.0,1285.0
mean,849996.5,18.544576,77.210345,713.745525,1.770171e-05,77.208736,18.540253
std,1908736.0,5.722106,2.792198,426.892108,9.321063e-05,2.790056,5.725651
min,813.5934,8.53558,69.025734,1.0,2.698065e-08,69.025705,8.535348
25%,119013.1,14.169228,75.875751,355.0,1.87539e-06,75.875809,14.169181
50%,305394.7,17.556943,77.392034,706.0,5.318159e-06,77.392046,17.556284
75%,765914.8,23.038854,78.350397,1063.0,1.646252e-05,78.348523,23.038629
max,30744920.0,31.962736,91.276737,4421.0,0.002451388,91.276729,31.961923


Unnamed: 0,State,Area,Latitude,Longitude,fid,geometry,dataset,area_m2,centroid_lon,centroid_lat
0,Karnataka,307270.1,13.094437,78.284459,677,"MULTIPOLYGON (((78.28741 13.09153, 78.2881 13....",ind_pv_solar_farms_2022,4e-06,78.286072,13.092826
1,Karnataka,139467.5,13.08387,78.292082,678,"MULTIPOLYGON (((78.29433 13.07967, 78.29461 13...",ind_pv_solar_farms_2022,7e-06,78.291715,13.082543
2,Karnataka,549756.2,13.121554,78.343837,675,"MULTIPOLYGON (((78.3433 13.12249, 78.34427 13....",ind_pv_solar_farms_2022,2e-06,78.343961,13.121596
3,Karnataka,524094.8,13.747866,77.552639,676,"MULTIPOLYGON (((77.55266 13.74717, 77.55364 13...",ind_pv_solar_farms_2022,2e-06,77.552326,13.747899
4,Karnataka,1701728.0,13.551716,77.514227,665,"MULTIPOLYGON (((77.50913 13.55098, 77.51162 13...",ind_pv_solar_farms_2022,1.6e-05,77.512698,13.550867


Processing dataset global_pv_inventory_sent2_spot_2021 with 4 files in datasets/raw/labels/global_pv_inventory_sent2_spot_2021




Loaded geodataframe with raw counts of 119087 PV installations
Coordinate reference system: EPSG:4326
Available columns: ['unique_id', 'area', 'confidence', 'install_date', 'iso-3166-1', 'iso-3166-2', 'gti', 'pvout', 'capacity_mw', 'match_id', 'wdpa_10km', 'LC_CLC300_1992', 'LC_CLC300_1993', 'LC_CLC300_1994', 'LC_CLC300_1995', 'LC_CLC300_1996', 'LC_CLC300_1997', 'LC_CLC300_1998', 'LC_CLC300_1999', 'LC_CLC300_2000', 'LC_CLC300_2001', 'LC_CLC300_2002', 'LC_CLC300_2003', 'LC_CLC300_2004', 'LC_CLC300_2005', 'LC_CLC300_2006', 'LC_CLC300_2007', 'LC_CLC300_2008', 'LC_CLC300_2009', 'LC_CLC300_2010', 'LC_CLC300_2011', 'LC_CLC300_2012', 'LC_CLC300_2013', 'LC_CLC300_2014', 'LC_CLC300_2015', 'LC_CLC300_2016', 'LC_CLC300_2017', 'LC_CLC300_2018', 'mean_ai', 'GCR', 'eff', 'ILR', 'area_error', 'lc_mode', 'lc_arid', 'lc_vis', 'geometry', 'aoi_idx', 'aoi', 'id', 'Country', 'Province', 'Project', 'WRI_ref', 'Polygon Source', 'Date', 'building', 'operator', 'generator_source', 'amenity', 'landuse', 'power

  return lib.intersects(a, b, **kwargs)
  return lib.intersection(a, b, **kwargs)


Removing 41613 geometries with >75.0% overlap
Processing global_pv_inventory_sent2_spot_2021 metadata
Filtering from 85 columns to 11 columns:
['geometry', 'unique_id', 'area', 'confidence', 'install_date', 'capacity_mw', 'iso-3166-2', 'pvout', 'osm_id', 'Project', 'construction']



  gdf['area_m2'] = gdf_proj.geometry.area

  gdf['centroid_lon'] = gdf_proj.geometry.centroid.x

  gdf['centroid_lat'] = gdf_proj.geometry.centroid.y


After filtering and cleaning, we have 85375 PV installations
Coordinate reference system: EPSG:4326
Available columns: ['geometry', 'unique_id', 'area', 'confidence', 'install_date', 'capacity_mw', 'iso-3166-2', 'pvout', 'osm_id', 'Project', 'construction', 'area_m2', 'centroid_lon', 'centroid_lat']


ArrowInvalid: ("Could not convert 'yes' with type str: tried to convert to double", 'Conversion failed for column area with type object')

In [46]:
selected_datasets[2]

'global_pv_inventory_sent2_spot_2021'

In [37]:
random.shuffle.__doc__

'Shuffle list x in place, and return None.\n\n        Optional argument random is a 0-argument function returning a\n        random float in [0.0, 1.0); if it is the default None, the\n        standard random.random will be used.\n\n        '

#### France West Europe PV Installations 2023

From [research publication](https://doi.org/10.1038/s41597-023-01951-4): 
```
The Git repository contains the raw crowdsourcing data and all the material necessary to re-generate our training dataset and technical validation.  
It is structured as follows: the raw subfolder contains the raw annotation data from the two annotation campaigns and the raw PV installations’ metadata.  
The replication subfolder contains the compiled data used to generate our segmentation masks.  
The validation subfolder contains the compiled data necessary to replicate the analyses presented in the technical validation section.
```

We will be using the `replication` subfolder to generate our PV polygons geojson file.

In [None]:
print(selected_metadata['fra_west_eur_pv_installations_2023'].keys())
print(selected_metadata['fra_west_eur_pv_installations_2023']['files'])
data_out = Path(os.getenv('DATA_PATH'))
fra_files = selected_metadata['fra_west_eur_pv_installations_2023']['files']
fra_out = selected_metadata['fra_west_eur_pv_installations_2023']['output_dir']
ds_sub = os.path.join(fra_out, 'replication')
fra_sub_files = '\n'.join([os.path.relpath(f, data_out) for f in fra_files if f.startswith(ds_sub)])
print(f"Subdir files:\n{fra_sub_files}")

In [None]:
# bespoke pre-processing for datsets not directly available in geojson or shapefile format
# parse the point or polygon json files with geopandas, transform raw polygons or points features into proper geometry for geojson conversion
from shapely.geometry import Polygon, Point, MultiPolygon
import json

# TODO: make function for processing of france json geometries

def france_eur_pv_preprocess(ds_metadata, ds_subdir, metadata_dir='raw', crs=None, geom_type='Polygon'):
    ds_dir = Path(ds_metadata['output_dir'])
    data_dir = ds_dir / ds_subdir
    metadata_file = 'raw-metadata_df.csv' if metadata_dir == 'raw' else 'metadata_df.csv'
    metadata_file = ds_dir / metadata_dir / metadata_file
    coords_file = "polygon-analysis.json" if geom_type == 'Polygon' else "point-analysis.json"
    # keep files that are in the specified subdir and have the above filename
    geom_files = [fpath for fpath in ds_metadata['files'] if fpath.startswith(data_dir) and fpath.endswith(coords_file)]
    crs = crs or 'EPSG:4326' # default to WGS84

    # load the metadata file
    metadata_df = pd.read_csv(metadata_file)
    print(f"Loaded '{metadata_file.split('/')[-1]}' with {len(metadata_df)} rows")

    # load into geopandas, inspect the data, and add metadata_df to separate pd dataframe
    raw_features = []
    for geom_file_path in geom_files:
        campaign_name = Path(geom_file_path).parent.name
        print(f"Processing {campaign_name} campaign...")
        
        with open(geom_file_path, 'r') as f:
            geom_data = json.load(f)
        feat_types = set([f['type'] for f in geom_data])
        print(f"Feature types: {feat_types}")
    
        for idx, feature_dict in enumerate(geom_data):
            # Skip empty dictionaries
            if not feature_dict:
                continue
            
            try:
                feature_id = feature_dict.get('id', idx) # Use index if ID is not present

                # extract geometry and coordinates
                if geom_type == 'Polygon':
                    # feat_dict = [{'polygons': [{'points': {'x': <px_coord>, 'y': <px_coord>}, ...}]}, ...]
                    coords = feature_dict['polygons']
                    if isinstance(coords, list) and len(coords) > 0:
                            # Handle multiple polygons
                            polygons = []
                            for poly_coords in coords:
                                if len(poly_coords) >= 3:  # Need at least 3 points for a polygon
                                    polygons.append(Polygon(poly_coords))
                            
                            if len(polygons) == 1:
                                geometry = polygons[0]
                            else:
                                geometry = MultiPolygon(polygons)
                                
                            # Create feature dictionary with properties
                            feature = {
                                'id': feature_id,
                                'campaign': campaign_name,
                                'geometry': geometry
                            }
                    raw_features.append(feature)
                elif geom_type == 'Point':
                    # feat_dict = [{'clicks': [{'@type': 'Point', 'x': <px_coord>, 'y': <px_coord>}, ...]}, ...]
                    coords = feature_dict['clicks']
                    if isinstance(coords, list) and len(coords) > 0:
                        points = []
                        for point_coords in coords:
                            if 'x' not in point_coords or 'y' not in point_coords:
                                continue
                            else:
                                points.append(Point(point_coords['x'], point_coords['y']))
                    raw_features.extend(points)
            except Exception as e:
                print(f"Error processing feature {feature_dict}: {e}")
                continue

    if raw_features:
        # Convert to GeoDataFrame
        gdf = gpd.GeoDataFrame(raw_features, crs=crs)
        # add metadata to the gdf
        if 'id' in gdf.columns:
            gdf['id'] = gdf['id'].astype(str)
        # Ensure CRS is set
        if gdf.crs is None:
            gdf.set_crs(crs, inplace=True)
        elif str(gdf.crs) != crs:
            gdf = gdf.to_crs(crs)
        # need to add geotransform if available to convert pixel coords to lat/lon

        # gdf['source_dataset'] = add in calling function
    
    return gdf, metadata_df

In [None]:
# import the different datasets and convert to geoparquet with geopandas
def gdp_load_and_gpq_convert(dataset_name, label_fmt='geojson', crs=None, geom_type='Polygon'):

    crs = crs or 'EPSG:4326' # default to WGS84 if crs is None
    

In [None]:
t.tree.__doc__

# Visualization Functions for PV Data

After processing the datasets into standardized geoparquet format, we'll use the following visualization libraries to explore and present the data:

- **Folium**: For interactive web maps with various basemaps and markers
- **Pydeck**: For high-performance 3D and large-scale visualizations
- **Lonboard**: For GPU-accelerated geospatial visualization of large datasets

Each library has specific strengths that we'll leverage for different visualization needs.

## Folium Visualization Functions

Folium is excellent for creating interactive web maps with various basemaps and markers. It's particularly useful for visualizing geographic distributions and creating choropleth maps.

In [None]:
def create_folium_cluster_map(gdf, zoom_start=3, title="PV Installation Clusters"):
    """
    Create a cluster map of PV installations using Folium.
    
    Parameters:
    -----------
    gdf : GeoDataFrame
        GeoDataFrame containing PV installation data with geometry column
    zoom_start : int
        Initial zoom level for the map
    title : str
        Title for the map
        
    Returns:
    --------
    folium.Map
        Interactive Folium map with clustered markers
    """
    # Ensure the GeoDataFrame is in WGS84 (EPSG:4326) for Folium compatibility
    if gdf.crs != "EPSG:4326":
        gdf = gdf.to_crs("EPSG:4326")
    
    # Get centroid of all points to center the map
    center_lat = gdf.geometry.centroid.y.mean()
    center_lon = gdf.geometry.centroid.x.mean()
    
    # Create base map
    m = folium.Map(location=[center_lat, center_lon], zoom_start=zoom_start,
                  tiles='CartoDB positron')
    
    # Add title
    title_html = f'''
             <h3 align="center" style="font-size:16px"><b>{title}</b></h3>
             '''
    m.get_root().html.add_child(folium.Element(title_html))
    
    # Add marker cluster
    marker_cluster = folium.plugins.MarkerCluster().add_to(m)
    
    # Add markers for each PV installation
    for idx, row in gdf.iterrows():
        # Get the centroid if the geometry is a polygon
        if row.geometry.geom_type in ['Polygon', 'MultiPolygon']:
            centroid = row.geometry.centroid
            popup_text = f"ID: {idx}"
            
            # Add additional information if available in the dataframe
            for col in ['capacity_mw', 'area_sqm', 'installation_date', 'source_dataset']:
                if col in gdf.columns:
                    popup_text += f"<br>{col}: {row[col]}"
            
            folium.Marker(
                location=[centroid.y, centroid.x],
                popup=folium.Popup(popup_text, max_width=300),
                icon=folium.Icon(color='green', icon='solar-panel', prefix='fa')
            ).add_to(marker_cluster)
    
    return m

def create_folium_choropleth(gdf, column, bins=8, cmap='YlOrRd', 
                             title="PV Installation Density"):
    """
    Create a choropleth map of PV installations using Folium.
    
    Parameters:
    -----------
    gdf : GeoDataFrame
        GeoDataFrame containing PV installation data with geometry column
    column : str
        Column name to use for choropleth coloring
    bins : int
        Number of bins for choropleth map
    cmap : str
        Matplotlib colormap name
    title : str
        Title for the map
        
    Returns:
    --------
    folium.Map
        Interactive Folium choropleth map
    """
    # Ensure the GeoDataFrame is in WGS84 (EPSG:4326)
    if gdf.crs != "EPSG:4326":
        gdf = gdf.to_crs("EPSG:4326")
    
    # Get centroid of all points to center the map
    center_lat = gdf.geometry.centroid.y.mean()
    center_lon = gdf.geometry.centroid.x.mean()
    
    # Create base map
    m = folium.Map(location=[center_lat, center_lon], zoom_start=3,
                  tiles='CartoDB positron')
    
    # Add title
    title_html = f'''
             <h3 align="center" style="font-size:16px"><b>{title}</b></h3>
             '''
    m.get_root().html.add_child(folium.Element(title_html))
    
    # Create choropleth layer
    folium.Choropleth(
        geo_data=gdf,
        name='choropleth',
        data=gdf,
        columns=[gdf.index.name if gdf.index.name else 'index', column],
        key_on='feature.id',
        fill_color=cmap,
        fill_opacity=0.7,
        line_opacity=0.2,
        legend_name=column,
        bins=bins
    ).add_to(m)
    
    # Add hover functionality
    style_function = lambda x: {'fillColor': '#ffffff', 
                                'color': '#000000', 
                                'fillOpacity': 0.1, 
                                'weight': 0.1}
    highlight_function = lambda x: {'fillColor': '#000000', 
                                    'color': '#000000', 
                                    'fillOpacity': 0.5, 
                                    'weight': 0.1}
    
    # Add tooltips
    folium.GeoJson(
        gdf,
        style_function=style_function,
        highlight_function=highlight_function,
        tooltip=folium.GeoJsonTooltip(
            fields=[column],
            aliases=[column.replace('_', ' ').title()],
            style=("background-color: white; color: #333333; font-family: arial; font-size: 12px; padding: 10px;")
        )
    ).add_to(m)
    
    # Add layer control
    folium.LayerControl().add_to(m)
    
    return m

def create_folium_heatmap(gdf, intensity_column=None, radius=15, 
                          title="PV Installation Heatmap"):
    """
    Create a heatmap of PV installations using Folium.
    
    Parameters:
    -----------
    gdf : GeoDataFrame
        GeoDataFrame containing PV installation data with geometry column
    intensity_column : str, optional
        Column name to use for heatmap intensity; if None, all points have equal weight
    radius : int
        Radius for heatmap points (in pixels)
    title : str
        Title for the map
        
    Returns:
    --------
    folium.Map
        Interactive Folium heatmap
    """
    # Ensure the GeoDataFrame is in WGS84 (EPSG:4326)
    if gdf.crs != "EPSG:4326":
        gdf = gdf.to_crs("EPSG:4326")
    
    # Get centroids for all geometries
    if any(gdf.geometry.geom_type.isin(['Polygon', 'MultiPolygon'])):
        centroids = gdf.geometry.centroid
    else:
        centroids = gdf.geometry
    
    # Get coordinates for heatmap
    heat_data = [[point.y, point.x] for point in centroids]
    
    # Add intensity if specified
    if intensity_column and intensity_column in gdf.columns:
        heat_data = [[point.y, point.x, float(intensity)] 
                    for point, intensity in zip(centroids, gdf[intensity_column])]
    
    # Get centroid of all points to center the map
    center_lat = sum(point[0] for point in heat_data) / len(heat_data)
    center_lon = sum(point[1] for point in heat_data) / len(heat_data)
    
    # Create base map
    m = folium.Map(location=[center_lat, center_lon], zoom_start=4,
                  tiles='CartoDB positron')
    
    # Add title
    title_html = f'''
             <h3 align="center" style="font-size:16px"><b>{title}</b></h3>
             '''
    m.get_root().html.add_child(folium.Element(title_html))
    
    # Add heatmap layer
    folium.plugins.HeatMap(
        heat_data,
        radius=radius,
        blur=10,
        gradient={0.4: 'blue', 0.65: 'lime', 1: 'red'}
    ).add_to(m)
    
    return m

## PyDeck Visualization Functions

PyDeck is excellent for high-performance 3D visualizations and handling large datasets. It's particularly useful for creating layered maps with multiple types of data.

In [None]:
def create_pydeck_scatterplot(gdf, color_column=None, size_scale=100, 
                             title="PV Installation Map"):
    """
    Create a scatterplot of PV installations using PyDeck.
    
    Parameters:
    -----------
    gdf : GeoDataFrame
        GeoDataFrame containing PV installation data with geometry column
    color_column : str, optional
        Column name to use for point coloring
    size_scale : float
        Scaling factor for point size
    title : str
        Title for the map
        
    Returns:
    --------
    pydeck.Deck
        Interactive PyDeck map with scatterplot layer
    """
    # Ensure the GeoDataFrame is in WGS84 (EPSG:4326)
    if gdf.crs != "EPSG:4326":
        gdf = gdf.to_crs("EPSG:4326")
    
    # Convert to DataFrame with lat/lon columns
    df = pd.DataFrame({
        'lat': gdf.geometry.centroid.y,
        'lon': gdf.geometry.centroid.x
    })
    
    # Add additional columns from the original GeoDataFrame
    for col in gdf.columns:
        if col != 'geometry':
            df[col] = gdf[col]
    
    # Handle color mapping
    if color_column and color_column in df.columns:
        # Check if the column is numeric
        if pd.api.types.is_numeric_dtype(df[color_column]):
            color_scale = [
                [0, [65, 182, 196]],
                [0.33, [127, 205, 187]],
                [0.66, [199, 233, 180]],
                [1, [237, 248, 177]]
            ]
            
            # Normalize the values
            df['color_value'] = (df[color_column] - df[color_column].min()) / (df[color_column].max() - df[color_column].min())
            get_color = f"[r, g, b]"
            
            # Create a calculated color column using the scale
            df['r'] = df['color_value'].apply(lambda x: int(
                np.interp(x, [scale[0] for scale in color_scale], [scale[1][0] for scale in color_scale])
            ))
            df['g'] = df['color_value'].apply(lambda x: int(
                np.interp(x, [scale[0] for scale in color_scale], [scale[1][1] for scale in color_scale])
            ))
            df['b'] = df['color_value'].apply(lambda x: int(
                np.interp(x, [scale[0] for scale in color_scale], [scale[1][2] for scale in color_scale])
            ))
            
        else:
            # For categorical data, use hash of category for color
            unique_cats = df[color_column].unique()
            color_map = {cat: [int(h) % 256 for h in str(hash(cat))[:3]] for cat in unique_cats}
            df['r'] = df[color_column].map(lambda x: color_map[x][0])
            df['g'] = df[color_column].map(lambda x: color_map[x][1])
            df['b'] = df[color_column].map(lambda x: color_map[x][2])
            
        get_color = "[r, g, b]"
    else:
        # Default color: green for solar panels
        get_color = "[0, 128, 0]"  # Green
    
    # Calculate point size - use area if available
    if 'area_sqm' in df.columns:
        get_size = f"Math.sqrt(area_sqm) * {size_scale/100}"
    elif 'capacity_mw' in df.columns:
        get_size = f"Math.sqrt(capacity_mw) * {size_scale/10}"
    else:
        get_size = str(size_scale)
    
    # Create ScatterplotLayer
    layer = pdk.Layer(
        'ScatterplotLayer',
        df,
        get_position=['lon', 'lat'],
        get_radius=get_size,
        get_fill_color=get_color,
        pickable=True,
        opacity=0.8,
        stroked=True,
        filled=True
    )
    
    # Set initial view state to center on data
    view_state = pdk.ViewState(
        longitude=df['lon'].mean(),
        latitude=df['lat'].mean(),
        zoom=3,
        pitch=0
    )
    
    # Create tooltip
    tooltip = {
        "html": "<b>ID:</b> {index}<br>"
    }
    
    # Add additional fields to tooltip if available
    for col in ['capacity_mw', 'area_sqm', 'installation_date', 'source_dataset']:
        if col in df.columns:
            tooltip["html"] += f"<b>{col.replace('_', ' ').title()}:</b> {{{col}}}<br>"
    
    if color_column:
        tooltip["html"] += f"<b>{color_column.replace('_', ' ').title()}:</b> {{{color_column}}}"
    
    # Create deck
    deck = pdk.Deck(
        layers=[layer],
        initial_view_state=view_state,
        tooltip=tooltip,
        map_style='light'
    )
    
    return deck

def create_pydeck_polygons(gdf, color_column=None, extrusion_column=None, 
                          extrusion_scale=100, title="PV Installation Map"):
    """
    Create a 3D polygon map of PV installations using PyDeck.
    
    Parameters:
    -----------
    gdf : GeoDataFrame
        GeoDataFrame containing PV installation data with Polygon geometry
    color_column : str, optional
        Column name to use for polygon coloring
    extrusion_column : str, optional
        Column name to use for polygon height extrusion
    extrusion_scale : float
        Scaling factor for extrusion height
    title : str
        Title for the map
        
    Returns:
    --------
    pydeck.Deck
        Interactive PyDeck map with polygon layer
    """
    # Ensure the GeoDataFrame is in WGS84 (EPSG:4326)
    if gdf.crs != "EPSG:4326":
        gdf = gdf.to_crs("EPSG:4326")
    
    # Filter to only include polygons
    poly_gdf = gdf[gdf.geometry.geom_type.isin(['Polygon', 'MultiPolygon'])]
    
    if len(poly_gdf) == 0:
        return "Error: No polygon geometries found in the GeoDataFrame."
    
    # Convert to a format PyDeck can use
    polygon_data = []
    
    for idx, row in poly_gdf.iterrows():
        geom = row.geometry
        
        # Handle both Polygon and MultiPolygon
        polygons = [geom] if geom.geom_type == 'Polygon' else list(geom.geoms)
        
        for poly in polygons:
            # Extract exterior coordinates
            exterior_coords = list(poly.exterior.coords)
            coords = [[point[0], point[1]] for point in exterior_coords]
            
            # Create a feature for each polygon
            feature = {
                'polygon': coords,
                'index': idx
            }
            
            # Add additional properties
            for col in poly_gdf.columns:
                if col != 'geometry':
                    feature[col] = row[col] if not pd.isna(row[col]) else None
            
            polygon_data.append(feature)
    
    # Create DataFrame from polygon data
    df = pd.DataFrame(polygon_data)
    
    # Handle color mapping
    if color_column and color_column in df.columns:
        # Check if the column is numeric
        if pd.api.types.is_numeric_dtype(df[color_column]):
            color_scale = [
                [0, [65, 182, 196]],
                [0.33, [127, 205, 187]],
                [0.66, [199, 233, 180]],
                [1, [237, 248, 177]]
            ]
            
            # Normalize the values
            df['color_value'] = (df[color_column] - df[color_column].min()) / (df[color_column].max() - df[color_column].min())
            
            # Create a calculated color column using the scale
            df['r'] = df['color_value'].apply(lambda x: int(
                np.interp(x, [scale[0] for scale in color_scale], [scale[1][0] for scale in color_scale])
            ))
            df['g'] = df['color_value'].apply(lambda x: int(
                np.interp(x, [scale[0] for scale in color_scale], [scale[1][1] for scale in color_scale])
            ))
            df['b'] = df['color_value'].apply(lambda x: int(
                np.interp(x, [scale[0] for scale in color_scale], [scale[1][2] for scale in color_scale])
            ))
            
            get_color = "[r, g, b]"
        else:
            # For categorical data, use hash of category for color
            unique_cats = df[color_column].unique()
            color_map = {cat: [int(h) % 256 for h in str(hash(str(cat)))[:3]] for cat in unique_cats}
            df['r'] = df[color_column].map(lambda x: color_map[x][0])
            df['g'] = df[color_column].map(lambda x: color_map[x][1])
            df['b'] = df[color_column].map(lambda x: color_map[x][2])
            
            get_color = "[r, g, b]"
    else:
        # Default color: green for solar panels
        get_color = "[0, 128, 0]"  # Green
    
    # Handle extrusion
    if extrusion_column and extrusion_column in df.columns:
        get_elevation = f"{extrusion_column} * {extrusion_scale}"
    else:
        get_elevation = str(extrusion_scale)
    
    # Create PolygonLayer
    layer = pdk.Layer(
        'PolygonLayer',
        df,
        get_polygon='polygon',
        get_fill_color=get_color,
        get_elevation=get_elevation,
        elevation_scale=1,
        extruded=True,
        filled=True,
        wireframe=True,
        pickable=True,
        opacity=0.6,
        auto_highlight=True
    )
    
    # Find center of polygons for the view
    all_coords = []
    for poly in df['polygon']:
        all_coords.extend(poly)
    
    center_lon = np.mean([coord[0] for coord in all_coords])
    center_lat = np.mean([coord[1] for coord in all_coords])
    
    # Set initial view state
    view_state = pdk.ViewState(
        longitude=center_lon,
        latitude=center_lat,
        zoom=10,
        pitch=45,
        bearing=0
    )
    
    # Create tooltip
    tooltip = {
        "html": "<b>ID:</b> {index}<br>"
    }
    
    # Add additional fields to tooltip if available
    for col in ['capacity_mw', 'area_sqm', 'installation_date', 'source_dataset']:
        if col in df.columns:
            tooltip["html"] += f"<b>{col.replace('_', ' ').title()}:</b> {{{col}}}<br>"
    
    if color_column:
        tooltip["html"] += f"<b>{color_column.replace('_', ' ').title()}:</b> {{{color_column}}}<br>"
    
    if extrusion_column:
        tooltip["html"] += f"<b>{extrusion_column.replace('_', ' ').title()}:</b> {{{extrusion_column}}}"
    
    # Create deck
    deck = pdk.Deck(
        layers=[layer],
        initial_view_state=view_state,
        tooltip=tooltip,
        map_style='light'
    )
    
    return deck

def create_pydeck_heatmap(gdf, weight_column=None, intensity=1, radius=1000,
                         title="PV Installation Heatmap"):
    """
    Create a heatmap of PV installations using PyDeck.
    
    Parameters:
    -----------
    gdf : GeoDataFrame
        GeoDataFrame containing PV installation data with geometry column
    weight_column : str, optional
        Column name to use for heatmap weighting
    intensity : float
        Intensity of the heatmap
    radius : float
        Radius of influence for each point (in meters)
    title : str
        Title for the map
        
    Returns:
    --------
    pydeck.Deck
        Interactive PyDeck heatmap
    """
    # Ensure the GeoDataFrame is in WGS84 (EPSG:4326)
    if gdf.crs != "EPSG:4326":
        gdf = gdf.to_crs("EPSG:4326")
    
    # Convert to DataFrame with lat/lon columns
    df = pd.DataFrame({
        'lat': gdf.geometry.centroid.y,
        'lon': gdf.geometry.centroid.x
    })
    
    # Add weight column if specified
    if weight_column and weight_column in gdf.columns:
        df['weight'] = gdf[weight_column]
        get_weight = 'weight'
    else:
        get_weight = 1
    
    # Create HeatmapLayer
    layer = pdk.Layer(
        'HeatmapLayer',
        df,
        get_position=['lon', 'lat'],
        get_weight=get_weight,
        pickable=False,
        opacity=0.8,
        radius_pixels=radius/100,  # Convert meters to pixels roughly
        intensity=intensity,
        threshold=0.05,
        color_range=[
            [1, 152, 189],
            [73, 227, 206],
            [216, 254, 181],
            [254, 237, 177],
            [254, 173, 84],
            [209, 55, 78]
        ]
    )
    
    # Set initial view state to center on data
    view_state = pdk.ViewState(
        longitude=df['lon'].mean(),
        latitude=df['lat'].mean(),
        zoom=3,
        pitch=0
    )
    
    # Create deck
    deck = pdk.Deck(
        layers=[layer],
        initial_view_state=view_state,
        map_style='light'
    )
    
    return deck

## Lonboard Visualization Functions

Lonboard is a GPU-accelerated geospatial visualization library that's excellent for handling very large datasets. It's particularly useful for creating high-performance interactive visualizations of millions of data points.

In [None]:
def create_lonboard_map(gdf, color_column=None, size_column=None, size_scale=1,
                       title="PV Installation Map"):
    """
    Create an interactive map of PV installations using Lonboard.
    
    Parameters:
    -----------
    gdf : GeoDataFrame
        GeoDataFrame containing PV installation data with geometry column
    color_column : str, optional
        Column name to use for point coloring
    size_column : str, optional
        Column name to use for point sizing
    size_scale : float
        Scaling factor for point size
    title : str
        Title for the map
        
    Returns:
    --------
    lonboard.Map
        Interactive Lonboard map
    """
    # Ensure the GeoDataFrame is in WGS84 (EPSG:4326)
    if gdf.crs != "EPSG:4326":
        gdf = gdf.to_crs("EPSG:4326")
    
    # Handle color mapping if specified
    if color_column and color_column in gdf.columns:
        color = gdf[color_column]
    else:
        color = None
    
    # Handle size mapping if specified
    if size_column and size_column in gdf.columns:
        size = gdf[size_column] * size_scale
    else:
        size = size_scale
    
    # Create the map
    m = lonboard.Map()
    
    # Handle different geometry types
    if all(gdf.geometry.geom_type.isin(['Point'])):
        # For point geometries
        m.add_layer(
            lonboard.ScatterplotLayer(
                gdf,
                get_color=color,
                get_radius=size,
                opacity=0.8,
                pickable=True,
                auto_highlight=True
            )
        )
    elif all(gdf.geometry.geom_type.isin(['Polygon', 'MultiPolygon'])):
        # For polygon geometries
        m.add_layer(
            lonboard.GeoJsonLayer(
                gdf,
                get_fill_color=color,
                get_line_color=[0, 0, 0, 200],
                get_line_width=2,
                opacity=0.8,
                pickable=True,
                auto_highlight=True
            )
        )
    else:
        # For mixed geometries, convert to points (centroids) for simplicity
        gdf_centroids = gdf.copy()
        gdf_centroids['geometry'] = gdf_centroids.geometry.centroid
        
        m.add_layer(
            lonboard.ScatterplotLayer(
                gdf_centroids,
                get_color=color,
                get_radius=size,
                opacity=0.8,
                pickable=True,
                auto_highlight=True
            )
        )
    
    return m

def create_lonboard_heatmap(gdf, weight_column=None, radius=1000,
                          intensity=1, title="PV Installation Heatmap"):
    """
    Create a heatmap of PV installations using Lonboard.
    
    Parameters:
    -----------
    gdf : GeoDataFrame
        GeoDataFrame containing PV installation data with geometry column
    weight_column : str, optional
        Column name to use for heatmap weighting
    radius : float
        Radius of influence for each point (in meters)
    intensity : float
        Intensity of the heatmap
    title : str
        Title for the map
        
    Returns:
    --------
    lonboard.Map
        Interactive Lonboard heatmap
    """
    # Ensure the GeoDataFrame is in WGS84 (EPSG:4326)
    if gdf.crs != "EPSG:4326":
        gdf = gdf.to_crs("EPSG:4326")
    
    # Handle weight mapping if specified
    if weight_column and weight_column in gdf.columns:
        weight = gdf[weight_column]
    else:
        weight = 1
    
    # Get centroids for all geometries
    gdf_centroids = gdf.copy()
    if not all(gdf.geometry.geom_type.isin(['Point'])):
        gdf_centroids['geometry'] = gdf_centroids.geometry.centroid
    
    # Create the map
    m = lonboard.Map()
    
    # Add heatmap layer
    m.add_layer(
        lonboard.HeatmapLayer(
            gdf_centroids,
            get_weight=weight,
            radius_pixels=int(radius/100),  # Convert meters to pixels roughly
            intensity=intensity,
            threshold=0.05,
            color_range=[
                [1, 152, 189, 255],
                [73, 227, 206, 255],
                [216, 254, 181, 255],
                [254, 237, 177, 255],
                [254, 173, 84, 255],
                [209, 55, 78, 255]
            ]
        )
    )
    
    return m

def create_lonboard_aggregation(gdf, resolution=8, color_scale='viridis',
                              title="PV Installation Density"):
    """
    Create a hexbin aggregation map of PV installations using Lonboard.
    
    Parameters:
    -----------
    gdf : GeoDataFrame
        GeoDataFrame containing PV installation data with geometry column
    resolution : int
        Resolution of hexbins (higher = more detailed)
    color_scale : str
        Matplotlib colormap name for coloring
    title : str
        Title for the map
        
    Returns:
    --------
    lonboard.Map
        Interactive Lonboard hexbin aggregation map
    """
    # Ensure the GeoDataFrame is in WGS84 (EPSG:4326)
    if gdf.crs != "EPSG:4326":
        gdf = gdf.to_crs("EPSG:4326")
    
    # Get centroids for all geometries
    gdf_centroids = gdf.copy()
    if not all(gdf.geometry.geom_type.isin(['Point'])):
        gdf_centroids['geometry'] = gdf_centroids.geometry.centroid
    
    # Create the map
    m = lonboard.Map()
    
    # Add hexbin layer
    m.add_layer(
        lonboard.H3HexagonLayer(
            gdf_centroids,
            get_hex_id=lambda row: h3.geo_to_h3(row.geometry.y, row.geometry.x, resolution),
            get_fill_color="colorScale",
            color_scale=color_scale,
            opacity=0.8,
            pickable=True,
            auto_highlight=True
        )
    )
    
    return m

## Example Usage

Here are some examples of how to use these visualization functions with your processed PV datasets. After loading your geoparquet files into GeoDataFrames, you can use these functions to create interactive visualizations.

In [None]:
# Example usage (commented out until you have processed your datasets)
"""
# Load a processed geoparquet file
gdf = gpd.read_parquet('data/geoparquet/combined_pv_dataset.parquet')

# Basic visualizations with each library
# 1. Create a Folium cluster map
folium_map = create_folium_cluster_map(gdf, zoom_start=2, title="Global PV Installations")
display(folium_map)

# 2. Create a PyDeck 3D visualization
if 'capacity_mw' in gdf.columns:
    pydeck_map = create_pydeck_polygons(
        gdf, 
        color_column='source_dataset',
        extrusion_column='capacity_mw',
        extrusion_scale=100,
        title="3D PV Installation Capacity"
    )
    display(pydeck_map)

# 3. Create a Lonboard heatmap for large datasets
lonboard_map = create_lonboard_heatmap(
    gdf,
    weight_column='area_sqm' if 'area_sqm' in gdf.columns else None,
    radius=2000,
    intensity=2,
    title="Global PV Installation Density"
)
display(lonboard_map)
"""

## Advanced Visualization: Multi-layer Comparison

For more sophisticated analysis, you might want to compare multiple datasets or visualize multiple attributes simultaneously. Here's an example of how to create a multi-layer visualization using PyDeck.

In [None]:
# Example of multi-layer visualization (commented out until datasets are processed)
"""
def create_multi_dataset_comparison(gdfs_dict, base_color_scale=None):
    '''
    Create a multi-layer comparison of different PV datasets
    
    Parameters:
    -----------
    gdfs_dict : dict
        Dictionary of {dataset_name: gdf} pairs
    base_color_scale : list, optional
        Base color scale to use for differentiation
        
    Returns:
    --------
    pydeck.Deck
        Interactive PyDeck map with multiple layers
    '''
    if base_color_scale is None:
        base_color_scale = [
            [255, 0, 0],  # Red
            [0, 255, 0],  # Green
            [0, 0, 255],  # Blue
            [255, 255, 0],  # Yellow
            [255, 0, 255],  # Magenta
            [0, 255, 255],  # Cyan
        ]
    
    # Create layers list
    layers = []
    
    # Track all coordinates to determine view center
    all_lats = []
    all_lons = []
    
    # Create a layer for each dataset with a unique color
    for i, (name, gdf) in enumerate(gdfs_dict.items()):
        # Ensure the GeoDataFrame is in WGS84 (EPSG:4326)
        if gdf.crs != "EPSG:4326":
            gdf = gdf.to_crs("EPSG:4326")
        
        # Get color for this dataset
        color_idx = i % len(base_color_scale)
        color = base_color_scale[color_idx]
        
        # Convert to DataFrame with lat/lon columns
        df = pd.DataFrame({
            'lat': gdf.geometry.centroid.y,
            'lon': gdf.geometry.centroid.x,
            'dataset': name
        })
        
        # Add additional columns from the original GeoDataFrame
        for col in gdf.columns:
            if col != 'geometry':
                df[col] = gdf[col]
        
        # Create ScatterplotLayer for this dataset
        layer = pdk.Layer(
            'ScatterplotLayer',
            df,
            get_position=['lon', 'lat'],
            get_radius=100,
            get_fill_color=color + [180],  # Add alpha value
            pickable=True,
            opacity=0.8,
            stroked=True,
            filled=True,
            id=f"scatter-{name}"  # Add ID for legend
        )
        
        layers.append(layer)
        
        # Track coordinates
        all_lats.extend(df['lat'].tolist())
        all_lons.extend(df['lon'].tolist())
    
    # Set initial view state to center on all data
    view_state = pdk.ViewState(
        longitude=np.mean(all_lons),
        latitude=np.mean(all_lats),
        zoom=3,
        pitch=0
    )
    
    # Create tooltip
    tooltip = {
        "html": "<b>Dataset:</b> {dataset}<br>"
    }
    
    # Create deck
    deck = pdk.Deck(
        layers=layers,
        initial_view_state=view_state,
        tooltip=tooltip,
        map_style='light'
    )
    
    return deck

# After processing your datasets:
# gdfs = {
#    'Global PV Inventory': gpd.read_parquet('data/geoparquet/global_pv_inventory.parquet'),
#    'USA PV Data': gpd.read_parquet('data/geoparquet/usa_pv_data.parquet'),
#    'UK PV Data': gpd.read_parquet('data/geoparquet/uk_pv_data.parquet')
# }
# multi_comparison = create_multi_dataset_comparison(gdfs)
# display(multi_comparison)
"""

## Conclusion

These visualization functions provide a comprehensive toolkit for exploring and presenting your PV installation data. Each library has its strengths:

- **Folium**: Best for quick interactive web maps with various basemaps and standard visualization types
- **PyDeck**: Excellent for 3D visualizations and handling larger datasets with complex visualizations
- **Lonboard**: Best performance for very large datasets with GPU acceleration

You can customize these functions further based on your specific analysis needs and the attributes available in your processed datasets.