# Summary

This document details a Python script that generates image files and corresponding CSV files from tree data in Pasadena, California, using satellite imagery from Google Earth Engine (GEE). Below is an overview of its key components and functionality:

## Key Components and Functionality

- **Libraries Used**: Imports `pandas`, `geopandas`, `shapely`, `ee` (Earth Engine), `os`, `urllib.request`, and `tqdm` for data manipulation, geospatial operations, API interactions, file handling, and progress tracking.
- **GEE Authentication**: Authenticates and initializes the GEE API to access satellite imagery.
- **Data Input**: Loads tree data from `pasadena.csv`, standardizing column names to lowercase.
- **GeoDataFrame Setup**: Converts tree coordinates into a GeoDataFrame with EPSG:2229 (State Plane California Zone V) CRS.
- **Grid Setup**: Defines grid cells with a side length of 112 meters (converted to feet) and image dimensions of 224x224 pixels.
- **Grid Calculation**: Determines grid extent from tree data and calculates bottom-left corners of grid cells.
- **Cell Processing**:
  - A function `process_grid_cell` processes each grid cell by:
    - Defining cell bounds and selecting trees within them.
    - Skipping empty cells.
    - Creating a polygon in EPSG:2229, converting it to WGS84 for GEE.
    - Fetching the latest NAIP imagery, generating a thumbnail URL, and saving it as a PNG.
    - Calculating relative pixel coordinates (`x`, `y`) for trees within the image.
    - Saving tree data (`treeID`, `x`, `y`) as a CSV file.
- **Execution**: Loops through all grid cells with a `tqdm` progress bar, processing each one.
- **Output**: Saves PNG images and CSV files to `chips/pasadena_data`.

## Purpose
The script prepares geospatial tree data for urban or environmental analysis, producing a structured dataset of images and tree locations suitable for GreenCity project.

In [1]:
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point, Polygon
import ee
import os
import urllib.request
from tqdm import tqdm

In [2]:
# Authenticate and initialize the Earth Engine API
ee.Authenticate()
# Initialize GEE
ee.Initialize(project='') # Replace with you GEE project ID

In [3]:
# Load Pasadena tree data
df = pd.read_csv("pasadena.csv")
df.columns = df.columns.str.lower()

In [4]:
# Create GeoDataFrame in EPSG:2229 (State Plane California Zone V, feet)
gdf = gpd.GeoDataFrame(
    df,
    geometry=gpd.points_from_xy(df["x_coordinate"], df["y_coordinate"]),
    crs="EPSG:2229",
)

In [5]:
# Define grid parameters
meters_per_side = 112  # Side length of each grid cell in meters
feet_per_side = meters_per_side * 3.28084  # Convert meters to feet
pixel_dim = 224  # Image dimensions in pixels (224x224)
output_dir = "chips/pasadena_data"
os.makedirs(output_dir, exist_ok=True)

In [6]:
# Calculate grid bounds
min_x, min_y, max_x, max_y = gdf.total_bounds

In [7]:
# Generate grid cell bottom-left corners
x_steps = int((max_x - min_x) / feet_per_side) + 1
y_steps = int((max_y - min_y) / feet_per_side) + 1
grid_cells = [
    (min_x + i * feet_per_side, min_y + j * feet_per_side)
    for i in range(x_steps)
    for j in range(y_steps)
]

In [8]:
# Function to process each grid cell
def process_grid_cell(cell_x, cell_y):
    # Define cell bounds in EPSG:2229
    cell_min_x = cell_x
    cell_max_x = cell_x + feet_per_side
    cell_min_y = cell_y
    cell_max_y = cell_y + feet_per_side

    # Select trees within this cell
    trees_in_cell = gdf[
        (gdf["x_coordinate"] >= cell_min_x)
        & (gdf["x_coordinate"] < cell_max_x)
        & (gdf["y_coordinate"] >= cell_min_y)
        & (gdf["y_coordinate"] < cell_max_y)
    ]

    if trees_in_cell.empty:
        return  # Skip cells with no trees

    # Create polygon for the cell in EPSG:2229
    cell_polygon = Polygon(
        [
            (cell_min_x, cell_min_y),
            (cell_max_x, cell_min_y),
            (cell_max_x, cell_max_y),
            (cell_min_x, cell_max_y),
        ]
    )

    # Convert cell bounds to WGS84 for GEE
    cell_gdf = gpd.GeoSeries([cell_polygon], crs="EPSG:2229").to_crs("EPSG:4326")
    cell_coords = list(cell_gdf.geometry[0].exterior.coords)
    region = ee.Geometry.Polygon(cell_coords)

    # Fetch NAIP image
    image = (
        ee.ImageCollection("USDA/NAIP/DOQQ")
        .filterBounds(region)
        .filterDate("2020-01-01", "2022-12-31")
        .sort("system:time_start", False)
        .first()
        .clip(region)
    )

    # Generate thumbnail URL in EPSG:2229 projection
    url = image.getThumbURL({
        'region': region,
        'dimensions': f'{pixel_dim}x{pixel_dim}',
        'format': 'png',
        'bands': ['N', 'R', 'G'],
        'min': 0,
        'max': 255,
        'crs': 'EPSG:2229',  # Ensure image projection matches tree coordinates
    })

    # Save image
    image_filename = f"image_{cell_x:.0f}_{cell_y:.0f}.png"
    image_path = os.path.join(output_dir, image_filename)
    urllib.request.urlretrieve(url, image_path)

    # Calculate relative pixel coordinates
    scale_x = (cell_max_x - cell_min_x) / pixel_dim
    scale_y = (cell_max_y - cell_min_y) / pixel_dim
    trees_in_cell = trees_in_cell.copy()
    trees_in_cell["x"] = ((trees_in_cell["x_coordinate"] - cell_min_x) / scale_x).astype(int)
    trees_in_cell["y"] = ((cell_max_y - trees_in_cell["y_coordinate"]) / scale_y).astype(int)

    # Prepare CSV data with treeID, x, y
    csv_data = trees_in_cell[["objectid", "x", "y"]].rename(columns={"objectid": "treeID"})

    # Save CSV
    csv_filename = f"image_{cell_x:.0f}_{cell_y:.0f}.csv"
    csv_path = os.path.join(output_dir, csv_filename)
    csv_data.to_csv(csv_path, index=False)

In [9]:
# Process all grid cells with progress bar
for cell_x, cell_y in tqdm(grid_cells, desc="Processing grid cells"):
    process_grid_cell(cell_x, cell_y)

Processing grid cells: 100%|██████████| 8787/8787 [20:15<00:00,  7.23it/s]  


In [10]:
print(f"Processing complete. Images and CSVs saved to {output_dir}")

Processing complete. Images and CSVs saved to chips/pasadena_data
