# Epic 7: Data Processing and API Logic for Threatened Plant Index Visualization

Name: Zihan

### Step 1: Load and Preprocess Data 📂
This cell establishes the **data foundation** for our backend service.

1.  **Import Libraries**: We import `pandas` for general data processing, `geopandas` for handling geospatial data, and the key tool `shapely.wkt`.
2.  **Load Data**:
    * `Table14_TSX_Table_VIC_version4.csv` is loaded into a standard pandas DataFrame (`df_tsx`), containing the index values for all states over the years.
    * `Table15_StateShapeTable.csv` is also loaded, containing the state boundary information needed for map drawing.
3.  **Core Transformation**: The most critical step is `gpd.GeoDataFrame(...)`. It uses `shapely.wkt.loads` to convert the `MULTIPOLYGON` strings stored as plain text in the CSV file into **actual geometric objects** that `geopandas` can understand and plot. This upgrades a regular DataFrame into a powerful `GeoDataFrame` (`gdf_states`).
4.  **Validation**: The final `print` statements check if the data is loaded correctly and make us aware that the lists of states in `TSX Data` and `Shape Data` are slightly different, helping us understand the data coverage.

In [None]:
import pandas as pd
import geopandas as gpd
from shapely import wkt # Used to convert strings back to geometry objects

# --- 1. Load and Preprocess Data ---

# Load time series data
df_tsx = pd.read_csv('Table14_TSX_Table_VIC_version4.csv')

# Load geographic shape data
df_shapes = pd.read_csv('Table15_StateShapeTable.csv')

# Convert WKT (Well-Known Text) strings from CSV into real geometry objects
# This is the key step that turns a pandas DataFrame into a GeoDataFrame
df_shapes['geometry'] = df_shapes['geometry'].apply(wkt.loads)
gdf_states = gpd.GeoDataFrame(df_shapes, geometry='geometry')

# Check the data
print("TSX Data Head:")
display(df_tsx.head())
print("\nGeoDataFrame Head:")
display(gdf_states.head())
print(f"\nAvailable states in TSX data: {df_tsx['state'].unique()}")
print(f"Available states in Shape data: {gdf_states['state'].unique()}")

TSX Data Head:


Unnamed: 0,year,index_value,value_type,annual_mean_temp,annual_precip_sum,annual_radiation_sum,state
0,2000,1.0,Historical,14.75799,615.8,5848.04,Victoria
1,2001,0.851265,Historical,14.526712,564.2,5662.22,Victoria
2,2002,0.738937,Historical,14.698013,432.4,5789.7603,Victoria
3,2003,0.728083,Historical,14.443562,571.0,5895.1,Victoria
4,2004,0.647233,Historical,14.277344,621.2,5806.97,Victoria



GeoDataFrame Head:


Unnamed: 0,state,geometry
0,New South Wales,"MULTIPOLYGON (((-31.509 159.062, -31.509 159.0..."
1,Victoria,"MULTIPOLYGON (((-39.158 146.293, -39.157 146.2..."
2,Queensland,"MULTIPOLYGON (((-10.683 142.531, -10.682 142.5..."
3,South Australia,"MULTIPOLYGON (((-38.063 140.66, -38.062 140.66..."
4,Western Australia,"MULTIPOLYGON (((-35.191 117.87, -35.191 117.87..."



Available states in TSX data: ['Victoria' 'New South Wales' 'South Australia' 'Western Australia'
 'Australian Capital Territory' 'National']
Available states in Shape data: ['New South Wales' 'Victoria' 'Queensland' 'South Australia'
 'Western Australia' 'Tasmania' 'Northern Territory'
 'Australian Capital Territory' 'Other Territories']


### Step 2: Define Core Backend Functions 🛠️
In this cell, we define three core functions that **simulate the functionality of the Web API to be built in the future**. Each function serves a specific frontend need, responsible for extracting and formatting the required information from the raw data.

1.  `get_map_base_geojson()`:
    *   **Purpose**: Provides the **basic geographic outlines** needed for the frontend to draw the Australian map.
    *   **Work**: Directly converts the `gdf_states` `GeoDataFrame` into **GeoJSON** format. This is a standard geospatial data format that frontend map libraries (like Leaflet, Mapbox) can use directly.
    *   **Characteristic**: This data is static; the frontend only needs to request it once when the page loads.

2.  `get_choropleth_data_for_year(year)`:
    *   **Purpose**: Provides the **data** used to render the Choropleth Map colors based on the user-selected **year**.
    *   **Work**: Accepts a year, filters the `df_tsx` data for that year, and formats it into a simple Python dictionary `{'State Name': 'TSX Index'}`.
    *   **Characteristic**: The returned data packet is very **lightweight**, suitable for frequent requests when the user interacts with the time slider.

3.  `get_state_timeseries_data(state)`:
    *   **Purpose**: Provides the complete time series data needed to draw the right-hand **line chart** based on the **state** clicked by the user.
    *   **Work**: Accepts a state name, filters `df_tsx` for **all relevant data** (TSX index, weather variables, data type, etc.) for that state from 2000 to 2027.
    *   **Characteristic**: Returns data in JSON record array format (`[{...}, {...}]`), which the frontend can easily iterate over to plot the chart.

In [None]:
# --- 2. Define Core Functions for Backend ---

def get_map_base_geojson():
    """
    Function: Provides geographic boundary data for all states (in GeoJSON format).
    Usage: Called once during frontend initialization to draw the base map outline.
    Returns: A GeoJSON formatted string.
    """
    # to_json() automatically converts the GeoDataFrame to GeoJSON
    return gdf_states.to_json()

def get_choropleth_data_for_year(year: int):
    """
    Function: Gets the TSX index for all states for a specified year.
    Usage: Called by the frontend when the user slides the time slider to update map colors for that year.
    Parameter: year - The selected year (e.g., 2010)
    Returns: A dictionary where keys are state names and values are the TSX index for that state. e.g., {'Victoria': 0.45, ...}
    """
    if year not in df_tsx['year'].unique():
        return {"error": f"Year {year} not found in data."}
        
    # Filter data for the specified year
    data_for_year = df_tsx[df_tsx['year'] == year]
    
    # Convert result to {state: index_value} dictionary format for easy frontend use
    result = pd.Series(data_for_year.index_value.values, index=data_for_year.state).to_dict()
    return result

def get_state_timeseries_data(state: str):
    """
    Function: Gets the complete time series data (TSX and all weather variables) for a specified state.
    Usage: Called by the frontend when the user clicks a state on the map to draw the line chart on the right.
    Parameter: state - The selected state name (e.g., "Victoria")
    Returns: A JSON object containing all data for the state (in records list format).
    """
    if state not in df_tsx['state'].unique():
        return {"error": f"State '{state}' not found in data."}

    # Filter data for the specified state
    state_data = df_tsx[df_tsx['state'] == state].copy()
    
    # For easy frontend processing, directly convert the filtered DataFrame to JSON
    # The 'records' orientation generates a list of dictionaries, ideal for frontend rendering
    # e.g., [{'year': 2000, 'index_value': 1.0, ...}, {'year': 2001, ...}]
    return state_data.to_json(orient='records')

### Step 3: Test Functions in the Notebook ✅
The purpose of this cell is to **verify that the functions defined in Step 2 work correctly** and produce data in the expected format. Doing this before migrating the logic to a real web server greatly simplifies the debugging process.

1.  **Test GeoJSON**: We call `get_map_base_geojson()` and print the first 200 characters of its output. The goal is to confirm it successfully generates a valid GeoJSON string starting with `{"type": "FeatureCollection", ...}`.
2.  **Test Choropleth Data**: Using `2010` as an example, we call `get_choropleth_data_for_year()`. The output is a clear dictionary where keys are state names and values are the TSX index for 2010, in the correct format.
3.  **Test Time Series Data**: Using `'Victoria'` and `'National'` as examples, we call `get_state_timeseries_data()`. The output is a JSON array where each object represents a year's complete data. This proves the function correctly filters by state and formats the data.

Since all tests return the expected results, we can be confident that these core logics are **correct and reliable**, ready for the next step of integration into the Flask web application.

In [None]:
# --- 3. Test these functions in the notebook ---

print("\n--- Testing Functions ---")

# Test 1: Get GeoJSON for the map base
print("\n[Test] GeoJSON for map base:")
geojson_data = get_map_base_geojson()
# GeoJSON is long, print first 200 chars to check format
print(geojson_data[:200] + "...") 

# Test 2: Get Choropleth data for 2010
print("\n[Test] Choropleth data for year 2010:")
choropleth_data_2010 = get_choropleth_data_for_year(2010)
print(choropleth_data_2010)

# Test 3: Get timeseries data for Victoria
print("\n[Test] Timeseries data for Victoria:")
victoria_timeseries = get_state_timeseries_data('Victoria')
print(victoria_timeseries)

# Test 4: Get default data for National
# Assuming your data includes 'National'
print("\n[Test] Timeseries data for National:")
national_timeseries = get_state_timeseries_data('National')
print(national_timeseries)


--- Testing Functions ---

[Test] GeoJSON for map base:
{"type": "FeatureCollection", "features": [{"id": "0", "type": "Feature", "properties": {"state": "New South Wales"}, "geometry": {"type": "MultiPolygon", "coordinates": [[[[-31.508856162538123, 159.0...

[Test] Choropleth data for year 2010:
{'Victoria': 0.454594001858494, 'New South Wales': 0.433957342449423, 'South Australia': 0.971380029575459, 'Western Australia': 0.438504885066282, 'Australian Capital Territory': 1.15055990457538, 'National': 0.552198806798463}

[Test] Timeseries data for Victoria:
[{"year":2000,"index_value":1.0,"value_type":"Historical","annual_mean_temp":14.75799,"annual_precip_sum":615.8,"annual_radiation_sum":5848.04,"state":"Victoria"},{"year":2001,"index_value":0.8512650869,"value_type":"Historical","annual_mean_temp":14.526712,"annual_precip_sum":564.2,"annual_radiation_sum":5662.22,"state":"Victoria"},{"year":2002,"index_value":0.7389365802,"value_type":"Historical","annual_mean_temp":14.698013,"an