# Data Preparation: Enriching Threatened Species Index with Historical Weather Data for Time Series Analysis

Name: Zihan

### Workflow Summary

This Jupyter Notebook documents the complete process from raw data loading to final dataset generation, divided into four core steps:

1.  **Main Data Processing (TSX Index Data Wrangling)**
    * Load and merge Threatened Species Index (TSX Index) data from 6 different CSV files covering Australian states and national data.
    * Standardize data, unify column names and formats, generate main dataset `combined_df` containing records from 2000 to 2021.

2.  **External Data Strategy (External Data Strategy)**
    * Determine strategy to enrich time series model predictive power by introducing historical weather data as exogenous variables.
    * Select Open-Meteo historical weather API as data source, identify three key weather indicators: annual average temperature, annual total precipitation, annual total shortwave radiation.

3.  **Weather Data Acquisition & Processing (Weather Data Acquisition & Processing)**
    * Use official `openmeteo-requests` Python client with caching and auto-retry for stable, efficient data retrieval.
    * Use state capital coordinates to call API, obtain daily weather data from 2000 to 2024.
    * Aggregate daily data to annual level (temperature average, precipitation/radiation sum), calculate national (National) average indicators.
    * Clean aggregated data, generate complete `weather_df` weather dataset.

4.  **Final Data Merge & Storage (Final Merge & Storage)**
    * Use **Right Merge** to combine main data (`combined_df`) with weather data (`weather_df`), generate final dataset `final_df` with complete 2000-2024 records.
    * Save `final_df` as CSV file (`Table14_TSX_Table_VIC_version3.csv`) as direct input for next phase **SARIMAX model predictive analysis**.

### Step 1: Import Libraries and Set Paths

In this cell, we import `pandas` and `os` libraries, then define directory path containing data files and filename list. This prepares for all subsequent operations.

In [None]:
# Cell 1
import os
import pandas as pd

# Define directory containing data files
# Use forward slash '/' which works well on Windows, Mac, and Linux
data_directory = '01_raw_data/06_tsx_table_vic'

# Define list of target files to process (already renamed)
target_files = [
    "National.csv",
    "Australian_Capital_Territory.csv",
    "New_South_Wales.csv",
    "South_Australia.csv",
    "Victoria.csv",
    "Western_Australia.csv"
]

print("Libraries imported, paths set.")

库已导入，路径已设置。


### Step 2: Loop Through, Process, and Merge Data

This is the core processing step. Code iterates through each filename in `target_files` list:
1.  Read corresponding CSV file.
2.  Add `state` column based on filename.
3.  Rename `value`, `low`, `high` columns.
4.  Reorder columns according to final required sequence.
5.  Store processed data (DataFrame) in temporary list.

In [None]:
# Cell 2
# Initialize empty list to store each processed DataFrame
all_dataframes = []

print(f"Starting to process files from directory '{data_directory}'...")

# Iterate through filename list
for filename in target_files:
    filepath = os.path.join(data_directory, filename)

    if not os.path.exists(filepath):
        print(f"Warning: File not found, skipping -> {filepath}")
        continue

    # 1. Read CSV file
    df = pd.read_csv(filepath)
    
    # 2. Extract state/territory name from filename and add as new column
    state = filename.replace('.csv', '').replace('_', ' ')
    df['state'] = state
    
    # 3. Rename columns to match final structure
    df.rename(columns={
        'value': 'index_value',
        'low': 'index_conf_low',
        'high': 'index_conf_high'
    }, inplace=True)
    
    # 4. Reorder columns according to specified sequence
    df = df[['year', 'state', 'index_value', 'index_conf_low', 'index_conf_high']]
    
    # 5. Add processed DataFrame to list
    all_dataframes.append(df)
    
    print(f"Processed: {filename}")

print("\nAll files processed.")

开始从目录 '01_raw_data/06_tsx_table_vic' 中处理文件...
已处理: National.csv
已处理: Australian_Capital_Territory.csv
已处理: New_South_Wales.csv
已处理: South_Australia.csv
已处理: Victoria.csv
已处理: Western_Australia.csv

所有文件处理完毕。


### Step 3: Final Merge and Data Type Conversion

Previous step created list with 6 independent datasets. Now we merge them into single complete dataset and strictly convert each column's data type as required.

In [None]:
# Cell 3
# Merge all DataFrames in list into one
combined_df = pd.concat(all_dataframes, ignore_index=True)

# Ensure final data types are correct
combined_df['year'] = combined_df['year'].astype(int)
combined_df['state'] = combined_df['state'].astype(str)
combined_df['index_value'] = combined_df['index_value'].astype(float)
combined_df['index_conf_low'] = combined_df['index_conf_low'].astype(float)
combined_df['index_conf_high'] = combined_df['index_conf_high'].astype(float)

print("Data successfully merged and data types converted.")

数据已成功合并并转换好数据类型。


### Step 4: Check Final Results

Finally, we verify data is ready as expected using two commands:
1.  `combined_df.head()`: Display first 5 rows of final dataset to visually check content and format.
2.  `combined_df.info()`: Display dataset summary including total rows, column names, non-null counts, and data types to confirm structure is correct.

In [None]:
# Cell 4
# Display first 5 rows of final DataFrame to check data correctness
print("--- Final Merged Data (First 5 Rows) ---")
combined_df.head()

--- 最终合并数据 (前5行) ---


Unnamed: 0,year,state,index_value,index_conf_low,index_conf_high
0,2000,National,1.0,1.0,1.0
1,2001,National,0.894786,0.779023,1.014265
2,2002,National,0.833242,0.71714,0.965727
3,2003,National,0.852811,0.712254,1.026288
4,2004,National,0.799586,0.658812,0.968141


In [None]:
# Cell 5
# Display DataFrame structure information (column names, non-null counts, data types)
print("\n--- Final Data Structure and Types ---")
combined_df.info()


--- 最终数据结构和类型 ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132 entries, 0 to 131
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   year             132 non-null    int32  
 1   state            132 non-null    object 
 2   index_value      132 non-null    float64
 3   index_conf_low   132 non-null    float64
 4   index_conf_high  132 non-null    float64
dtypes: float64(3), int32(1), object(1)
memory usage: 4.8+ KB


### Step 5: Install, Setup, and Prepare Weather Data Parameters

In this step, we adopt more professional, reliable data acquisition method: use official `openmeteo-requests` Python client. This cell completes all data acquisition preparations.

**1. Reasons for Using Official Client:**
-   **Smart Caching**: Automatically saves API responses locally for fast subsequent runs without consuming API resources.
-   **Automatic Retries**: Automatically retries failed requests due to network fluctuations, greatly improving code stability.

**2. Preparation Process in This Cell:**
-   **Install Libraries**: Ensure `openmeteo-requests` and dependencies are installed.
-   **Define Parameters**:
    -   **Capital Coordinates**: Create dictionary with state capitals and their coordinates.
    -   **API Parameters**: Define required weather variables and time range (2000-01-01 to 2024-12-31).
-   **Setup Client**: Initialize API client with caching and retry functionality for next step's data acquisition.

In [None]:
# 1. Install required libraries (if not already installed)
# If running in Jupyter, use ! prefix to execute pip command
# !pip install openmeteo-requests requests-cache retry-requests numpy pandas

import openmeteo_requests
import pandas as pd
import requests_cache
from retry_requests import retry

# 2. Setup Open-Meteo API client with caching and retry functionality
cache_session = requests_cache.CachedSession('.cache', expire_after=-1)
retry_session = retry(cache_session, retries=5, backoff_factor=0.2)
openmeteo = openmeteo_requests.Client(session=retry_session)

# 3. Redefine our parameters
# Define capital city coordinates for each state/territory
capital_coords = {
    "Victoria": {"lat": -37.81, "lon": 144.96},
    "New South Wales": {"lat": -33.87, "lon": 151.21},
    "South Australia": {"lat": -34.93, "lon": 138.60},
    "Western Australia": {"lat": -31.95, "lon": 115.86},
    "Australian Capital Territory": {"lat": -35.28, "lon": 149.13},
}
start_date = "2000-01-01"
end_date = "2024-12-31"
base_url = "https://archive-api.open-meteo.com/v1/archive"

# API client requires variables as list
daily_vars_list = ["precipitation_sum", "temperature_2m_mean", "shortwave_radiation_sum"]

print("✅ Official API client successfully set up!")

✅ 官方API客户端已成功设置！


### Step 6: Use Official Client to Acquire and Aggregate Weather Data

Now we use the `openmeteo` client from previous step to acquire 2000-2024 weather data for each state.

This code block performs these core operations:
1.  Iterate through each state's capital coordinates.
2.  Use official client to request 25 years of daily data for each state.
3.  Efficiently parse returned data into Pandas DataFrame.
4.  **Resample** daily data to annual level and **aggregate**:
    -   Temperature: annual **average**.
    -   Precipitation: annual **sum**.
    -   Radiation: annual **sum**.
5.  Store processed annual data in list for next merge step.

In [None]:
# Initialize empty list to store each state's processed annual weather data
annual_weather_data_list = []

print("🚀 Starting to acquire and process state weather data using official client...")

# Iterate through each state
for state, coords in capital_coords.items():
    print(f"\n--- Processing [{state}] ---")
    
    params = {
        "latitude": coords["lat"],
        "longitude": coords["lon"],
        "start_date": start_date,
        "end_date": end_date,
        "daily": daily_vars_list,
        "timezone": "auto"
    }

    try:
        # Use official client to call API
        responses = openmeteo.weather_api(base_url, params=params)
        response = responses[0]

        # --- Parse returned data ---
        daily = response.Daily()
        
        daily_precipitation_sum = daily.Variables(0).ValuesAsNumpy()
        daily_temperature_2m_mean = daily.Variables(1).ValuesAsNumpy()
        daily_shortwave_radiation_sum = daily.Variables(2).ValuesAsNumpy()

        # --- Use official pd.date_range() method to create correct date index ---
        daily_data = {"date": pd.date_range(
            start = pd.to_datetime(daily.Time(), unit = "s", utc = True),
            end = pd.to_datetime(daily.TimeEnd(), unit = "s", utc = True),
            freq = pd.Timedelta(seconds = daily.Interval()),
            inclusive = "left"
        )}
        # ----------------------------------------------------------------------

        daily_data["precipitation_sum"] = daily_precipitation_sum
        daily_data["temperature_2m_mean"] = daily_temperature_2m_mean
        daily_data["shortwave_radiation_sum"] = daily_shortwave_radiation_sum
        
        daily_df = pd.DataFrame(data=daily_data)
        daily_df.set_index('date', inplace=True)
        
        # --- Annual aggregation logic (unchanged) ---
        annual_agg_df = daily_df.resample('YE').agg({
            'temperature_2m_mean': 'mean',
            'precipitation_sum': 'sum',
            'shortwave_radiation_sum': 'sum'
        })
        
        annual_agg_df['state'] = state
        annual_weather_data_list.append(annual_agg_df)
        
        print(f"✅ Successfully acquired and processed [{state}] data.")

    except Exception as e:
        print(f"❌ Error processing [{state}]: {e}")

print("\n🎉 All state data processing completed.")

🚀 开始使用官方客户端获取并处理各州天气数据...

--- 正在处理 [Victoria] ---
✅ 已成功获取并处理完 [Victoria] 的数据。

--- 正在处理 [New South Wales] ---
✅ 已成功获取并处理完 [New South Wales] 的数据。

--- 正在处理 [South Australia] ---
✅ 已成功获取并处理完 [South Australia] 的数据。

--- 正在处理 [Western Australia] ---
✅ 已成功获取并处理完 [Western Australia] 的数据。

--- 正在处理 [Australian Capital Territory] ---
✅ 已成功获取并处理完 [Australian Capital Territory] 的数据。

🎉 所有州的数据处理完毕。


### Step 7: Merge Data and Calculate National Averages

Now we merge all state data from previous step into single large DataFrame. Then calculate averages across all states by year grouping to create "National" data, append to final DataFrame.

In [None]:
# Merge all states' annual weather data
weather_df = pd.concat(annual_weather_data_list)

# Extract year as regular column
weather_df['year'] = weather_df.index.year
weather_df.reset_index(drop=True, inplace=True)

# Calculate national annual averages
# Group by year and average all numeric columns
national_df = weather_df.groupby('year').mean(numeric_only=True).reset_index()
national_df['state'] = 'National'

# Append national data to main DataFrame
weather_df = pd.concat([weather_df, national_df], ignore_index=True)

# Rename columns for better readability
weather_df.rename(columns={
    'temperature_2m_mean': 'annual_mean_temp',
    'precipitation_sum': 'annual_precip_sum',
    'shortwave_radiation_sum': 'annual_radiation_sum'
}, inplace=True)

# Adjust column order
weather_df = weather_df[['year', 'state', 'annual_mean_temp', 'annual_precip_sum', 'annual_radiation_sum']]

print("Final weather dataset 'weather_df' generated.")

已生成最终的天气数据集'weather_df'。


### Step 8: Check Final Weather Data

Finally, we check `weather_df` content and structure to ensure data is ready as expected.

In [None]:
weather_df[weather_df['state'] == "Australian Capital Territory"]

Unnamed: 0,year,state,annual_mean_temp,annual_precip_sum,annual_radiation_sum


In [None]:
# Display first few rows to check data format
weather_df.head()

Unnamed: 0,year,state,annual_mean_temp,annual_precip_sum,annual_radiation_sum
0,1999,Victoria,14.71625,0.0,24.59
1,2000,Victoria,14.75799,615.799988,5848.040039
2,2001,Victoria,14.526712,564.200012,5662.220215
3,2002,Victoria,14.698013,432.399994,5789.760254
4,2003,Victoria,14.443562,571.0,5895.100098


In [None]:
# Display last few rows to see calculated 'National' data
weather_df.tail()

Unnamed: 0,year,state,annual_mean_temp,annual_precip_sum,annual_radiation_sum
125,2020,National,16.554989,899.525024,6237.892578
126,2021,National,16.314037,849.174988,6297.140137
127,2022,National,16.460024,1122.150024,6253.627441
128,2023,National,16.750156,736.325012,6551.685059
129,2024,National,17.195477,731.349976,6567.745117


In [None]:
# Check data structure and types
weather_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130 entries, 0 to 129
Data columns (total 5 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   year                  130 non-null    int32  
 1   state                 130 non-null    object 
 2   annual_mean_temp      130 non-null    float32
 3   annual_precip_sum     130 non-null    float32
 4   annual_radiation_sum  130 non-null    float32
dtypes: float32(3), int32(1), object(1)
memory usage: 3.2+ KB


### Step 9: Data Cleaning and Final Confirmation

Based on previous analysis, we need to perform two final cleaning tasks:
1.  **Re-acquire Missing Data**: Due to API rate limits, `Australian Capital Territory` data wasn't acquired. We'll re-run Steps 6 and 7 to complete this data.
2.  **Remove Anomalous Year**: Filter out erroneous 1999 record in data.

After completing these steps, `weather_df` will be clean, complete dataset.

In [None]:
# Cleaning step 1: Filter out erroneous 1999 data
# Keep only records with year >= 2000
original_rows = len(weather_df)
weather_df = weather_df[weather_df['year'] >= 2000].copy()

print(f"Filtered out 1999 data. Rows changed from {original_rows} to {len(weather_df)}.")

# Cleaning step 2: Check if all state data is complete
# Normally should have 6 regions (5 states/territories + 1 National)
if len(weather_df['state'].unique()) < 6:
    print("\nDetected missing state data (possibly due to API limits), recommend returning to re-run Steps 6 and 7.")
    print("Due to caching, re-running will only request previously failed parts and will be fast.")
else:
    print("\n✅ All state/territory data is complete.")

# Final confirmation
print("\n--- Final Data Overview After Cleaning ---")
weather_df.info()

print("\n--- Data Year Range ---")
print(f"From {weather_df['year'].min()} to {weather_df['year'].max()}")

已过滤掉年份为1999的数据。行数从 156 变为 150。

✅ 所有州/地区的数据均已完整。

--- 清理后的最终数据概览 ---
<class 'pandas.core.frame.DataFrame'>
Index: 150 entries, 1 to 155
Data columns (total 5 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   year                  150 non-null    int32  
 1   state                 150 non-null    object 
 2   annual_mean_temp      150 non-null    float32
 3   annual_precip_sum     150 non-null    float32
 4   annual_radiation_sum  150 non-null    float32
dtypes: float32(3), int32(1), object(1)
memory usage: 4.7+ KB

--- 数据年份范围 ---
从 2000 年到 2024 年


### Step 10: Merge Main Data with Weather Data

Now we merge `combined_df` (main data, 2000-2021) and `weather_df` (weather data, 2000-2024) into final dataset `final_df`.

We use **Right Merge**, using weather data as base to retain all year records. This way, for 2022-2024 years that only have weather data, main data columns (like `index_value`) will automatically be filled with `NaN`.

In [None]:
# Use 'year' and 'state' as common keys for right merge
final_df = pd.merge(combined_df, weather_df, on=['year', 'state'], how='right')

# Check head of merged dataset (should show complete 2000-2001 data)
print("--- Merged Data (Head) ---")
display(final_df.head())

# Check tail of merged dataset (should show 2024 data with index_value as NaN)
print("\n--- Merged Data (Tail) ---")
display(final_df.tail())

--- 合并后数据 (头部) ---


Unnamed: 0,year,state,index_value,index_conf_low,index_conf_high,annual_mean_temp,annual_precip_sum,annual_radiation_sum
0,2000,Victoria,1.0,1.0,1.0,14.75799,615.799988,5848.040039
1,2001,Victoria,0.851265,0.75391,0.95416,14.526712,564.200012,5662.220215
2,2002,Victoria,0.738937,0.623807,0.873159,14.698013,432.399994,5789.760254
3,2003,Victoria,0.728083,0.578651,0.924713,14.443562,571.0,5895.100098
4,2004,Victoria,0.647233,0.500643,0.853355,14.277344,621.200012,5806.970215



--- 合并后数据 (尾部) ---


Unnamed: 0,year,state,index_value,index_conf_low,index_conf_high,annual_mean_temp,annual_precip_sum,annual_radiation_sum
145,2020,National,0.305946,0.202542,0.489085,15.825681,890.140015,6209.967773
146,2021,National,0.331837,0.196334,0.582872,15.465296,883.320007,6234.124023
147,2022,National,,,,15.654364,1109.099976,6152.21582
148,2023,National,,,,16.01475,711.420044,6518.726074
149,2024,National,,,,16.47016,703.160034,6553.233887


### Step 11: Store Final Merged Data

We have successfully cleaned, aggregated, and merged original species index data with API-acquired weather data.

Final step: save this `final_df` DataFrame containing complete 2000-2024 records as CSV file. This file will serve as input data for our next analysis phase (SARIMAX prediction model).

Code will automatically check and create required subfolder (`01_data_wrangling/02_wrangled_data`).

In [None]:
import os

# Define output folder and filename
output_folder = os.path.join('02_wrangled_data')
output_filename = 'Table14_TSX_Table_VIC_version3.csv'
full_filepath = os.path.join(output_folder, output_filename)

# Ensure output folder exists, create if not exists
os.makedirs(output_folder, exist_ok=True)

# Save final_df as CSV file
# index=False prevents pandas from writing DataFrame index to file, keeping data clean
final_df.to_csv(full_filepath, index=False)

print(f"🎉 Data successfully saved to: {full_filepath}")

🎉 数据已成功保存至: 02_wrangled_data\Table14_TSX_Table_VIC_version3.csv
