
# Drought Features: Weekly Aggregation for FIPS 6107

This notebook builds a **feature dataset for drought modeling** by aggregating weather variables over rolling windows, using only the **weekly-scored dates** as targets.

## What it does
- Loads a CSV of daily (or sub-daily) weather data with a `score` column available on **weekly** rows.
- Filters to **FIPS 6107**.
- For every weekly (scored) date, computes rolling-window aggregates:
  - **Precipitation** sums (7/30/90/180 days)
  - **Temperature** means, plus **T2M_MAX** maxima
  - **Humidity proxies** (QV2M, T2MDEW, T2MWET mean)
  - **Wind** (10m, 50m) means/max/min-mean/range-mean
  - **Pressure** means
  - Additional derived metrics (precip deficits, temp anomalies, VPD proxies)
- Exports a new dataset with one row per weekly (scored) date.

> Tip: If you want to change the FIPS or add/remove features, look for the **Parameters** and **Feature Engineering** sections below.



## Parameters
Update these to point to your input data and desired output file name.


In [54]:

import pandas as pd
import numpy as np

# --- File paths ---
# to create the files with the necessary county, you can run the code with the train, test and validation separately

# INPUT_CSV = '../data/archive/train_timeseries/train_timeseries.csv'        
# OUTPUT_CSV = '../data/generated_data/fips_6107_drought_features_train.csv'

# INPUT_CSV = '../data/archive/test_timeseries/test_timeseries.csv'        
# OUTPUT_CSV = '../data/generated_data/fips_6107_drought_features_test.csv'

INPUT_CSV = '../data/archive/validation_timeseries/validation_timeseries.csv'        
OUTPUT_CSV = '../data/generated_data/fips_6107_drought_features_validation.csv'


# --- Target FIPS ---
TARGET_FIPS = 6107  # Kern County, CA (example)



## Load and prepare the data
- Reads the input CSV
- Parses the `date` column to datetime
- Filters to the target **FIPS**
- Sorts by date to ensure correct rolling-window logic


In [None]:
# Load your data
df = pd.read_csv(INPUT_CSV)

# Convert date to datetime
df['date'] = pd.to_datetime(df['date'])

# Filter for TARGET_FIPS only
df_filtered = df[df['fips'] == TARGET_FIPS].copy()

# Sort by date to ensure proper rolling calculations
df_filtered = df_filtered.sort_values('date').reset_index(drop=True)

# Identify rows with scores (weekly observations)
scored_dates = df_filtered[df_filtered['score'].notna()]['date'].values

print(f"Total rows in input: {len(df):,}")
print(f"Rows for FIPS {TARGET_FIPS}: {len(df_filtered):,}")
print(f"Weekly (scored) dates found: {len(scored_dates):,}")
df_filtered.head()


Total rows in input: 2,268,840
Rows for FIPS 6107: 730
Weekly (scored) dates found: 104


Unnamed: 0,fips,date,PRECTOT,PS,QV2M,T2M,T2MDEW,T2MWET,T2M_MAX,T2M_MIN,...,TS,WS10M,WS10M_MAX,WS10M_MIN,WS10M_RANGE,WS50M,WS50M_MAX,WS50M_MIN,WS50M_RANGE,score
0,6107,2017-01-01,0.02,85.32,3.4,0.04,-3.87,-3.83,4.02,-5.14,...,-1.68,1.52,2.76,0.3,2.46,1.98,3.17,0.37,2.8,
1,6107,2017-01-02,0.0,85.75,2.82,-1.62,-5.95,-5.9,2.27,-4.92,...,-2.34,1.34,2.17,0.76,1.41,1.98,2.87,0.83,2.04,
2,6107,2017-01-03,2.21,86.07,4.0,1.27,-1.49,-1.48,5.76,-1.45,...,-0.18,1.2,2.14,0.46,1.68,1.73,2.8,0.61,2.19,4.957
3,6107,2017-01-04,35.05,85.91,5.45,2.67,2.67,2.67,4.99,0.5,...,2.2,2.06,2.55,1.39,1.16,3.58,4.34,2.67,1.67,
4,6107,2017-01-05,15.63,85.64,5.12,1.88,1.3,1.31,5.25,-4.74,...,1.16,2.48,3.93,1.12,2.81,3.68,5.08,1.81,3.26,



## Feature engineering on weekly (scored) dates
For each **target (weekly)** date, compute aggregates over the trailing windows:
- **7, 30, 90, 180 days** (some families only use a subset).
We build a new table with one row per weekly date.


In [56]:

# Create the new dataset - start with scored weeks only
result_rows = []

for target_date in scored_dates:
    target_date = pd.Timestamp(target_date)
    
    # Get the score for this week
    score_value = df_filtered.loc[df_filtered['date'] == target_date, 'score'].values[0]
    
    # Create masks for different rolling windows ending on target_date (exclusive of the lower bound)
    mask_7d = (df_filtered['date'] <= target_date) & (df_filtered['date'] > target_date - pd.Timedelta(days=7))
    mask_30d = (df_filtered['date'] <= target_date) & (df_filtered['date'] > target_date - pd.Timedelta(days=30))
    mask_90d = (df_filtered['date'] <= target_date) & (df_filtered['date'] > target_date - pd.Timedelta(days=90))
    mask_180d = (df_filtered['date'] <= target_date) & (df_filtered['date'] > target_date - pd.Timedelta(days=180))
    
    # Initialize row with basic info
    row = {'fips': TARGET_FIPS, 'date': target_date, 'score': score_value}
    
    # PRECIPITATION FEATURES - Sum over windows
    row['prec_sum_7d'] = df_filtered.loc[mask_7d, 'PRECTOT'].sum()
    row['prec_sum_30d'] = df_filtered.loc[mask_30d, 'PRECTOT'].sum()
    row['prec_sum_90d'] = df_filtered.loc[mask_90d, 'PRECTOT'].sum()
    row['prec_sum_180d'] = df_filtered.loc[mask_180d, 'PRECTOT'].sum()
    
    # TEMPERATURE FEATURES - Mean over windows
    for period, mask in [('7d', mask_7d), ('30d', mask_30d), ('90d', mask_90d), ('180d', mask_180d)]:
        row[f't2m_mean_{period}'] = df_filtered.loc[mask, 'T2M'].mean()
        row[f't2m_max_mean_{period}'] = df_filtered.loc[mask, 'T2M_MAX'].mean()
        row[f't2m_min_mean_{period}'] = df_filtered.loc[mask, 'T2M_MIN'].mean()
        row[f't2m_range_mean_{period}'] = df_filtered.loc[mask, 'T2M_RANGE'].mean()
        row[f'ts_mean_{period}'] = df_filtered.loc[mask, 'TS'].mean()
    
    # TEMPERATURE MAX for heatwave detection (7d and 30d only)
    row['t2m_max_7d'] = df_filtered.loc[mask_7d, 'T2M_MAX'].max()
    row['t2m_max_30d'] = df_filtered.loc[mask_30d, 'T2M_MAX'].max()
    
    # HUMIDITY PROXIES - Mean over 7/30/90d
    for period, mask in [('7d', mask_7d), ('30d', mask_30d), ('90d', mask_90d)]:
        row[f'qv2m_mean_{period}'] = df_filtered.loc[mask, 'QV2M'].mean()
        row[f't2mdew_mean_{period}'] = df_filtered.loc[mask, 'T2MDEW'].mean()
        row[f't2mwet_mean_{period}'] = df_filtered.loc[mask, 'T2MWET'].mean()
    
    # WIND FEATURES - Mean and Max over 7/30d (evaporative demand)
    for period, mask in [('7d', mask_7d), ('30d', mask_30d)]:
        # 10m wind
        row[f'ws10m_mean_{period}'] = df_filtered.loc[mask, 'WS10M'].mean()
        row[f'ws10m_max_{period}'] = df_filtered.loc[mask, 'WS10M_MAX'].max()
        row[f'ws10m_min_mean_{period}'] = df_filtered.loc[mask, 'WS10M_MIN'].mean()
        row[f'ws10m_range_mean_{period}'] = df_filtered.loc[mask, 'WS10M_RANGE'].mean()
        
        # 50m wind
        row[f'ws50m_mean_{period}'] = df_filtered.loc[mask, 'WS50M'].mean()
        row[f'ws50m_max_{period}'] = df_filtered.loc[mask, 'WS50M_MAX'].max()
        row[f'ws50m_min_mean_{period}'] = df_filtered.loc[mask, 'WS50M_MIN'].mean()
        row[f'ws50m_range_mean_{period}'] = df_filtered.loc[mask, 'WS50M_RANGE'].mean()
    
    # PRESSURE FEATURES - Mean over 7/30d
    row['ps_mean_7d'] = df_filtered.loc[mask_7d, 'PS'].mean()
    row['ps_mean_30d'] = df_filtered.loc[mask_30d, 'PS'].mean()
    
    # ADDITIONAL DROUGHT-RELEVANT FEATURES
    # Precipitation deficit (compare recent to longer-term average)
    row['prec_deficit_30v90d'] = row['prec_sum_30d'] - (row['prec_sum_90d'] / 3)
    row['prec_deficit_7v30d'] = row['prec_sum_7d'] - (row['prec_sum_30d'] / 4.3)
    
    # Temperature anomaly (recent vs longer-term)
    row['temp_anomaly_7v90d'] = row['t2m_mean_7d'] - row['t2m_mean_90d']
    row['temp_anomaly_30v180d'] = row['t2m_mean_30d'] - row['t2m_mean_180d']
    
    # Vapor pressure deficit proxy (temperature - dewpoint)
    row['vpd_proxy_7d'] = row['t2m_mean_7d'] - row['t2mdew_mean_7d']
    row['vpd_proxy_30d'] = row['t2m_mean_30d'] - row['t2mdew_mean_30d']
    
    result_rows.append(row)

# Create final dataframe
df_drought_features = pd.DataFrame(result_rows)
print(f"Dataset created with {len(df_drought_features)} weekly observations")
print(f"Total features: {len(df_drought_features.columns)}")
df_drought_features.head()


Dataset created with 104 weekly observations
Total features: 62


Unnamed: 0,fips,date,score,prec_sum_7d,prec_sum_30d,prec_sum_90d,prec_sum_180d,t2m_mean_7d,t2m_max_mean_7d,t2m_min_mean_7d,...,ws50m_min_mean_30d,ws50m_range_mean_30d,ps_mean_7d,ps_mean_30d,prec_deficit_30v90d,prec_deficit_7v30d,temp_anomaly_7v90d,temp_anomaly_30v180d,vpd_proxy_7d,vpd_proxy_30d
0,6107,2017-01-03,4.957,2.23,2.23,2.23,2.23,-0.103333,4.016667,-3.836667,...,0.603333,2.343333,85.713333,85.713333,1.486667,1.711395,0.0,0.0,3.666667,3.666667
1,6107,2017-01-10,4.0,134.65,136.88,136.88,136.88,3.47,6.834286,-0.364286,...,1.544,3.299,86.044286,85.945,91.253333,102.817442,1.072,0.0,2.238571,2.667
2,6107,2017-01-17,3.912,34.05,170.93,170.93,170.93,2.084286,8.041429,-2.647143,...,1.181765,3.918235,85.787143,85.88,113.953333,-5.701163,-0.184538,0.0,4.372857,3.369412
3,6107,2017-01-24,2.9203,82.1,253.03,253.03,253.03,0.312857,3.615714,-2.37,...,1.327917,4.162917,85.24,85.693333,168.686667,23.255814,-1.385476,0.0,2.235714,3.03875
4,6107,2017-01-31,2.9214,0.0,253.01,253.03,253.03,3.591429,10.512857,-1.478571,...,1.235,4.106667,86.682857,85.936667,168.666667,-58.839535,1.465622,0.069527,9.631429,4.548



## Export results
Saves the engineered features to CSV for downstream modeling.


In [57]:

# Save to CSV
df_drought_features.to_csv(OUTPUT_CSV, index=False)

print(f"Saved features to: {OUTPUT_CSV}")
print(f"Columns ({len(df_drought_features.columns)}):")
print(df_drought_features.columns.tolist())


Saved features to: ../data/generated_data/fips_6107_drought_features_validation.csv
Columns (62):
['fips', 'date', 'score', 'prec_sum_7d', 'prec_sum_30d', 'prec_sum_90d', 'prec_sum_180d', 't2m_mean_7d', 't2m_max_mean_7d', 't2m_min_mean_7d', 't2m_range_mean_7d', 'ts_mean_7d', 't2m_mean_30d', 't2m_max_mean_30d', 't2m_min_mean_30d', 't2m_range_mean_30d', 'ts_mean_30d', 't2m_mean_90d', 't2m_max_mean_90d', 't2m_min_mean_90d', 't2m_range_mean_90d', 'ts_mean_90d', 't2m_mean_180d', 't2m_max_mean_180d', 't2m_min_mean_180d', 't2m_range_mean_180d', 'ts_mean_180d', 't2m_max_7d', 't2m_max_30d', 'qv2m_mean_7d', 't2mdew_mean_7d', 't2mwet_mean_7d', 'qv2m_mean_30d', 't2mdew_mean_30d', 't2mwet_mean_30d', 'qv2m_mean_90d', 't2mdew_mean_90d', 't2mwet_mean_90d', 'ws10m_mean_7d', 'ws10m_max_7d', 'ws10m_min_mean_7d', 'ws10m_range_mean_7d', 'ws50m_mean_7d', 'ws50m_max_7d', 'ws50m_min_mean_7d', 'ws50m_range_mean_7d', 'ws10m_mean_30d', 'ws10m_max_30d', 'ws10m_min_mean_30d', 'ws10m_range_mean_30d', 'ws50m_mean_30


## Quick QA checks
Missing values, dtypes, and a preview.


In [58]:

print("/nDataset info:")
print(df_drought_features.info())

print("/nMissing values (total):", df_drought_features.isnull().sum().sum())

# Peek
df_drought_features.head(10)


/nDataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 62 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   fips                  104 non-null    int64         
 1   date                  104 non-null    datetime64[ns]
 2   score                 104 non-null    float64       
 3   prec_sum_7d           104 non-null    float64       
 4   prec_sum_30d          104 non-null    float64       
 5   prec_sum_90d          104 non-null    float64       
 6   prec_sum_180d         104 non-null    float64       
 7   t2m_mean_7d           104 non-null    float64       
 8   t2m_max_mean_7d       104 non-null    float64       
 9   t2m_min_mean_7d       104 non-null    float64       
 10  t2m_range_mean_7d     104 non-null    float64       
 11  ts_mean_7d            104 non-null    float64       
 12  t2m_mean_30d          104 non-null    float64       
 13  t2m_

Unnamed: 0,fips,date,score,prec_sum_7d,prec_sum_30d,prec_sum_90d,prec_sum_180d,t2m_mean_7d,t2m_max_mean_7d,t2m_min_mean_7d,...,ws50m_min_mean_30d,ws50m_range_mean_30d,ps_mean_7d,ps_mean_30d,prec_deficit_30v90d,prec_deficit_7v30d,temp_anomaly_7v90d,temp_anomaly_30v180d,vpd_proxy_7d,vpd_proxy_30d
0,6107,2017-01-03,4.957,2.23,2.23,2.23,2.23,-0.103333,4.016667,-3.836667,...,0.603333,2.343333,85.713333,85.713333,1.486667,1.711395,0.0,0.0,3.666667,3.666667
1,6107,2017-01-10,4.0,134.65,136.88,136.88,136.88,3.47,6.834286,-0.364286,...,1.544,3.299,86.044286,85.945,91.253333,102.817442,1.072,0.0,2.238571,2.667
2,6107,2017-01-17,3.912,34.05,170.93,170.93,170.93,2.084286,8.041429,-2.647143,...,1.181765,3.918235,85.787143,85.88,113.953333,-5.701163,-0.184538,0.0,4.372857,3.369412
3,6107,2017-01-24,2.9203,82.1,253.03,253.03,253.03,0.312857,3.615714,-2.37,...,1.327917,4.162917,85.24,85.693333,168.686667,23.255814,-1.385476,0.0,2.235714,3.03875
4,6107,2017-01-31,2.9214,0.0,253.01,253.03,253.03,3.591429,10.512857,-1.478571,...,1.235,4.106667,86.682857,85.936667,168.666667,-58.839535,1.465622,0.069527,9.631429,4.548
5,6107,2017-02-07,2.34,69.58,215.38,322.61,322.61,5.907143,9.331429,2.691429,...,0.957667,4.304,85.991429,85.934667,107.843333,19.491628,3.084774,0.192632,3.45,4.686667
6,6107,2017-02-14,1.3605,20.63,172.31,343.24,343.24,6.912857,12.482857,2.491429,...,1.004333,3.931667,86.164286,86.015333,57.896667,-19.442093,3.45419,0.612667,4.948571,5.292333
7,6107,2017-02-21,0.451,47.99,148.18,391.23,391.23,5.934286,9.675714,2.361429,...,0.912,3.916333,85.615714,86.056667,17.77,13.529535,2.142363,1.293077,3.788571,5.220333
8,6107,2017-02-28,0.0,1.62,139.82,392.85,392.85,0.262857,5.21,-3.467143,...,0.881333,3.686,85.778571,85.919333,8.87,-30.896279,-3.110363,1.739446,4.132857,4.840667
9,6107,2017-03-07,0.0,9.2,131.21,402.05,402.05,4.057143,10.832857,-2.074286,...,0.905667,4.08,86.188571,85.937333,-2.806667,-21.313953,0.611385,0.912576,6.911429,4.658667



## Notes & Next Steps
- If the dataset is very large, consider vectorizing with rolling windows or using a time-index and `pd.Series.rolling` for speed.
- You can generalize this notebook to loop over **multiple FIPS** and then concatenate results.
- For modeling, consider normalizing/standardizing features and adding **lag** versions of key metrics.
