- Time Series Data Preparation
- Reads weather data with dates as the index, recognizing it's temporal (time-based).
- Treats missing data carefully by analyzing null patterns and forward-filling when appropriate.
- Filters out unreliable columns (with too many missing values) to ensure model integrity.
- Index Handling & Time Features
- Converts the index to datetime to unlock powerful time-based functionality.
- Extracts features like year, month, and day-of-year — crucial for time series seasonality analysis.
- Feature Engineering for Forecasting
- Constructs a forecasting target by shifting the temperature data forward — predicting tomorrow using today’s data.
- Adds rolling averages to capture short-term trends and momentum in temperature and precipitation.
- Introduces percent change in rolling features — highlighting trend direction and volatility.
- Computes seasonal averages (monthly & day-of-year) using expanding window statistics to reflect cumulative climate behavior over time.
- Machine Learning on Time Series
- Implements Ridge Regression, a linear model that adds regularization to prevent overfitting.
- Excludes non-predictive or leakage-prone columns (
target
,station
,name
) from the input features.
- Backtesting: Simulated Real-World Forecasting
- Defines a custom backtesting loop: simulates how a model would perform in production by iteratively training on historical data and testing on the future.
- Ensures temporal integrity (i.e., never trains on future data).
- Stores prediction results and evaluates both accuracy and magnitude of error.
- Model Evaluation and Interpretation
- Uses mean absolute error (MAE) to measure forecast accuracy — a common metric in regression problems.
- Explores distribution of prediction error over time to identify model behavior and reliability.
- Sorts and filters high-error days to better understand when the model struggles (e.g., extreme weather events).
- Visualization for Insight
- Plots weather metrics (e.g., snow depth) and prediction error frequencies — essential for communicating patterns, anomalies, or performance degradation visually.
- Key Modeling Principles Reinforced
- Importance of data quality filtering before modeling.
- Use of time-aware features like lag, rolling stats, and expanding averages for better temporal modeling.
- Practice of modular thinking (via reusable functions like
backtest()
andcompute_rolling()
). - Insight into the limitations of simple models and when additional complexity might be warranted (e.g., nonlinear models, exogenous variables).