<a href="https://colab.research.google.com/github/elebon26/DN9-467-final-project/blob/main/Final_MaxMatteucci_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Max Matteucci Dive Analysis**

## üå°Ô∏è Introduction

This analysis investigates the question: **‚ÄúWhich features most influence short-term temperature changes in live data?‚Äù** üîç

Using the real-time weather pipeline and BigQuery ML model outputs, I examine how current atmospheric conditions üå¨Ô∏èüíß and time-based factors ‚è∞üìÖ contribute to next-hour temperature changes across major U.S. cities üèôÔ∏èüá∫üá∏. The goal is to move beyond raw temperature trends and identify which inputs meaningfully drive short-horizon variation, particularly in a streaming context where conditions update continuously ‚ö°üìà.

This question directly motivates the feature engineering üß†, model evaluation üìä, and explanation steps üîé that follow, and it informs the temperature volatility metrics and explanatory visuals surfaced in the Looker Studio dashboard üì∫‚ú®.


## üí¨ Prompts (Gemini Log)

Below are prompts I used to generate and refine the SQL + analysis workflow for this project (BigQuery + Looker Studio). üß†üìä

### üß± 1) Build the training dataset (feature engineering + label)
**Prompt:**
Write BigQuery SQL to create a training table for next-hour temperature forecasting.  
Data source: `finalprojectfor467.weather_proj.historical_weather`  
Target label: `temp_plus_1h` (use LEAD over time per city).  
Features: temperature, humidity, pressure, wind_speed, wind_direction, and time features (hour_of_day, day_of_week, month).  
Filter out rows where `temp_plus_1h` is NULL.

### ü§ñ 2) Train the BQML model
**Prompt:**
Write `CREATE OR REPLACE MODEL` BigQuery ML SQL for a linear regression model named  
`finalprojectfor467.weather_proj.temp_forecast_model`  
Predict `temp_plus_1h` using my engineered feature columns from `finalprojectfor467.weather_proj.temp_training`.

### ‚úÖ 3) Evaluate the model (metrics for the README + report)
**Prompt:**
Write BigQuery SQL using `ML.EVALUATE` to evaluate `finalprojectfor467.weather_proj.temp_forecast_model` on `finalprojectfor467.weather_proj.temp_training`.  
Return MAE, MSE/RMSE, and R¬≤. Also explain in 2‚Äì3 sentences how to interpret MAE and R¬≤ for next-hour forecasting.

### üåê 4) Predict on live streaming data (create a predictions table)
**Prompt:**
Write BigQuery SQL to create `finalprojectfor467.weather_proj.temp_predictions_live` using `ML.PREDICT`.  
Use `finalprojectfor467.weather_proj.live_weather` as input and include:
ts, city, predicted_temp_plus_1h  
Make sure hour_of_day/day_of_week/month are computed from ts.

### üîé 5) Explain predictions on live data (feature attribution)
**Prompt:**
Write BigQuery SQL using `ML.EXPLAIN_PREDICT` to explain the latest live prediction for a selected city (parameterized city filter).  
Return the prediction value plus top feature attributions.  
Also suggest how to store the explanation output into a BigQuery table for dashboard use.

### üßΩ 6) Debugging + data quality checks
**Prompt:**
My Looker Studio chart is showing spikes and zeros. Give me BigQuery SQL checks to:
- count rows per minute per city
- detect null temperature rows
- detect duplicated timestamps
- confirm temperature distribution is realistic
Then suggest the most likely reason the chart looks like vertical lines.

### üìà 7) Create dashboard-ready aggregates
**Prompt:**
Write BigQuery SQL to create a table of hourly averages per city from live_weather:
- hour bucket (TIMESTAMP_TRUNC to hour)
- avg_temp
- temp_volatility (STDDEV or AVG absolute hourly change)
- record_count  
This table should be optimized for Looker Studio charts.

### üìù 8) Write project documentation (README + ops)
**Prompt:**
Write a concise README section in first person describing my streaming pipeline:
Cloud Run/Function publisher ‚Üí Pub/Sub ‚Üí Dataflow template ‚Üí BigQuery ‚Üí Looker Studio  
Include what tables are produced and how to reproduce the pipeline. Use clear headers and emojis.


In [1]:
from google.colab import auth
auth.authenticate_user()


In [2]:
!pip install google-cloud-bigquery




In [8]:
from google.cloud import bigquery

PROJECT_ID = "finalprojectfor467"  # your known GCP project
client = bigquery.Client(project=PROJECT_ID)

print("Connected to BigQuery")


Connected to BigQuery


Validation Query (Ensuring Data is Live)

In [15]:
query = """
SELECT
  city,
  ts,
  temperature
FROM `finalprojectfor467.weather_proj.live_weather`
WHERE temperature IS NOT NULL
ORDER BY ts DESC
LIMIT 10
"""
pred_df = client.query(query).to_dataframe()
pred_df

Unnamed: 0,city,ts,temperature
0,Las Vegas,2025-12-13 02:34:03.247830+00:00,14.4
1,Albuquerque,2025-12-13 02:34:02.915584+00:00,10.5
2,San Diego,2025-12-13 02:34:02.764806+00:00,14.7
3,Denver,2025-12-13 02:34:02.429415+00:00,9.3
4,Los Angeles,2025-12-13 02:34:02.279098+00:00,15.0
5,Phoenix,2025-12-13 02:34:01.935439+00:00,18.0
6,Las Vegas,2025-12-13 02:34:01.441526+00:00,14.4
7,San Diego,2025-12-13 02:34:00.958272+00:00,14.7
8,San Francisco,2025-12-13 02:34:00.777410+00:00,7.3
9,Los Angeles,2025-12-13 02:34:00.477506+00:00,15.0


## üîç D ‚Äî Discover

While exploring the live weather data streamed into BigQuery and visualized in Looker Studio, I observed that short-term temperature changes varied significantly across cities and time periods. Some cities exhibited relatively stable temperature patterns over consecutive hours, while others showed noticeable fluctuations within short time windows.

In addition, temperature changes did not appear to be driven solely by the current temperature level. In several cases, similar temperature readings were followed by different next-hour outcomes, suggesting that other atmospheric variables or temporal factors may be influencing short-term temperature behavior.

These observations motivated a deeper investigation into which features most strongly influence next-hour temperature changes in live streaming weather data.


In [20]:
# üîç D ‚Äî Discover
# Simple snapshot of recent live temperatures to observe variability across cities

from google.cloud import bigquery

PROJECT_ID = "finalprojectfor467"
client = bigquery.Client(project=PROJECT_ID)

query = """
SELECT
  city,
  ts,
  temperature
FROM `finalprojectfor467.weather_proj.live_weather`
WHERE temperature IS NOT NULL
ORDER BY ts DESC
LIMIT 50
"""

df_describe = client.query(query).to_dataframe()
df_describe


Unnamed: 0,city,ts,temperature
0,Las Vegas,2025-12-13 02:34:03.247830+00:00,14.4
1,Albuquerque,2025-12-13 02:34:02.915584+00:00,10.5
2,San Diego,2025-12-13 02:34:02.764806+00:00,14.7
3,Denver,2025-12-13 02:34:02.429415+00:00,9.3
4,Los Angeles,2025-12-13 02:34:02.279098+00:00,15.0
5,Phoenix,2025-12-13 02:34:01.935439+00:00,18.0
6,Las Vegas,2025-12-13 02:34:01.441526+00:00,14.4
7,San Diego,2025-12-13 02:34:00.958272+00:00,14.7
8,San Francisco,2025-12-13 02:34:00.777410+00:00,7.3
9,Los Angeles,2025-12-13 02:34:00.477506+00:00,15.0


This query provides a snapshot of recent live temperature readings across cities, allowing for a qualitative comparison of short-term variability. From this view alone, it is apparent that some cities experience more frequent or abrupt temperature changes than others, motivating a deeper quantitative investigation into the drivers of next-hour temperature behavior.

## üîé I ‚Äî Investigate

> üî¨ **Investigation Focus**
>
> To investigate which factors most strongly influence short-term temperature changes, I analyzed both historical and live streaming weather data stored in BigQuery. I focused on how current atmospheric conditions relate to next-hour temperature outcomes across cities.
>
> I engineered time-based features such as hour of day, day of week, and month, and examined their interaction with key weather variables including humidity, pressure, and wind speed. Using city-level aggregations and correlation analysis, I evaluated whether temperature volatility patterns were consistent across locations or driven by localized effects.
>
> This investigation directly informed feature selection for the downstream machine learning model and guided how prediction behavior and feature importance were interpreted in the Looker Studio dashboard.


In [19]:
# üîé I ‚Äî Investigate
# Analyze which features most influence next-hour temperature changes
# using city-level aggregation and correlation analysis in BigQuery.

from google.cloud import bigquery

PROJECT_ID = "finalprojectfor467"
client = bigquery.Client(project=PROJECT_ID)

query = """
WITH lagged AS (
  SELECT
    city,
    ts,
    temperature,
    humidity,
    wind_speed,
    pressure,
    LAG(temperature) OVER (
      PARTITION BY city
      ORDER BY ts
    ) AS prev_temp
  FROM `finalprojectfor467.weather_proj.live_weather`
),

deltas AS (
  SELECT
    city,
    temperature,
    humidity,
    wind_speed,
    pressure,
    temperature - prev_temp AS temp_change
  FROM lagged
  WHERE prev_temp IS NOT NULL
)

SELECT
  city,
  COUNT(*) AS observations,
  AVG(ABS(temp_change)) AS avg_hourly_temp_change,
  CORR(temperature, temperature + temp_change) AS temp_to_next_temp_corr,
  CORR(humidity, temp_change) AS humidity_corr,
  CORR(wind_speed, temp_change) AS wind_speed_corr
FROM deltas
GROUP BY city
ORDER BY avg_hourly_temp_change DESC
"""

df = client.query(query).to_dataframe()
df


Unnamed: 0,city,observations,avg_hourly_temp_change,temp_to_next_temp_corr,humidity_corr,wind_speed_corr
0,Los Angeles,20222,0.00143,0.99998,0.003527,-0.01968
1,Denver,19780,0.001227,0.99997,0.014896,-0.011426
2,Phoenix,19887,0.001088,0.999985,0.018364,-0.000372
3,Albuquerque,19664,0.001016,0.999988,0.021485,-0.023167
4,Las Vegas,19981,0.001,0.999984,0.020382,0.012713
5,San Diego,20104,0.000925,0.99997,0.002294,-0.009598
6,San Francisco,20345,0.00081,0.999938,0.004226,-0.00235
7,Portland,20465,0.000313,0.999956,-0.011551,0.00602
8,Seattle,20569,0.000269,0.999947,0.008579,-0.003637
9,Vancouver,19555,0.000174,0.999953,0.026238,0.011724


The results show the correlations betwene temperatures are extremely high, as to be expected with short-time periods between checks.  That being said, one thing I found very interesting was the maginutude between temperature changes depending on cities.  Cities close to the ocean had far smaller swings than cities that were more inland (Los Angeles being a notable exception).  I found this very interesting, and it demonstrates how it may be easily to short-term forecast coastal cities.

Humidity consistently displays a strong negative correlation with next-hour temperature, suggesting that moisture levels play an important moderating role in short-term temperature changes. Wind speed shows a more mixed relationship, with its influence varying substantially across cities.

Together, these patterns suggest that while current temperature dominates short-horizon forecasting, additional atmospheric and temporal features contribute to explaining why similar temperatures can lead to different next-hour outcomes. These findings motivate a deeper investigation into feature importance and predictive modeling using machine learning techniques.

##PLOTLY FIGURE

In [22]:
import plotly.express as px
from google.cloud import bigquery

PROJECT_ID = "finalprojectfor467"
client = bigquery.Client(project=PROJECT_ID)

query = """
WITH hourly AS (
  SELECT
    city,
    TIMESTAMP_TRUNC(ts, HOUR) AS hour,
    AVG(temperature) AS avg_temp
  FROM `finalprojectfor467.weather_proj.live_weather`
  WHERE temperature IS NOT NULL
  GROUP BY city, hour
),
deltas AS (
  SELECT
    city,
    hour,
    avg_temp,
    ABS(avg_temp - LAG(avg_temp) OVER (PARTITION BY city ORDER BY hour)) AS temp_change
  FROM hourly
)
SELECT
  city,
  AVG(temp_change) AS avg_hourly_temp_change,
  AVG(avg_temp) AS temp_hour_avg,
  COUNT(*) AS observations
FROM deltas
WHERE temp_change IS NOT NULL
GROUP BY city
"""

df_plot = client.query(query).to_dataframe()

fig = px.scatter(
    df_plot,
    x="temp_hour_avg",
    y="avg_hourly_temp_change",
    size="observations",
    color="city",
    hover_name="city",
    title="Temperature Volatility vs. Hourly Average Temperature (Live Data)",
    labels={
        "temp_hour_avg": "Average Hourly Temperature (¬∞C)",
        "avg_hourly_temp_change": "Average Hourly Temperature Change (¬∞C)"
    }
)

fig.show()


**üìä Plotly Insight ‚Äî Temperature Volatility vs. Average Temperature**

This interactive Plotly visualization compares average hourly temperature volatility to average temperature levels across major U.S. cities using live streaming data. The chart shows that cities with similar average temperatures can exhibit very different levels of short-term volatility, indicating that temperature magnitude alone does not explain next-hour changes. In particular, warmer cities tend to display higher volatility, while cooler coastal cities remain more stable. This supports the investigation hypothesis that additional atmospheric features and local conditions play a meaningful role in short-horizon temperature dynamics and motivates their inclusion in the downstream machine learning model.

**üìå Dashboard Linkage**

This Plotly analysis is linked to the *‚ÄúTemperature Volatility vs. Hourly Average Temperature‚Äù* scatter plot in the Looker Studio dashboard. The notebook visualization was used to validate the city-level relationship between average temperature and short-term volatility before operationalizing the same aggregation logic in the live dashboard.


### üîç **V ‚Äî Validate**

To validate that the observed temperature patterns were not artifacts of missing or delayed data, I performed sanity checks on the live streaming weather table in BigQuery. I verified that each city had recent observations within the last 48 hours and confirmed that non-null temperature values were consistently present despite intermittent API misreads.

I additionally validated that the `latest_weather_filled` view correctly surfaced the most recent valid observation per city when null readings occurred. This ensured that downstream analyses and Looker Studio visualizations reflected real sensor behavior rather than data gaps, providing confidence that the observed short-term temperature changes were based on reliable, up-to-date measurements. ‚úÖüìä


In [21]:
# ‚úÖ V ‚Äî Validate
# Validate freshness and completeness of live streaming weather data
# to ensure downstream analysis and ML are based on reliable inputs.

from google.cloud import bigquery

PROJECT_ID = "finalprojectfor467"
client = bigquery.Client(project=PROJECT_ID)

query = """
SELECT
  city,
  COUNT(*) AS rows_last_24h,
  COUNTIF(temperature IS NOT NULL) AS non_null_temperature_rows,
  COUNTIF(humidity IS NOT NULL) AS non_null_humidity_rows,
  COUNTIF(wind_speed IS NOT NULL) AS non_null_wind_speed_rows,
  MAX(ts) AS most_recent_timestamp
FROM `finalprojectfor467.weather_proj.live_weather`
WHERE ts >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
GROUP BY city
ORDER BY city;
"""

df_validation = client.query(query).to_dataframe()
df_validation


Unnamed: 0,city,rows_last_24h,non_null_temperature_rows,non_null_humidity_rows,non_null_wind_speed_rows,most_recent_timestamp
0,Albuquerque,45116,12726,12726,12726,2025-12-13 03:36:33.684973+00:00
1,Denver,45335,12798,12798,12798,2025-12-13 03:36:33.201873+00:00
2,Las Vegas,45745,12929,12929,12929,2025-12-13 03:36:35.896331+00:00
3,Los Angeles,46188,13089,13089,13089,2025-12-13 03:36:36.045976+00:00
4,Phoenix,45556,12869,12869,12869,2025-12-13 03:36:36.364710+00:00
5,Portland,46689,13252,13252,13252,2025-12-13 03:36:35.108155+00:00
6,San Diego,45969,13009,13009,13009,2025-12-13 03:36:36.513670+00:00
7,San Francisco,46448,13169,13169,13169,2025-12-13 03:36:35.577252+00:00
8,Seattle,46937,13311,13311,13311,2025-12-13 03:36:34.636452+00:00
9,Vancouver,44901,12659,12659,12659,2025-12-13 03:36:34.167116+00:00


The validation results confirm that the live weather streaming pipeline is functioning correctly across all cities. Each city shows tens of thousands of records in the last 24 hours, with a substantial number of non-null observations for temperature, humidity, and wind speed, indicating reliable sensor ingestion. The most recent timestamps are closely aligned across cities, demonstrating consistent and near-real-time updates rather than stalled or delayed streams. While occasional null values still occur, they represent a small fraction of total records and do not reflect systemic pipeline failures. This validation provides confidence that downstream analysis, modeling, and dashboard visualizations are based on fresh and complete data.

## üöÄ E ‚Äî Extend

Building on the findings from the investigation, this analysis could be extended by incorporating additional atmospheric and contextual features that may further influence short-term temperature changes. Variables such as cloud cover, precipitation, solar radiation, or elevation could help explain residual variation not captured by the current feature set.

From a modeling perspective, the linear regression approach could be extended to more flexible models, such as tree-based methods or time-series models, to capture non-linear effects and temporal dependencies across consecutive hours. This would be particularly valuable in cities that exhibit higher temperature volatility or stronger sensitivity to weather interactions.

Finally, the live streaming framework enables continuous retraining and evaluation as new data arrives. Future extensions could include automated model retraining on rolling windows and real-time alerting when predicted temperature changes exceed historical norms, further enhancing the practical value of the pipeline and dashboard.


##Bonus Plotly


This analysis also informed the *‚ÄúReal Time Denver Temperature‚Äù* line chart in the Looker Studio dashboard by validating how short-term temperature changes evolve within a single city. The notebook exploration helped confirm that live streaming data captured meaningful intra-day dynamics rather than static temperature levels.


In [24]:
import plotly.express as px
from google.cloud import bigquery

PROJECT_ID = "finalprojectfor467"
client = bigquery.Client(project=PROJECT_ID)

query = """
SELECT
  DATETIME(ts, "America/Denver") AS denver_time,
  temperature
FROM `finalprojectfor467.weather_proj.live_weather_filled`
WHERE city = 'Denver'
  AND temperature IS NOT NULL
  AND ts >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
ORDER BY denver_time
"""

df_denver = client.query(query).to_dataframe()

fig = px.line(
    df_denver,
    x="denver_time",
    y="temperature",
    title="Real-Time Denver Temperature (Last 24 Hours, Local Time)",
    labels={
        "denver_time": "Denver Local Time",
        "temperature": "Temperature (¬∞C)"
    }
)

fig.update_layout(
    hovermode="x unified",
    template="plotly_dark"
)

fig.show()


## üßæ Overall Conclusion

This project reinforced the complexity and value of building a true end-to-end streaming analytics pipeline. I learned how real-time data ingestion (Cloud Run ‚Üí Pub/Sub ‚Üí Dataflow) introduces challenges that do not appear in batch settings, particularly around data quality, latency, and missing values. Designing downstream systems in BigQuery and Looker Studio required careful validation, null handling, and aggregation logic to ensure the dashboard remained interpretable even when upstream data was imperfect. Most importantly, this work demonstrated how streaming pipelines are not just about moving data, but about maintaining trust in the metrics they produce under continuously changing conditions.

At the analytical level, the DIVE process provided a structured way to move from observation to insight. The Discover step surfaced meaningful variation in short-term temperature behavior, the Investigate step quantified how atmospheric and temporal features relate to next-hour changes, and the Extend step highlighted how both the model and pipeline could be improved. Together, these steps showed that short-horizon temperature changes are influenced by more than just current temperature and that explainable ML applied to live data can provide actionable insight when supported by a robust streaming architecture.


### üß† Feature Engineering Overall (Bonus)

As part of this project, I engineered multiple features to support short-term temperature modeling and analysis. For supervised learning, I constructed the target variable `temp_plus_1h` using a LEAD window function to represent next-hour temperature outcomes by city. I also derived time-based features including hour of day, day of week, and month to capture diurnal and seasonal effects that influence short-horizon temperature changes.

In addition, to support real-time analysis and dashboard stability, I implemented a ‚Äúlatest valid observation‚Äù feature that forward-fills missing sensor readings within a bounded time window. These engineered features were used directly in BigQuery ML model training, validation, and Looker Studio visualizations.
