# Term Project: End-to-End Cloud Data Pipeline
### Dataset: Student-chosen from Kaggle (with instructor approval).
Team project integrating Units 1–3 concepts.

## Project Requirements:
1. Define a clear business problem.
2. Batch Pipeline (Cloud Function + Scheduler) – automated data ingestion.
3. Streaming Pipeline (Pub/Sub + Dataflow) – real-time ingestion.
4. BigQuery ML – build and evaluate a predictive model.
5. Visualization – Looker Studio dashboard with KPIs.
6. Documentation – architecture diagram, decisions, prompts, lessons learned.

## Deliverables:
- Team notebook with all steps.
- Architecture blueprint (PDF or image).
- Screenshots of pipeline components running in GCP.
- Final Looker Studio dashboard link.

Rubric (Team 40%, Individual 60%):
- Data ingestion pipelines: 20%
- Streaming integration: 20%
- ML component: 20%
- Dashboard + KPIs: 20%
- Documentation & clarity: 20%
- Extra Credit (up to 10%): Innovation, additional datasets, advanced ML.


In [None]:
# Prompt output
from google.colab import auth
auth.authenticate_user() # Authenticates the Colab environment to Google Cloud

import os
PROJECT_ID = input("Enter your GCP Project ID: ").strip() # Prompt for and store the GCP Project ID
REGION = "us-central1"  # Set the default region; change if necessary

os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID # Export the Project ID as an environment variable

# Set the active project for gcloud commands
!gcloud config set project $GOOGLE_CLOUD_PROJECT
print("Project:", PROJECT_ID, "| Region:", REGION)


Enter your GCP Project ID: mgmt-467-47888-471119
Updated property [core/project].


To take a quick anonymous survey, run:
  $ gcloud survey

Project: mgmt-467-47888-471119 | Region: us-central1


In [None]:
from google.colab import files
print("Upload your kaggle.json (Kaggle > Account > Create New API Token)")
uploaded = files.upload()

import os
os.makedirs('/root/.kaggle', exist_ok=True)
with open('/root/.kaggle/kaggle.json', 'wb') as f:
    f.write(uploaded[list(uploaded.keys())[0]])
os.chmod('/root/.kaggle/kaggle.json', 0o600)  # owner-only permissions for security

!kaggle --version

Upload your kaggle.json (Kaggle > Account > Create New API Token)


Saving kaggle.json to kaggle.json
Kaggle API 1.7.4.5


In [None]:
# Create directory for raw data
!mkdir -p /content/data/raw

# Download the dataset using Kaggle CLI
!kaggle datasets download -d usgs/earthquake-database -p /content/data

# Unzip the downloaded file into the raw data directory (overwrite if necessary)
!unzip -o /content/data/*.zip -d /content/data/raw

# List all CSV files in the raw data directory with their sizes
!ls -lh /content/data/raw/*.csv

Dataset URL: https://www.kaggle.com/datasets/usgs/earthquake-database
License(s): CC0-1.0
Downloading earthquake-database.zip to /content/data
  0% 0.00/590k [00:00<?, ?B/s]
100% 590k/590k [00:00<00:00, 937MB/s]
Archive:  /content/data/earthquake-database.zip
  inflating: /content/data/raw/database.csv  
-rw-r--r-- 1 root root 2.3M Sep 20  2019 /content/data/raw/database.csv


In [None]:
import uuid
import os

# Create a unique bucket name with a random suffix
bucket_name = f"mgmt467-earthquake-{uuid.uuid4().hex[:8]}"
os.environ["BUCKET_NAME"] = bucket_name

# Create the GCS bucket
# Note: If you encounter an error related to the region,
# you might need to check available locations with '!gcloud storage locations list'
# and potentially update the REGION variable at the top of the notebook.
!gcloud storage buckets create gs://$BUCKET_NAME --location=US

# Upload all CSV files to the bucket under the 'netflix/' prefix
!gcloud storage cp /content/data/raw/* gs://$BUCKET_NAME/earthquake/

# Print the bucket name and explain the benefits of staging
print(f"Created bucket: {bucket_name}")
print("\nBenefits of staging data in GCS:")
print("- Provides a consistent and versionable source for data.")
print("- Decouples data storage from processing, allowing for flexible use with various GCP services like BigQuery.")
print("- Enables central management and access control of data.")

Creating gs://mgmt467-earthquake-cf15c197/...
Copying file:///content/data/raw/database.csv to gs://mgmt467-earthquake-cf15c197/earthquake/database.csv
Created bucket: mgmt467-earthquake-cf15c197

Benefits of staging data in GCS:
- Provides a consistent and versionable source for data.
- Decouples data storage from processing, allowing for flexible use with various GCP services like BigQuery.
- Enables central management and access control of data.


In [None]:
# Verification: List contents of the earthquake/ prefix in the bucket
import os
bucket_name = os.environ["BUCKET_NAME"]
!gcloud storage ls gs://$BUCKET_NAME/earthquake/ --recursive --readable-sizes

gs://mgmt467-earthquake-cf15c197/earthquake/:
gs://mgmt467-earthquake-cf15c197/earthquake/database.csv


In [None]:
# Create BigQuery dataset
DATASET = "earthquake"
# Attempt to create; ignore if exists
!bq --location=US mk -d --description "MGMT467 Earthquake dataset" $DATASET || echo "Dataset may already exist."

Dataset 'mgmt-467-47888-471119:earthquake' successfully created.


In [None]:
# Load table from GCS
tables = {
  "database": "database.csv",
}
import os
for tbl, fname in tables.items():
  src = f"gs://{os.environ['BUCKET_NAME']}/earthquake/{fname}"
  print("Loading", tbl, "from", src)
  !bq load --skip_leading_rows=1 --autodetect --source_format=CSV {DATASET}.{tbl} {src}

Loading database from gs://mgmt467-earthquake-cf15c197/earthquake/database.csv
Waiting on bqjob_r41671f45ae93ad4c_0000019afa928aca_1 ... (1s) Current status: DONE   


In [None]:
%%bigquery
SELECT
  *
FROM
  `mgmt-467-47888-471119.earthquake.database`
LIMIT 10;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,Date,Time,Latitude,Longitude,Type,Depth,Depth Error,Depth Seismic Stations,Magnitude,Magnitude Type,...,Magnitude Seismic Stations,Azimuthal Gap,Horizontal Distance,Horizontal Error,Root Mean Square,ID,Source,Location Source,Magnitude Source,Status
0,01/02/1965,13:44:18,19.246,145.616,Earthquake,131.6,,,6.0,MW,...,,,,,,ISCGEM860706,ISCGEM,ISCGEM,ISCGEM,Automatic
1,01/04/1965,11:29:49,1.863,127.352,Earthquake,80.0,,,5.8,MW,...,,,,,,ISCGEM860737,ISCGEM,ISCGEM,ISCGEM,Automatic
2,01/05/1965,18:05:58,-20.579,-173.972,Earthquake,20.0,,,6.2,MW,...,,,,,,ISCGEM860762,ISCGEM,ISCGEM,ISCGEM,Automatic
3,01/08/1965,18:49:43,-59.076,-23.557,Earthquake,15.0,,,5.8,MW,...,,,,,,ISCGEM860856,ISCGEM,ISCGEM,ISCGEM,Automatic
4,01/09/1965,13:32:50,11.938,126.427,Earthquake,15.0,,,5.8,MW,...,,,,,,ISCGEM860890,ISCGEM,ISCGEM,ISCGEM,Automatic
5,01/10/1965,13:36:32,-13.405,166.629,Earthquake,35.0,,,6.7,MW,...,,,,,,ISCGEM860922,ISCGEM,ISCGEM,ISCGEM,Automatic
6,01/12/1965,13:32:25,27.357,87.867,Earthquake,20.0,,,5.9,MW,...,,,,,,ISCGEM861007,ISCGEM,ISCGEM,ISCGEM,Automatic
7,01/15/1965,23:17:42,-13.309,166.212,Earthquake,35.0,,,6.0,MW,...,,,,,,ISCGEM861111,ISCGEM,ISCGEM,ISCGEM,Automatic
8,01/16/1965,11:32:37,-56.452,-27.043,Earthquake,95.0,,,6.0,MW,...,,,,,,ISCGEMSUP861125,ISCGEMSUP,ISCGEM,ISCGEM,Automatic
9,01/17/1965,10:43:17,-24.563,178.487,Earthquake,565.0,,,5.8,MW,...,,,,,,ISCGEM861148,ISCGEM,ISCGEM,ISCGEM,Automatic


In [None]:
import pandas as pd
import pandas_gbq

project_id = os.environ["GOOGLE_CLOUD_PROJECT"]
table_id = f"{project_id}.{DATASET}.database"

# Load data from BigQuery into a pandas DataFrame
df = pandas_gbq.read_gbq(f"SELECT * FROM `{table_id}`", project_id=project_id)

print("Shape of DataFrame before dropping NA values:", df.shape)

Downloading: 100%|[32m██████████[0m|
Shape of DataFrame before dropping NA values: (23412, 21)


In [None]:
df.isna().sum()

Unnamed: 0,0
Date,0
Time,0
Latitude,0
Longitude,0
Type,0
Depth,0
Depth Error,18951
Depth Seismic Stations,16315
Magnitude,0
Magnitude Type,3


In [None]:
df["Depth Seismic Stations"].value_counts()
df = df.drop(columns=["Magnitude Seismic Stations", "Magnitude Error", "Horizontal Error", "Root Mean Square", "Horizontal Distance", "Azimuthal Gap", "Depth Error", "Depth Seismic Stations"])

In [None]:
# Drop rows with any NA values
df_cleaned = df.dropna()

print("Shape of DataFrame after dropping NA values:", df_cleaned.shape)
display(df_cleaned.head())
print(df.shape)

Shape of DataFrame after dropping NA values: (23409, 13)


Unnamed: 0,Date,Time,Latitude,Longitude,Type,Depth,Magnitude,Magnitude Type,ID,Source,Location Source,Magnitude Source,Status
0,01/02/1965,13:44:18,19.246,145.616,Earthquake,131.6,6.0,MW,ISCGEM860706,ISCGEM,ISCGEM,ISCGEM,Automatic
1,01/04/1965,11:29:49,1.863,127.352,Earthquake,80.0,5.8,MW,ISCGEM860737,ISCGEM,ISCGEM,ISCGEM,Automatic
2,01/05/1965,18:05:58,-20.579,-173.972,Earthquake,20.0,6.2,MW,ISCGEM860762,ISCGEM,ISCGEM,ISCGEM,Automatic
3,01/08/1965,18:49:43,-59.076,-23.557,Earthquake,15.0,5.8,MW,ISCGEM860856,ISCGEM,ISCGEM,ISCGEM,Automatic
4,01/09/1965,13:32:50,11.938,126.427,Earthquake,15.0,5.8,MW,ISCGEM860890,ISCGEM,ISCGEM,ISCGEM,Automatic


(23412, 13)


### Preprocessing Data for BigQuery ML

We will prepare the `df` DataFrame for BigQuery ML. This involves:
1.  Combining 'Date' and 'Time' columns into a single datetime object.
2.  Extracting new features like 'Year', 'Month', 'Day', and 'Hour' from the datetime.
3.  Dropping the original 'Date' and 'Time' columns.
4.  Handling the remaining 3 missing values in 'Magnitude Type' by dropping those rows.
5.  Selecting relevant features for our prediction model.

In [None]:
import pandas as pd
import pandas_gbq

# Make a copy to avoid modifying the original df directly in case it's needed later
df_bqml = df.copy()

# Combine 'Date' and 'Time' into a single datetime column
df_bqml['datetime'] = pd.to_datetime(df_bqml['Date'] + ' ' + df_bqml['Time'], errors='coerce')

# Drop rows where datetime conversion failed (if any)
df_bqml.dropna(subset=['datetime'], inplace=True)

# Extract temporal features
df_bqml['Year'] = df_bqml['datetime'].dt.year
df_bqml['Month'] = df_bqml['datetime'].dt.month
df_bqml['Day'] = df_bqml['datetime'].dt.day
df_bqml['Hour'] = df_bqml['datetime'].dt.hour

# Drop original 'Date', 'Time', and the temporary 'datetime' column
df_bqml = df_bqml.drop(columns=['Date', 'Time', 'datetime'])

# Handle remaining missing values in 'Magnitude Type' (only 3 NAs)
df_bqml.dropna(subset=['Magnitude Type'], inplace=True)

# Select features for the BQML model (target is 'Magnitude')
# 'ID', 'Source', 'Location Source', 'Magnitude Source', 'Status' are metadata or source-related, not typically used as direct features for prediction in a barebones model.
features = [
    'Latitude',
    'Longitude',
    'Depth',
    'Type',
    'Magnitude Type',
    'Year',
    'Month',
    'Day',
    'Hour',
    'Magnitude' # This is our target variable
]

df_bqml_processed = df_bqml[features]

print("Shape of DataFrame after preprocessing for BQML:", df_bqml_processed.shape)
display(df_bqml_processed.head())

Shape of DataFrame after preprocessing for BQML: (23406, 10)


Unnamed: 0,Latitude,Longitude,Depth,Type,Magnitude Type,Year,Month,Day,Hour,Magnitude
0,19.246,145.616,131.6,Earthquake,MW,1965,1,2,13,6.0
1,1.863,127.352,80.0,Earthquake,MW,1965,1,4,11,5.8
2,-20.579,-173.972,20.0,Earthquake,MW,1965,1,5,18,6.2
3,-59.076,-23.557,15.0,Earthquake,MW,1965,1,8,18,5.8
4,11.938,126.427,15.0,Earthquake,MW,1965,1,9,13,5.8


### Upload Processed Data to BigQuery

Now we will upload the preprocessed DataFrame to a new BigQuery table, which will serve as the input for our BigQuery ML model.

In [None]:
# Define the new BigQuery table ID for the processed data
BQML_TABLE_NAME = "earthquake_bqml_data"
bqml_table_id = f"{PROJECT_ID}.{DATASET}.{BQML_TABLE_NAME}"

# Upload the processed DataFrame to BigQuery
pandas_gbq.to_gbq(
    df_bqml_processed,
    bqml_table_id,
    project_id=PROJECT_ID,
    if_exists='replace' # Replace table if it already exists
)

print(f"Processed data uploaded to BigQuery table: {bqml_table_id}")

100%|██████████| 1/1 [00:00<00:00, 10538.45it/s]

Processed data uploaded to BigQuery table: mgmt-467-47888-471119.earthquake.earthquake_bqml_data





### Create and Train a BigQuery ML Model

We will now create a barebones BigQuery ML linear regression model to predict `Depth` using the preprocessed data. `LINEAR_REG` is suitable for continuous target variables like depth.

In [1]:
%%bigquery
CREATE OR REPLACE MODEL `mgmt-467-47888-471119.earthquake.earthquake_depth_model`
OPTIONS(
  model_type = 'LINEAR_REG',
  input_label_cols = ['Depth']
)
AS
SELECT
  -- Features
  Latitude,
  Longitude,
  Type,
  Year,
  Month,
  Day,
  Hour,
  Magnitude,
  -- Label (target)
  Depth
FROM
  `mgmt-467-47888-471119.earthquake.earthquake_bqml_data`;


ERROR:
 404 POST https://bigquery.googleapis.com/bigquery/v2/projects//jobs?prettyPrint=false: Request couldn't be served.

Location: None
Job ID: 1461ee77-d748-4d72-9674-9a1dfc11f9df



### Evaluate the BigQuery ML Model

Finally, let's evaluate the performance of our newly trained barebones model. We will use `ML.EVALUATE` to get standard regression metrics.

In [None]:
%%bigquery
SELECT
  *
FROM
  ML.EVALUATE(MODEL `mgmt-467-47888-471119.earthquake.earthquake_depth_model`);

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,69.466639,14724.883719,1.594076,46.30696,0.01249,0.012501


### Data quality check in BigQuery (batch table)

To complement the local pandas cleaning, we also run a data quality check directly in BigQuery
against the raw `earthquake.database` table. We look for missing magnitudes, missing depths,
out-of-range magnitudes, and invalid latitude/longitude coordinates.

In [None]:
%%bigquery
SELECT
  COUNT(*) AS total_rows,
  SUM(CASE WHEN Magnitude IS NULL THEN 1 ELSE 0 END) AS missing_magnitude,
  SUM(CASE WHEN Depth IS NULL THEN 1 ELSE 0 END) AS missing_depth,
  SUM(CASE WHEN Magnitude < 0 OR Magnitude > 10 THEN 1 ELSE 0 END) AS out_of_range_magnitude,
  SUM(
    CASE
      WHEN Latitude NOT BETWEEN -90 AND 90
        OR Longitude NOT BETWEEN -180 AND 180
      THEN 1 ELSE 0
    END
  ) AS invalid_coordinates
FROM `mgmt-467-47888-471119.earthquake.database`;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,total_rows,missing_magnitude,missing_depth,out_of_range_magnitude,invalid_coordinates
0,23412,0,0,0,0


### Explaining Depth Predictions with `ML.EXPLAIN_PREDICT`

To understand how the model is using the features, we run `ML.EXPLAIN_PREDICT` on a small
sample of rows from the preprocessed BigQuery table. This returns per-feature contribution
scores for each prediction.

In [None]:
%%bigquery depth_explanations
SELECT
  *
FROM ML.EXPLAIN_PREDICT(
  MODEL `mgmt-467-47888-471119.earthquake.earthquake_depth_model`,
  (
    SELECT
      Latitude,
      Longitude,
      Type,
      `Magnitude Type`,
      Year,
      Month,
      Day,
      Hour,
      Magnitude
    FROM `mgmt-467-47888-471119.earthquake.earthquake_bqml_data`
    LIMIT 50
  ),
  STRUCT(10 AS top_k_features)
);

Query is running:   0%|          |

Downloading:   0%|          |

### Validate the streaming table (near real-time rows)

The Cloud Function + Pub/Sub + Dataflow pipeline writes normalized JSON earthquake events
into a streaming fact table. Here we validate that rows are arriving in near real time by
looking at the most recent timestamps.

In [None]:
%%bigquery recent_stream_events
SELECT
  event_time,
  latitude,
  longitude,
  depth,
  mag,
  magType,
  type
FROM `mgmt-467-47888-471119.earthquake_stream.earthquake_events`
ORDER BY event_time DESC
LIMIT 20;

Executing query with job ID: 6f840965-9391-477c-a9ab-7cb37dd61a73
Query executing: 0.42s


ERROR:
 404 Not found: Dataset mgmt-467-47888-471119:earthquake_stream was not found in location US; reason: notFound, message: Not found: Dataset mgmt-467-47888-471119:earthquake_stream was not found in location US

Location: US
Job ID: 6f840965-9391-477c-a9ab-7cb37dd61a73



### Scoring Streaming Earthquake Events with the Depth Model

Here we use the `earthquake_depth_model` trained on historical Kaggle data to predict the
depth of recent earthquakes coming from the streaming pipeline. This connects the batch and
streaming parts of the architecture: batch data for training, streaming data for inference.

In [None]:
%%bigquery depth_predictions_stream
SELECT
  event_time,
  latitude,
  longitude,
  actual_depth,
  mag,
  magType,
  type,
  predicted_Depth
FROM ML.PREDICT(
  MODEL `mgmt-467-47888-471119.earthquake.earthquake_depth_model`,
  (
    SELECT
      event_time,
      CAST(latitude AS FLOAT64) AS Latitude,
      CAST(longitude AS FLOAT64) AS Longitude,
      depth AS actual_depth,                         -- keep as passthrough column
      CAST(mag AS FLOAT64) AS Magnitude,
      magType AS `Magnitude Type`,
      type AS Type,
      EXTRACT(YEAR  FROM event_time) AS Year,
      EXTRACT(MONTH FROM event_time) AS Month,
      EXTRACT(DAY   FROM event_time) AS Day,
      EXTRACT(HOUR  FROM event_time) AS Hour
    FROM `mgmt-467-47888-471119.earthquake_stream.earthquake_events`
    WHERE event_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
  )
);

Executing query with job ID: 41178d99-8f25-48d6-b48d-260f75907db5
Query executing: 0.38s


ERROR:
 404 Not found: Dataset mgmt-467-47888-471119:earthquake_stream was not found in location US; reason: notFound, message: Not found: Dataset mgmt-467-47888-471119:earthquake_stream was not found in location US

Location: US
Job ID: 41178d99-8f25-48d6-b48d-260f75907db5



### Interactive Plotly figure (from streaming + predictions)

In [None]:
from google.cloud import bigquery
import plotly.express as px

PROJECT_ID = "mgmt-467-47888-471119"  # change if needed

client = bigquery.Client(project=PROJECT_ID)

sql = """
SELECT
  Latitude,
  Longitude,
  Magnitude,
  `Magnitude Type`,
  Year,
  Month,
  Day,
  Hour,
  actual_depth,
  predicted_Depth
FROM ML.PREDICT(
  MODEL `mgmt-467-47888-471119.earthquake.earthquake_depth_model`,
  (
    SELECT
      Latitude,
      Longitude,
      Type,
      `Magnitude Type`,
      Year,
      Month,
      Day,
      Hour,
      Magnitude,
      Depth AS actual_depth  -- keep the true depth as a passthrough column
    FROM `mgmt-467-47888-471119.earthquake.earthquake_bqml_data`
    WHERE Depth IS NOT NULL
    LIMIT 2000               -- sample size; adjust up/down as you like
  )
);
"""

depth_pred_df = client.query(sql).to_dataframe()

depth_pred_df.head()

Unnamed: 0,Latitude,Longitude,Magnitude,Magnitude Type,Year,Month,Day,Hour,actual_depth,predicted_Depth
0,19.246,145.616,6.0,MW,1965,1,2,13,131.6,64.02971
1,1.863,127.352,5.8,MW,1965,1,4,11,80.0,68.325079
2,-20.579,-173.972,6.2,MW,1965,1,5,18,20.0,101.710827
3,-59.076,-23.557,5.8,MW,1965,1,8,18,15.0,96.945249
4,11.938,126.427,5.8,MW,1965,1,9,13,15.0,66.705308


In [3]:
from google.cloud import bigquery
import plotly.express as px

PROJECT_ID = "mgmt-467-47888-471119"  # change if needed

client = bigquery.Client(project=PROJECT_ID)

sql = """
SELECT
  Latitude,
  Longitude,
  Magnitude,
  `Magnitude Type`,
  Year,
  Month,
  Day,
  Hour,
  actual_depth,
  predicted_Depth
FROM ML.PREDICT(
  MODEL `mgmt-467-47888-471119.earthquake.earthquake_depth_model`,
  (
    SELECT
      Latitude,
      Longitude,
      Type,
      `Magnitude Type`,
      Year,
      Month,
      Day,
      Hour,
      Magnitude,
      Depth AS actual_depth  -- keep the true depth as a passthrough column
    FROM `mgmt-467-47888-471119.earthquake.earthquake_bqml_data`
    WHERE Depth IS NOT NULL
    LIMIT 2000               -- sample size; adjust up/down as you like
  )
);
"""

depth_pred_df = client.query(sql).to_dataframe()

if depth_pred_df.empty:
    print("No data returned from BigQuery – check table/model names or remove the LIMIT.")
else:
    fig = px.scatter(
        depth_pred_df,
        x="actual_depth",
        y="predicted_Depth",
        color="Magnitude",
        hover_data=["Latitude", "Longitude", "Year", "Month", "Day", "Magnitude Type"],
        labels={
            "actual_depth": "Actual Depth (km)",
            "predicted_Depth": "Predicted Depth (km)",
            "Magnitude": "Magnitude"
        },
        title="Actual vs Predicted Earthquake Depth (BQML Linear Regression)"
    )

    # Add a diagonal reference line (perfect prediction)
    min_val = min(depth_pred_df["actual_depth"].min(), depth_pred_df["predicted_Depth"].min())
    max_val = max(depth_pred_df["actual_depth"].max(), depth_pred_df["predicted_Depth"].max())

    fig.add_shape(
        type="line",
        x0=min_val,
        y0=min_val,
        x1=max_val,
        y1=max_val,
        line=dict(color="black", dash="dash")
    )

    fig.update_layout(
        xaxis_title="Actual Depth (km)",
        yaxis_title="Predicted Depth (km)",
        legend_title_text="Magnitude"
    )

    fig.show()

## DIVE Journal: Depth Prediction on Streaming Earthquake Data

**Question / Goal**

Can a linear regression model trained on historical Kaggle earthquakes generalize well to
predict the depth of recent streaming USGS events, and how might that support situational
awareness for emergency managers?

**Initial Prompt (summarized)**

> "Given my project proposal, help me build a BigQuery ML `LINEAR_REG` model to predict
> Depth using the preprocessed data and then use it with streaming events."

**Iterations / Failures**

- First attempt targeted `Magnitude` instead of `Depth` as the label, which did not align
  with the business question. I refactored the BQML creation query so `Depth` became the
  label and removed it from the input features.
- I initially tried to pass the `depth` column directly into `ML.PREDICT`, which would have
  treated it as a feature. I fixed this by renaming it to `actual_depth` in the inner
  SELECT so it is only a passthrough column for evaluation.

**Validation**

- I compared `actual_depth` and `predicted_Depth` on a sample of streaming events and
  sanity-checked that shallow events (< 70 km) were generally predicted as shallow, while
  deeper events were predicted with higher depths.
- I also used `ML.EXPLAIN_PREDICT` on historical data to confirm that magnitude and location
  are among the most influential features for depth predictions.