## Batch Data

In [29]:
#EXAMPLE (from LLM) — Auth + Project/Region (commented; write your own cell using the prompt)
from google.colab import auth
auth.authenticate_user()

import os
PROJECT_ID = input("Enter your GCP Project ID: ").strip()
REGION = "us-central1"  # keep consistent; change if instructed
os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID
print("Project:", PROJECT_ID, "| Region:", REGION)

# Set active project for gcloud/BigQuery CLI
!gcloud config set project $GOOGLE_CLOUD_PROJECT
!gcloud config get-value project
#Done: Auth + Project/Region set

Enter your GCP Project ID: mgmt-467-2500
Project: mgmt-467-2500 | Region: us-central1
Updated property [core/project].
mgmt-467-2500


Generate a single Colab code cell that:

Prompts me to upload kaggle.json,
Saves to ~/.kaggle/kaggle.json with 0600 permissions,
Prints kaggle --version. Add comments about security and reproducibility.

In [30]:
# # EXAMPLE (from LLM) — Kaggle setup (commented)
from google.colab import files
print("Upload your kaggle.json (Kaggle > Account > Create New API Token)")
uploaded = files.upload()

import os
os.makedirs('/root/.kaggle', exist_ok=True)
with open('/root/.kaggle/kaggle.json', 'wb') as f:
    f.write(uploaded[list(uploaded.keys())[0]])
os.chmod('/root/.kaggle/kaggle.json', 0o600)  # owner-only

!kaggle --version

Upload your kaggle.json (Kaggle > Account > Create New API Token)


Saving kaggle (1).json to kaggle (1) (1).json
Kaggle API 1.7.4.5


Keep raw files under /content/data/raw for predictable paths and auditing. Dataset: usgs/significant-earthquakes-1965-2016

Build Prompt
Generate a Colab code cell that:

Creates /content/data/raw,
Downloads the dataset to /content/data with Kaggle CLI,
Unzips into /content/data/raw (overwrite OK),
Lists all CSVs with sizes in a neat table. Include comments describing each step.

In [31]:
# # EXAMPLE (from LLM) — Download & unzip (commented)
!mkdir -p /content/data/raw
!kaggle datasets download -d usgs/earthquake-database -p /content/data
!unzip -o /content/data/*.zip -d /content/data/raw
# List CSV inventory
!ls -lh /content/data/raw/*.csv


Dataset URL: https://www.kaggle.com/datasets/usgs/earthquake-database
License(s): CC0-1.0
earthquake-database.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  /content/data/earthquake-database.zip
  inflating: /content/data/raw/database.csv  
-rw-r--r-- 1 root root 2.3M Sep 20  2019 /content/data/raw/database.csv


In [32]:
#EXAMPLE (from LLM) — GCS staging (commented)
import uuid, os
import subprocess # Needed for checking bucket existence more robustly

# Retrieve PROJECT_ID from the environment variable set in a previous cell
PROJECT_ID = os.environ.get("GOOGLE_CLOUD_PROJECT")

# REGION is set as a Python variable in a previous cell (SX_mavKv4IY2).
# Let's ensure it's also set in the environment for consistency with gcloud commands
REGION_PYTHON_VAR = "us-central1" # Assuming this value from kernel state
os.environ["REGION"] = REGION_PYTHON_VAR # Set environment variable
GCS_REGION = REGION_PYTHON_VAR.upper() # Convert to uppercase for GCS bucket location

# Generate a unique bucket name (as per the original example from the user)
bucket_name = f"{PROJECT_ID}-earthquake-data"
os.environ["BUCKET_NAME"] = bucket_name # Set environment variable for subsequent !gcloud commands

# Check if the bucket already exists. If not, create it.
print(f"Checking for bucket: gs://{bucket_name}")
# Using subprocess.run to capture stderr and check return code directly.
# A non-zero return code typically indicates the bucket was not found.
result = subprocess.run(
    ["gcloud", "storage", "buckets", "describe", f"gs://{bucket_name}", f"--project={PROJECT_ID}"],
    capture_output=True, text=True, check=False
)

if result.returncode != 0: # If command failed, it implies the bucket does not exist
    print(f"GCS bucket gs://{bucket_name} not found. Creating in region {GCS_REGION}...")
    # Use shell command with environment variables. `gcloud` will pick up $BUCKET_NAME, $GCS_REGION, $PROJECT_ID.
    !gcloud storage buckets create gs://$BUCKET_NAME --location=$GCS_REGION --project=$PROJECT_ID
else:
    print(f"GCS bucket gs://{bucket_name} already exists. Skipping creation.")

# Copy raw data to the GCS bucket
print(f"Copying /content/data/raw/* to gs://{bucket_name}/earthquake/")
!gcloud storage cp /content/data/raw/* gs://$BUCKET_NAME/earthquake/ --project=$PROJECT_ID

print("Bucket:", bucket_name)
# Verify contents
print(f"Verifying contents in gs://{bucket_name}/earthquake/")
!gcloud storage ls gs://$BUCKET_NAME/earthquake/ --project=$PROJECT_ID

Checking for bucket: gs://mgmt-467-2500-earthquake-data
GCS bucket gs://mgmt-467-2500-earthquake-data already exists. Skipping creation.
Copying /content/data/raw/* to gs://mgmt-467-2500-earthquake-data/earthquake/
Copying file:///content/data/raw/database.csv to gs://mgmt-467-2500-earthquake-data/earthquake/database.csv
Bucket: mgmt-467-2500-earthquake-data
Verifying contents in gs://mgmt-467-2500-earthquake-data/earthquake/
gs://mgmt-467-2500-earthquake-data/earthquake/database.csv


Cell A: Create (idempotently) dataset netflix in US multi-region; if it exists, print a friendly message.
Cell B: Load tables from gs://$BUCKET_NAME/netflix/: users, movies, watch_history, recommendation_logs, search_logs, reviews with --skip_leading_rows=1 --autodetect --source_format=CSV. Finish with row-count queries for each table.

In [33]:
#EXAMPLE (from LLM) — BigQuery dataset (commented)
DATASET="earthquake"
#Attempt to create; ignore if exists
!bq --location=US mk -d --description "MGMT467 Earthquake dataset" $DATASET || echo "Dataset may already exist."

BigQuery error in mk operation: Dataset 'mgmt-467-2500:earthquake' already
exists.
Dataset may already exist.


# Task
Update the code in cell `xibmXGdfumhO` to correct the `bq load` command by directly injecting the Python variables `DATASET`, `tbl`, and `src` into the shell command string, and fix the `bq query` command by constructing the full query string using f-strings in Python with `PROJECT_ID`, `DATASET`, and `tbl` before passing it to the `bq query` shell command. Confirm that the cell runs successfully, loads the data into BigQuery, and then accurately performs the row count queries, displaying the results without any errors.

## Update BigQuery Load and Query Code

### Subtask:
Modify the code in the cell `xibmXGdfumhO` to correct the `bq load` command by including the `src` variable as the source argument and ensuring the destination table is correctly formatted (`$DATASET.$tbl`). Additionally, fix the `bq query` command by performing all string formatting (including `tbl`, `PROJECT_ID`, and `DATASET`) within Python using f-strings before the command is passed to the shell, to prevent syntax errors.


**Reasoning**:
The subtask requires modifying the existing code cell `xibmXGdfumhO` to correct the `bq load` and `bq query` commands. I will update the `bq load` command to correctly interpolate Python variables and fix the `bq query` command by pre-formatting the SQL query string in Python using f-strings before passing it to the shell command.



In [34]:
tables = {
   "database": "database.csv"
}
import os

# Ensure PROJECT_ID and DATASET are available for f-strings
PROJECT_ID = os.environ.get("GOOGLE_CLOUD_PROJECT", PROJECT_ID)
DATASET = DATASET # Already defined in kernel state

# Fix: Force BUCKET_NAME to the correct one where files were actually uploaded
# based on previous cell's execution output, as there was an inconsistency
# between the Python variable 'bucket_name' and the effective
# $BUCKET_NAME used by gcloud storage commands in the prior cell.
# The files were observed to be uploaded to 'gs://mgmt-467-2500-earthquake-data/earthquake/database.csv'
os.environ["BUCKET_NAME"] = "mgmt-467-2500-earthquake-data"


for tbl, fname in tables.items():
   src = f"gs://{os.environ['BUCKET_NAME']}/earthquake/{fname}"
   print("Loading", tbl, "from", src)
   # Corrected bq load command using Python variable interpolation
   # Added quotes around {src} to ensure it's treated as a single argument by the shell.
   !bq load --skip_leading_rows=1 --autodetect --source_format=CSV {DATASET}.{tbl} "{src}"

# Row counts
for tbl in tables.keys():
   # Pre-format the SQL query string using f-strings in Python
   query_string = f"SELECT '{tbl}' AS table_name, COUNT(*) AS n FROM `{PROJECT_ID}.{DATASET}.{tbl}`"
   # Escape backticks in the query string to prevent shell interpretation
   # Corrected Python escape sequence for a literal backslash followed by a backtick
   query_string_escaped = query_string.replace('`', '\\`')

   # Pass the fully formed query string to bq query directly, with escaped backticks
   print(f"Running query: {query_string}") # For debugging
   !bq query --nouse_legacy_sql "{query_string_escaped}"


Loading database from gs://mgmt-467-2500-earthquake-data/earthquake/database.csv
Waiting on bqjob_r2b82cf5382e99b0f_0000019b0f2fa377_1 ... (1s) Current status: DONE   
Running query: SELECT 'database' AS table_name, COUNT(*) AS n FROM `mgmt-467-2500.earthquake.database`
+------------+-------+
| table_name |   n   |
+------------+-------+
| database   | 23412 |
+------------+-------+


### Converting 'Date' column to `DATE` type in BigQuery

To convert the 'Date' column in your `earthquake.database` table to a proper `DATE` type, we'll use a `CREATE OR REPLACE TABLE` statement. This will rebuild the table with the updated schema.

We've identified two main string formats for dates in your data:
1. `MM/DD/YYYY`
2. `YYYY-MM-DDTHH:MM:SS.000Z` (an ISO timestamp which needs to be cast to `DATE`)

Since `SAFE.PARSE_DATE` and `TRY_PARSE_DATE` were not recognized in your environment, we'll use a `CASE` statement with `REGEXP_CONTAINS` to check the format of each 'Date' string. This allows us to apply the correct parsing function (`PARSE_DATE` or `PARSE_TIMESTAMP`) only when the format matches, and `CAST` the result of `PARSE_TIMESTAMP` to `DATE`. If neither format matches, the date will be `NULL`.

In [40]:
create_table_with_converted_date_sql = f"""CREATE OR REPLACE TABLE `{PROJECT_ID}.{DATASET}.database` AS
SELECT
    Time,
    Latitude,
    Longitude,
    Type,
    Depth,
    `Depth Error`,
    `Depth Seismic Stations`,
    Magnitude,
    `Magnitude Type`,
    `Magnitude Error`,
    `Magnitude Seismic Stations`,
    `Azimuthal Gap`,
    `Horizontal Distance`,
    `Horizontal Error`,
    `Root Mean Square`,
    ID,
    Source,
    `Location Source`,
    `Magnitude Source`,
    Status,
    CASE
        WHEN REGEXP_CONTAINS(Date, r'^\\d{{1,2}}/\\d{{1,2}}/\\d{{4}}$') THEN PARSE_DATE('%m/%d/%Y', Date)
        WHEN REGEXP_CONTAINS(Date, r'^\\d{{4}}-\\d{{2}}-\\d{{2}}T\\d{{2}}:\\d{{2}}:\\d{{2}}\\.\\d{{3}}Z$') THEN CAST(PARSE_TIMESTAMP('%Y-%m-%dT%H:%M:%E*S%Ez', Date) AS DATE)
        ELSE NULL -- Return NULL if neither format matches
    END AS Date -- Overwrite the original Date column with the new DATE type
FROM
    `{PROJECT_ID}.{DATASET}.database`
"""

print("Generated BigQuery SQL query to convert 'Date' column in the existing table:")
print(create_table_with_converted_date_sql)

# Escape backticks for shell command execution
create_table_with_converted_date_sql_escaped = create_table_with_converted_date_sql.replace('`', '\\`')

print("Executing BigQuery query to convert the 'Date' column...")
!bq query --nouse_legacy_sql "{create_table_with_converted_date_sql_escaped}"

print("\nVerifying the schema of the updated table...")
from google.cloud import bigquery
client = bigquery.Client(project=PROJECT_ID)
table_id_updated = f"{PROJECT_ID}.{DATASET}.database"
table_updated = client.get_table(table_id_updated)

print(f"Schema for table {table_id_updated}:")
for field in table_updated.schema:
    print(f"- {field.name}: {field.field_type}")


Generated BigQuery SQL query to convert 'Date' column in the existing table:
CREATE OR REPLACE TABLE `mgmt-467-2500.earthquake.database` AS
SELECT
    Time,
    Latitude,
    Longitude,
    Type,
    Depth,
    `Depth Error`,
    `Depth Seismic Stations`,
    Magnitude,
    `Magnitude Type`,
    `Magnitude Error`,
    `Magnitude Seismic Stations`,
    `Azimuthal Gap`,
    `Horizontal Distance`,
    `Horizontal Error`,
    `Root Mean Square`,
    ID,
    Source,
    `Location Source`,
    `Magnitude Source`,
    Status,
    CASE
        WHEN REGEXP_CONTAINS(Date, r'^\d{1,2}/\d{1,2}/\d{4}$') THEN PARSE_DATE('%m/%d/%Y', Date)
        WHEN REGEXP_CONTAINS(Date, r'^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{3}Z$') THEN CAST(PARSE_TIMESTAMP('%Y-%m-%dT%H:%M:%E*S%Ez', Date) AS DATE)
        ELSE NULL -- Return NULL if neither format matches
    END AS Date -- Overwrite the original Date column with the new DATE type
FROM
    `mgmt-467-2500.earthquake.database`

Executing BigQuery query to convert

# Task
Create a BigQuery SQL query to prepare training data for a linear regression model. The query should select `Magnitude`, `Latitude`, `Longitude`, `Depth`, `Depth Error`, `Depth Seismic Stations`, and `Location Source` from the `earthquake.database` table. Ensure all selected columns are not null. Engineer new features: an interaction term for `Latitude * Longitude` and a polynomial term for `POW(Depth, 2)`. Store this SQL query in a Python variable for subsequent use in model creation.

## Feature Engineering and Training Data Preparation

### Subtask:
Generate a BigQuery SQL query to select relevant features, engineer new features (interaction and polynomial terms), and filter out null values to prepare training data for a linear regression model. Store this query in a Python variable.


**Reasoning**:
The subtask requires generating a BigQuery SQL query to prepare training data. I will define a Python variable `training_data_sql` and construct the SQL query string using an f-string, incorporating the specified columns, engineered features, and null value filters.



In [35]:
training_data_sql = f"""#standardSQL
SELECT
    Magnitude,
    Latitude,
    Longitude,
    Depth,
    `Depth Error`,
    `Depth Seismic Stations`,
    `Location Source`,
    POW(Depth, 2) AS DepthSquared
FROM
    `{PROJECT_ID}.{DATASET}.database`
"""

print("Generated BigQuery SQL query for training data:")
print(training_data_sql)

Generated BigQuery SQL query for training data:
#standardSQL
SELECT
    Magnitude,
    Latitude,
    Longitude,
    Depth,
    `Depth Error`,
    `Depth Seismic Stations`,
    `Location Source`,
    POW(Depth, 2) AS DepthSquared
FROM
    `mgmt-467-2500.earthquake.database`



## Create BigQuery ML Regression Model

### Subtask:
Construct and execute a `CREATE MODEL` BigQuery ML statement to train a `LINEAR_REG` model to predict 'Magnitude'.


**Reasoning**:
I need to construct the BigQuery ML `CREATE OR REPLACE MODEL` statement by combining existing variables like `PROJECT_ID`, `DATASET`, `MODEL_NAME`, and the `training_data_sql` that was generated previously. I will store this in a Python variable, print it, and then execute it using the `bq query` command, ensuring proper escaping of backticks.



In [36]:
MODEL_NAME = 'earthquake_Depth_predictor'

# Redefine training_data_sql with aliased column names to handle spaces
# This ensures the CREATE MODEL statement uses valid identifiers.
training_data_sql = f"""#standardSQL
SELECT
    Magnitude,
    Latitude,
    Longitude,
    Depth,
    `Depth Error` AS Depth_Error,  -- Alias column with space
    `Depth Seismic Stations` AS Depth_Seismic_Stations, -- Alias column with space
    `Location Source` AS Location_Source, -- Alias column with space
    POW(Depth, 2) AS DepthSquared
FROM
    `{PROJECT_ID}.{DATASET}.database`
"""

create_model_sql = f"""CREATE OR REPLACE MODEL `{PROJECT_ID}.{DATASET}.{MODEL_NAME}`
OPTIONS(
    model_type='LINEAR_REG',
    input_label_cols=['Depth']
)
AS
{training_data_sql}"""

print("Generated BigQuery ML CREATE MODEL query:")
print(create_model_sql)

# Escape backticks in the query string to prevent shell interpretation issues
# Using '\`' for a literal backslash followed by a backtick to avoid SyntaxWarning
create_model_sql_escaped = create_model_sql.replace('`', '\`')

print("Executing BigQuery ML CREATE MODEL query...")
!bq query --nouse_legacy_sql "{create_model_sql_escaped}"

Generated BigQuery ML CREATE MODEL query:
CREATE OR REPLACE MODEL `mgmt-467-2500.earthquake.earthquake_Depth_predictor`
OPTIONS(
    model_type='LINEAR_REG',
    input_label_cols=['Depth']
)
AS
#standardSQL
SELECT
    Magnitude,
    Latitude,
    Longitude,
    Depth,
    `Depth Error` AS Depth_Error,  -- Alias column with space
    `Depth Seismic Stations` AS Depth_Seismic_Stations, -- Alias column with space
    `Location Source` AS Location_Source, -- Alias column with space
    POW(Depth, 2) AS DepthSquared
FROM
    `mgmt-467-2500.earthquake.database`

Executing BigQuery ML CREATE MODEL query...



invalid escape sequence '\`'


invalid escape sequence '\`'


invalid escape sequence '\`'



Waiting on bqjob_r1f369616ebd9f42b_0000019b0f2fc78e_1 ... (23s) Current status: DONE   

b'[query] The input data has NULL values in one or more columns: Depth_Error, Depth_Seismic_Stations. BQML automatically handles null values (See https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-auto-preprocessing#imputation). If null values represent a special value in the data, replace them with the desired value before training and then retry.'



## Evaluate BigQuery ML Model Performance

### Subtask:
Run an ML.EVALUATE BigQuery ML query on the newly trained regression model to assess its performance.


**Reasoning**:
I need to construct the BigQuery ML ML.EVALUATE statement, using the previously defined PROJECT_ID, DATASET, and MODEL_NAME variables to evaluate the trained model. Then, I will print the query, escape any backticks to ensure correct shell execution, and finally execute the query using the bq command.



In [37]:
evaluate_model_sql = f"""SELECT * FROM ML.EVALUATE(MODEL `{PROJECT_ID}.{DATASET}.{MODEL_NAME}`) """

print("Generated BigQuery ML EVALUATE MODEL query:")
print(evaluate_model_sql)

import google.cloud.bigquery

# Initialize the BigQuery client (if not already initialized, ensure PROJECT_ID is set)
# Assuming client is already initialized from the previous step, or re-initialize if needed.
client = google.cloud.bigquery.Client(project=PROJECT_ID)

print("Executing BigQuery ML EVALUATE MODEL query using Python client...")
# Execute the query
query_job = client.query(evaluate_model_sql)
results = query_job.result() # Wait for the job to complete and get results

# Display results
for row in results:
    print(row)


Generated BigQuery ML EVALUATE MODEL query:
SELECT * FROM ML.EVALUATE(MODEL `mgmt-467-2500.earthquake.earthquake_Depth_predictor`) 
Executing BigQuery ML EVALUATE MODEL query using Python client...
Row((25.45399183619078, 9805.526034555902, 0.5893087382057156, 18.184320516220396, 0.3894838843623398, 0.38975437410492664), {'mean_absolute_error': 0, 'mean_squared_error': 1, 'mean_squared_log_error': 2, 'median_absolute_error': 3, 'r2_score': 4, 'explained_variance': 5})


**Reasoning**:
The previous execution successfully evaluated the model, but generated a `SyntaxWarning` due to an incorrect escape sequence for the backtick. I will fix the Python string literal for escaping backticks by using a double backslash (`\\`) to represent a literal backslash in the `replace` method, which will correctly escape the backtick for the shell without Python syntax warnings.



In [38]:
evaluate_model_sql = f"""SELECT * FROM ML.EVALUATE(MODEL `{PROJECT_ID}.{DATASET}.{MODEL_NAME}`) """

print("Generated BigQuery ML EVALUATE MODEL query:")
print(evaluate_model_sql)

# Escape backticks in the query string to prevent shell interpretation issues
# Use '\\`' for a literal backslash followed by a backtick to avoid SyntaxWarning
evaluate_model_sql_escaped = evaluate_model_sql.replace('`', '\\`')

print("Executing BigQuery ML EVALUATE MODEL query...")
!bq query --nouse_legacy_sql "{evaluate_model_sql_escaped}"

Generated BigQuery ML EVALUATE MODEL query:
SELECT * FROM ML.EVALUATE(MODEL `mgmt-467-2500.earthquake.earthquake_Depth_predictor`) 
Executing BigQuery ML EVALUATE MODEL query...
+---------------------+--------------------+------------------------+-----------------------+--------------------+---------------------+
| mean_absolute_error | mean_squared_error | mean_squared_log_error | median_absolute_error |      r2_score      | explained_variance  |
+---------------------+--------------------+------------------------+-----------------------+--------------------+---------------------+
|   25.45399183619078 |  9805.526034555902 |     0.5893087382057156 |    18.184320516220396 | 0.3894838843623398 | 0.38975437410492664 |
+---------------------+--------------------+------------------------+-----------------------+--------------------+---------------------+


In [39]:
import pandas as pd
import plotly.express as px
import google.cloud.bigquery

# Construct the ML.PREDICT query
predict_sql = f"""SELECT
    Depth AS actual_depth, -- Corrected to actual_depth
    predicted_Depth AS predicted_depth
FROM
    ML.PREDICT(
        MODEL `{PROJECT_ID}.{DATASET}.{MODEL_NAME}`,
        (SELECT
            Magnitude,
            Latitude,
            Longitude,
            Depth,
            `Depth Error` AS Depth_Error,
            `Depth Seismic Stations` AS Depth_Seismic_Stations,
            `Location Source` AS Location_Source,
            POW(Depth, 2) AS DepthSquared
        FROM
            `{PROJECT_ID}.{DATASET}.database`)
    ) """

print("Executing BigQuery ML PREDICT query to get actual vs. predicted values...")
client = google.cloud.bigquery.Client(project=PROJECT_ID)
query_job = client.query(predict_sql)

# Convert results to a Pandas DataFrame
predictions_df = query_job.to_dataframe()

# Create the scatter plot
fig = px.scatter(
    predictions_df,
    x='actual_depth',
    y='predicted_depth',
    title='Actual Depth vs. Predicted Depth',
    labels={'actual_depth': 'Actual Depth', 'predicted_depth': 'Predicted Depth'},
    trendline='ols',
    trendline_color_override='red',
    template='plotly_white'
)

fig.update_layout(
    xaxis_title='Actual Depth',
    yaxis_title='Predicted Depth',
    hovermode='closest'
)
fig.show()

print("First 5 rows of predictions:")
display(predictions_df.head())

Executing BigQuery ML PREDICT query to get actual vs. predicted values...


First 5 rows of predictions:


Unnamed: 0,actual_depth,predicted_depth
0,131.6,69.734415
1,80.0,51.970083
2,20.0,35.705118
3,15.0,40.474071
4,15.0,41.179681


## Final Task

### Subtask:
Provide a summary of the BigQuery ML model creation and evaluation process, including the features used, the model's performance metrics, and any insights gained or recommendations for further model improvement.


## Visualize Model Evaluation Metrics

### Subtask:
Create a Plotly figure (bar chart) to visualize the BigQuery ML evaluation metrics.

**Reasoning**:
I will re-execute the `ML.EVALUATE` query using the BigQuery Python client to obtain the model's performance metrics. These metrics will then be converted into a Pandas DataFrame, making it easy to create a bar chart with Plotly Express to visualize each metric's value.