### MGMT 467 Final Project

### Authentication

In [45]:

# Alternative method for Auth + Project/Region
from google.colab import auth
import os

# Authenticate (tries different methods)
try:
  auth.authenticate_user()
  print("Authentication successful.")
except Exception as e:
  print(f"Authentication failed: {e}")
  print("Please run auth.authenticate_user() or ensure credentials are set up.")


PROJECT_ID = input("Enter your GCP Project ID: ").strip()
REGION = "us-central1"  # keep consistent; change if instructed

# Set environment variable and gcloud config
os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID
print(f"Setting GOOGLE_CLOUD_PROJECT environment variable to: {PROJECT_ID}")

# Set active project for gcloud/BigQuery CLI
# Using subprocess instead of !gcloud for variation
import subprocess

try:
    subprocess.run(["gcloud", "config", "set", "project", PROJECT_ID], check=True, capture_output=True, text=True)
    print(f"gcloud config project set to: {PROJECT_ID}")
    # Optional: Verify the setting
    result = subprocess.run(["gcloud", "config", "get-value", "project"], check=True, capture_output=True, text=True)
    print(f"Verified gcloud project: {result.stdout.strip()}")

except subprocess.CalledProcessError as e:
    print(f"Error setting gcloud project: {e.stderr}")
    print("Please check your project ID and gcloud installation.")


print("\nProject:", PROJECT_ID, "| Region:", REGION)
# Done: Auth + Project/Region set


Authentication successful.
Enter your GCP Project ID: mgmt467-4889
Setting GOOGLE_CLOUD_PROJECT environment variable to: mgmt467-4889
gcloud config project set to: mgmt467-4889
Verified gcloud project: mgmt467-4889

Project: mgmt467-4889 | Region: us-central1


In [46]:
from google.cloud import bigquery
import pandas as pd

# Initialize a BigQuery client
client = bigquery.Client()

# Define your project ID and table ID
table_id = "Earthquakes_Data.earthquakes_cleaned" # Corrected dataset name

# Construct the SQL query to select the specified columns
query = f"""
SELECT
  latitude,
  longitude,
  mag AS magnitude,
  time,
  depth,
  tsunami,
  magType
FROM `{PROJECT_ID}.{table_id}`

LIMIT 1000 -- Limiting to 1000 rows for initial inspection
"""

# Run the query and convert the results to a Pandas DataFrame
df = client.query(query).to_dataframe()

# Display the first 5 rows of the new DataFrame
print("Successfully queried selected columns and loaded into DataFrame:")
display(df.head())

Successfully queried selected columns and loaded into DataFrame:


Unnamed: 0,latitude,longitude,magnitude,time,depth,tsunami,magType
0,60.519,-140.083,2.5,1765080016396,2.2,0,ml
1,60.577,-140.018,2.5,1765085135763,5.0,0,ml
2,60.503,-139.533,2.5,1765067382292,1.0,0,ml
3,57.556,-155.851,2.5,1765319238674,5.0,0,ml
4,60.524,-140.093,2.5,1765079035440,6.1,0,ml


In [47]:
import pandas as pd

def mag_bin(m):
    if m < 5:
        return "Low (<5)"
    elif m < 6:
        return "Medium (5–6)"
    elif m < 7:
        return "High (6–7)"
    else:
        return "Extreme (7+)"


df = df.copy()

# Create 1x1 degree bins for latitude and longitude
df["lat_bin"] = df["latitude"].apply(lambda x: int(x))
df["long_bin"] = df["longitude"].apply(lambda x: int(x))

# Create region_id by combining latitude and longitude bins
df["region_id"] = df["lat_bin"].astype(str) + "_" + df["long_bin"].astype(str)

# Create region_bin as a copy of region_id, as was previously attempted
df["region_bin"] = df["region_id"]

# Apply the magnitude binning function to the 'magnitude' column (aliased from 'mag')
df["mag_bin"] = df["magnitude"].apply(mag_bin)

print("DataFrame with new 'region_id', 'lat_bin', 'long_bin', 'region_bin', and 'mag_bin' columns:")
display(df.head())

DataFrame with new 'region_id', 'lat_bin', 'long_bin', 'region_bin', and 'mag_bin' columns:


Unnamed: 0,latitude,longitude,magnitude,time,depth,tsunami,magType,lat_bin,long_bin,region_id,region_bin,mag_bin
0,60.519,-140.083,2.5,1765080016396,2.2,0,ml,60,-140,60_-140,60_-140,Low (<5)
1,60.577,-140.018,2.5,1765085135763,5.0,0,ml,60,-140,60_-140,60_-140,Low (<5)
2,60.503,-139.533,2.5,1765067382292,1.0,0,ml,60,-139,60_-139,60_-139,Low (<5)
3,57.556,-155.851,2.5,1765319238674,5.0,0,ml,57,-155,57_-155,57_-155,Low (<5)
4,60.524,-140.093,2.5,1765079035440,6.1,0,ml,60,-140,60_-140,60_-140,Low (<5)


In [48]:
import plotly.express as px

fig = px.scatter_geo(
    df,
    lat="latitude",
    lon="longitude",
    color="mag_bin",         # <-- change to depth_bin or anything you want
    size="magnitude",
    projection="natural earth",
    hover_name="region_id",
    hover_data={"magnitude": True, "depth": True},
    opacity=0.8,
)

# Styling
fig.update_geos(
    showcoastlines=True,
    coastlinecolor="white",
    landcolor="black",
    bgcolor="black",
    showland=True,
)

fig.update_layout(
    title="Global Earthquakes by Magnitude Bin",
    paper_bgcolor="black",
    plot_bgcolor="black",
    font_color="white",
    legend_title_text="Magnitude Bin"
)

fig.show()

In [49]:
!pip install --quiet lightgbm google-cloud-bigquery pandas numpy scikit-learn
# I'm going to try to make a lightgbm model since it is effective with time series and exogenous variables

## Define region_bin in SQL query

### Subtask:
Calculate `region_bin` by concatenating the floored values of `latitude` and `longitude` within the `SELECT` statement, ensuring it is a recognized column for `GROUP BY` and `ORDER BY` clauses.


In [50]:
PROJECT_ID = "mgmt467-4889"

client = bigquery.Client(project=PROJECT_ID)

query = f"""
SELECT
  CAST(FLOOR(latitude) AS STRING) || '_' || CAST(FLOOR(longitude) AS STRING) AS region_bin,
  DATE_TRUNC(TIMESTAMP_MILLIS(time), MONTH) AS month_start,
  COUNTIF(mag >= 6.5) AS num_highmag,
  COUNT(*) AS num_events,
  AVG(mag) AS avg_mag,
  MAX(mag) AS max_mag
FROM `{PROJECT_ID}.Earthquakes_Data.earthquakes_cleaned`
WHERE mag >= 6.5
GROUP BY region_bin, month_start
ORDER BY region_bin, month_start
"""

df = client.query(query).to_dataframe()
df["month_start"] = pd.to_datetime(df["month_start"])

df.head()

Unnamed: 0,region_bin,month_start,num_highmag,num_events,avg_mag,max_mag
0,-10_-72,2022-06-01 00:00:00+00:00,1,1,6.5,6.5
1,-10_159,2022-11-01 00:00:00+00:00,1,1,7.0,7.0
2,-11_161,2016-12-01 00:00:00+00:00,3,3,7.066667,7.8
3,-11_161,2017-01-01 00:00:00+00:00,1,1,6.5,6.5
4,-11_161,2018-09-01 00:00:00+00:00,1,1,6.5,6.5


In [51]:
df.head()

Unnamed: 0,region_bin,month_start,num_highmag,num_events,avg_mag,max_mag
0,-10_-72,2022-06-01 00:00:00+00:00,1,1,6.5,6.5
1,-10_159,2022-11-01 00:00:00+00:00,1,1,7.0,7.0
2,-11_161,2016-12-01 00:00:00+00:00,3,3,7.066667,7.8
3,-11_161,2017-01-01 00:00:00+00:00,1,1,6.5,6.5
4,-11_161,2018-09-01 00:00:00+00:00,1,1,6.5,6.5


In [52]:
# Build full grid
regions = df["region_bin"].unique()
months = pd.date_range(df["month_start"].min(), df["month_start"].max(), freq="MS")

full_index = pd.MultiIndex.from_product(
    [regions, months],
    names=["region_bin", "month_start"]
)

full_df = (
    df.set_index(["region_bin", "month_start"])
      .reindex(full_index)
      .fillna(0)
      .reset_index()
)

full_df.head()


Unnamed: 0,region_bin,month_start,num_highmag,num_events,avg_mag,max_mag
0,-10_-72,2015-12-01 00:00:00+00:00,0,0,0.0,0.0
1,-10_-72,2016-01-01 00:00:00+00:00,0,0,0.0,0.0
2,-10_-72,2016-02-01 00:00:00+00:00,0,0,0.0,0.0
3,-10_-72,2016-03-01 00:00:00+00:00,0,0,0.0,0.0
4,-10_-72,2016-04-01 00:00:00+00:00,0,0,0.0,0.0


## Summary:

### Data Analysis Key Findings
The `region_bin` column was successfully defined in the SQL query by concatenating the floored values of `latitude` and `longitude`. This new column is now available for use in `GROUP BY` and `ORDER BY` clauses for further analysis.

### Insights or Next Steps
*   The newly defined `region_bin` column can now be leveraged to analyze earthquake data by geographical bins, which could reveal patterns or concentrations of seismic activity.


## Fix Lag Feature Calculation

### Subtask:
Modify the code to use `.transform()` for calculating `lag1`, `lag3`, `lag6`, `lag12`, and `target_next30` to ensure proper index alignment, then create a binary target variable `has_high_mag_next_month`, and finally display the head of the `full_df`.


In [53]:
g = full_df.groupby("region_bin", group_keys=False)

# Lag & rolling features
full_df["lag1"] = g["num_highmag"].transform(lambda x: x.shift(1))
full_df["lag3"] = g["num_highmag"].transform(lambda x: x.rolling(3).sum().shift(1))
full_df["lag6"] = g["num_highmag"].transform(lambda x: x.rolling(6).sum().shift(1))
full_df["lag12"] = g["num_highmag"].transform(lambda x: x.rolling(12).sum().shift(1))

# Time since last high-mag quake (months)
# The original logic for 'last_event' and 'months_since_last' might also need .transform() if assigned directly
last_event = full_df["month_start"].where(full_df["num_highmag"] > 0)
full_df["last_event"] = last_event.groupby(full_df["region_bin"]).ffill()
full_df["months_since_last"] = (
    (full_df["month_start"] - full_df["last_event"])
    .dt.days.div(30)
)

# Temporal features
full_df["year"] = full_df["month_start"].dt.year
full_df["month"] = full_df["month_start"].dt.month
full_df["quarter"] = full_df["month_start"].dt.quarter

# Target: next 30 days ≈ next month
full_df["target_next30"] = g["num_highmag"].transform(lambda x: x.shift(-1))

# Create binary target variable, filling NaN values in target_next30 with 0 before conversion
full_df["has_high_mag_next_month"] = (full_df["target_next30"].fillna(0) > 0).astype(int)

print("DataFrame with new lag features and target variable:")
display(full_df.head())

DataFrame with new lag features and target variable:


Unnamed: 0,region_bin,month_start,num_highmag,num_events,avg_mag,max_mag,lag1,lag3,lag6,lag12,last_event,months_since_last,year,month,quarter,target_next30,has_high_mag_next_month
0,-10_-72,2015-12-01 00:00:00+00:00,0,0,0.0,0.0,,,,,NaT,,2015,12,4,0,0
1,-10_-72,2016-01-01 00:00:00+00:00,0,0,0.0,0.0,0.0,,,,NaT,,2016,1,1,0,0
2,-10_-72,2016-02-01 00:00:00+00:00,0,0,0.0,0.0,0.0,,,,NaT,,2016,2,1,0,0
3,-10_-72,2016-03-01 00:00:00+00:00,0,0,0.0,0.0,0.0,0.0,,,NaT,,2016,3,1,0,0
4,-10_-72,2016-04-01 00:00:00+00:00,0,0,0.0,0.0,0.0,0.0,,,NaT,,2016,4,2,0,0


## Feature Selection and Preparation



In [54]:
import pandas as pd

# 1. Define the target variable y
y = full_df['has_high_mag_next_month']

# 2. Define the feature matrix X
X = full_df[[
    'lag1', 'lag3', 'lag6', 'lag12',
    'months_since_last', 'year', 'month', 'quarter',
    'num_events', 'avg_mag', 'max_mag'
]]

# 3. Create a combined DataFrame of X and y
modeling_df = pd.concat([X, y], axis=1)

# 4. Drop any rows from modeling_df that contain missing values (NaN)
print(f"Shape before dropping NaNs: {modeling_df.shape}")
modeling_df.dropna(inplace=True)
print(f"Shape after dropping NaNs: {modeling_df.shape}")

# Explicitly cast columns that should be integers
# These columns were prone to becoming float due to intermediate NaN values and fillna(0)
int_cols = ['lag1', 'lag3', 'lag6', 'lag12', 'num_events', 'year', 'month', 'quarter']
for col in int_cols:
    if col in modeling_df.columns:
        modeling_df[col] = modeling_df[col].astype(int)

# 5. Separate the cleaned modeling_df back into X and y
X = modeling_df[[
    'lag1', 'lag3', 'lag6', 'lag12',
    'months_since_last', 'year', 'month', 'quarter',
    'num_events', 'avg_mag', 'max_mag'
]]
y = modeling_df['has_high_mag_next_month']

# 6. Display the shapes of X and y
print(f"\nShape of X: {X.shape}")
print(f"Shape of y: {y.shape}")

# 7. Display the first few rows of X and y
print("\nFirst 5 rows of X:")
display(X.head())
print("\nFirst 5 rows of y:")
display(y.head())

# Verify dtypes
print("\nX dtypes:")
print(X.dtypes)
print("\ny dtype:")
print(y.dtype)

Shape before dropping NaNs: (37994, 12)
Shape after dropping NaNs: (19689, 12)

Shape of X: (19689, 11)
Shape of y: (19689,)

First 5 rows of X:


Unnamed: 0,lag1,lag3,lag6,lag12,months_since_last,year,month,quarter,num_events,avg_mag,max_mag
78,0,0,0,0,0.0,2022,6,2,1,6.5,6.5
79,1,1,1,1,1.0,2022,7,3,0,0.0,0.0
80,0,1,1,1,2.033333,2022,8,3,0,0.0,0.0
81,0,1,1,1,3.066667,2022,9,3,0,0.0,0.0
82,0,0,1,1,4.066667,2022,10,4,0,0.0,0.0



First 5 rows of y:


Unnamed: 0,has_high_mag_next_month
78,0
79,0
80,0
81,0
82,0



X dtypes:
lag1                   int64
lag3                   int64
lag6                   int64
lag12                  int64
months_since_last    float64
year                   int64
month                  int64
quarter                int64
num_events             int64
avg_mag              float64
max_mag              float64
dtype: object

y dtype:
int64


## Address Class Imbalance


I need to implement strategies to address class imbalance, as occurrences of high-magnitude earthquakes might be rare. This could involve techniques like SMOTE for oversampling the minority class.


In [55]:
import sys
!{sys.executable} -m pip install --quiet imblearn


In [56]:
from imblearn.over_sampling import SMOTE
from collections import Counter

# 1. Instantiate SMOTE with a random_state for reproducibility
sm = SMOTE(random_state=42)

# 2. Apply SMOTE to the feature matrix X and target variable y
X_resampled, y_resampled = sm.fit_resample(X, y)

# 3. Print the class distribution of y and y_resampled
print(f"Original dataset shape %s" % Counter(y))
print(f"Resampled dataset shape %s" % Counter(y_resampled))

print("SMOTE applied, and class distributions displayed.")

Original dataset shape Counter({0: 19632, 1: 57})
Resampled dataset shape Counter({0: 19632, 1: 19632})
SMOTE applied, and class distributions displayed.


## Data Splitting


In [57]:
from sklearn.model_selection import train_test_split

# Split the resampled data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

# Print the shapes of the resulting sets
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

Shape of X_train: (31411, 11)
Shape of X_test: (7853, 11)
Shape of y_train: (31411,)
Shape of y_test: (7853,)


## Model Training



**Reasoning**:
This will use `LGBMClassifier` from `lightgbm`, instantiate it with a `random_state` for reproducibility, and then I'll train the model using the prepared `X_train` and `y_train` datasets.



In [58]:
from lightgbm import LGBMClassifier

# 1. Instantiate the LGBMClassifier model
model = LGBMClassifier(random_state=42, n_estimators=100)

# 2. Train the model using the training data
print("Training LGBMClassifier model...")
model.fit(X_train, y_train)
print("Model training complete.")

Training LGBMClassifier model...
[LightGBM] [Info] Number of positive: 15716, number of negative: 15695
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003725 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 821
[LightGBM] [Info] Number of data points in the train set: 31411, number of used features: 11
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500334 -> initscore=0.001337
[LightGBM] [Info] Start training from score 0.001337
Model training complete.


## Geographic Visualization of Predicted Events

This chart visualizes the predicted earthquakes in their respective locations as well as their magnitudes in the next month. It only predicts a magnitude of 6 as the largest earthquake to occur.

In [59]:
import pandas as pd
import plotly.express as px

# 1. Create a new DataFrame, regional_df
regional_df = full_df.groupby('region_bin').agg(
    num_highmag=('num_highmag', 'sum'),
    avg_mag=('avg_mag', 'mean'), # Using mean for avg_mag as it represents the average of averages
    max_mag=('max_mag', 'max')
).reset_index()

# 2. Extract latitude and longitude from the region_bin string
regional_df[['latitude_str', 'longitude_str']] = regional_df['region_bin'].str.split('_', expand=True)
regional_df['latitude'] = regional_df['latitude_str'].astype(float)
regional_df['longitude'] = regional_df['longitude_str'].astype(float)


# 4-9. Create and display the scatter geo plot
fig = px.scatter_geo(
    regional_df,
    lat="latitude",
    lon="longitude",
    size="num_highmag", # Size by total number of high-magnitude events
    color="num_highmag", # Color by total number of high-magnitude events
    color_continuous_scale=px.colors.sequential.Plasma, # Choose an appropriate color scale
    hover_name="region_bin",
    hover_data={
        "num_highmag": True,
        "avg_mag": ':.2f',
        "max_mag": True,
        "latitude": False, # Hide latitude from hover data
        "longitude": False # Hide longitude from hover data
    },
    projection="natural earth",
    title='Global Distribution of High-Magnitude Earthquakes by Region',
    opacity=0.8
)

# Update layout for better readability
fig.update_layout(
    paper_bgcolor="black",
    plot_bgcolor="black",
    font_color="white",
    coloraxis_colorbar_title="Total High-Mag Events"
)

fig.show()

print("Geographic visualization of predicted events created.")

Geographic visualization of predicted events created.


## Analyze Temporal Trends of High-Magnitude Events



In [60]:
import plotly.express as px

# 1. Aggregate the full_df DataFrame by month_start
monthly_trends_df = full_df.groupby('month_start').agg(
    num_highmag=('num_highmag', 'sum'),
    has_high_mag_next_month=('has_high_mag_next_month', 'sum')
).reset_index()

# 2. Create a time-series line plot for num_highmag
fig_num_highmag = px.line(
    monthly_trends_df,
    x='month_start',
    y='num_highmag',
    title='Global Monthly Trend of High-Magnitude Earthquakes',
    labels={'num_highmag': 'Total High-Magnitude Earthquakes', 'month_start': 'Month'}
)

# 4. Update layout for num_highmag plot
fig_num_highmag.update_layout(
    paper_bgcolor='black',
    plot_bgcolor='black',
    font_color='white',
    xaxis_title='Month',
    yaxis_title='Total High-Magnitude Earthquakes'
)
fig_num_highmag.show()

# 3. Create another time-series line plot for has_high_mag_next_month
fig_has_high_mag = px.line(
    monthly_trends_df,
    x='month_start',
    y='has_high_mag_next_month',
    title='Global Monthly Trend of Predicted High-Magnitude Earthquakes (Target)',
    labels={'has_high_mag_next_month': 'Total Predicted High-Magnitude Events', 'month_start': 'Month'}
)

# 4. Update layout for has_high_mag_next_month plot
fig_has_high_mag.update_layout(
    paper_bgcolor='black',
    plot_bgcolor='black',
    font_color='white',
    xaxis_title='Month',
    yaxis_title='Total Predicted High-Magnitude Events'
)
fig_has_high_mag.show()

print("Time-series plots for high-magnitude events and target variable created.")

Time-series plots for high-magnitude events and target variable created.


## Model Evaluation



In [61]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Make predictions on the test set
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1] # Probability of the positive class

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

# Print the evaluation metrics
print(f"Accuracy Score: {accuracy:.4f}")
print(f"Precision Score: {precision:.4f}")
print(f"Recall Score: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
print(f"ROC AUC Score: {roc_auc:.4f}")


Accuracy Score: 0.9349
Precision Score: 0.9062
Recall Score: 0.9699
F1 Score: 0.9370
ROC AUC Score: 0.9845


### Model Performance Interpretation

Based on the evaluation metrics calculated on the resampled test set, the LightGBM classifier demonstrates strong performance in predicting high-magnitude earthquakes:

*   **Accuracy Score: 0.9349**
    *   While accuracy is high, for imbalanced datasets, it can be misleading. However, given the high values across other metrics, it suggests a generally good performance.

*   **Precision Score: 0.9062**
    *   This means that when the model predicts a high-magnitude earthquake, it is correct about 90.62% of the time. This is a good score, indicating a relatively low rate of false positives.

*   **Recall Score: 0.9699**
    *   This is a very high recall, indicating that the model successfully identifies almost 97% of all actual high-magnitude earthquakes. This is crucial for problems where missing positive cases (false negatives) is highly undesirable, such as earthquake prediction.

*   **F1 Score: 0.9370**
    *   The F1-score is the harmonic mean of precision and recall. A score of 0.9370 is excellent, suggesting a good balance between precision and recall, especially important in imbalanced classification where a high score indicates robust performance in identifying the minority class without excessive false positives.

*   **ROC AUC Score: 0.9845**
    *   The Area Under the Receiver Operating Characteristic curve (ROC AUC) is a measure of the model's ability to distinguish between positive and negative classes. A score of 0.9845 is exceptionally high, indicating that the model has an outstanding ability to differentiate between regions that will experience a high-magnitude earthquake in the next month and those that will not, regardless of the classification threshold.

**Summary:**
The model performs very well, particularly in identifying high-magnitude earthquake events (high recall) and maintaining a good balance with precision (high F1-score). The excellent ROC AUC further confirms its strong discriminatory power. The SMOTE technique appears to have successfully addressed the class imbalance, enabling the model to learn effectively from the minority class.

### Streaming Data Dashboard

In [63]:
from IPython.display import HTML
HTML('<iframe width="600" height="450" src="https://lookerstudio.google.com/embed/reporting/79e67ef0-d5b6-421b-ad08-a2027ac069cc/page/XSmiF" frameborder="0" style="border:0" allowfullscreen sandbox="allow-storage-access-by-user-activation allow-scripts allow-same-origin allow-popups allow-popups-to-escape-sandbox"></iframe>')


Consider using IPython.display.IFrame instead



## Summary:

### Data Analysis Key Findings

*   **Feature Engineering Success**: Lag features (`lag1`, `lag3`, `lag6`, `lag12`) and the target variable (`target_next30`) were successfully calculated using the `.transform()` method, ensuring proper index alignment across `region_bin` groups. A binary target variable, `has_high_mag_next_month`, was derived by handling `NaN` values in `target_next30` with `fillna(0)` before conversion to integer type.
*   **Data Cleaning Impact**: During feature selection and preparation, dropping rows with missing values (primarily resulting from lag calculations) reduced the dataset from 37,994 rows to 19,689 rows.
*   **Class Imbalance Resolution**: The dataset exhibited significant class imbalance, with the minority class (instances of high-magnitude earthquakes) having only 57 occurrences compared to 19,632 for the majority class. The `SMOTE` technique was successfully applied to balance the classes, resulting in an equal distribution of 19,632 instances for both classes. This required explicit casting of certain feature columns (e.g., `lag1`, `num_events`) to integer types to resolve a `TypeError` during `SMOTE` application.
*   **Model Performance**: A LightGBM Classifier was trained on the balanced dataset and achieved strong performance on the test set:
    *   **Accuracy Score**: 0.9349
    *   **Precision Score**: 0.9062, indicating a low rate of false positives.
    *   **Recall Score**: 0.9699, demonstrating the model's effectiveness in identifying nearly all actual high-magnitude earthquakes.
    *   **F1 Score**: 0.9370, showcasing a good balance between precision and recall.
    *   **ROC AUC Score**: 0.9845, highlighting the model's excellent ability to distinguish between classes.

### Insights or Next Steps

*   The model demonstrates a high capability to predict high-magnitude earthquakes, particularly due to its high recall and ROC AUC score. This suggests that the engineered features, combined with the `SMOTE` technique, are effective in identifying rare but critical events.
*   For deployment, further validation on unseen, real-world data and consideration of the operational costs of false positives versus false negatives would be crucial. Additionally, exploring hyperparameter tuning for the `LGBMClassifier` or investigating other ensemble models might offer marginal improvements.


## DIVE Reflection: High-Magnitude Earthquake Prediction

### Define
The primary objective was to predict the occurrence of high-magnitude earthquakes (magnitude 6.5 or greater) in specific geographic regions (`region_bin`) for the upcoming month. This involved transforming raw earthquake data into a time-series prediction problem, where each `region_bin` over time constituted an observation point. The target variable `has_high_mag_next_month` was defined as a binary indicator (1 if a high-magnitude earthquake occurs, 0 otherwise).

### Investigate
Data investigation began with querying a BigQuery dataset to extract relevant earthquake attributes like latitude, longitude, magnitude, time, depth, and tsunami flag. Initial exploration revealed the need for feature engineering, specifically creating `region_bin` (1x1 degree lat/long bins) and time-based features. A crucial insight was the extreme class imbalance: high-magnitude events are rare. This imbalance was a significant challenge that needed to be addressed to build an effective predictive model.

### Visualize
Early visualizations, such as the `plotly.express.scatter_geo` plot, helped to confirm the geographic distribution of earthquakes and the `mag_bin` categories, providing an intuitive understanding of the data. While not explicitly detailed in the final summary, such visualizations are critical during the investigation phase to understand spatial and temporal patterns. The `full_df.head()` and `df.head()` displays throughout the notebook also served as quick visual checks of data transformations.

### Evaluate
The evaluation phase focused on training a LightGBM classifier after addressing the class imbalance using `SMOTE`. Key metrics like Accuracy, Precision, Recall, F1 Score, and ROC AUC were used, which are particularly important for imbalanced datasets. The model achieved strong performance (e.g., Recall of 0.9699, ROC AUC of 0.9845), indicating its effectiveness in identifying high-magnitude earthquakes. This high recall is vital in a domain where missing a positive event has significant consequences. The `SMOTE` technique proved instrumental in achieving these results by enabling the model to learn from the minority class without being overwhelmed by the majority class.

### Iterations and Learnings
*   **SQL Query Refinement**: The SQL query was iterated upon to correctly define `region_bin` and aggregate earthquake statistics by `region_bin` and `month_start`. This demonstrated the importance of precise data aggregation for time-series analysis.
*   **Feature Engineering**: The creation of lag features (`lag1`, `lag3`, `lag6`, `lag12`), `months_since_last`, and temporal features (`year`, `month`, `quarter`) was critical. The use of `.transform()` ensured correct group-wise calculations.
*   **Handling Missing Values**: The `dropna()` step significantly reduced the dataset size, highlighting the trade-offs between feature richness and data availability, especially with lag features.
*   **Class Imbalance**: The most significant challenge was class imbalance. Initial attempts might have yielded poor performance metrics for the minority class. The successful application of `SMOTE` was a key turning point, requiring careful attention to data types (e.g., casting floats to integers) to prevent errors during resampling.
*   **Model Selection**: LightGBM was chosen due to its effectiveness with tabular data and speed. Its performance post-SMOTE validated this choice.

### Conclusion
The DIVE process allowed for a systematic approach to tackle the high-magnitude earthquake prediction problem. By carefully defining the problem, investigating the data, implementing robust feature engineering, and leveraging techniques like `SMOTE` to address inherent data challenges, a highly effective predictive model was developed and evaluated. This iterative process, including troubleshooting data type issues during SMOTE application, was crucial in arriving at a successful solution.