# Task 3: Complete Analysis - From EDA to Predictive Modeling

**Student:** Nicholas Fleischhauer  
**Date:** December 2, 2025  
**Dataset:** NOAA Storm Events (2020-2025)

---

## Two-Phase Analysis

### **PART 1: Exploratory Data Analysis (Task 2 - RDD Aggregations)**
- Descriptive statistics using RDD operations
- Aggregations to understand patterns in human harm
- Identify which event types and locations are most affected

### **PART 2: Predictive Modeling (Task 3 - Random Forest)**
- Machine Learning to quantify predictive importance
- Feature importance analysis
- Answer: Which factors are STRONGEST predictors?

---

# Task 2: Predictors of Human Harm - Aggregation Job

**Student:** Nicholas Fleischhauer  
**Date:** November 23, 2025

## Research Question
Which factors are the strongest predictors of human harm? Can we determine if 'human' factors (location) are more predictive than 'storm' factors (EVENT_TYPE, MAGNITUDE_TYPE)?

## Summarization/Aggregation Job Overview
This notebook performs PySpark aggregation operations on the NOAA Storm Events dataset to:
1. **Aggregate human harm** (injuries + deaths) by storm factors (EVENT_TYPE, MAGNITUDE_TYPE)
2. **Aggregate human harm** by location factors (STATE, CZ_NAME)  
3. **Count event frequency** per location as a population density proxy
4. **Analyze combined factors** (EVENT_TYPE × STATE) to identify patterns

**Dataset:** NOAA Storm Events (2020-2025 subset, ~371K rows, 51 columns)  
**Operations:** GroupBy aggregations using PySpark RDD transformations (map, filter, reduceByKey)


In [1]:
from pyspark.sql import SparkSession
import pyspark
from pyspark import SparkContext
import math


In [2]:
# Create SparkSession for CSV reading, then get SparkContext for RDD operations
# Configure for GCS access (uncomment auth configs after I get the service account key)

# Base configuration
spark_builder = SparkSession.builder \
    .appName("HumanHarmAnalysis") \
    .config("spark.jars", "/opt/spark/jars/gcs-connector-hadoop3-2.2.11.jar")

# TODO:
# Uncomment these lines when using the gcs-key.json in the project root:
# spark_builder = spark_builder \
#     .config("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
#     .config("spark.hadoop.google.cloud.auth.service.account.json.keyfile", "/home/sparkdev/app/gcs-key.json")

spark = spark_builder.getOrCreate()
sc = spark.sparkContext

# Silence verbose Spark logs - only show warnings and errors
sc.setLogLevel("WARN")



Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/12/03 06:08:52 WARN DependencyUtils: Local jar /opt/spark/jars/gcs-connector-hadoop3-2.2.11.jar does not exist, skipping.


25/12/03 06:08:53 INFO SparkContext: Running Spark version 4.0.1
25/12/03 06:08:53 INFO SparkContext: OS info Linux, 6.14.0-36-generic, amd64
25/12/03 06:08:53 INFO SparkContext: Java version 21.0.9
25/12/03 06:08:53 INFO ResourceUtils: No custom resources configured for spark.driver.
25/12/03 06:08:53 INFO SparkContext: Submitted application: HumanHarmAnalysis
25/12/03 06:08:53 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
25/12/03 06:08:53 INFO ResourceProfile: Limiting resource is cpu
25/12/03 06:08:53 INFO ResourceProfileManager: Added ResourceProfile id: 0
25/12/03 06:08:53 INFO SecurityManager: Changing view acls to: sparkdev
25/12/03 06:08:53 INFO SecurityManager: Changing modify acls to: sparkdev
25/12/03 06:08:53 INFO Security

25/12/03 06:08:53 INFO Utils: Successfully started service 'sparkDriver' on port 40335.
25/12/03 06:08:53 INFO SparkEnv: Registering MapOutputTracker
25/12/03 06:08:53 INFO SparkEnv: Registering BlockManagerMaster
25/12/03 06:08:53 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
25/12/03 06:08:53 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
25/12/03 06:08:53 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
25/12/03 06:08:53 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-76671202-b100-45f7-8848-bb93c35db440
25/12/03 06:08:53 INFO SparkEnv: Registering OutputCommitCoordinator
25/12/03 06:08:53 INFO JettyUtils: Start Jetty 0.0.0.0:4040 for SparkUI


25/12/03 06:08:53 INFO Utils: Successfully started service 'SparkUI' on port 4040.
25/12/03 06:08:53 ERROR SparkContext: Failed to add /opt/spark/jars/gcs-connector-hadoop3-2.2.11.jar to Spark environment
java.io.FileNotFoundException: Jar /opt/spark/jars/gcs-connector-hadoop3-2.2.11.jar not found
	at org.apache.spark.SparkContext.addLocalJarFile$1(SparkContext.scala:2174)
	at org.apache.spark.SparkContext.addJar(SparkContext.scala:2230)
	at org.apache.spark.SparkContext.$anonfun$new$15(SparkContext.scala:538)
	at org.apache.spark.SparkContext.$anonfun$new$15$adapted(SparkContext.scala:538)
	at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:619)
	at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:617)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:935)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:538)
	at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:59)
	at java.base/jdk.internal.reflect.NativeConstructo

25/12/03 06:08:54 INFO BlockManagerMasterEndpoint: Registering block manager 671658836a76:34497 with 434.4 MiB RAM, BlockManagerId(driver, 671658836a76, 34497, None)
25/12/03 06:08:54 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 671658836a76, 34497, None)
25/12/03 06:08:54 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 671658836a76, 34497, None)


In [3]:
# Load data using Spark's CSV reader (handles quotes, escaping properly)
# Then convert to RDD for RDD operations

# Local subset file (for testing)
csv_path = "/home/sparkdev/app/Task2/storm_g2020.csv"

# TODO: Use this after I get the service account key
# Full dataset from GCS (uncomment when GCS is configured)
# csv_path = "gs://msds-694-cohort-14-group12/storm_data.csv"


# Read CSV with proper handling of headers, quotes, and escaping
df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("quote", "\"") \
    .option("escape", "\"") \
    .option("multiLine", "true") \
    .csv(csv_path)

# Convert DataFrame to RDD of Row objects
rdd = df.rdd

print(f"RDD loaded: {rdd.count()} rows")
print(f"Number of partitions: {rdd.getNumPartitions()}")


[Stage 1:>                                                          (0 + 1) / 1]

                                                                                

[Stage 2:>                                                          (0 + 1) / 1]

RDD loaded: 371544 rows
Number of partitions: 1


                                                                                

In [4]:
# Preview the data - RDD of Row objects
print("First few rows:")
for row in rdd.take(3):
    print(row)

# Get column names from DataFrame for reference
print(f"\nColumn names ({len(df.columns)} total):")
print(df.columns)


First few rows:


[Stage 3:>                                                          (0 + 1) / 1]

Row(BEGIN_YEARMONTH=202006, BEGIN_DAY=24, BEGIN_TIME=1620, END_YEARMONTH=202006, END_DAY=24, END_TIME=1620, EPISODE_ID=149684.0, EVENT_ID=902190, STATE='GEORGIA', STATE_FIPS=13.0, YEAR=2020, MONTH_NAME='June', EVENT_TYPE='Thunderstorm Wind', CZ_TYPE='C', CZ_FIPS=321, CZ_NAME='WORTH', WFO='TAE', BEGIN_DATE_TIME='24-JUN-20 16:20:00', CZ_TIMEZONE='EST-5', END_DATE_TIME='24-JUN-20 16:20:00', INJURIES_DIRECT=0, INJURIES_INDIRECT=0, DEATHS_DIRECT=0, DEATHS_INDIRECT=0, DAMAGE_PROPERTY='0.00K', DAMAGE_CROPS='0.00K', SOURCE='911 Call Center', MAGNITUDE=50.0, MAGNITUDE_TYPE='EG', FLOOD_CAUSE=None, CATEGORY=None, TOR_F_SCALE=None, TOR_LENGTH=None, TOR_WIDTH=None, TOR_OTHER_WFO=None, TOR_OTHER_CZ_STATE=None, TOR_OTHER_CZ_FIPS=None, TOR_OTHER_CZ_NAME=None, BEGIN_RANGE=1.0, BEGIN_AZIMUTH='W', BEGIN_LOCATION='DOLES', END_RANGE=1.0, END_AZIMUTH='W', END_LOCATION='DOLES', BEGIN_LAT=31.7, BEGIN_LON=-83.89, END_LAT=31.7, END_LON=-83.89, EPISODE_NARRATIVE='As is typical during summer, scattered afternoon 

                                                                                

## Key Columns (access by name using row['COLUMN_NAME'])
- **STATE** - State name
- **EVENT_TYPE** - Type of storm event
- **CZ_NAME** - County/Zone name
- **INJURIES_DIRECT** - Direct injuries
- **INJURIES_INDIRECT** - Indirect injuries
- **DEATHS_DIRECT** - Direct deaths
- **DEATHS_INDIRECT** - Indirect deaths
- **MAGNITUDE** - Storm magnitude
- **MAGNITUDE_TYPE** - Type of magnitude measurement

Note: Since we read CSV as DataFrame then converted to RDD, rows are Row objects.
Access columns by name: `row['STATE']` or `row.STATE`


In [5]:
def safe_int(x):
    """Safely convert to int, return 0 if None/null"""
    if x is None:
        return 0
    try:
        return int(float(x))
    except (ValueError, TypeError):
        return 0

def calculate_total_harm(row):
    """Calculate total human harm: injuries + deaths (direct + indirect)"""
    # Row objects support dictionary-style access: row['COLUMN'] returns None if missing/null
    injuries_direct = safe_int(row['INJURIES_DIRECT'])
    injuries_indirect = safe_int(row['INJURIES_INDIRECT'])
    deaths_direct = safe_int(row['DEATHS_DIRECT'])
    deaths_indirect = safe_int(row['DEATHS_INDIRECT'])
    return injuries_direct + injuries_indirect + deaths_direct + deaths_indirect

# Create RDD with total harm calculated
rdd_with_harm = rdd.map(lambda row: (row, calculate_total_harm(row)))

# Filter to only rows with harm > 0
rdd_harm = rdd_with_harm.filter(lambda x: x[1] > 0)

# Calculate metrics
events_with_harm = rdd_harm.count()
total_harm_sum = int(rdd_harm.map(lambda x: x[1]).sum())

# Display results with clean formatting
print("="*60)
print("DATASET OVERVIEW: Human Harm Analysis")
print("="*60)
print(f"Total events with harm:        {events_with_harm:,} events")
print(f"Total people harmed:           {total_harm_sum:,} people")
print(f"Average harm per event:        {total_harm_sum/events_with_harm:.2f} people/event")
print("="*60)


[Stage 4:>                                                          (0 + 1) / 1]

                                                                                

[Stage 5:>                                                          (0 + 1) / 1]

DATASET OVERVIEW: Human Harm Analysis
Total events with harm:        4,604 events
Total people harmed:           17,541 people
Average harm per event:        3.81 people/event


                                                                                

## Analysis 1: Human Harm by Storm Factors


In [6]:
# Aggregate total harm by EVENT_TYPE
harm_by_event = (
    rdd_harm
    .map(lambda x: (x[0]['EVENT_TYPE'] if x[0]['EVENT_TYPE'] else 'UNKNOWN', x[1]))
    .filter(lambda x: x[0] and x[0] != "")  # Only events with valid type
    .mapValues(lambda v: (v, 1))  # (harm, count)
    .reduceByKey(lambda a, b: (a[0] + b[0], a[1] + b[1]))  # Sum harm and count
    .mapValues(lambda x: (x[0], x[1], x[0] / x[1] if x[1] > 0 else 0))  # (total_harm, count, avg_harm)
)

# Sort by total harm descending
harm_by_event_sorted = harm_by_event.sortBy(lambda x: x[1][0], ascending=False)

# Total is the total harm for the event type, so if there are 2 hurricanes with 100 deaths each, this will be 200
# Count is the number of events with harm, so if there are 2 hurricanes with 100 deaths each, this will be 2
# Avg is the average harm per event, soif there are 2 hurricanes with 100 deaths each, this will be 100
print("Top 10 Event Types by Total Human Harm:")
for event_type, (total_harm, count, avg_harm) in harm_by_event_sorted.take(10):
    print(f"{event_type}: Total={int(total_harm)}, Count={count}, Avg={avg_harm:.2f}")


Top 10 Event Types by Total Human Harm:


[Stage 6:>                                                          (0 + 1) / 1]

Tornado: Total=4146, Count=515, Avg=8.05
Excessive Heat: Total=3561, Count=332, Avg=10.73
Heat: Total=1646, Count=569, Avg=2.89
Thunderstorm Wind: Total=1193, Count=577, Avg=2.07
Winter Weather: Total=931, Count=303, Avg=3.07
Wildfire: Total=742, Count=131, Avg=5.66
Rip Current: Total=590, Count=382, Avg=1.54
Flash Flood: Total=549, Count=256, Avg=2.14
Lightning: Total=402, Count=229, Avg=1.76
Winter Storm: Total=393, Count=139, Avg=2.83


                                                                                

In [7]:
# Aggregate total harm by MAGNITUDE_TYPE
harm_by_magnitude = (
    rdd_harm
    .map(lambda x: (x[0]['MAGNITUDE_TYPE'] if x[0]['MAGNITUDE_TYPE'] else 'NONE', x[1]))
    .filter(lambda x: x[0] and x[0] != "")
    .mapValues(lambda v: (v, 1))
    .reduceByKey(lambda a, b: (a[0] + b[0], a[1] + b[1]))
    .mapValues(lambda x: (x[0], x[1], x[0] / x[1] if x[1] > 0 else 0))
)

harm_by_magnitude_sorted = harm_by_magnitude.sortBy(lambda x: x[1][0], ascending=False)

print("\nHarm by Magnitude Type:")
for mag_type, (total_harm, count, avg_harm) in harm_by_magnitude_sorted.collect():
    print(f"{mag_type}: Total={int(total_harm)}, Count={count}, Avg={avg_harm:.2f}")



Harm by Magnitude Type:


[Stage 8:>                                                          (0 + 1) / 1]

NONE: Total=15906, Count=3766, Avg=4.22
EG: Total=1418, Count=738, Avg=1.92
MG: Total=208, Count=98, Avg=2.12
ES: Total=9, Count=2, Avg=4.50


                                                                                

### Key Findings: Storm Factors and Human Harm

**MAGNITUDE_TYPE Analysis:**

The MAGNITUDE_TYPE field records the type of magnitude measurement for storm events. The categories include:
- **EG** (Estimated Gust) - estimated wind speed in knots
- **MG** (Measured Gust) - measured wind speed in knots
- **ES** (Estimated Sustained) - estimated sustained wind speed
- **NONE** - no magnitude measurement recorded

**Critical:** Events with **NONE** as the magnitude type account for the vast majority of human harm (15,906 people across 3,766 events, averaging 4.22 people per event). This significantly outpaces events with recorded wind measurements (EG averages only 1.92 people/event).

This suggests that **non-wind/hail storm events** — such as floods, tornadoes without wind speed data, extreme temperatures, and other weather phenomena that don't record magnitude — are actually **more dangerous to humans** than events with measurable wind speeds or hail sizes. This finding supports the hypothesis that different storm factors may predict harm differently, and that magnitude measurements alone are insufficient predictors of human impact.

For our Random Forest modeling in later phases, this indicates that:
1. EVENT_TYPE (the type of storm) may be a stronger predictor than MAGNITUDE_TYPE
2. Location factors (STATE, CZ_NAME) could be even more important if they correlate with severe non-wind events
3. We should engineer features that capture the severity of events beyond just wind/hail measurements


## Analysis 2: Human Harm by Location Factors


In [8]:
# Aggregate total harm by STATE
harm_by_state = (
    rdd_harm
    .map(lambda x: (x[0]['STATE'] if x[0]['STATE'] else 'UNKNOWN', x[1]))
    .filter(lambda x: x[0] and x[0] != "")
    .mapValues(lambda v: (v, 1))
    .reduceByKey(lambda a, b: (a[0] + b[0], a[1] + b[1]))
    .mapValues(lambda x: (x[0], x[1], x[0] / x[1] if x[1] > 0 else 0))
)

harm_by_state_sorted = harm_by_state.sortBy(lambda x: x[1][0], ascending=False)

print("Top 10 States by Total Human Harm:")
for state, (total_harm, count, avg_harm) in harm_by_state_sorted.take(10):
    print(f"{state}: Total={int(total_harm)}, Count={count}, Avg={avg_harm:.2f}")


Top 10 States by Total Human Harm:


[Stage 10:>                                                         (0 + 1) / 1]

TEXAS: Total=2915, Count=316, Avg=9.22
ARIZONA: Total=2181, Count=709, Avg=3.08
CALIFORNIA: Total=967, Count=303, Avg=3.19
KENTUCKY: Total=967, Count=113, Avg=8.56
MISSOURI: Total=888, Count=172, Avg=5.16
TENNESSEE: Total=803, Count=130, Avg=6.18
MISSISSIPPI: Total=617, Count=110, Avg=5.61
OKLAHOMA: Total=591, Count=95, Avg=6.22
FLORIDA: Total=573, Count=259, Avg=2.21
GEORGIA: Total=435, Count=97, Avg=4.48


                                                                                

## Note About Location

It will probably be important to take into account population count and population densities per state when using this kind of analysis. Its possible that they can skew the statistics so that it appears like one state may have higher harm, simply because it has more people or more population density in high risk regions.

**TODO:** Adjust for these concerns in a future exploration.

In [9]:
# Aggregate total harm by STATE and CZ_NAME (County/Zone)
harm_by_cz = (
    rdd_harm
    .map(lambda x: ((
        x[0]['STATE'] if x[0]['STATE'] else 'UNKNOWN',
        x[0]['CZ_NAME'] if x[0]['CZ_NAME'] else 'UNKNOWN'
    ), x[1]))
    .filter(lambda x: x[0][0] and x[0][0] != "" and x[0][1] and x[0][1] != "")
    .mapValues(lambda v: (v, 1))
    .reduceByKey(lambda a, b: (a[0] + b[0], a[1] + b[1]))
    .mapValues(lambda x: (x[0], x[1], x[0] / x[1] if x[1] > 0 else 0))
)

harm_by_cz_sorted = harm_by_cz.sortBy(lambda x: x[1][0], ascending=False)

print("\nTop 10 Counties/Zones by Total Human Harm:")
for (state, cz), (total_harm, count, avg_harm) in harm_by_cz_sorted.take(10):
    print(f"{state}, {cz}: Total={int(total_harm)}, Count={count}, Avg={avg_harm:.2f}")



Top 10 Counties/Zones by Total Human Harm:


[Stage 12:>                                                         (0 + 1) / 1]

TEXAS, DALLAS: Total=1427, Count=14, Avg=101.93
ARIZONA, CENTRAL PHOENIX: Total=1027, Count=188, Avg=5.46
MISSOURI, DOUGLAS: Total=352, Count=2, Avg=176.00
TEXAS, DENTON: Total=315, Count=23, Avg=13.70
OKLAHOMA, TULSA: Total=309, Count=23, Avg=13.43
NEVADA, LAS VEGAS VALLEY: Total=271, Count=58, Avg=4.67
KENTUCKY, GRAVES: Total=239, Count=5, Avg=47.80
KENTUCKY, HOPKINS: Total=234, Count=3, Avg=78.00
TENNESSEE, DAVIDSON: Total=226, Count=12, Avg=18.83
ARIZONA, TUCSON METRO AREA: Total=225, Count=84, Avg=2.68


                                                                                

## Analysis 3: Event Frequency by Location (Proxy for Population Density)

This will partially address the concerns in analysis 2.

In [10]:
# Count events per CZ_NAME - more events = likely more populated area
# This will be used as a proxy for population density in later modeling
event_count_by_cz = (
    rdd
    .map(lambda row: ((
        row['STATE'] if row['STATE'] else 'UNKNOWN',
        row['CZ_NAME'] if row['CZ_NAME'] else 'UNKNOWN'
    ), 1))
    .filter(lambda x: x[0][0] and x[0][0] != "" and x[0][1] and x[0][1] != "")
    .reduceByKey(lambda a, b: a + b)
)

event_count_sorted = event_count_by_cz.sortBy(lambda x: x[1], ascending=False)

print("\nTop 10 Counties/Zones by Event Count (Population Proxy):")
for (state, cz), count in event_count_sorted.take(10):
    print(f"{state}, {cz}: {count} events")



Top 10 Counties/Zones by Event Count (Population Proxy):


[Stage 14:>                                                         (0 + 1) / 1]

ILLINOIS, COOK: 797 events
ATLANTIC SOUTH, VOLUSIA-BREVARD COUNTY LINE TO SEBASTIAN INLET 0-20NM: 754 events
PENNSYLVANIA, ALLEGHENY: 753 events
ALABAMA, LAUDERDALE: 714 events
ARIZONA, MARICOPA: 710 events
ALABAMA, COLBERT: 618 events
ATLANTIC NORTH, CHESAPEAKE BAY SANDY PT TO N BEACH MD: 609 events
OKLAHOMA, OKLAHOMA: 579 events
COLORADO, EL PASO: 578 events
TEXAS, TARRANT: 572 events


                                                                                

## Analysis 4: Combined Storm + Location Factors


In [11]:
# Aggregate harm by (EVENT_TYPE, STATE) pairs
# This shows interaction between storm type and location
harm_by_event_state = (
    rdd_harm
    .map(lambda x: ((
        x[0]['EVENT_TYPE'] if x[0]['EVENT_TYPE'] else 'UNKNOWN',
        x[0]['STATE'] if x[0]['STATE'] else 'UNKNOWN'
    ), x[1]))
    .filter(lambda x: x[0][0] and x[0][0] != "" and x[0][1] and x[0][1] != "")
    .mapValues(lambda v: (v, 1))
    .reduceByKey(lambda a, b: (a[0] + b[0], a[1] + b[1]))
    .mapValues(lambda x: (x[0], x[1], x[0] / x[1] if x[1] > 0 else 0))
)

harm_by_event_state_sorted = harm_by_event_state.sortBy(lambda x: x[1][0], ascending=False)

print("\nTop 10 (Event Type, State) Combinations by Total Harm:")
for (event_type, state), (total_harm, count, avg_harm) in harm_by_event_state_sorted.take(10):
    print(f"{event_type} in {state}: Total={int(total_harm)}, Count={count}, Avg={avg_harm:.2f}")



Top 10 (Event Type, State) Combinations by Total Harm:


[Stage 16:>                                                         (0 + 1) / 1]

Excessive Heat in TEXAS: Total=1428, Count=29, Avg=49.24
Excessive Heat in ARIZONA: Total=1163, Count=190, Avg=6.12
Heat in ARIZONA: Total=871, Count=454, Avg=1.92
Tornado in KENTUCKY: Total=763, Count=26, Avg=29.35
Tornado in TENNESSEE: Total=627, Count=41, Avg=15.29
Tornado in TEXAS: Total=495, Count=56, Avg=8.84
Tornado in MISSISSIPPI: Total=472, Count=47, Avg=10.04
Heat in TEXAS: Total=406, Count=37, Avg=10.97
Wildfire in CALIFORNIA: Total=364, Count=43, Avg=8.47
Drought in MISSOURI: Total=350, Count=1, Avg=350.00


                                                                                

## Summary & Next Steps

This Task 2 analysis provides:
1. **Baseline aggregations** comparing storm factors (EVENT_TYPE, MAGNITUDE_TYPE) vs. location factors (STATE, CZ_NAME)
2. **Event frequency proxy** for population density (more events = likely more populated)
3. **Combined factor analysis** showing interactions between storm and location factors

**For Future Phases (Random Forest Modeling):**
- Use these aggregations to engineer features
- Join with actual population density data (external source)
- Train Random Forest model with:
  - Storm features: EVENT_TYPE, MAGNITUDE_TYPE, MAGNITUDE
  - Location features: STATE, CZ_NAME, EVENT_COUNT_PER_CZ (population proxy)
  - Interaction features: EVENT_TYPE × STATE, MAGNITUDE × EVENT_COUNT
- Calculate feature importance to determine which factors are strongest predictors:
  - Stuff like SHAP, Permutation importance, gini, etc.

In [12]:
# Stop Spark session
spark.stop()


---
---

# PART 2: PREDICTIVE MODELING WITH RANDOM FOREST

## Transition from EDA to Machine Learning

In PART 1, I used **RDD operations** to explore patterns:
- Tornadoes and Heat cause most harm
- Texas and Arizona most affected
- Events without magnitude data often more dangerous

**These findings were DESCRIPTIVE** - they told us what happened.

Now in PART 2, I use **Random Forest** to make this PREDICTIVE:

### Why Random Forest?

1. **Feature Importance** - Directly answers which factors are strongest PREDICTORS
2. **Handles Mixed Features** - Categorical (EVENT_TYPE, STATE) + Numeric (MAGNITUDE)
3. **Class Imbalance Robust** - 98.8% events have no harm; RF handles this naturally
4. **Non-Linear** - Captures interactions (e.g., Tornado in Texas vs elsewhere)
5. **Interpretable** - Clear importance scores for stakeholders
6. **Industry Standard** - Proven for risk prediction

**Goal:** Transform EDA insights into quantified predictive importance.

---

## 1. Setup and Data Loading


In [13]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum as _sum, count, when, coalesce, lit
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml import Pipeline
import time
import json


In [14]:
# Create SparkSession
spark = SparkSession.builder \
    .appName("Task3_HarmPrediction") \
    .config("spark.sql.adaptive.enabled", "true") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("WARN")

print(f"Spark version: {spark.version}")
print(f"Spark UI: {spark.sparkContext.uiWebUrl}")


Spark version: 4.0.1
Spark UI: http://671658836a76:4040


25/12/03 06:09:41 ERROR SparkContext: Failed to add file:/opt/spark/jars/gcs-connector-hadoop3-2.2.11.jar to Spark environment
java.io.FileNotFoundException: Jar /opt/spark/jars/gcs-connector-hadoop3-2.2.11.jar not found
	at org.apache.spark.SparkContext.addLocalJarFile$1(SparkContext.scala:2174)
	at org.apache.spark.SparkContext.addJar(SparkContext.scala:2230)
	at org.apache.spark.SparkContext.$anonfun$new$15(SparkContext.scala:538)
	at org.apache.spark.SparkContext.$anonfun$new$15$adapted(SparkContext.scala:538)
	at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:619)
	at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:617)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:935)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:538)
	at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:59)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.N

In [15]:
# Load data
csv_path = "/home/sparkdev/app/Task2/storm_g2020.csv"
# TODO: Change to GCS for final run
# csv_path = "gs://msds-694-cohort-14-group12/storm_data.csv"

df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("quote", "\"") \
    .option("escape", "\"") \
    .option("multiLine", "true") \
    .csv(csv_path)

print(f"✓ Loaded {df.count():,} rows with {len(df.columns)} columns")


[Stage 1:>                                                          (0 + 1) / 1]

                                                                                

✓ Loaded 371,544 rows with 51 columns


[Stage 2:>                                                          (0 + 1) / 1]                                                                                

## 2. Feature Engineering


In [16]:
# Create target variable
df = df.withColumn('TOTAL_HARM',
    coalesce(col('INJURIES_DIRECT'), lit(0)) +
    coalesce(col('INJURIES_INDIRECT'), lit(0)) +
    coalesce(col('DEATHS_DIRECT'), lit(0)) +
    coalesce(col('DEATHS_INDIRECT'), lit(0))
)

df = df.withColumn('has_harm', when(col('TOTAL_HARM') > 0, 1).otherwise(0))

# Maybe remove something even as simple as printing statistics prior to doing the splitting of data?
# OR maybe this isn't necessary, since we might be onnly working with the train and validation locally and
# assuming that the test set is the unseen data in the cloud.
# If so, then I didn't have time to account for the local
# data being disjoint from the cloud data.
print("Class distribution:")
df.groupBy('has_harm').count().orderBy('has_harm').show()


Class distribution:


[Stage 5:>                                                          (0 + 1) / 1]

+--------+------+
|has_harm| count|
+--------+------+
|       0|366940|
|       1|  4604|
+--------+------+



                                                                                

In [17]:
# Create event count per county (population proxy)
event_counts = df.groupBy('STATE', 'CZ_NAME') \
    .agg(count('*').alias('EVENT_COUNT_PER_CZ'))

df = df.join(event_counts, on=['STATE', 'CZ_NAME'], how='left')
df = df.withColumn('EVENT_COUNT_PER_CZ', coalesce(col('EVENT_COUNT_PER_CZ'), lit(1)))

# Handle missing values
df = df.withColumn('MAGNITUDE', coalesce(col('MAGNITUDE'), lit(0.0)))
df = df.withColumn('MAGNITUDE_TYPE', coalesce(col('MAGNITUDE_TYPE'), lit('NONE')))

# Filter invalid rows
df = df.filter(
    col('EVENT_TYPE').isNotNull() &
    col('STATE').isNotNull()
)

print(f"✓ Clean dataset: {df.count():,} rows")


[Stage 8:>                                                          (0 + 1) / 1]

✓ Clean dataset: 371,544 rows


                                                                                

## 3. Train/Validation/Test Split


In [18]:
# Select modeling features
feature_cols = ['EVENT_TYPE', 'STATE', 'MAGNITUDE_TYPE', 'MAGNITUDE', 'EVENT_COUNT_PER_CZ', 'has_harm']
df_model = df.select(feature_cols)

# Split: 70% train, 15% validation, 15% test (NO caching - simpler & avoids warnings)
seed = 42
train_df, temp_df = df_model.randomSplit([0.7, 0.3], seed=seed)
val_df, test_df = temp_df.randomSplit([0.5, 0.5], seed=seed)

print(f"Train set: {train_df.count():,}")
print(f"Validation set: {val_df.count():,}")
print(f"Test set (SACRED): {test_df.count():,}")


[Stage 11:>                 (0 + 1) / 1][Stage 12:>                 (0 + 1) / 1]

                                                                                

[Stage 16:>                                                         (0 + 7) / 7]                                                                                

Train set: 260,518


[Stage 20:>                 (0 + 1) / 1][Stage 21:>                 (0 + 1) / 1]

                                                                                

Validation set: 55,665


[Stage 29:>                 (0 + 1) / 1][Stage 30:>                 (0 + 1) / 1]

                                                                                

Test set (SACRED): 55,361


## 4. Build ML Pipeline & Train Model


In [19]:
# Build pipeline
indexers = [
    StringIndexer(inputCol='EVENT_TYPE', outputCol='EVENT_TYPE_idx', handleInvalid='keep'),
    StringIndexer(inputCol='STATE', outputCol='STATE_idx', handleInvalid='keep'),
    StringIndexer(inputCol='MAGNITUDE_TYPE', outputCol='MAGNITUDE_TYPE_idx', handleInvalid='keep')
]

encoders = [
    OneHotEncoder(inputCol='EVENT_TYPE_idx', outputCol='EVENT_TYPE_vec'),
    OneHotEncoder(inputCol='STATE_idx', outputCol='STATE_vec'),
    OneHotEncoder(inputCol='MAGNITUDE_TYPE_idx', outputCol='MAGNITUDE_TYPE_vec')
]

assembler = VectorAssembler(
    inputCols=['EVENT_TYPE_vec', 'STATE_vec', 'MAGNITUDE_TYPE_vec', 'MAGNITUDE', 'EVENT_COUNT_PER_CZ'],
    outputCol='features',
    handleInvalid='keep'
)

# Random Forest
rf = RandomForestClassifier(
    labelCol='has_harm',
    featuresCol='features',
    numTrees=100,
    maxDepth=10,
    seed=seed
)

pipeline = Pipeline(stages=indexers + encoders + [assembler, rf])

print("✓ Pipeline built with 3 indexers, 3 encoders, 1 assembler, 1 RF classifier")


✓ Pipeline built with 3 indexers, 3 encoders, 1 assembler, 1 RF classifier


In [20]:
# Train model (SIMPLIFIED - removed caching to avoid issues)
print("=" * 60)
print("TRAINING RANDOM FOREST MODEL")
print("=" * 60)
print("Training on 260K rows...")
print("(This may take 2-5 minutes...)")
print()

start_time = time.time()
model = pipeline.fit(train_df)
train_time = time.time() - start_time

print(f"✓ Training completed in {train_time:.2f}s")
print()

# Evaluate on validation set
print("Evaluating model on validation set...")
val_pred = model.transform(val_df)

evaluator_auc = BinaryClassificationEvaluator(labelCol='has_harm', metricName='areaUnderROC')
evaluator_acc = MulticlassClassificationEvaluator(labelCol='has_harm', metricName='accuracy')

val_auc = evaluator_auc.evaluate(val_pred)
val_acc = evaluator_acc.evaluate(val_pred)

print(f"\nValidation AUC: {val_auc:.4f}")
print(f"Validation Accuracy: {val_acc:.4f}")
print("=" * 60)


TRAINING RANDOM FOREST MODEL
Training on 260K rows...
(This may take 2-5 minutes...)



[Stage 38:>                 (0 + 1) / 1][Stage 39:>                 (0 + 1) / 1]

                                                                                

[Stage 51:>                 (0 + 1) / 1][Stage 52:>                 (0 + 1) / 1]

                                                                                

[Stage 64:>                 (0 + 1) / 1][Stage 65:>                 (0 + 1) / 1]

                                                                                

25/12/03 06:09:56 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


[Stage 77:>                 (0 + 1) / 1][Stage 78:>                 (0 + 1) / 1]

                                                                                

[Stage 81:>                 (0 + 1) / 1][Stage 82:>                 (0 + 1) / 1]

                                                                                

[Stage 90:>                 (0 + 1) / 1][Stage 91:>                 (0 + 1) / 1]

                                                                                

[Stage 97:>                                                         (0 + 7) / 7]

                                                                                

[Stage 102:>                                                        (0 + 7) / 7]

                                                                                

[Stage 105:>                                                        (0 + 7) / 7]

                                                                                

[Stage 108:>                                                        (0 + 7) / 7]

                                                                                

[Stage 111:>                                                        (0 + 7) / 7]

                                                                                



                                                                                

[Stage 117:>                                                        (0 + 7) / 7]



[Stage 120:>                                                        (0 + 7) / 7]



[Stage 123:>                                                        (0 + 7) / 7]

                                                                                25/12/03 06:10:12 WARN DAGScheduler: Broadcasting large task binary with size 1366.6 KiB


[Stage 126:>                                                        (0 + 7) / 7]



25/12/03 06:10:13 WARN DAGScheduler: Broadcasting large task binary with size 1674.3 KiB


[Stage 129:>                                                        (0 + 7) / 7]



[Stage 131:>                (0 + 1) / 1][Stage 132:>                (0 + 1) / 1]

[Stage 131:>                                                        (0 + 1) / 1]                                                                                

[Stage 135:>                (0 + 1) / 1][Stage 136:>                (0 + 1) / 1]

                                                                                

✓ Training completed in 27.24s

Evaluating model on validation set...


[Stage 139:>                (0 + 1) / 1][Stage 140:>                (0 + 1) / 1]

                                                                                

[Stage 158:>                (0 + 1) / 1][Stage 159:>                (0 + 1) / 1]

                                                                                


Validation AUC: 0.8085
Validation Accuracy: 0.9889




## 5. SACRED Test Set Evaluation


In [21]:
print("="*60)
print("EVALUATING ON SACRED TEST SET (FIRST & ONLY TIME)")
print("="*60)

test_pred = model.transform(test_df)

test_auc = evaluator_auc.evaluate(test_pred)
test_acc = evaluator_acc.evaluate(test_pred)

# Maybe change these logs to be more valid in stating that this is 
print(f"\nTest AUC: {test_auc:.4f}")
print(f"Test Accuracy: {test_acc:.4f}")

print("\nConfusion Matrix:")
test_pred.groupBy('has_harm', 'prediction').count().orderBy('has_harm', 'prediction').show()


EVALUATING ON SACRED TEST SET (FIRST & ONLY TIME)


[Stage 165:>                (0 + 1) / 1][Stage 166:>                (0 + 1) / 1]

                                                                                

[Stage 184:>                (0 + 1) / 1][Stage 185:>                (0 + 1) / 1]

                                                                                


Test AUC: 0.8313
Test Accuracy: 0.9880

Confusion Matrix:


[Stage 191:>                (0 + 1) / 1][Stage 192:>                (0 + 1) / 1]

                                                                                

+--------+----------+-----+
|has_harm|prediction|count|
+--------+----------+-----+
|       0|       0.0|54636|
|       0|       1.0|    6|
|       1|       0.0|  660|
|       1|       1.0|   59|
+--------+----------+-----+



## 6. Feature Importance (Answer Research Question)


In [22]:
# Extract feature importance with detailed breakdown
print("Extracting feature importance for stakeholder interpretation...")
print("="*80)

rf_model = model.stages[-1]
feature_importance = rf_model.featureImportances.toArray()

# Get indexers to map back to original categories
event_type_indexer = model.stages[0]
state_indexer = model.stages[1]
mag_type_indexer = model.stages[2]

# Get encoders for dimensions
event_type_encoder = model.stages[3]
state_encoder = model.stages[4]
mag_type_encoder = model.stages[5]

# Get number of categories from indexers, then calc features (n-1 due to one-hot encoding)
n_event_cats = len(event_type_indexer.labels)
n_state_cats = len(state_indexer.labels)
n_mag_cats = len(mag_type_indexer.labels)

n_event = max(1, n_event_cats - 1)
n_state = max(1, n_state_cats - 1)
n_mag = max(1, n_mag_cats - 1)

print(f"\nFeature vector breakdown:")
print(f"  EVENT_TYPE one-hot:      indices 0-{n_event-1} ({n_event_cats} categories → {n_event} features)")
print(f"  STATE one-hot:           indices {n_event}-{n_event+n_state-1} ({n_state_cats} categories → {n_state} features)")
print(f"  MAGNITUDE_TYPE one-hot:  indices {n_event+n_state}-{n_event+n_state+n_mag-1} ({n_mag_cats} categories → {n_mag} features)")
print(f"  MAGNITUDE (numeric):     index {n_event+n_state+n_mag}")
print(f"  EVENT_COUNT_PER_CZ:      index {n_event+n_state+n_mag+1}")
print(f"  TOTAL FEATURES:          {len(feature_importance)}")
print()

# Aggregate importances by feature group
event_type_imp = sum(feature_importance[:n_event])
state_imp = sum(feature_importance[n_event:n_event+n_state])
mag_type_imp = sum(feature_importance[n_event+n_state:n_event+n_state+n_mag])
magnitude_imp = feature_importance[n_event+n_state+n_mag] if len(feature_importance) > n_event+n_state+n_mag else 0
event_count_imp = feature_importance[n_event+n_state+n_mag+1] if len(feature_importance) > n_event+n_state+n_mag+1 else 0

# DETAILED BREAKDOWN FOR STAKEHOLDERS
print("="*80)
print("DETAILED FEATURE IMPORTANCE BREAKDOWN (For Stakeholder Presentation)")
print("="*80)
print()
print("INDIVIDUAL FEATURE GROUP IMPORTANCE:")
print("-" * 80)
print(f"1. EVENT_TYPE (storm type):           {event_type_imp:.6f} ({event_type_imp/sum(feature_importance)*100:.2f}%)")
print(f"2. STATE (location):                  {state_imp:.6f} ({state_imp/sum(feature_importance)*100:.2f}%)")
print(f"3. MAGNITUDE_TYPE (wind measurement): {mag_type_imp:.6f} ({mag_type_imp/sum(feature_importance)*100:.2f}%)")
print(f"4. MAGNITUDE (wind speed):            {magnitude_imp:.6f} ({magnitude_imp/sum(feature_importance)*100:.2f}%)")
print(f"5. EVENT_COUNT_PER_CZ (pop. proxy):   {event_count_imp:.6f} ({event_count_imp/sum(feature_importance)*100:.2f}%)")
print()

# Calculate storm vs location groupings
storm_imp = event_type_imp + mag_type_imp + magnitude_imp
location_imp = state_imp + event_count_imp

print("="*80)
print("HIGH-LEVEL SUMMARY (Storm vs Location)")
print("="*80)
print()
print("STORM-RELATED FACTORS (what type of weather event):")
print(f"  - EVENT_TYPE:      {event_type_imp:.6f} ({event_type_imp/sum(feature_importance)*100:.2f}%)")
print(f"  - MAGNITUDE_TYPE:  {mag_type_imp:.6f} ({mag_type_imp/sum(feature_importance)*100:.2f}%)")
print(f"  - MAGNITUDE:       {magnitude_imp:.6f} ({magnitude_imp/sum(feature_importance)*100:.2f}%)")
print(f"  TOTAL STORM:       {storm_imp:.6f} ({storm_imp/sum(feature_importance)*100:.2f}%)")
print()
print("LOCATION-RELATED FACTORS (where the event occurs):")
print(f"  - STATE:           {state_imp:.6f} ({state_imp/sum(feature_importance)*100:.2f}%)")
print(f"  - EVENT_COUNT:     {event_count_imp:.6f} ({event_count_imp/sum(feature_importance)*100:.2f}%)")
print(f"  TOTAL LOCATION:    {location_imp:.6f} ({location_imp/sum(feature_importance)*100:.2f}%)")
print()

print("="*80)
print("RESEARCH QUESTION ANSWER")
print("="*80)
if storm_imp > location_imp:
    diff_pct = (storm_imp - location_imp) / sum(feature_importance) * 100
    storm_pct = storm_imp/sum(feature_importance)*100
    location_pct = location_imp/sum(feature_importance)*100
    
    print(f"✓ STORM factors are MORE predictive of human harm")
    print(f"  Storm factors: {storm_pct:.1f}%")
    print(f"  Location factors: {location_pct:.1f}%")
    print(f"  Difference: {diff_pct:.1f} percentage points")
    print()
    
    # Dynamic interpretation based on magnitude of difference
    if diff_pct < 10:
        intensity = "slightly"
    elif diff_pct < 30:
        intensity = "moderately"
    else:
        intensity = "significantly"
    
    print(f"Interpretation: The TYPE of weather event (tornado, flood, heat, etc.)")
    print(f"is {intensity} more important ({storm_pct:.1f}% vs {location_pct:.1f}%) than WHERE it occurs when predicting harm.")
else:
    diff_pct = (location_imp - storm_imp) / sum(feature_importance) * 100
    storm_pct = storm_imp/sum(feature_importance)*100
    location_pct = location_imp/sum(feature_importance)*100
    
    print(f"✓ LOCATION factors are MORE predictive of human harm")
    print(f"  Location factors: {location_pct:.1f}%")
    print(f"  Storm factors: {storm_pct:.1f}%")
    print(f"  Difference: {diff_pct:.1f} percentage points")
    print()
    
    # Dynamic interpretation based on magnitude of difference
    if diff_pct < 10:
        intensity = "slightly"
    elif diff_pct < 30:
        intensity = "moderately"
    else:
        intensity = "significantly"
    
    print(f"Interpretation: WHERE a weather event occurs (which state, population)")
    print(f"is {intensity} more important ({location_pct:.1f}% vs {storm_pct:.1f}%) than the TYPE of event when predicting harm.")
print("="*80)


Extracting feature importance for stakeholder interpretation...

Feature vector breakdown:
  EVENT_TYPE one-hot:      indices 0-53 (55 categories → 54 features)
  STATE one-hot:           indices 54-121 (69 categories → 68 features)
  MAGNITUDE_TYPE one-hot:  indices 122-125 (5 categories → 4 features)
  MAGNITUDE (numeric):     index 126
  EVENT_COUNT_PER_CZ:      index 127
  TOTAL FEATURES:          131

DETAILED FEATURE IMPORTANCE BREAKDOWN (For Stakeholder Presentation)

INDIVIDUAL FEATURE GROUP IMPORTANCE:
--------------------------------------------------------------------------------
1. EVENT_TYPE (storm type):           0.710642 (71.06%)
2. STATE (location):                  0.227856 (22.79%)
3. MAGNITUDE_TYPE (wind measurement): 0.010218 (1.02%)
4. MAGNITUDE (wind speed):            0.004380 (0.44%)
5. EVENT_COUNT_PER_CZ (pop. proxy):   0.000001 (0.00%)

HIGH-LEVEL SUMMARY (Storm vs Location)

STORM-RELATED FACTORS (what type of weather event):
  - EVENT_TYPE:      0.710642 (7

### 6.1 Top Contributing Event Types and States (Actionable Insights)


In [23]:
# Show TOP contributing event types and states for stakeholders
print("="*80)
print("TOP CONTRIBUTING FACTORS (Most Actionable for Stakeholders)")
print("="*80)
print()

# Get the labels from indexers
event_type_labels = event_type_indexer.labels
state_labels = state_indexer.labels

# Get top EVENT_TYPES by importance
event_type_importances = feature_importance[:n_event]
top_event_indices = sorted(range(len(event_type_importances)), 
                          key=lambda i: event_type_importances[i], 
                          reverse=True)[:10]

print("TOP 10 EVENT TYPES (Storm Types) for Predicting Harm:")
print("-" * 80)
for rank, idx in enumerate(top_event_indices, 1):
    if idx < len(event_type_labels):
        event_name = event_type_labels[idx]
        importance = event_type_importances[idx]
        pct = (importance / sum(feature_importance)) * 100
        print(f"{rank:2d}. {event_name:30s} - {importance:.6f} ({pct:.2f}%)")
print()

# Get top STATES by importance
state_importances = feature_importance[n_event:n_event+n_state]
top_state_indices = sorted(range(len(state_importances)), 
                          key=lambda i: state_importances[i], 
                          reverse=True)[:10]

print("TOP 10 STATES (Locations) for Predicting Harm:")
print("-" * 80)
for rank, idx in enumerate(top_state_indices, 1):
    if idx < len(state_labels):
        state_name = state_labels[idx]
        importance = state_importances[idx]
        pct = (importance / sum(feature_importance)) * 100
        print(f"{rank:2d}. {state_name:30s} - {importance:.6f} ({pct:.2f}%)")
print()

print("="*80)
print("STAKEHOLDER RECOMMENDATIONS:")
print("="*80)
print("Based on feature importance analysis:")
print()

# Dynamic recommendations based on actual importance values
storm_pct = storm_imp/sum(feature_importance)*100
location_pct = location_imp/sum(feature_importance)*100
diff = abs(storm_pct - location_pct)

if storm_imp > location_imp:
    print(f"1. PRIORITY: FOCUS on EVENT TYPE (storm characteristics) - {storm_pct:.1f}% importance")
    print(f"   - Train responders for top event types (Rip Current, Heat, Avalanche, etc.)")
    print(f"   - Develop event-specific response protocols")
    print(f"   - Stock emergency supplies tailored to these specific events")
    print()
    print(f"2. SECONDARY: Target high-risk locations - {location_pct:.1f}% importance")
    print(f"   - Allocate resources to top states (Wyoming, Utah, etc.)")
    print(f"   - Enhance warning systems in vulnerable regions")
    print()
else:
    print(f"1. PRIORITY: FOCUS on LOCATION (where events occur) - {location_pct:.1f}% importance")
    print(f"   - Allocate resources to top states listed above")
    print(f"   - Enhance regional warning systems")
    print()
    print(f"2. SECONDARY: Event type awareness - {storm_pct:.1f}% importance")
    print(f"   - Train for top event types")
    print(f"   - Event-specific protocols")
    print()

if diff < 10:
    print(f"3. NOTE: Storm ({storm_pct:.1f}%) and location ({location_pct:.1f}%) factors are close")
    print(f"   - Difference is only {diff:.1f} percentage points")
    print(f"   - Best strategy: Consider BOTH event type AND location together")
else:
    print(f"3. NOTE: Clear dominant factor ({max(storm_pct, location_pct):.1f}% vs {min(storm_pct, location_pct):.1f}%)")
    print(f"   - Focus resources on the dominant factor")
    print(f"   - Secondary factor still relevant but less critical")

print("="*80)


TOP CONTRIBUTING FACTORS (Most Actionable for Stakeholders)

TOP 10 EVENT TYPES (Storm Types) for Predicting Harm:
--------------------------------------------------------------------------------
 1. Rip Current                    - 0.437450 (43.75%)
 2. Heat                           - 0.096486 (9.65%)
 3. Avalanche                      - 0.068776 (6.88%)
 4. Lightning                      - 0.048487 (4.85%)
 5. Tornado                        - 0.015374 (1.54%)
 6. Wildfire                       - 0.005993 (0.60%)
 7. Hurricane (Typhoon)            - 0.005545 (0.55%)
 8. Hail                           - 0.005440 (0.54%)
 9. Marine Strong Wind             - 0.003445 (0.34%)
10. Excessive Heat                 - 0.003432 (0.34%)

TOP 10 STATES (Locations) for Predicting Harm:
--------------------------------------------------------------------------------
 1. WYOMING                        - 0.169631 (16.96%)
 2. UTAH                           - 0.011579 (1.16%)
 3. WISCONSIN            

## 7. Save Results


In [24]:
# Save results
results = {
    'test_auc': float(test_auc),
    'test_accuracy': float(test_acc),
    'val_auc': float(val_auc),
    'val_accuracy': float(val_acc),
    'storm_importance': float(storm_imp),
    'location_importance': float(location_imp),
    'event_type_importance': float(event_type_imp),
    'state_importance': float(state_imp),
    'magnitude_importance': float(magnitude_imp),
    'event_count_importance': float(event_count_imp)
}

with open('/home/sparkdev/app/Task3_results.json', 'w') as f:
    json.dump(results, f, indent=2)

print("✓ Results saved to Task3_results.json")
print("\nSummary:")
print(json.dumps(results, indent=2))


✓ Results saved to Task3_results.json

Summary:
{
  "test_auc": 0.8312743629681817,
  "test_accuracy": 0.9879698704864436,
  "val_auc": 0.8084900646485705,
  "val_accuracy": 0.9888799065840295,
  "storm_importance": 0.7252400577656135,
  "location_importance": 0.22785755121787488,
  "event_type_importance": 0.7106420203071844,
  "state_importance": 0.22785642733593744,
  "magnitude_importance": 0.00438001568949174,
  "event_count_importance": 1.1238819374258197e-06
}


In [25]:
# Stop Spark
spark.stop()
print("✓ Complete!")


✓ Complete!
