## 🌦 NOAA Daily Weather Dataset Description

The NOAA (National Oceanic and Atmospheric Administration) daily weather dataset contains historical weather observations collected from multiple weather stations across California. These observations are recorded on a daily basis and include a wide range of meteorological variables.

### 🧾 Key Features:
| Column     | Description |
|------------|-------------|
| `STATION`  | Unique station ID from NOAA’s GHCN database |
| `NAME`     | Name of the weather station |
| `LATITUDE` | Latitude of the weather station |
| `LONGITUDE`| Longitude of the weather station |
| `ELEVATION`| Elevation of the station (meters) |
| `DATE`     | Date of the observation |
| `AWND`     | Average wind speed (tenths of meters per second) |
| `PGTM`     | Peak gust time (HHMM) |
| `PRCP`     | Precipitation (tenths of mm) |
| `TMAX`     | Maximum daily temperature (tenths of °C) |
| `TMIN`     | Minimum daily temperature (tenths of °C) |
| `WDF2`     | Direction of fastest 2-minute wind (degrees) |
| `WDF5`     | Direction of fastest 5-second wind (degrees) |
| `WSF2`     | Speed of fastest 2-minute wind (tenths of m/s) |
| `WSF5`     | Speed of fastest 5-second wind (tenths of m/s) |

---

## 🔥 Importance of NOAA Weather Data in Predicting Wildfire Cause

Weather conditions are among the most critical environmental factors influencing wildfire **ignition**, **spread**, and **severity**. Including NOAA weather data allows us to contextualize the wildfire with respect to its surrounding climate conditions, which can be strongly correlated with the cause.

### ✅ Why it Matters:
- **High wind speeds** can indicate natural causes like **lightning-induced wildfires** or contribute to human-initiated spread.
- **High temperatures** and **low precipitation** create dry environments, increasing flammability of vegetation.
- **Weather direction/speed** helps assess whether fire spread was influenced by natural conditions.

By aligning wildfire ignition records with historical weather conditions, we can improve the accuracy of classification models that predict the **cause of fire** (e.g., human vs. natural causes).

---

## 🎯 Purpose of Using NOAA Dataset

1. **Feature Engineering:** Create environment-based predictors (e.g., max temp, wind speed, rainfall levels).
2. **Modeling Fire Cause:** Use weather context to classify wildfire causes using supervised learning.
3. **Spatial-Temporal Enrichment:** Merge weather and fire records by location/date to model ignition conditions.
4. **Understanding Correlation:** Analyze how environmental variables contribute to specific fire causes.

---
📚 **Source:**
This dataset is obtained from NOAA’s **Global Historical Climatology Network (GHCN)**. You can access the data [here](https://www.ncdc.noaa.gov/cdo-web/).

> 🔄 The NOAA dataset enhances the temporal and environmental granularity of wildfire data, leading to more **explainable**, **reliable**, and **data-driven** predictions of fire causes.


# Data Exploration: Handling Missing Values

## 1. Identifying Missing Values

In this step, we explore the dataset for missing values. We inspect the number of missing values in each column. Here’s an overview of the missing values count for each column in our dataset:


In [1]:
import pandas as pd

# Define the file path
file_path = r"C:\Users\annis\OneDrive\Desktop\California Wildfire\Data Files\3996332.csv"

# Load the CSV file into a DataFrame
weather_df = pd.read_csv(filae_path)

# Display the first few rows of the DataFrame to confirm it's loaded correctly
print(weather_df.head())



       STATION                                  NAME  LATITUDE  LONGITUDE  \
0  USW00093193  FRESNO YOSEMITE INTERNATIONAL, CA US  36.77999 -119.72016   
1  USW00093193  FRESNO YOSEMITE INTERNATIONAL, CA US  36.77999 -119.72016   
2  USW00093193  FRESNO YOSEMITE INTERNATIONAL, CA US  36.77999 -119.72016   
3  USW00093193  FRESNO YOSEMITE INTERNATIONAL, CA US  36.77999 -119.72016   
4  USW00093193  FRESNO YOSEMITE INTERNATIONAL, CA US  36.77999 -119.72016   

   ELEVATION      DATE   ACMH   ACSH  AWND  DAPR  ...  WT11  WT13  WT14  WT16  \
0      101.9  1/1/1992   40.0    0.0  3.80   NaN  ...   NaN   NaN   NaN   NaN   
1      101.9  1/2/1992  100.0  100.0  4.70   NaN  ...   NaN   NaN   NaN   NaN   
2      101.9  1/3/1992  100.0  100.0  3.80   NaN  ...   NaN   NaN   NaN   1.0   
3      101.9  1/4/1992   90.0   80.0  8.72   NaN  ...   NaN   NaN   NaN   NaN   
4      101.9  1/5/1992  100.0  100.0  8.50   NaN  ...   NaN   NaN   NaN   1.0   

   WT18  WT21  WT22  WV01  WV03  WV20  
0   NaN   

In [2]:
weather_df.columns

Index(['STATION', 'NAME', 'LATITUDE', 'LONGITUDE', 'ELEVATION', 'DATE', 'ACMH',
       'ACSH', 'AWND', 'DAPR', 'FMTM', 'MDPR', 'PGTM', 'PRCP', 'SNOW', 'SNWD',
       'TAVG', 'TMAX', 'TMIN', 'TSUN', 'WDF1', 'WDF2', 'WDF5', 'WDFG', 'WDFM',
       'WESD', 'WSF1', 'WSF2', 'WSF5', 'WSFG', 'WSFM', 'WT01', 'WT02', 'WT03',
       'WT04', 'WT05', 'WT07', 'WT08', 'WT09', 'WT10', 'WT11', 'WT13', 'WT14',
       'WT16', 'WT18', 'WT21', 'WT22', 'WV01', 'WV03', 'WV20'],
      dtype='object')

In [8]:
weather_df.shape

(55999, 50)

In [6]:
weather_df.isnull().sum()

STATION          0
NAME             0
LATITUDE         0
LONGITUDE        0
ELEVATION        0
DATE             0
ACMH         51102
ACSH         51100
AWND          8296
DAPR         55994
FMTM         29975
MDPR         55994
PGTM         27012
PRCP           395
SNOW         34329
SNWD         33523
TAVG         34391
TMAX          2341
TMIN          2345
TSUN         47204
WDF1         54202
WDF2         12594
WDF5         13680
WDFG         51233
WDFM         54910
WESD         38467
WSF1         54202
WSF2         12591
WSF5         13671
WSFG         51233
WSFM         54910
WT01         40637
WT02         53504
WT03         55535
WT04         55988
WT05         55618
WT07         55011
WT08         40158
WT09         55829
WT10         55987
WT11         55994
WT13         48722
WT14         55573
WT16         52052
WT18         55991
WT21         55885
WT22         55981
WV01         55926
WV03         55997
WV20         55992
dtype: int64

## 2. Handling Missing Values

After exploring the missing values, we calculate the percentage of missing values for each column to identify which columns have more than 50% missing data. Columns with more than 50% missing values are considered unreliable and are dropped to maintain the quality of the dataset.

We proceed by dropping columns that have over 50% missing data. 

Once these columns are dropped, we are left with the following important columns for further analysis:

- **`STATION`**
- **`NAME`**
- **`LATITUDE`**
- **`LONGITUDE`**
- **`ELEVATION`**
- **`DATE`**
- **`AWND`**
- **`PGTM`**
- **`PRCP`**
- **`TMAX`**
- **`TMIN`**
- **`WDF2`**
- **`WDF5`**
- **`WSF2`**
- **`WSF5`**

In [9]:
missing_percent = weather_df.isnull().sum() / len(weather_df) * 100
missing_percent.sort_values(ascending=False)

WV03         99.996429
WT11         99.991071
MDPR         99.991071
DAPR         99.991071
WV20         99.987500
WT18         99.985714
WT04         99.980357
WT10         99.978571
WT22         99.967857
WV01         99.869641
WT21         99.796425
WT09         99.696423
WT05         99.319631
WT14         99.239272
WT03         99.171414
WT07         98.235683
WDFM         98.055322
WSFM         98.055322
WSF1         96.791014
WDF1         96.791014
WT02         95.544563
WT16         92.951660
WDFG         91.489134
WSFG         91.489134
ACMH         91.255201
ACSH         91.251629
WT13         87.005125
TSUN         84.294362
WT01         72.567367
WT08         71.711995
WESD         68.692298
TAVG         61.413597
SNOW         61.302880
SNWD         59.863569
FMTM         53.527742
PGTM         48.236576
WDF5         24.429008
WSF5         24.412936
WDF2         22.489687
WSF2         22.484330
AWND         14.814550
TMIN          4.187575
TMAX          4.180432
PRCP       

In [10]:
# Drop columns with more than 50% missing values
threshold = 50
cols_to_drop = missing_percent[missing_percent > threshold].index
weather_df.drop(columns=cols_to_drop, inplace=True)


## 3. Imputing Remaining Missing Values

For the remaining columns, we choose to impute missing values using the **median**. 

### Why Impute with Median?

The **median** is a good choice for imputing missing values because it is less sensitive to outliers compared to the mean. Weather data is prone to extreme values (e.g., storms or temperature spikes), which can skew the mean. The median, on the other hand, represents the central tendency of the data without being influenced by extreme values, making it more robust for imputation.

- **Robustness to Outliers**: The median is not influenced by extreme values, so it prevents any skewed data from distorting the imputation process.
- **Preserving Data Distribution**: Since weather data often contains outliers (e.g., extreme weather events), using the median ensures that the imputed values are more representative of the data’s central tendency.
- **Maintaining Consistency**: Imputing with the median ensures that the model training process remains consistent without introducing unnecessary bias that might occur with extreme outliers.


In [11]:
# Optional Fill remaining NaNs with column-wise median
weather_df.fillna(weather_df.median(numeric_only=True), inplace=True)

In [12]:
weather_df.columns 

Index(['STATION', 'NAME', 'LATITUDE', 'LONGITUDE', 'ELEVATION', 'DATE', 'AWND',
       'PGTM', 'PRCP', 'TMAX', 'TMIN', 'WDF2', 'WDF5', 'WSF2', 'WSF5'],
      dtype='object')

In [14]:
weather_df['STATION'].value_counts()

STATION
USW00093193    10593
USW00023174    10593
USW00023188    10593
USW00003171     8216
USW00023257     8185
MXN00021053     5947
US1CASR0055     1872
Name: count, dtype: int64

In [15]:
weather_df['NAME'].value_counts()

NAME
FRESNO YOSEMITE INTERNATIONAL, CA US        10593
LOS ANGELES INTERNATIONAL AIRPORT, CA US    10593
SAN DIEGO INTERNATIONAL AIRPORT, CA US      10593
RIVERSIDE MUNICIPAL AIRPORT, CA US           8216
MERCED MUNICIPAL AIRPORT, CA US              8185
SAN BERNARDINO LAGUNAS, MX                   5947
SAN BERNARDINO 5.1 NW, CA US                 1872
Name: count, dtype: int64

In [16]:
weather_df.isnull().sum()

STATION      0
NAME         0
LATITUDE     0
LONGITUDE    0
ELEVATION    0
DATE         0
AWND         0
PGTM         0
PRCP         0
TMAX         0
TMIN         0
WDF2         0
WDF5         0
WSF2         0
WSF5         0
dtype: int64

In [17]:
weather_df.head()

Unnamed: 0,STATION,NAME,LATITUDE,LONGITUDE,ELEVATION,DATE,AWND,PGTM,PRCP,TMAX,TMIN,WDF2,WDF5,WSF2,WSF5
0,USW00093193,"FRESNO YOSEMITE INTERNATIONAL, CA US",36.77999,-119.72016,101.9,1/1/1992,3.8,2100.0,0.0,59.0,32.0,270.0,270.0,15.0,19.0
1,USW00093193,"FRESNO YOSEMITE INTERNATIONAL, CA US",36.77999,-119.72016,101.9,1/2/1992,4.7,222.0,0.0,46.0,40.0,270.0,270.0,15.0,19.0
2,USW00093193,"FRESNO YOSEMITE INTERNATIONAL, CA US",36.77999,-119.72016,101.9,1/3/1992,3.8,523.0,0.07,52.0,41.0,270.0,270.0,15.0,19.0
3,USW00093193,"FRESNO YOSEMITE INTERNATIONAL, CA US",36.77999,-119.72016,101.9,1/4/1992,8.72,2309.0,0.0,60.0,43.0,270.0,270.0,15.0,19.0
4,USW00093193,"FRESNO YOSEMITE INTERNATIONAL, CA US",36.77999,-119.72016,101.9,1/5/1992,8.5,422.0,1.02,59.0,44.0,270.0,270.0,15.0,19.0


In [19]:
import pandas as pd
weather_df['DATE'] = pd.to_datetime(weather_df['DATE'])
weather_df['DATE']

0       1992-01-01
1       1992-01-02
2       1992-01-03
3       1992-01-04
4       1992-01-05
           ...    
55994   2020-12-27
55995   2020-12-28
55996   2020-12-29
55997   2020-12-30
55998   2020-12-31
Name: DATE, Length: 55999, dtype: datetime64[ns]

In [20]:
weather_df.head()

Unnamed: 0,STATION,NAME,LATITUDE,LONGITUDE,ELEVATION,DATE,AWND,PGTM,PRCP,TMAX,TMIN,WDF2,WDF5,WSF2,WSF5
0,USW00093193,"FRESNO YOSEMITE INTERNATIONAL, CA US",36.77999,-119.72016,101.9,1992-01-01,3.8,2100.0,0.0,59.0,32.0,270.0,270.0,15.0,19.0
1,USW00093193,"FRESNO YOSEMITE INTERNATIONAL, CA US",36.77999,-119.72016,101.9,1992-01-02,4.7,222.0,0.0,46.0,40.0,270.0,270.0,15.0,19.0
2,USW00093193,"FRESNO YOSEMITE INTERNATIONAL, CA US",36.77999,-119.72016,101.9,1992-01-03,3.8,523.0,0.07,52.0,41.0,270.0,270.0,15.0,19.0
3,USW00093193,"FRESNO YOSEMITE INTERNATIONAL, CA US",36.77999,-119.72016,101.9,1992-01-04,8.72,2309.0,0.0,60.0,43.0,270.0,270.0,15.0,19.0
4,USW00093193,"FRESNO YOSEMITE INTERNATIONAL, CA US",36.77999,-119.72016,101.9,1992-01-05,8.5,422.0,1.02,59.0,44.0,270.0,270.0,15.0,19.0


## 4. Data Cleaning Summary

After handling missing values by:
1. Dropping columns with more than 50% missing data.
2. Imputing the remaining missing values with the median.

We are left with a cleaner, more reliable dataset containing the following columns:

- **`STATION`**: Station identifier.
- **`NAME`**: Name of the station.
- **`LATITUDE`**: Latitude of the station.
- **`LONGITUDE`**: Longitude of the station.
- **`ELEVATION`**: Elevation of the station.
- **`DATE`**: The date of the observation.
- **`AWND`**: Average wind speed.
- **`PGTM`**: Peak gust time.
- **`PRCP`**: Precipitation.
- **`TMAX`**: Maximum temperature.
- **`TMIN`**: Minimum temperature.
- **`WDF2`**: Wind direction (2-minute average).
- **`WDF5`**: Wind direction (5-second maximum).
- **`WSF2`**: Wind speed (2-minute average).
- **`WSF5`**: Wind speed (5-second maximum).

This cleaned dataset is now ready for further exploration and modeling for **wildfire prediction**.


In [21]:
# Save the CAL_FIRE dataframe to a CSV file
weather_df.to_csv('Weather_Database.csv', index=False)

In [25]:
weather_df.shape

(55999, 15)

## 🌲 NOAA Weather Data Columns: Full Forms & Wildfire Relevance

### 🛰️ Station Metadata
- **STATION**: Unique identifier for the weather station.
- **NAME**: Name of the station (helps geolocate the readings).
- **LATITUDE** / **LONGITUDE**: Coordinates of the weather station — crucial for mapping to fire locations.
- **ELEVATION**: Station elevation above sea level (in meters) — impacts temperature, wind, and precipitation.

These help in **spatial matching** with California wildfire locations (e.g., FRAP dataset).

---

### 📅 Temporal Data
- **DATE**: The observation date (YYYY-MM-DD) — used to align weather conditions with wildfire start dates.

---

### 🌬️ Wind Features
- **AWND**: *Average Daily Wind Speed* (in meters per second)  
  ⏩ High wind speeds increase wildfire spread potential and direction.

- **WDF2** / **WDF5**: *Direction of Fastest 2-minute / 5-second Wind*  
  ⏩ Indicates wind direction before and during peak gusts — important for modeling spread direction.

- **WSF2** / **WSF5**: *Fastest 2-minute / 5-second Wind Speed*  
  ⏩ Sudden wind gusts can ignite or rapidly expand fires — useful for early risk detection.

---

### 🌧️ Precipitation & Soil Moisture
- **PRCP**: *Daily Precipitation* (in tenths of mm)  
  ⏩ Less rainfall = drier vegetation → higher fire risk.

- **PGTM**: *Peak Gust Time*  
  ⏩ When the peak wind gust occurred. Combined with wind speed and fire start times, it can reveal ignition patterns.

---

### 🌡️ Temperature
- **TMAX**: *Maximum Daily Temperature* (tenths of °C)  
  ⏩ Hot days dry out fuel sources faster, increasing risk.

- **TMIN**: *Minimum Daily Temperature* (tenths of °C)  
  ⏩ Helps understand diurnal variation; low nighttime temps slow spread, but warm nights can maintain combustion.

---

## 🔥 Summary: Why These Features Matter for Wildfires

These variables represent **key fire-driving weather patterns**:
- Wind (spread and direction)
- Temperature (fuel ignition and drying)
- Precipitation (fuel moisture)
- Location + Time (to align with fire incidents)

When merged with the **California FRAP wildfire dataset**, they can help **build predictive models**, **assess fire risk**, or **analyze historical trends**.

---

### 🚀 Data Cleaning & Readiness

The dataset has been **cleaned** by addressing missing values, imputing with the median, and dropping columns with more than 50% missing data. It is now **ready to be merged** with other relevant datasets, like the **California FRAP wildfire dataset**, for further analysis, prediction, and model building for wildfire cause prediction and risk assessment.
