### Weather Disaster Prediction 
#### DS 3000 Final Project
Members: Luke Abbatessa, Daniel Gilligan, Ruby Potash, Megan Putnam 

**Data Information**

Data set: https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/

Datasets include three types of data related to weather events recorded for a given year: 
1. Storm details
2. Fatalities
3. Locations 

Naming convention for the csv files are as follows: "StormEvents_[file_type]-ftp_v1.0_d[data_year]_c[creation_date].csv.gz"

In [1]:
# Import necessary libraries
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression

In [2]:
LOCATIONS_2021 = "https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_locations-ftp_v1.0_d2021_c20221018.csv.gz"
DETAILS_2021 = "https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2021_c20221018.csv.gz"
FATALITIES_2021 = "https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_fatalities-ftp_v1.0_d2021_c20221018.csv.gz"
COLS_OF_INTEREST = [
    "BEGIN_YEARMONTH", "END_YEARMONTH", "EVENT_TYPE", "CZ_TYPE", "EPISODE_NARRATIVE", 
    "EVENT_NARRATIVE", "INJURIES_DIRECT", "DEATHS_DIRECT", "DAMAGE_PROPERTY", "DAMAGE_CROPS"
]

In [3]:
# Import 3 files for storm events 2021 
loc_df = pd.read_csv(LOCATIONS_2021)
detail_df = pd.read_csv(DETAILS_2021)
fatal_df = pd.read_csv(FATALITIES_2021)

In [4]:
# Missing data counts for columns and rows 
dfs = {"locations":loc_df, "details":detail_df, "fatalities":fatal_df}

for df_name in dfs:
    df = dfs[df_name]
    print(f"\n\033[1m{df_name}")
    print("\033[0mShape:", df.shape)
    print("\nData missing per col:", df.isnull().sum().sort_values(ascending=False), sep="\n")
    print("\nNumber of rows (right) missing X columns (left)):", df.isnull().sum(axis=1).value_counts(), sep="\n")


[1mlocations
[0mShape: (58271, 11)

Data missing per col:
YEARMONTH         0
EPISODE_ID        0
EVENT_ID          0
LOCATION_INDEX    0
RANGE             0
AZIMUTH           0
LOCATION          0
LATITUDE          0
LONGITUDE         0
LAT2              0
LON2              0
dtype: int64

Number of rows (right) missing X columns (left)):
0    58271
dtype: int64

[1mdetails
[0mShape: (61110, 51)

Data missing per col:
CATEGORY              61071
TOR_OTHER_WFO         60884
TOR_OTHER_CZ_STATE    60884
TOR_OTHER_CZ_FIPS     60884
TOR_OTHER_CZ_NAME     60884
TOR_F_SCALE           59567
TOR_LENGTH            59567
TOR_WIDTH             59567
FLOOD_CAUSE           54061
MAGNITUDE_TYPE        37350
MAGNITUDE             31058
BEGIN_AZIMUTH         25955
BEGIN_RANGE           25955
BEGIN_LOCATION        25955
END_RANGE             25955
END_AZIMUTH           25955
END_LOCATION          25955
BEGIN_LAT             25955
BEGIN_LON             25955
END_LAT               25955
END_LON     

Location file has no missing values. Details file has widespread data missing--many presumably are due to the variable not being applicable for a given storm (e.g. tornado metrics). Only age and sex columns of the fatality data are missing values. 

Consulted Statistics Globe to confirm the process of merging two Pandas DataFrames
https://statisticsglobe.com/merge-pandas-dataframes-based-on-particular-column-python#:~:text=Within%20this%20function%2C%20we%20have%20to%20specify%20the,on%20%3D%20%22col%22%29%20print%28data_join%29%20%23%20Print%20merged%20DataFrame

In [5]:
# Merge the three dataframes on Event ID
def merge_dfs(df1, df2, col):
    """Merge two dataframes on a common column"""
    joined_df = pd.merge(df1, df2, on=col)
    return joined_df

loc_detail_df = merge_dfs(loc_df, detail_df, "EVENT_ID")

joined_df = merge_dfs(loc_detail_df, fatal_df, "EVENT_ID")
print(joined_df.head(5))

   YEARMONTH  EPISODE_ID_x  EVENT_ID  LOCATION_INDEX  RANGE AZIMUTH  \
0     202102        155272    936331               1   2.96     SSW   
1     202102        155272    936331               2   1.58       N   
2     202102        155279    936389               1   1.60      NW   
3     202102        155279    936389               1   1.60      NW   
4     202102        155279    936389               2   1.01     NNW   

    LOCATION  LATITUDE  LONGITUDE     LAT2  ...  FAT_YEARMONTH  FAT_DAY  \
0  KAHAKULOA   20.9611  -156.5692  2057666  ...         202102       18   
1   KIPAHULU   20.6725  -156.0739  2040350  ...         202102       18   
2     WAIHEE   20.9455  -156.5385  2056730  ...         202102       27   
3     WAIHEE   20.9455  -156.5385  2056730  ...         202102       27   
4   KIPAHULU   20.6640  -156.0743  2039840  ...         202102       27   

   FAT_TIME  FATALITY_ID  FATALITY_TYPE        FATALITY_DATE  FATALITY_AGE  \
0         0        42429              I  02/

In [6]:
# Filter the joined dataframe based on the columns of interest
def extract_df_cols(df, cols_lst):
    """Extract certain columns from a dataframe"""
    df = df[cols_lst]
    return df

filtered_df = extract_df_cols(joined_df, COLS_OF_INTEREST)
print(filtered_df.head(5))
print(" ")

print(filtered_df.dtypes)

   BEGIN_YEARMONTH  END_YEARMONTH  EVENT_TYPE CZ_TYPE  \
0           202102         202102  Heavy Rain       C   
1           202102         202102  Heavy Rain       C   
2           202102         202102  Heavy Rain       C   
3           202102         202102  Heavy Rain       C   
4           202102         202102  Heavy Rain       C   

                                   EPISODE_NARRATIVE  \
0  A surface front stalled near Kauai, along with...   
1  A surface front stalled near Kauai, along with...   
2  Gusty trade winds helped keep showers moving a...   
3  Gusty trade winds helped keep showers moving a...   
4  Gusty trade winds helped keep showers moving a...   

                                     EVENT_NARRATIVE  INJURIES_DIRECT  \
0  A 26-year-old woman died when she was swept ou...                0   
1  A 26-year-old woman died when she was swept ou...                0   
2  Two hikers went missing along the Waikamoi Tra...                0   
3  Two hikers went missing a

Consulted stack overflow for making a new column from the string slice of another
https://stackoverflow.com/questions/25789445/pandas-make-new-column-from-string-slice-of-another-column

In [7]:
def change_col_types(df, col, data_type):
    """Change column data types as necessary"""
    df[col] = df[col].astype(data_type)
    return df[col]

# Change BEGIN_YEARMONTH and END_YEARMONTH from objects to strings to allow for slicing
filtered_df["BEGIN_YEARMONTH"] = change_col_types(filtered_df, "BEGIN_YEARMONTH", pd.StringDtype())
filtered_df["END_YEARMONTH"] = change_col_types(filtered_df, "END_YEARMONTH", pd.StringDtype())

# Separate BEGIN_YEARMONTH and END_YEARMONTH into year and month columns
filtered_df["BEGIN_YEAR"] = filtered_df.BEGIN_YEARMONTH.str[:4]
filtered_df["BEGIN_MONTH"] = filtered_df.BEGIN_YEARMONTH.str[4:]

filtered_df["END_YEAR"] = filtered_df.END_YEARMONTH.str[:4]
filtered_df["END_MONTH"] = filtered_df.END_YEARMONTH.str[4:]

# Drop the now-unnecessary columns BEGIN_YEARMONTH and END_YEARMONTH
filtered_df = filtered_df.drop(["BEGIN_YEARMONTH", "END_YEARMONTH"], axis=1)

print(filtered_df.columns)
print(" ")

print(filtered_df.head(5))

Index(['EVENT_TYPE', 'CZ_TYPE', 'EPISODE_NARRATIVE', 'EVENT_NARRATIVE',
       'INJURIES_DIRECT', 'DEATHS_DIRECT', 'DAMAGE_PROPERTY', 'DAMAGE_CROPS',
       'BEGIN_YEAR', 'BEGIN_MONTH', 'END_YEAR', 'END_MONTH'],
      dtype='object')
 
   EVENT_TYPE CZ_TYPE                                  EPISODE_NARRATIVE  \
0  Heavy Rain       C  A surface front stalled near Kauai, along with...   
1  Heavy Rain       C  A surface front stalled near Kauai, along with...   
2  Heavy Rain       C  Gusty trade winds helped keep showers moving a...   
3  Heavy Rain       C  Gusty trade winds helped keep showers moving a...   
4  Heavy Rain       C  Gusty trade winds helped keep showers moving a...   

                                     EVENT_NARRATIVE  INJURIES_DIRECT  \
0  A 26-year-old woman died when she was swept ou...                0   
1  A 26-year-old woman died when she was swept ou...                0   
2  Two hikers went missing along the Waikamoi Tra...                0   
3  Two hikers w

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = df[col].astype(data_type)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df["BEGIN_YEARMONTH"] = change_col_types(filtered_df, "BEGIN_YEARMONTH", pd.StringDtype())
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df["END_YEARMONTH"] = change_col_types(filtered_df, "END_YEA

In [8]:
# Change the types of certain columns
filtered_df["EVENT_TYPE"] = change_col_types(filtered_df, "EVENT_TYPE", pd.StringDtype())
filtered_df["CZ_TYPE"] = change_col_types(filtered_df, "CZ_TYPE", pd.StringDtype())
filtered_df["EPISODE_NARRATIVE"] = change_col_types(filtered_df, "EPISODE_NARRATIVE", pd.StringDtype())
filtered_df["EVENT_NARRATIVE"] = change_col_types(filtered_df, "EVENT_NARRATIVE", pd.StringDtype())
filtered_df["DAMAGE_PROPERTY"] = change_col_types(filtered_df, "DAMAGE_PROPERTY", pd.StringDtype())
filtered_df["DAMAGE_CROPS"] = change_col_types(filtered_df, "DAMAGE_CROPS", pd.StringDtype())
filtered_df["BEGIN_YEAR"] = change_col_types(filtered_df, "BEGIN_YEAR", int)
filtered_df["BEGIN_MONTH"] = change_col_types(filtered_df, "BEGIN_MONTH", int)
filtered_df["END_YEAR"] = change_col_types(filtered_df, "END_YEAR", int)
filtered_df["END_MONTH"] = change_col_types(filtered_df, "END_MONTH", int)

print(filtered_df.dtypes)

EVENT_TYPE           string
CZ_TYPE              string
EPISODE_NARRATIVE    string
EVENT_NARRATIVE      string
INJURIES_DIRECT       int64
DEATHS_DIRECT         int64
DAMAGE_PROPERTY      string
DAMAGE_CROPS         string
BEGIN_YEAR            int64
BEGIN_MONTH           int64
END_YEAR              int64
END_MONTH             int64
dtype: object


In [9]:
# Replace "K" with (3 zeros - 2), "M" with (6 - 2), 
# "B" with (9 - 2), and "." with "" for DAMAGE_PROPERTY and DAMAGE_CROPS
# 
# Already 2 0's after a decimal point for both columns
letter_num = {"K": "0"*(3-2), "M": "0"*(6-2), "B": "0"*(9-2), ".": ""}
    
filtered_df["DAMAGE_PROPERTY"] \
= filtered_df["DAMAGE_PROPERTY"].str.translate(str.maketrans(letter_num))

filtered_df["DAMAGE_PROPERTY"] = pd.to_numeric(filtered_df["DAMAGE_PROPERTY"])

filtered_df["DAMAGE_CROPS"] \
= filtered_df["DAMAGE_CROPS"].str.translate(str.maketrans(letter_num))

filtered_df["DAMAGE_CROPS"] = pd.to_numeric(filtered_df["DAMAGE_CROPS"])


def get_unique_col_vals(df):
    """Gather unique values from each dataframe column"""
    for col in df:
        print(df[col].unique())
        
get_unique_col_vals(filtered_df)

<StringArray>
[              'Heavy Rain',                'Lightning',
              'Flash Flood',                  'Tornado',
        'Thunderstorm Wind',       'Marine Strong Wind',
                    'Flood',              'Debris Flow',
                     'Hail', 'Marine Thunderstorm Wind']
Length: 10, dtype: string
<StringArray>
['C', 'Z']
Length: 2, dtype: string
<StringArray>
[                                                                                         'A surface front stalled near Kauai, along with an upper trough, brought heavy rain and thunderstorms to parts of the Aloha State, and flash flooding to Kauai.  Downpours over windward East Maui during this episode contributed to the death of a 26-year-old woman who was swept out to sea when Waioka Pond near Hana became swollen by waters coming down the slope from near the Haleakala summit.  The costs of any damages were not available.',
                                                'Gusty trade winds helped keep 