# Messy IPYNB File

### Reading in dataset and looking at some basic info / summary information

In [2]:
import pandas as pd
import matplotlib as pyplot

df = pd.read_csv("tsunami-events.tsv", sep='\t')
# data is related to tsunami runup data which contains information on locations where tsunami effects were observed
df.head()

Unnamed: 0,Search Parameters,Year,Mo,Dy,Hr,Mn,Sec,Tsunami Event Validity,Tsunami Cause Code,Earthquake Magnitude,...,Total Missing,Total Missing Description,Total Injuries,Total Injuries Description,Total Damage ($Mil),Total Damage Description,Total Houses Destroyed,Total Houses Destroyed Description,Total Houses Damaged,Total Houses Damaged Description
0,[],,,,,,,,,,...,,,,,,,,,,
1,,-2000.0,,,,,,1.0,1.0,,...,,,,,,4.0,,,,
2,,-1610.0,,,,,,4.0,6.0,,...,,,,,,3.0,,,,
3,,-1365.0,,,,,,1.0,1.0,,...,,,,,,3.0,,,,
4,,-1300.0,,,,,,2.0,0.0,6.0,...,,,,,,,,,,


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2714 entries, 0 to 2713
Data columns (total 46 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Search Parameters                   1 non-null      object 
 1   Year                                2713 non-null   float64
 2   Mo                                  2565 non-null   float64
 3   Dy                                  2457 non-null   float64
 4   Hr                                  1399 non-null   float64
 5   Mn                                  1314 non-null   float64
 6   Sec                                 913 non-null    float64
 7   Tsunami Event Validity              2713 non-null   float64
 8   Tsunami Cause Code                  2709 non-null   float64
 9   Earthquake Magnitude                1530 non-null   float64
 10  Vol                                 150 non-null    float64
 11  More Info                           0 non-n


***Notes:***
* Topics of interest:
    * locations of earthquakes
    * intensity vs damage
    * how much damage/deaths are tsunamis causing (historical vs last 5-10 years)
* Notable:
    * Look into if there are reasons why some data is missing (year, geographic location, etc)

### Column Differentiation

#### What makes [Name] and [Name Description] different?

General answer: (https://www.ngdc.noaa.gov/hazel/view/hazards/tsunami/event-data)
* [Name] columns have numbers whenever possible (deaths = # of deaths, houses destroyed = # houses destroyed)
* [Name Description] columns are categories of numerical data.  (For example: none, few, some, many, very many /  none, limited, moderate, severe extreme / etc).  They represent if a description of event was found in literature instead of number of deaths, the value is coded into a category representing a range of numbers.  If event has actual number of deaths, that number is also coded into a category for consistency).

### Some Summary Statistics

In [4]:
# summary statistics on some column categories
deaths = df['Total Deaths'].describe()
damage = df['Total Damage ($Mil)'].describe()
damage_descrip = df['Total Damage Description'].describe()
print("Deaths:\n" + str(deaths))
print("\nDamage:\n" + str(damage))
print("\nDamage Description:\n" + str(damage_descrip))

Deaths:
count       575.000000
mean       4052.154783
std       19732.670343
min           1.000000
25%           5.000000
50%          50.000000
75%         834.500000
max      316000.000000
Name: Total Deaths, dtype: float64

Damage:
count       140.000000
mean       4248.415779
std       21925.412749
min           0.003000
25%           2.000000
50%          34.000000
75%         509.250000
max      220085.456000
Name: Total Damage ($Mil), dtype: float64

Damage Description:
count    1123.000000
mean        2.150490
std         1.058691
min         1.000000
25%         1.000000
50%         2.000000
75%         3.000000
max         4.000000
Name: Total Damage Description, dtype: float64


## Looking into what the most common tsunami cause is:

In [5]:
# figuring out cause code for tsnuami, turning numbers into categories, plotting bar chart of counts of each cause
# 0 - Unknown
# 1 - Earthquake
# 2 - Questionable Earthquake
# 3 - Earthquake and Landslide
# 4 - Volcano and Earthquake
# 5 - Volcano, Earthquake, and Landslide
# 6 - Volcano
# 7 - Volcano and Landslide
# 8 - Landslide
# 9 - Meteorological
# 10 - Explosion
# 11 - Astronomical Tide


#Basing this on event validity - only plotting tsunami events that ranged from questionable tsunami to definite tsunmai
# Leaving out entries that signify erroneous error, minor disturbance in inland river, or very doubtful tsunami

# Should I condense categories or leave them be? (aka 1,2,3,4,5 all fall under earthquakes, 3,5,7,8 all under landslide. etc)
# This would cause some duplication - I'll try both and see

causes = df[['Tsunami Event Validity','Tsunami Cause Code']]
causes.head()




Unnamed: 0,Tsunami Event Validity,Tsunami Cause Code
0,,
1,1.0,1.0
2,4.0,6.0
3,1.0,1.0
4,2.0,0.0


Idea for another question - does cause of tsnuami lead to differing impacts? aka do volcano events cause larger tsnuamis than landslides?

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2714 entries, 0 to 2713
Data columns (total 46 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Search Parameters                   1 non-null      object 
 1   Year                                2713 non-null   float64
 2   Mo                                  2565 non-null   float64
 3   Dy                                  2457 non-null   float64
 4   Hr                                  1399 non-null   float64
 5   Mn                                  1314 non-null   float64
 6   Sec                                 913 non-null    float64
 7   Tsunami Event Validity              2713 non-null   float64
 8   Tsunami Cause Code                  2709 non-null   float64
 9   Earthquake Magnitude                1530 non-null   float64
 10  Vol                                 150 non-null    float64
 11  More Info                           0 non-n

In [13]:
#cleaning df by removing some columns I won't be using:
t_df = df.drop(['More Info', 'Tsunami Magnitude (Abe)', 
                'Deaths', 'Missing','Injuries', 'Damage ($Mil)', 'Houses Destroyed', 'Houses Damaged',
               'Total Deaths', 'Total Missing', 'Total Damage ($Mil)', 'Total Houses Destroyed', 
                'Total Houses Damaged'], axis=1)
t_df.head()

Unnamed: 0,Search Parameters,Year,Mo,Dy,Hr,Mn,Sec,Tsunami Event Validity,Tsunami Cause Code,Earthquake Magnitude,...,Damage Description,Houses Destroyed Description,Houses Damaged Description,Total Death Description,Total Missing Description,Total Injuries,Total Injuries Description,Total Damage Description,Total Houses Destroyed Description,Total Houses Damaged Description
0,[],,,,,,,,,,...,,,,,,,,,,
1,,-2000.0,,,,,,1.0,1.0,,...,4.0,,,3.0,,,,4.0,,
2,,-1610.0,,,,,,4.0,6.0,,...,3.0,,,3.0,,,,3.0,,
3,,-1365.0,,,,,,1.0,1.0,,...,,,,,,,,3.0,,
4,,-1300.0,,,,,,2.0,0.0,6.0,...,,,,,,,,,,
