
This notebook presents a detailed analysis for different kinds of data visualization in `Plotly` package, a dynamic plot tool.    
We could put the mouse on the figure to develop information about the data point from dataframe.   

If you like this notebook or have other idea about data visulization for Tsunamis, please upvote and let me know your great idea!   
    
Thanks in advance!ðŸ™‚

<img src="http://1721181113.rsc.cdn77.org/data/images/full/27154/tsunami.jpg" width="1000">

--- 
# First of all, a preliminary understanding of the meaning of each data:   

### Tsunami Cause Code

Valid values: 0 to 11
The source of the tsunami:
0	Unknown  
1	Earthquake  
2	Questionable Earthquake  
3	Earthquake and Landslide  
4	Volcano and Earthquake  
5	Volcano, Earthquake, and Landslide  
6	Volcano  
7	Volcano and Landslide  
8	Landslide  
9	Meteorological  
10	Explosion  
11	Astronomical Tide  
---
### Tsunami Validity Number

Valid values: -1 to 4   
Validity of the actual tsunami occurrence is indicated by a numerical rating of the reports of that event:   
-1	erroneous entry   
0	event that only caused a seiche or disturbance in an inland river   
1	very doubtful tsunami   
2	questionable tsunami   
3	probable tsunami   
4	definite tsunami   
---
### Number of Deposits  
The total number of deposits link will display the deposits associated with a particular tsunami event.

---
### Number Of Runups
The total number of runups link will display the runup locations associated with a particular tsunami event.
>The tsunami database may also include errors that are unique to that database. One of the most important measurements associated with a tsunami event is the maximum runup height or water height reached above sea level in meters. Unfortunately, it is not always clear which reference level was used. The tsunami database also includes locations where the tsunami was observed, called runup locations. The same problem that occurs when identifying earthquake epicenters can occur when assigning runup locations, where the names of localities were incorrectly transcribed or where some localities had identical or very similar names. In addition, names of locations can change over time adding to the possibility of errors. If tsunami arrival and travel times are available for specific runup locations, they are included in the database. These data are valuable in verifying tsunami travel time models. The definition used in this database is the arrival or travel time of the first wave that arrives at a runup location. The first wave may not have been the largest wave, therefore the travel time reported in the original source may have been the second or third wave.   
Source: NOAA (https://www.ngdc.noaa.gov/hazard/tsunami-db-intro.html#intro)

# Import packages and load data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#sns.set_style("whitegrid")
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import warnings
warnings.filterwarnings("ignore")

In [2]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/historical-data-of-tsunamis18002021/tsunami historical data from 1800 to 2021.csv


In [3]:
df = pd.read_csv('/kaggle/input/historical-data-of-tsunamis18002021/tsunami historical data from 1800 to 2021.csv')

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2162 entries, 0 to 2161
Data columns (total 26 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Year                       2161 non-null   float64
 1   Mo                         2112 non-null   float64
 2   Dy                         2054 non-null   float64
 3   Hr                         1338 non-null   float64
 4   Mn                         1289 non-null   float64
 5   Sec                        941 non-null    float64
 6   Tsunami Event Validity     2161 non-null   float64
 7   Tsunami Cause Code         2158 non-null   float64
 8   Earthquake Magnitude       1293 non-null   float64
 9   Deposits                   2161 non-null   float64
 10  Country                    2161 non-null   object 
 11  Location Name              2159 non-null   object 
 12  Latitude                   1892 non-null   float64
 13  Longitude                  1892 non-null   float

There are many useful things in this dataset, however, in this notebook, we only choose the following columns for futher analysis:

In [5]:
data = df[['Year','Mo','Dy','Latitude','Longitude','Tsunami Event Validity','Tsunami Cause Code',
           'Deposits','Country','Location Name','Number of Runups','Total Deaths']]

# Rename the columns and drop nan
data.rename(columns = {'Tsunami Event Validity':'Validity',
                       'Tsunami Cause Code':'Code',
                       'Location Name':'Location',
                       'Number of Runups':'Runups',
                       'Total Deaths':'Death'}, inplace = True)
data = data.drop([0])
data.isnull().sum()

Year            0
Mo             49
Dy            107
Latitude      269
Longitude     269
Validity        0
Code            3
Deposits        0
Country         0
Location        2
Runups          0
Death        1669
dtype: int64

In [6]:
data['Code'].value_counts()

1.0     1522
0.0      210
8.0      104
3.0       96
6.0       91
9.0       90
2.0       16
4.0       14
7.0       11
11.0       2
10.0       1
5.0        1
Name: Code, dtype: int64

Most of Tsunamis are caused by 'Earthquake'

In [7]:
data['Deposits'].value_counts(ascending=False)

0.0      2021
1.0        91
2.0        13
3.0        10
4.0         6
8.0         4
6.0         3
20.0        2
10.0        2
5.0         2
11.0        1
26.0        1
14.0        1
9.0         1
19.0        1
144.0       1
7.0         1
Name: Deposits, dtype: int64

Most of Tsunamis are not considered as 'Dangerous', as they do not have any deposit (Do not effect inland properties).

# Part 1: General EDA

According to the above data, questions like the folloing are easy to think of: 
- Which country is most threatened by the tsunami? 
- Are there more and more tsunamis?
- ...   


In [8]:
data.head(3)

Unnamed: 0,Year,Mo,Dy,Latitude,Longitude,Validity,Code,Deposits,Country,Location,Runups,Death
1,1800.0,6.0,2.0,,,2.0,1.0,0.0,PORTUGAL,AZORES,1.0,
2,1802.0,1.0,4.0,45.3,14.4,2.0,1.0,0.0,CROATIA,BAKAR,3.0,
3,1802.0,3.0,19.0,17.2,-62.4,2.0,1.0,0.0,ANTIGUA AND BARBUDA,ANTIGUA ISLAND & ST. CHRISTOPHER,2.0,


In [9]:
data.loc[(data['Year']>=1800)&(data['Year']<1850), 'Century'] = 'Early_19'
data.loc[(data['Year']>=1850)&(data['Year']<1900), 'Century'] = 'Late_19'
data.loc[(data['Year']>=1900)&(data['Year']<1950), 'Century'] = 'Early_20'
data.loc[(data['Year']>=1950)&(data['Year']<2000), 'Century'] = 'Late_20'
data.loc[data['Year']>=2000, 'Century'] = 'Early_21'

In [10]:
data_dd = pd.DataFrame({
                        'Year': data.Year.value_counts().sort_index().index,
                        'Count': data.Year.value_counts().sort_index(),
                        }).reset_index(drop=True)

data_dd = data_dd.merge(data[['Year', 'Century']].drop_duplicates('Year'),
                        how='left',
                        on='Year')

# BarPlot: General Tsunami Analysis

In [11]:
fig = px.bar(
             data_dd,
             x='Year',
             y='Count',
             template='gridon',
             color_discrete_sequence =px.colors.qualitative.Set3,
             title='<b> Fig.1 Tsunami Counts in Different Years'
             )

fig.show()

**It is not difficult to see from the `Fig.1` that there is a small peak about once every ten years.**   
**Every 50 years or so, there is a frequent occurrence of tsunamis around the world.**   

In [12]:
df_count = pd.DataFrame(data[['Code','Validity']].groupby(['Validity'])['Code'].value_counts())
df_count = df_count.rename({'Code': 'Count'}, axis=1)
df_count.reset_index(inplace=True)

In [13]:
fig = px.bar(
             df_count,
             x='Validity',
             y='Count',
             color='Code',
             template='gridon',
             color_continuous_scale = 'teal',
             width=1000,
             height=500,
             title='<b> Fig.2 Tsunami Code Count in Different Validity'
            )

fig.show()

**From `Fig.2`, most of recorded Tsunamis were turly happened and caused by earthquake.**       
**Small fraction of misreported tsunamis due to meteorological factors.**

In [14]:
fig = px.bar(
             df_count,
             x='Code',
             y='Count',
             color='Validity',
             template='gridon',
             color_continuous_scale = px.colors.carto.Teal,
             log_y=False,
             width=1000,
             height=500,
             title='<b>Fig.3 Tsunami Count in Different Code'
            )

fig.show()

**From `Fig.3`, the vast majority of recorded tsunamis in the world are triggered by earthquakes.**   
**Other factors such as volcanic eruptions account for only a few.**

# PiePlot: Country Analysis

In [15]:
country = pd.DataFrame(data[['Country']].value_counts()).reset_index()
country = country.rename(columns={0: "Count"})

# Represent only countries suffered from tsunamis over 20 times
country.loc[country['Count'] < 20, 'Country'] = 'Other countries' 

In [16]:
fig = px.pie(
             country,
             names='Country',
             values='Count',
             template='gridon',
             color_discrete_sequence=px.colors.carto.Teal,
             width=1000, 
             height=500,
             title='<b>Fig.4 Tsunami Count for Different Country'
            )

fig.update_traces(
                  textposition='inside',
                  textinfo='percent+label'
                  )

fig.show()

**From `Fig.4`, the top five countries with the most tsunami records (including erroneous entry) in the world are:**
- USA
- Indonesia
- Japan
- Chile
- Greece

# GeoPlot: Worldwide Tsunami Densisty

In [17]:
# Set a feature named 'Century'

map = pd.DataFrame(df[['Country','Year','Latitude','Longitude']])
map = map.dropna()
map.loc[(map['Year']>=1800)&(map['Year']<1900), 'Century'] = 19
map.loc[(map['Year']>=1900)&(map['Year']<2000), 'Century'] = 20
map.loc[map['Year']>=2000, 'Century'] = 21

In [18]:
fig = px.density_mapbox(
                        map,
                        lat='Latitude',
                        lon='Longitude',
                        hover_name='Country',
                        radius=8,
                        color_continuous_scale=px.colors.carto.Teal,
                        #color_continuous_scale = px.colors.diverging.Geyser,
                        center=dict(lat=0, lon=210), zoom=0.5,
                        )

fig.update_layout(
                  mapbox_style="carto-positron",
                  width=1000, 
                  height=500,
                  title_text="<b>Fig.5 World Tsunami Density",
                  title_x=0.5
                 )

fig.show()

**From `Fig.5`, the majority of tsunamis are occured around Pacific Ocean and Mediterranean Sea.**

# Part 2: Further Data Visulization

In [19]:
data.head(3)

Unnamed: 0,Year,Mo,Dy,Latitude,Longitude,Validity,Code,Deposits,Country,Location,Runups,Death,Century
1,1800.0,6.0,2.0,,,2.0,1.0,0.0,PORTUGAL,AZORES,1.0,,Early_19
2,1802.0,1.0,4.0,45.3,14.4,2.0,1.0,0.0,CROATIA,BAKAR,3.0,,Early_19
3,1802.0,3.0,19.0,17.2,-62.4,2.0,1.0,0.0,ANTIGUA AND BARBUDA,ANTIGUA ISLAND & ST. CHRISTOPHER,2.0,,Early_19


# BarPlot: Study Reported Tsunami Kind
Not all tsunamis that are recorded will actually occur, so if we study tsunamis that are erroneously recorded or less likely to happen, our results will be skewed.   
In this part, we will devide tsunamis into 3 kinds:
- True tsunamis
- Maybe tsunamis
- Fake tsunamis

In [20]:
# Check the definite tsunamis' frequency around world

data[data['Validity']==4].Country.value_counts().head(5)

JAPAN        182
USA           71
INDONESIA     67
RUSSIA        63
CHILE         42
Name: Country, dtype: int64

In [21]:
for i,j in enumerate(data[data['Validity']==4].Country.unique()):
    
    #Choose the top 5 countries with the most tsunami recorded as 'definite'
    if len(data[(data['Country']==j)&(data['Validity']==4)]) >=42:
        data.loc[data['Country']==j, 'Country_id'] = i
        
    else:
        data.loc[data['Country']==j, 'Country_id'] = None  

In [22]:
# Set a feature to describe the tsunami kind

data.loc[data['Validity']==4, 'Kind'] = 'True'
data.loc[(data['Validity']>=1)&(data['Validity']<=3), 'Kind'] = 'Maybe'
data.loc[(data['Validity']== -1)|(data['Validity']== 0), 'Kind'] = 'Fake'

data_2 = data.dropna()
data_2['Count'] = 1

In [23]:
#fig = px.violin(data, 
#                y="Year", 
#                color="Kind",
#                template='gridon',
#                color_discrete_sequence=px.colors.qualitative.Pastel,
#                #box=True, 
#                #points="suspectedoutliers", 
#                width=1000, height=500,
#                violinmode='overlay', 
#                hover_data=data.columns,
#                title='<b> Development of Tsunami Detection Technology'
#               )
#fig.show()

#fig = px.violin(data, 
#                y="Year", 
#                x='Effect',
#                #color="Kind",
#                template='gridon',
#                color_discrete_sequence=px.colors.qualitative.Set2,
#                #box=True, 
#                #points="suspectedoutliers", 
#                width=1000, height=500,
#                violinmode='overlay', 
#                hover_data=data.columns,
#                title='<b> Development of Tsunami Detection Technology'
#               )
#fig.show()

In [24]:
fig = px.bar(
             data_2,
             x="Country",
             y="Count", 
             color="Country",
             facet_col= 'Century',
             facet_row='Kind',
             template='gridon',
             color_discrete_sequence=px.colors.carto.Geyser,
             width=1000,
             height=600,
             title='<b>Fig.6 Ture Tsunami or Fake Tsunami? <br> Situation for Top5 Countries in Different Century'
            )

fig.update_xaxes(tickangle=45)

fig.show()

**According to `Fig.6`:**
1. **The USA is the country with the most tsunami records. But Japan does record the most tsunamis as 'definite', and tsunamis are getting more and more frequent in recent years.**.  
2. **Furthermore, the USA has most fake tsunamis records.**   
3. **Fewer and fewer tsunamis are recorded as possible over time.**  
4. **Russia experienced an unusually high number of tsunamis in the second half of the twentieth century compared to other times.**

# BarPlot: Study Deposit to Assess Tsunami Risk
The total number of deposits link will display the deposits associated with a particular tsunami event.   
>Sedimentary deposits left by tsunamis can be used to extend the record of tsunamis to improve risk assessment. The two primary factors in tsunami risk, tsunami frequency and magnitude, can be addressed through field and modeling studies of tsunami deposits.   
---- Bruce E Jaffe

In [25]:
data.Deposits.value_counts()

0.0      2021
1.0        91
2.0        13
3.0        10
4.0         6
8.0         4
6.0         3
20.0        2
10.0        2
5.0         2
11.0        1
26.0        1
14.0        1
9.0         1
19.0        1
144.0       1
7.0         1
Name: Deposits, dtype: int64

In [26]:
# Classify Tsunamis' Effect

data.loc[data['Deposits']==0, 'Effect'] = 'None'
data.loc[data['Deposits']==1, 'Effect'] = 'Once'
data.loc[(data['Deposits']>1)&(data['Deposits']<10), 'Effect'] = 'Several'
data.loc[data['Deposits']>=10, 'Effect'] = 'Dozens'
data['Count'] = 1

In [27]:
fig = px.bar(
             data[(data['Validity']==4)&(~data['Country_id'].isnull())], 
             x="Country",
             y="Count", 
             color="Country",
             facet_col= 'Century',
             facet_row='Effect',
             template='gridon',
             log_y=True,
             color_discrete_sequence=px.colors.carto.Geyser,
             width=1000,
             height=600,
             title='<b>Fig.7 Does the Tsunami Effect Inland Properties? <br> Deposit Study for Top5 Countries in Different Century'
            )

fig.update_xaxes(tickangle=45)

fig.show()

**According to `Fig.7`:**
1. **The majority of tsunamis do not have any impact on inland areas, as they do not create deposit areas.**
2. **Both Indonesia and USA occurred tsunamis that generated dozens of deposits, which were very dangerous tsunamis.**
3. **The deposits can be used to assess coastal risk management for future generations.**

# ViolinPlot: Combine Tsunami Kind and its Damage
Through data visualization, some of our conjectures...

In [28]:
fig = make_subplots(
                    rows=1,
                    cols=2,
                    subplot_titles = ['a) Tsunami Authenticity','b) Tsunami Damage Level'],
                    shared_yaxes=True
                   )

fig.add_trace(
    go.Violin(
                y=data.Year, 
                x=data.Kind,
                name='Kind',
                marker=dict(color="teal", size=4, opacity=0.8)
                #legendgroup='Kind'
                ),
                row=1,
                col=1
                )

fig.add_trace(
    go.Violin( 
                y=data.Year, 
                x=data.Effect,
                name='Effect',
                marker=dict(color='rgb(237,187,138)', size=4, opacity=0.8)
                ),
                row=1,
                col=2
                )

fig.update_layout(
                  height=400,
                  width=1000, 
                  title_text="<b>Fig.8 Development of Tsunami Detection Technology",
                  title_x=0.5,
                  template='gridon',
                  #yaxis=dict(range=[1750,2090])
                 )

fig.show()

**According to `Fig.8`, our conjectures:**   
- **a)**
    - As times change, we are increasingly certain that whether a tsunami will occur or not.
- **b)**
    - At the same time, the danger posed by tsunamis to us is increasing.

# GeoPlot: Analysis for Selected Country
The location distribution of tsunamis in the following three countries:
- Japan
- USA
- Indonesia

In [29]:
Japan = data[data['Country']== 'JAPAN']
Japan.loc[Japan['Deposits']==0, 'Deposits'] =0.05 # make the data points visiable on the plot

In [30]:
fig = px.scatter_mapbox(
                        Japan,
                        lat='Latitude',
                        lon='Longitude', 
                        color="Deposits", 
                        size="Deposits",
                        color_continuous_scale=px.colors.carto.Tealrose, 
                        hover_name='Location',
                        hover_data=['Year','Kind','Death'],
                        size_max=33, 
                        zoom=3.8,
                        center=dict(lat=38, lon=139),
                        title = '<b>Fig.9 Japan Tsunami Distribution with Deposit Numbers'
                        )

fig.update_layout(
                  mapbox_style="carto-positron",
                  width=1000,
                  height=550,
                  title_x=0.5,
                  )

fig.show()

**According to `Fig.9`,**
1. **More dangerous tsunamis hit northern Japan than southern Japan.**
2. **Three dangerous Tsunamis hit Japan in 1983/1993/2011 respectively, and caused a total death of 18,764 people.**
3. **The terrible tsunami occurred on Honshu Island caused the death of 18,429 in 2011.**

We can zoom in to check more detailed information related to tsunamis, like cities and date...


In [31]:
USA = data[data['Country']== 'USA']
USA.loc[USA['Deposits']==0, 'Deposits'] =0.05

In [32]:
fig = px.scatter_mapbox(
                        USA,
                        lat='Latitude',
                        lon='Longitude', 
                        color="Deposits", 
                        size="Deposits",
                        color_continuous_scale=px.colors.carto.Tealrose, 
                        hover_name='Location',
                        hover_data=['Year','Kind','Death'],
                        size_max=35, 
                        zoom=1.9,
                        center=dict(lat=44, lon=-125),
                        title = '<b>Fig.10 USA Tsunami Distribution with Deposit Numbers'
                       )

fig.update_layout(
                  mapbox_style="carto-positron",
                  width=1000,
                  height=550,
                  title_x=0.5
                 )

fig.show()

**According to `Fig.10`,**
1. **The most dangerous tsunamis hit the USA around the Gulf of Alaska.**
2. **Western USA is more vulnerable to tsunamis than eastern USA.**
3. **The five most dangerous tsunamis happened over 50 years ago, in 1964/1946/1883/1957/1891 respectively, and caused a total death of 301 people.**   

We can zoom in to check more detailed information related to tsunamis, like cities and date...


In [33]:
Indonesia = data[data['Country']== 'INDONESIA']
Indonesia.loc[Indonesia['Deposits']==0, 'Deposits'] =0.05

In [34]:
fig = px.scatter_mapbox(
                        Indonesia,
                        lat='Latitude',
                        lon='Longitude', 
                        color="Deposits", 
                        size="Deposits",
                        color_continuous_scale=px.colors.carto.Tealrose, 
                        hover_name='Location',
                        hover_data=['Year','Kind','Death'],
                        size_max=144,   #set the size as the deposits numbers
                        zoom=3,
                        center=dict(lat=-1, lon=114),
                        title = '<b>Fig.11 Indonesia Tsunami Distribution with Deposit Numbers'
                       )

fig.update_layout(
                  mapbox_style="carto-positron",
                  width=1000,
                  height=550,
                  title_x=0.5,
                 )

fig.show()

**According to `Fig.11`,**
1. **The tsunami hit Indonesia on the coast of Sumatra in 2004 is the most devastating and deadliest tsunami ever, it caused a total death of 227,899 people.**
2. **The five most dangerous tsunamis in Indonesia caused a total death of 267,856 people, they occurred in 2004/1883/1992/1994/2006 respectively.**   

We can zoom in to check more detailed information related to tsunamis, like cities and date...


# Off Topic
Check the color for different models in `Plotly` package.   
Find the favourite one to use in plotting figures~

In [35]:
# Choose 'carto' as an example
fig = px.colors.carto.swatches()
fig.show()