![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fdata-viz-of-the-week&branch=main&subPath=costliest-disasters/costliest-natural-disasters.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"></a>

# Callysto's Weekly Data Visualization 

## Costliest Natural Disasters

### Recommended Grade levels: 5-9

### Instructions

Click "Cell" and select "Run All".

This will import the data and run all the code, so you can see this week's data visualization. Scroll back to the top after you’ve run the cells.

![instructions](https://github.com/callysto/data-viz-of-the-week/blob/main/images/instructions.png?raw=true)

**You don't need to do any coding to view the visualizations**.

The plots generated in this notebook are interactive. You can hover over and click on elements to see more information. 

Email contact@callysto.ca if you experience issues.

### About this Notebook

Callysto's Weekly Data Visualization is a learning resource that aims to develop data literacy skills. We provide Grades 5-12 teachers and students with a data visualization, like a graph, to interpret. This companion resource walks learners through how the data visualization is created and interpreted by a data scientist. 

The steps of the data analysis process are listed below and applied to each weekly topic.

1. Question - What are we trying to answer?
2. Gather - Find the data source(s) you will need. 
3. Organize - Arrange the data, so that you can easily explore it. 
4. Explore - Examine the data to look for evidence to answer the question. This includes creating visualizations. 
5. Interpret - Describe what's happening in the data visualization. 
6. Communicate - Explain how the evidence answers the question. 

## Question

What were the most expensive natural disasters in Canada? 

### Goal

Our goal is to show which natural disasters led to the greatest financial costs and use visualizations to discover any patterns to their impact.

The dataset is taken from [Public Safety Canada](https://www.publicsafety.gc.ca/cnt/rsrcs/cndn-dsstr-dtbs/index-en.aspx), and contains information on Canadian natural disaster events from the years 1900 to 2019.

### Background

Weather events and natural diasters have the potential to cause huge amounts of damage to property. Have you ever wondered what the most expensive natural disasters and weather events are in Canada? We are going to explore the costliest natural disasters in the 2010 decade in this notebook. 


## Gather

### Code: 

Run the code cells below to import the libraries we need for this project. Libraries are pre-made code that make it easier to analyze our data.

In [None]:
%pip install -r requirements.txt
import pyodide_http
pyodide_http.patch_all()
import pandas as pd
import plotly.express as px
print('Libraries imported')

[pandas](https://pandas.pydata.org/) is a library that helps us with data analysis, and [Plotly Express](https://plotly.com/python/plotly-express/) is a library that helps us to make visualizations. Without importing these libraries we would have to use much more code to analyze our data and generate visualizations. We import the libraries with abbreviations, or aliases, so that we have less typing to do in each line of our code below.

### Data
We are using data from [Public Safety Canada](https://www.publicsafety.gc.ca/cnt/rsrcs/cndn-dsstr-dtbs/index-en.aspx) on natural disasters. Run the code below to populate the data into a dataframe.

#### Import the Data

In [None]:
data = pd.read_csv('data/CDD.txt', sep='\t')
data

### Comment on the data

The dataframe above is a file structure that allows Python to display data in an easily readable format, similar to a spreadsheet. 

As we can see from the numbers below the dataframe itself, this dataset has 867 rows and 23 columns. Each row represents a disaster, and each column describes an aspect of that disaster.

Next up, let's take a look at the names of the columns in our dataset. Run the code cell below to generate a list of all the columns.

In [None]:
for i in data.columns:
    print(i)

Let's have a look at the `EVENT TYPE` column.

In [None]:
data['EVENT TYPE'].unique()

We can filter the dataset so it only contains event types that we recognize.

In [None]:
event_types = ['Flood','Winter Storm','Heat Event','Storm - Unspecified / Other',
               'Hurricane / Typhoon / Tropical Storm','Storm Surge','Drought','Avalanche','Cold Event','Geomagnetic Storm']
data = data[data['EVENT TYPE'].isin(event_types)]
data['EVENT TYPE'].unique()

To answer our question, a few columns that are going to be most useful in analyzing the data are `ESTIMATED TOTAL COST` and `NORMALIZED TOTAL COST`. These columns will help us to understand the total costs of specific disaster events. 

The difference between `ESTIMATED TOTAL COST` and `NORMALIZED TOTAL COST` comes down to **inflation**. Inflation is when the cost associated with an item goes up over time; for example because of inflation the cost of a specific product like shoes will increase over time. Because of this increase, we need to standardize costs to a specific year. This is called **normalization**, and this data has been normalized to 2016 as that is the last year that had data to which we could normalize.

This data set has a lot of different values in it and we could create a number of different visualizations with the data. To illustrate this we have created the visualization below. The code below will create a bar graph that lists each event type and totals the `NORMALIZED TOTAL COST` for each event.

We'll first drop any rows that have `nan` or a number as the `EVENT TYPE`.

In [None]:
px.bar(data, x='EVENT TYPE', y='NORMALIZED TOTAL COST', title='Total Costs of Natural Disasters', height=800)

This bar graph shows which type of disasters cost the most according to a metric of `NORMALIZED TOTAL COST`. Run the code cell below to generate a bar graph that compares events types to the total insurance costs. 

In [None]:
px.bar(data[data['EVENT TYPE'].isin(event_types)], x='EVENT TYPE', y='INSURANCE PAYMENTS', title='Insurance Payments for Natural Disasters', height=800)

From the second bar graph we can see which `EVENT TYPE` category had the highest insurance costs. Do you notice any similarities or differences between the two bar graphs? 

## Organize

An important part of the data science process is cleaning up and organizing your data so it can be useful for finding observations. Part of cleaning involves 
- identifying missing data
- removing missing data
- ensuring the data is all in the same format
- identifying and dealing with outliers. 

Many of our fields have data in them, in data science we call that non-null. If you were to forget to write your name on your paper the name field would be null because there is no data in it. Once you write your name on your work suddenlty that value is non-null, because the value has data in it. Look at the information below when we run the `data.info` code  which fields have a higher number of non-null cells, as that number varies quite a bit by column. A few of the fields have very few non-null cells, meaning most of the data is not available. 

In Python, missing data is identified as `NaN` ('Not a Number'), so we want to see how much of our dataframe contains missing data. We do this by asking for the fields where the data is 'non-null'. Non-null essentially means those fields have actual data in them, or that they are *not* missing values.

Let's look at what the column names are and how many non-null data each contain. This function returns all of the column names, along with the number of non-null values inside each column.

In [None]:
data.describe().loc['count']

The code cell above just described the columns that were numbers. There are a couple of columns that shoud be dates, we can check what data type they are.

In [None]:
print(type(data['EVENT START DATE'].iloc[0]))
print(type(data['EVENT END DATE'].iloc[3]))

Let's convert the `DATE` columns to date values.

In [None]:
data.loc[:, 'EVENT START DATE'] = pd.to_datetime(data['EVENT START DATE'], format='%m/%d/%Y %I:%M:%S %p')
data.loc[:, 'EVENT END DATE'] = pd.to_datetime(data['EVENT END DATE'], format='%m/%d/%Y %I:%M:%S %p')

print(type(data['EVENT START DATE'].iloc[0]))
print(type(data['EVENT END DATE'].iloc[3]))

## Explore

The columns that are going to be useful to answer our question are the ones with information about costs. Because we are going to be trying to answer a question about costs we need to ensure we loook at data that references cost. Costs rise over time, so the column `NORMALIZED TOTAL COST` takes into account [inflation](https://en.wikipedia.org/wiki/Inflation).

The Canadian Disasters Database, where this data is taken from, defines a disaster as:

>A disaster is an interruption in time and space of normal processes causing death, injury or homelessness, economic or property loss, and/or significant environmental damage. The interruption is beyond the coping capacity of the community and/or is beyond the assumed risk factors of human activity. Assumed risk is inherent in most human activity such as transportation and handling of dangerous goods. The interruption precludes war.

They also provide definitions for some of their fields:

**EVACUATED**
If the exact or estimated number of people evacuated from the area during a disaster is known, it is placed in this field. Otherwise, the field contains a zero.

**DOLDAM**
If the exact or estimated value of the damage in millions of dollars is known, it is placed in this field. Otherwise, the field contains a zero. Note that the values shown are estimates given in the dollar value at the time of the disaster, and inflation is not taken into account.

**INJURED**
If the exact or estimated number of injuries is known, it is placed in this field. Otherwise, the field contains a zero.

**COMENG / COMFRA**
These two fields allow for comments in English and French, respectively. Where possible, they include a brief outline of the disaster and a qualitative description of the resulting damage.

**PLACE NAME / LAT / LONG**
A location of the disaster is included to assign latitude and longitudinal coordinates to the disaster. For some records, this indicates a central or approximate location, as the disasters may cover large regions.

Knowing this, let's create some visualizations with dates on the x-axis. For example, the column `INJURED / INFECTED` shows the number of people injured or infected in each disaster.

In [None]:
px.scatter(data, x='EVENT START DATE', y='INJURED / INFECTED', title='Number of Injured/Infected Individuals for Each Event', color='EVENT TYPE')

As the financial cost is the main question we're trying to answer, let's visualize the `NORMALIZED TOTAL COST` column.

In [None]:
px.scatter(data, x='EVENT START DATE', y='NORMALIZED TOTAL COST', title='Normalized Total Costs of Natural Disasters', color='EVENT TYPE', height=800)

We can see that there are a few disasters that were very costly. We can look at just the ones that had a normalized total cost greater than $ 50,000,000 and print out the comments column to read more about them

In [None]:
for row in data[data['NORMALIZED TOTAL COST'] > 50000000].iterrows():
    print('$', row[1]['NORMALIZED TOTAL COST'], row[1]['EVENT TYPE'])
    print(row[1]['COMMENTS'])
    print('------------------------')

In the next visualization we'll compare the total insurance payments to the normalized total cost for that particular event.

In [None]:
title = 'Insurance Payments Compared to Costs of Disasters'
px.scatter(data, x='NORMALIZED TOTAL COST', y='INSURANCE PAYMENTS', color='EVENT TYPE', hover_data=['PLACE','EVENT START DATE'], title=title)

## Interpret

In the scatter plot you can see the total cost of disaster events on the x-axis and the amount of insurance payments for those disaster events on the y-axis. The normalized total cost on the x axis is shown in millions of dollars while the insurance payments on the y axis are shown in billions of dollars.

The color of the dots represents the different types of disaster events; for example tornados are a different color than winter storms. You can click on the labels in the legend to hide or show event categories.

The majority of the events are in the same range of cost as most of the events are at or under half a billion dollars in insurance payments and the normalized total cost of most of the events was under fifty million dollars. Which events are outside of that range?

## Reflect on what you see

After making your visualization the next step is to use the data and your visualization to answer the question. Look at and interact with the visualization above. When you hover your mouse over the plots, you’ll notice more information appears. You can also use the legend to make plots appear and disappear.

#### Think about the following questions.

* What do you notice about these graphs?
* What do you wonder about the data?
* What kind of inferences can you make based on this data?
* Is there another way to visualize this data that would change your inerpretation of the information? 


#### Use the fill-in-the-blank prompts to summarize your thoughts.
* "I used to think _______"
* "Now I think _______"
* "I wish I knew more about _______"
* "These data visualizations remind me of _______"
* "I really like _______"

## Communicate

If you have not yet done this use the plot to answer our question on which natural disaster was the most expensive. 
Once we understand the costs of natural disasters how can we use that information?

How can you communicate that information? What kind of product could you create to share that information with your school community and wider community?

Consider tagging Callysto on [Twitter](https://twitter.com/callysto_canada), [YouTube](https://www.youtube.com/Callysto), [TikTok](https://www.tiktok.com/@callysto_canada), [Facebook](https://www.facebook.com/callystocanada/), or [Linkedin](https://www.linkedin.com/company/callysto-canada/) if you decide to share your reflections or projects on social media.

## Further Resources

For more information on the costliest weather events between 2012 and 2016 check out this article from the [Weather Network](https://www.theweathernetwork.com/ca/news/article/the-top-five-costliest-canadian-natural-disasters-of-the-2010s) 

You may find the following video about the 2013 Calgary floods interesting. Run the cell below to display the Youtube video. 

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('jgw06p4jeh8')

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)