![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

# Callysto's Weekly Data Visualization 

## Costliest Natural Disasters

### Recommended Grade levels: 5-9

### Instructions

#### "Run" the cells to see the graphs

Click "Cell" and select "Run All".

This will import the data and run all the code, so you can see this week's data visualization. Scroll to the top after you’ve run the cells.

![instructions](https://github.com/callysto/data-viz-of-the-week/blob/main/images/instructions.png?raw=true)

**You don't need to do any coding to view the visualizations**.

The plots generated in this notebook are interactive. You can hover over and click on elements to see more information. 

Email contact@callysto.ca if you experience issues.

### About this Notebook

Callysto's Weekly Data Visualization is a learning resource that aims to develop data literacy skills. We provide Grades 5-12 teachers and students with a data visualization, like a graph, to interpret. This companion resource walks learners through how the data visualization is created and interpreted by a data scientist. 

The steps of the data analysis process are listed below and applied to each weekly topic.

1. Question - What are we trying to answer?
2. Gather - Find the data source(s) you will need. 
3. Organize - Arrange the data, so that you can easily explore it. 
4. Explore - Examine the data to look for evidence to answer the question. This includes creating visualizations. 
5. Interpret - Describe what's happening in the data visualization. 
6. Communicate - Explain how the evidence answers the question. 

## Question

What were the most expensive natural disasters in Canada? 

### Goal

Our goal is to show which natural disasters led to the greatest financial costs and use visualizations to discover any patterns to their impact.

The dataset is taken from [Public Safety Canada](https://www.publicsafety.gc.ca/cnt/rsrcs/cndn-dsstr-dtbs/index-en.aspx), and contains information on Canadian natural disaster events from the years 1900 to 2019.

### Background

Weather events and natural diasters have the potential to cause huge amounts of damage to property. Have you ever wondered what the most expensive natural disasters and weather events are in Canada? We are going to explore the costliest natural disasters in the 2010 decade in this notebook. 


# Gather

### Code: 

The next step is to setup the notebook. To setup this notebook run the code cells below to import the libraries we need for this project. In short, libraries are pre-made code that make it easier to analyze our data.

In [None]:
import pandas as pd
import plotly.express as px

Pandas is a library that helps us with data analysis, and Plotly Express is a library that helps us to make visualizations. Without importing these libraries we would have to use much more code to analyze our data and generate visualizations. We import the libraries with abbreviations, or aliases, so that we have less typing to do in each line of our code below. 

### Data
We are using data from [Public Safety Canada](https://www.publicsafety.gc.ca/cnt/rsrcs/cndn-dsstr-dtbs/index-en.aspx) on natural disasters. Run the code below to populate the data into a dataframe.

#### Import the Data

In [None]:
data = pd.read_csv('data/CDD.txt', sep='\t')
data

### Comment on the data

The dataframe above is a file structure that allows Python to display data in an easily readable format, similar to a spreadsheet. 

As we can see from the numbers below the dataframe itself, this dataset has 867 rows and 23 columns.

According to the numbers beneath our dataframe, the dataframe has 867 rows and 23 columns. Each row represents a disaster, and each column describes an aspect of that disaster. If we look at the numbers in the leftmost column of the dataframe, we can see that it ends at (x-1). Why is that?

Python (like many programming languages) actually starts counting at 0. There are some really interesting technical reasons why that's the case, but for now just keep that in mind as we move through this notebook

Run the code cell below to generate a list of all the columns available in this dataset. To answer our question, a few columns that are going to be most useful in analyzing the data are `ESTIMATED TOTAL COST` and `NORMALIZED TOTAL COST`. These columns will help us to understand the total costs of specific disaster events. 

Inflation is when the cost associated with an item goes up over time, for example because of inflation the cost of a specific item like shoes will go up over time. Because the cost of items goes up over time we need to standardize costs to a specific year. This is called cost normalization. this data has been normalized to 2016  because that is the last year that was available to normalize the data with. 

In [None]:
for i in data.columns:
    print(i)

We can look at a specific column in the data by using the code below. The column `INJURED / INFECTED` shows the number of people injured or infected in each disaster. Run the code cell below to show the values in this column. 

In [None]:
data["INJURED / INFECTED"]

The code below shows `INJURED / INFECTED` column with all the values that are greater than zero. We have decided to look at all the columns where this value is above zero because they are unique and it is easier for the sake of this visualization to examine the events where this value is above zero.  

# \# why? consider using .notna() instead, events with 0 injuries might still be interesting. or, if you're going to remove them, you have to clearly state why
# \# I think I explained this above now. 

The column we are examining `INJURED / INFECTED` has many values that are 0 or `NaN` and the code below eliminates all of those values. The second line of code also displays the data from the lowest number in that column to the greatest. 

In [None]:
filtered_data = data[data['INJURED / INFECTED'] > 0]
filtered_data.sort_values(by='INJURED / INFECTED').tail(10)

The above code populated the data in a dataframe again. The code below will show us the information in that single column we were examining. 
 

In [None]:
data[data['INJURED / INFECTED'] > 0]['INJURED / INFECTED']

The code above allowed us to look at the data in a specific columm; the columns in this data set represent a specific piece of information about all of the disasters in this data set. We can use the code below to generate a graph of the number of injuired/infected indiviudals for each disaster. 

# \# before making this plot, consider using this function to better format your dates: https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html or `parse_dates` when first importing the CSV https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

In [None]:
px.bar(data[data['INJURED / INFECTED'] > 0], x='EVENT START DATE', y='INJURED / INFECTED', title='NUMBER OF INJURED/INFECTED INDIVIDUALS FOR EACH EVENT ')

Because that is a very long list it may be more useful to find the minimum and maximum values.  We can find the maximum and minimum values in that column with the following blocks of code. 

In [None]:
max_value = data['INJURED / INFECTED'].max()
print("Maximum value in column 'INJURED / INFECTED': ", max_value)

In [None]:
min_value = data['INJURED / INFECTED'].min()
print("Minimum value in column 'INJURED / INFECTED': ", min_value)

The code above told us the range of values in the `INJURED/INFECTED` column is 0-945. It may have been obvious from looking at our string that the minimum value was `0` but the python code made it much easier to find the maximum value given the lenghth of our string of data. 

Now that we examined the data in a specific column we are goign to examine the data in a speicific row. Run the code cell below to examine the data in the 50th row. Each row represents a specific disaster so this row represents all of the data available in this data set on a specific disaster event. 

In [None]:
display(data.iloc[50])

We can see that this particular event in row 50 was a flood that occured in Fort McMurray on 2016. By looking at the comments column we can see that there are particular dates that the flood affected the community. It is also evident from looking at the data what the `ESTIMATED TOTAL COST` and `NORMALIZED TOTAL COST` of this event were considerable. It may be helpful for you to write out those costs properly in standard form to understand how high the costs were of this event.  

The `COMMENTS` field may be of particular interest; when we read that column we can get more specific infomation on some of the events. Under `EVENT TYPE`, we can see some categories that the disasters might belong to. 

We can focus on these aspects of the data, and more, in the next steps. The columns that are going to be useful to answer our question here are the columns that have information about costs. Because we are going to be trying to answer a question about costs we need to ensure we loook at data that references cost. The column `NORMALIZED TOTAL COST` takes into account inflation. Costs naturally rise over time and that is what we call inflation. For instance if a cost of bread costs $5.00 eventually the cost of that bread will rise. The amount it rises varies depends on various factors int he national and global economies. 

The Canadian Disasters Database, where this data is taken from, defines a disaster with the following definition:

>A disaster is an interruption in time and space of normal processes causing death,
injury or homelessness, economic or property loss, and/or significant environmental
damage. The interruption is beyond the coping capacity of the community and/or is
beyond the assumed risk factors of human activity. Assumed risk is inherent in most
human activity such as transportation and handling of dangerous goods. The
interruption precludes war.


The Canadian disaster database provides the follow definitions for some of their fields.

**EVACUATED**
If the exact or estimated number of people evacuated from the area during a disaster is known, it
is placed in this field. Otherwise, the field contains a zero.

**DOLDAM**
If the exact or estimated value of the damage in millions of dollars is known, it is placed in this
field. Otherwise, the field contains a zero. Note that the values shown are estimates given in the
dollar value at the time of the disaster, and inflation is not taken into account.

**INJURED**
If the exact or estimated number of injuries is known, it is placed in this field. Otherwise, the field
contains a zero.

**COMENG / COMFRA**
These two fields allow for comments in English and French, respectively. Where possible, they
include a brief outline of the disaster and a qualitative description of the resulting damage.

**PLACE NAME / LAT / LONG**
A location of the disaster is included to assign latitude and longitudinal coordinates to the
disaster. For some records, this indicates a central or approximate location, as the disasters may
cover large regions.






"This should be the longest text section of the entire notebook" (Swanson, 2023)

# Organize

An important part of the data science process is cleaning up and organizing your data so it can be useful for finding observations. Part of cleaning involves 
- identifying missing data
- removing missing data
- ensuring the data is all in the same format
- identifying and dealing with outliers. 

Many of our fields have data in them, in data science we call that non-null. If you were to forget to write your name on your paper the name field would be null because there is no data in it. Once you write your name on your work suddenlty that value is non-null, because the value has data in it. Look at the information below when we run the `data.info` code  which fields have a higher number of non-null cells, as that number varies quite a bit by column. A few of the fields have very few non-null cells, meaning most of the data is not available. 


 In Python, missing data is identified as `NaN` ('Not a Number'), so we want to see how much of our dataframe contains missing data. We do this by asking for the fields where the data is 'non-null'. Non-null essentially means those fields have actual data in them, or that they are *not* missing values.

Let's look at what the column names are and how many non-null data each contain. This function returns all of the column names, along with the number of non-null values inside each column. The number of non-null values is available by looking at the `Non-Null Count` column. 

In [None]:
data.info()


As the financial cost is the main question we're trying to answer, we are using the `NORMALIZED TOTAL COST` for our analysis and visualization. The `NORMALIZED TOTAL COST` differs from the `ESTIMATED TOTAL COST` by taking into account inflation. As the data spans from the years 1900 to 2019, the real value of money has steadily decreased, so we need to account for that. The year 2016 is the last year we're able to normalize for from the data, and because this is the column we are most curious about, we want to omit any rows that don't include an amount for that column:

In [None]:
data = data[data['NORMALIZED TOTAL COST'].notna()]
data.info()

There is a lower total number of rows because we have removed all of the rows that do not have data for our column of interest. If there is no data in our column of interest it is because data does not exist for that value. 

We're also interested in the types of events that are included in the data. We can look at the `EVENT SUBGROUP` to see what types of events exist, and the code below extracts the unique values in that column: 

In [None]:
list(data['EVENT SUBGROUP'].unique())

It makes sense that events would be 'Meteorological - Hydrological', but what does '25' mean? Let's check out rows that have that value for `EVENT SUBGROUP`:

In [None]:
data[data['EVENT SUBGROUP']!='Meteorological - Hydrological']

# \# add text about what you found above and what to do with it

Now we can get rid of any data values where the `EVENT SUBGROUP` is not 'Meteorological - Hydrological'. This process removes the one outlier of our data that does not fit this category and makes our data cleaner to work with. 

In [None]:
data = data[data['EVENT SUBGROUP']=='Meteorological - Hydrological']
data

In [None]:
data = data[data['EVENT SUBGROUP']=='Meteorological - Hydrological']
data

You can now see our data frame only includes events where the `EVENT SUBGROUP` is equal to 'Meteorological - Hydrological'. Look under `EVENT TYPE` to see more information about what each of these events were. 
 
# \# keep in mind the readers probably don't know how to look into the dataframe to find more information. Instead, add a section where you do that.

# Explore

The next part of the data science process is generating a visualization to help us answer our question. This part of the data science process is really exciting! A visualization is often a graph, but it can be any way that we can visually show our data. In our case, we are going to use a scatter plot. A visualization helps to understand what kind of story our data is telling us. 

Run the code below to generate a scatter plot from the data that will help us to answer our question. Each point represents a specific event.

The size of the points represents the estimated cost, and the color represents the total insurance payments paid out for that particular event.

In [None]:
fig = px.scatter(data, x="NORMALIZED TOTAL COST", y="INSURANCE PAYMENTS", 
                 title='Total Cost of Disasters Compared to Insurance Payments', 
                 hover_data={"PLACE","COMMENTS"},
                 size='NORMALIZED TOTAL COST',
                 color='EVENT TYPE',
                 height=600)

fig.show()

The code below will generate a bar graph showing the total normalized costs for each event type.

# Interpret

In the scatter plot you can see the total cost of disaster events on the x axis and the amount of insurance payments for those disaster events being on the y axis. The color of the dots represents the different types of disaster events; for example tornados are a different color than winter storms. Each type of event is represented seperately in this visualization. If there are several different floods each one is represented by a seperate dot however they are all the same color because they belong to the same category.  You can click on the dots that represent specific types of events in the legend to the right, and hide particular event categories if you want to change the type of events you are looking at on the visualization. The larger dots also show events that had a larger normalized total cost. The normalized total cost on the x axis is shown in millions of dollars while the insurance payments on the y axis are shown in billions of dollars. 

If you scroll your cursor over a particular dot there are more details about the event represented by that dot. 

# \# i'd recommend removing the 'comments' from `hover_data`, as they are unreadable in their current format

The majority of the events are in the same range of cost as most of the events are at or under half a billion dollars in insurance payments and the normalized total cost of most of the events was under fifty million dollars. Which events are outside of that range? You can determine more information about those events by hovering over the point on the scatter plot. 

# \# this is true, but the plot with its current scaling makes it easy to see differences between high and low cost events, but not between low cost events

The plot shows one event that is within the range of the costs of the other disasters but with slighlty elevated costs. This dot is green indicating it is a flood and hovering over it the data indicates it took place in Toronto. Another blue dot shows a storm on Labour Day 1991 where the `INSURANCE PAYMENTS` were within the range of the majority of the events but this event had a `NORMALIZED TOTAL COST` that is over fifty million and below 100 million which is slightly elevated.  

There are three events in particular that have a much higher cost when compared to the rest of the events. What can you find out from the scatter plot about those particular events? These events all were covered fairly extensively in the news. As an extension to this notebook you could work to do some more research on these events. 

The code below will create a bar graph that lists each event type and totals the `NORMALIZED TOTAL COST` for each event. 

In [None]:
fig = px.bar(data, x='EVENT TYPE', y='NORMALIZED TOTAL COST', )
fig.show()

This bar graph shows which type of disasters cost the most according to a metric of `NORMALIZED TOTAL COST`. Run the code cell below to generate a bar graph that compares events types to the total insurance costs. 

In [None]:
fig = px.bar(data, x='EVENT TYPE', y='INSURANCE PAYMENTS', )
fig.show()

From the second bar graph we can see which `EVENT TYPE` category had the highest insurance costs. Do you notice any similarities or differences between the two bar graphs? 

# \# should these bar graphs be normalized by number of events? why or why not?

# Reflect on what you see

After making your visualization the next step is to use the data and your visualization to answer the question. Look at and interact with the visualization above. When you hover your mouse over the plots, you’ll notice more information appears. You can also use the legend to make plots appear and disappear.

#### Think about the following questions.

* What do you notice about these graphs?
* What do you wonder about the data?
* What kind of inferences can you make based on this data?
* Is there another way to visualize this data that would change your inerpretation of the information? 


#### Use the fill-in-the-blank prompts to summarize your thoughts.
* "I used to think _______"
* "Now I think _______"
* "I wish I knew more about _______"
* "These data visualizations remind me of _______"
* "I really like _______"

# Communicate

If you have not yet done this use the plot to answer our question on which natural disaster was the most expensive. 
Once we understand the costs of natural disasters how can we use that information?

How can you communicate that information? What kind of product could you create to share that information with your school community and wider community?

Consider tagging Callysto on [Twitter](https://twitter.com/callysto_canada), [YouTube](https://www.youtube.com/Callysto), [TikTok](https://www.tiktok.com/@callysto_canada), [Facebook](https://www.facebook.com/callystocanada/), or [Linkedin](https://www.linkedin.com/company/callysto-canada/) if you decide to share your reflections or projects on social media.

# Further Resources

For more information on the costliest weather events between 2012 and 2016 check out this article from the [Weather Network](https://www.theweathernetwork.com/ca/news/article/the-top-five-costliest-canadian-natural-disasters-of-the-2010s) 

You may find the following video about the 2013 Calgary floods interesting. Ensure you run the cell below to display the Youtube video. 

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('jgw06p4jeh8')

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)