In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hurricanes.ipynb")

# Are Female Hurricanes Deadlier than Male Hurricanes?

### Applying Data Visualizations and Numerical Summaries

In this lab assignment, we will work with data from a study published in the peer-reviewed journal PNAS.  Using descriptive statistical techniques we have learned in class, we will evaluate the claims made in [this article](https://www.pnas.org/doi/full/10.1073/pnas.1402786111) titled *Female hurricanes are deadlier than male hurricanes* by Jung, Shavitt et al.  First, let's start with a brief introduction to this notebook.

## Part 1. What is a Jupyter notebook?
This webpage is called a Jupyter notebook. A notebook is a place to write code and view the results of that code.  It is also a place to share and write text.
In a notebook, each rectangle containing text or code is called a *cell*.

**Text cells** (like this one) can be edited by double-clicking on them. They're written in a simple format called [Markdown](http://daringfireball.net/projects/markdown/syntax) to add formatting and section headings.
After you edit a text cell, click the "run cell" button at the top that looks like ▶| or hold down `shift` + `return` to confirm any changes. 

**Code cells** contain code in the Python 3 language. Running a code cell will execute all of the code it contains.
To run the code in a code cell, first click on that cell to activate it.  It'll be highlighted with a little green or blue rectangle.  Next, either press ▶| or hold down `shift` + `return`.

**Activity 1:** This is a text cell. It is the cell type where we can type text that isn't code. Go ahead and double click in this cell and you will see that you can edit it. 

**Type something here:** ....

**Activity 2:** Click on the code cell below and run the code:

In [None]:
#This cell is a code cell. It is where we can type code that can be executed.
#The hashtag at the start of this line makes it so that this text is a comment not code. 

print("Hello, World! \N{EARTH GLOBE ASIA-AUSTRALIA}!")

And this one:

In [None]:
# This coding cell imports some python libraries that we will be using throughout this notebook
# Don't worry about what they are, just run this cell before running any other cells below this one

from datascience import *
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import otter
grader = otter.Notebook("hurricanes.ipynb")

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# Part 2: Summary Statistics
We will start by loading the data and taking a look at the data matrix provided as documentation in the study.  In the *Materials and Methods* section of [the paper](https://www.pnas.org/doi/full/10.1073/pnas.1402786111), the researchers disclose that they "removed two hurricanes, Katrina in 2005 (1833 deaths) and Audrey in 1957 (416 deaths)" from this dataset. The first line of code sets the name `hurricanes_data` to the table that contains the data.  The second line of code displays the first 10 observations of the dataset.

In [None]:
hurricanes_data = Table.read_table('hurricanes1950_2012.csv')
hurricanes_data

**Question 2.1** What is the total number of observations in this dataset?  How many variables are there?
- Set `num_observations` to the total number of observations 
- Set `num_variables` to the total number of variables.

In [None]:
#replace the ... with the correct answer
num_observations = ...
num_variables = ...

For this lab, we will focus on two of the variables: `Gender_MF` and `alldeaths`.  Run the next code cell without making any changes so we can:
1. Make two tables from that table, one with observations where the variable `Gender_MF` is *female* and one where the variable `Gender_MF` is *male*.
2. Produce summary statistics for the variable `alldeaths` for both tables.


In [None]:
# Just run this cell
# Step 1: make 2 tables
female = hurricanes_data.where('Gender_MF', 'Female')
male = hurricanes_data.where('Gender_MF', 'Male')
# Step 2: summary statistics - start by making arrays
deaths_f = female.column('alldeaths')
deaths_m = male.column('alldeaths')
# sample sizes
n_f = len(deaths_f)
n_m = len(deaths_m)
# means
mean_f = np.mean(deaths_f)
mean_m = np.mean(deaths_m)
# mins
min_f = np.min(deaths_f)
min_m = np.min(deaths_m)
# first quartiles
q1_f = percentile(25, deaths_f)
q1_m = percentile(25, deaths_m)
# second quartiles
q2_f = percentile(50, deaths_f)
q2_m = percentile(50, deaths_m)
# third quartiles
q3_f = percentile(75, deaths_f)
q3_m = percentile(75, deaths_m)
# max
max_f = np.max(deaths_f)
max_m = np.max(deaths_m)
# standard deviations
sd_f = np.std(deaths_f)
sd_m = np.std(deaths_m)
# display summary stats
print('Female-named hurricanes')
print('n = ', n_f, ' mean = ', mean_f, ' min = ', min_f, 
      ' Q1 = ', q1_f, ' Q2 = ', q2_f, ' Q3 = ', q3_f, 
      ' max = ', max_f, ' std dev = ', sd_f) 
print('Male-named hurricanes')
print('n = ', n_m, ' mean = ', mean_m, ' min = ', min_m, 
      ' Q1 = ', q1_m, ' Q2 = ', q2_m, ' Q3 = ', q3_m, 
      ' max = ', max_m, ' std dev = ', sd_m) 


### Measures of Center
We use measures of center to determine what is a **"typical"** observation for a particular variable.  Measures of center also give us a starting point for comparing two or more groups. The choice of what measure of center is most appropriate depends on the data.  

**Question 2.2** For each group, we want to compare the measures of center presented in the summary statistics.  We need to determine whether:
1. The mean is greater than the median.
2. The mean is less than the median.
3. The mean and the median are equal or close in value. 
- Set `center_female` and `center_male` to either 1, 2 or 3 based on what you observe in the summary statistics.

In [None]:
center_female = ...
center_male = ...

**Question 2.3** What can your answers to the last question tell us about the likely shape of the distribution?  Based on the values for the mean and median for both the female-named and male-named hurricanes, the likely shape of both distributions is:
1. Symmetric
2. Right-skewed
3. Left-skewed
- Set `shape` to either 1, 2 or 3 based on your choice from the list above.

In [None]:
shape = ...

**Question 2.4** Compare the median numbers of deaths for each group and the mean numbers of deaths for each group.  
1. What are the medians equal to for each group? If we use the medians, does one group typically have more deaths? If so, which one?  
2. What are the means equal to for each group? If we use the means, does one group typically have more deaths? If so, which one? 
 
**Double click on the cell below to edit the cell and answer this question. Make sure to answer all parts to this question.**

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Measures of Variability
In addition to learning what is **"typical"** for a variable, we also need to know what variability or spread to expect for that variable.  We use measures of variability for this.  The standard deviation is listed in the summary statistics above and we can compute the other three measures of variability.  Run the code cell below to see those summary statistics again.  You will use those to complete question 2.5.

In [None]:
# Just run this cell
print('Female-named hurricanes')
print('n = ', n_f, ' mean = ', mean_f, ' min = ', min_f, ' Q1 = ', q1_f, ' Q2 = ', q2_f, ' Q3 = ', q3_f, ' max = ', max_f, ' std dev = ', sd_f) 
print('Male-named hurricanes')
print('n = ', n_m, ' mean = ', mean_m, ' min = ', min_m, ' Q1 = ', q1_m, ' Q2 = ', q2_m, ' Q3 = ', q3_m, ' max = ', max_m, ' std dev = ', sd_m) 

**Question 2.5** Compute the IQR, range and variance for the variable `alldeaths` for both the female-named and male-named hurricanes and set the corresponding names in the code cell to the correct values. Names ending in "_f" correspond to the statistics for the female-named hurricantes and names ending in "_m" correspond to the statistics for the male-named hurricantes  You can add a code cell after this markdown cell to have a place to do calculations if you would like.

In [None]:
# Compute the IQRs. 
iqr_f = ...
iqr_m = ...
# Compute the ranges
range_f = ...
range_m = ...
# Compute the variances and round to the nearest whole number.
variance_f = ...
variance_m = ...
# Don't change the code below this comment.
print(f'Measures of variability for female-named hurricanes: IQR = {iqr_f}, range = {range_f}, variance = {variance_f}')
print(f'Measures of variability for male-named hurricanes: IQR = {iqr_m}, range = {range_m}, variance = {variance_m}')

### Finding Outliers
What do you notice about the different measures of variability and how they compare between the two groups?  How do we decide which one best represents the variability for this dataset?  In addition to considering the shape of the distribution of the variable `alldeaths`, we should also determine whether there are outliers.   We do this by calculating the lower and upper fences, which are also known as the minimun and maximum whisker reaches, respectively.  When we build boxplots with numerical data, this is how we determine how far the whiskers could reach before marking observations as outliers.  In the next problem, you will calculate the lower and upper fences for the variable `alldeaths` for the female and male named hurricanes.

Lower Fence:
$$ Q1 - 1.5 \times IQR $$
Upper Fence:
$$ Q3 + 1.5 \times IQR $$

**Question 2.6** Compute the lower and upper fences for both the female-named and male-named huricanes.  

In [None]:
# Lower Fences
lower_fence_f = ...
lower_fence_m = ...
# Upper Fences
upper_fence_f = ...
upper_fence_m = ...
# Don't change the code below this comment.
print(f'Female-named hurricanes: lower fence = {lower_fence_f}, upper fence = {upper_fence_f}')
print(f'Male-named hurricanes: lower fence = {lower_fence_m}, upper fence = {upper_fence_m}')

**A Negative Number of Deaths?!** 
Pause here for a second and think about what it means for the lower fence to be a negative number.  Does it make sense in this context for the lower fence to be negative?  Can we have a negative number of deaths?   Sometimes we see this in the data, where there is a natural boundary for the observations to stop.  When this happens, it is often further proof that there is skew in the data.

**Question 2.7** What is the smallest number of deaths that could happen during a hurricane event?  In other words, what is the natural boundary for this variable? Set `natural_min` to this value.  *Hint: what is the min for each group?*

In [None]:
natural_min = ...

**How many outliers are there?** Since the lower fence is less than the min for both groups, we know there are not any outliers on the lower end of the data.  What about on the upper end? The code in the next two cells will help us determine how many outliers there are for both groups. 

In [None]:
# Just run this cell
female.where('alldeaths', are.above(upper_fence_f))

In [None]:
# Just run this cell
male.where('alldeaths', are.above(upper_fence_m))

**Question 2.8**  How many outliers are there for the female-named hurricanes?  How many outliers for the male-named hurricanes?  Set `outliers_f` and `outliers_m` equal to the correct values.

In [None]:
outliers_f = ...
outliers_m = ...

### Effect of Outliers on Sample Statistics
Outliers are observations that are extreme with respect to the rest of the data. They also have a bigger impact on some summary statistics than others.  Below, we will demonstrate the impact of those outliers on this dataset, but in general, just removing outliers without investigating them further is not good practice.  In fact, studying outliers helps us learn more about our data, like whether there has been an issue with data collection of it there is something we didn't fully understand about our data before collecting it.  


In [None]:
# Just run this cell
# Step 1: new tables without outliers
female_no_outliers = female.where('alldeaths', are.below_or_equal_to(upper_fence_f))
male_no_outliers = male.where('alldeaths', are.below_or_equal_to(upper_fence_m))
# Step 2: summary statistics - start by making arrays
deaths_f2 = female_no_outliers.column('alldeaths')
deaths_m2 = male_no_outliers.column('alldeaths')
# means
mean_f2 = np.mean(deaths_f2)
mean_m2 = np.mean(deaths_m2)
# first quartiles
q1_f2 = percentile(25, deaths_f2)
q1_m2 = percentile(25, deaths_m2)
# second quartiles
q2_f2 = percentile(50, deaths_f2)
q2_m2 = percentile(50, deaths_m2)
# third quartiles
q3_f2 = percentile(75, deaths_f2)
q3_m2 = percentile(75, deaths_m2)
# standard deviations
sd_f2 = np.std(deaths_f2)
sd_m2 = np.std(deaths_m2)

In [None]:
# Just run this cell
print('Summary Statistics with Outliers')
print('Female-named hurricanes')
print( ' mean = ', mean_f, ' Q1 = ', q1_f, ' Q2 = ', q2_f, ' Q3 = ', q3_f,  ' std dev = ', sd_f) 
print('Male-named hurricanes')
print( ' mean = ', mean_m,  ' Q1 = ', q1_m, ' Q2 = ', q2_m, ' Q3 = ', q3_m, ' std dev = ', sd_m) 
print('                  ')
print('Summary Statistics without Outliers')
print('Female-named hurricanes')
print( ' mean = ', mean_f2, ' Q1 = ', q1_f2, ' Q2 = ', q2_f2, ' Q3 = ', q3_f2,  ' std dev = ', sd_f2) 
print('Male-named hurricanes')
print( ' mean = ', mean_m2,  ' Q1 = ', q1_m2, ' Q2 = ', q2_m2, ' Q3 = ', q3_m2, ' std dev = ', sd_m2) 

**Question 2.9** After the outliers were removed, which *two* statstics changed the least of all of the summary statistics?  Use the summary statistics above from before and after the outliers were removed and set `robust` to the correct choice from the list below.  
1. The mean and median changed the least.
2. The mean and standard deviation changed the least.
3. The standard deviation and the range changed the least.
4. The standard deviation and the IQR changed the least. 
5. The IQR and the median changed the least.

In [None]:
robust = ...

# Part 3: Visualizations 
Now we will create some data visualizations to learn more about this dataset.  Run the next cell to create a histogram of the variable `alldeaths`.

In [None]:
# Just run this cell
hurricanes_data.hist('alldeaths',  bins = np.arange(0, 261, 10), unit = "number of deaths")

Take a look at the histogram above. Does the shape of distribution of the numerical variable `alldeaths` confirm your answer to question 2.3 earlier in the lab?  What do you notice about the horizontal axis?  What about the vertical axis?

The width of each bin is specified to be 10 in the code above.  The `hist` method we used to make the histogram above uses the convention that each bin includes the data at the left endpoint of the bin, but not the right endpoint.  This means that the first bin on the left hand side includes observations where there were 0 deaths that occured, but does not include observations where there were 10 deaths.

The vertical axis is a density scale and the height of each bin is the percent of observations that fall into that bin, relative to the bin width.  This means that the percent of all observations that are in each bin is found by multiplying the width of each bin by the height of each bin.  This also means that the total area of all the bars in this histogram is 100%.

$$ \text{area of bar} = \text{percent of observations in bin} = \text{height of bar}\times\text{width of bin} $$

**Question 3.1** If the height of the first bin on the left is 6.3 percent per number of deaths, what percent of the hurricanes in this dataset had less than 10 deaths? Set `under_10` to the correct percentage *without* the % sign.

In [None]:
under_10 = ...

The next code cell will create side-by-side boxplots.  Run the cell and look closely at the output.

In [None]:
# Just run this cell
f = pd.DataFrame({'female-named':deaths_f})
m = pd.DataFrame({'male-named':deaths_m})
df = pd.concat([f, m], axis = 1) 
df.boxplot()

Looking at these boxplots should confirm what you learned from the numerical summaries from earlier in this lab.  Notice on the graph where the upper fences would be for each boxplot.  The whiskers should stop at or below what you calculated for both graphs.  Also notice how many outliers you can see.  This should also align with what you found earlier in this lab.  

**Question 3.2** Use what you have learned in this lab so far about the shape of the distribution of this variable and the presence of outliers to determine what the most appropriate measures of center and variability are.  Assign `summary_stats` to the best answer from the choices below.
1. The most appropriate measure of center is the mean and the most appropriate measure of variability is the standard deviation because the distribution is skewed and there are outliers.
2. The most appropriate measure of center is the mean and the most appropriate measure of variability is the standard deviation because the distribution is symmetric.
3. The most appropriate measure of center is the median and the most appropriate measure of variability is the IQR because the distribution is symmetric.
4. The most appropriate measure of center is the median and the most appropriate measure of variability is the IQR because the distribution is skewed and there are outliers.
5. The most appropriate measure of center is the mean because it includes all the values and the most appropriate measure of variability is the standard deviation because it is computed by using the mean.

In [None]:
summary_stats = ...

# Part 4: The Impact of Hurricane Naming History on the Data
From 1953 until 1978, the United States only used female names for storms.  According to NOAA's [Tropical Cyclone Naming History and Retired Names](https://www.nhc.noaa.gov/aboutnames_history.shtml), it wasn't until the year 1979 that male and female names were both used.  In the next coding cell, we will make a subset of the data that only includes hurricanes from 1979 through 2012.  Then we will make a new data visualization to compare the two groups.


In [None]:
# Just run this cell
# Step 1: create table where the year is greater than 1978
after_1978 = hurricanes_data.where('Year', are.above(1978))
# Step 2: create tables for female and male names
female_1979 = after_1978.where('Gender_MF', 'Female')
male_1979 = after_1978.where('Gender_MF', 'Male')
# Code for making a boxplot using pandas
f2 = pd.DataFrame({'female-named':female_1979.column('alldeaths')})
m2 = pd.DataFrame({'male-named':male_1979.column('alldeaths')})
df2 = pd.concat([f2, m2], axis = 1) 
print('Deaths from Hurricanes 1979 - 2012')
df2.boxplot()

What do you notice about the two boxplots?  Which one has the bigger IQR?  How do the medians compare?  Can you guess which one might have the larger mean and standard deviation?  Run the next cell to see the corresponding summary statistics for this data.  

In [None]:
# Just run this cell to see summary statistics
# Create arrays of just one variable
deaths_f_1979 = female_1979.column('alldeaths')
deaths_m_1979 = male_1979.column('alldeaths')
# sample sizes
n_f_1979 = len(deaths_f_1979)
n_m_1979 = len(deaths_m_1979)
# means
mean_f_1979 = np.mean(deaths_f_1979)
mean_m_1979 = np.mean(deaths_m_1979)
# mins
min_f_1979 = np.min(deaths_f_1979)
min_m_1979 = np.min(deaths_m_1979)
# first quartiles
q1_f_1979 = percentile(25, deaths_f_1979)
q1_m_1979 = percentile(25, deaths_m_1979)
# second quartiles
q2_f_1979 = percentile(50, deaths_f_1979)
q2_m_1979 = percentile(50, deaths_m_1979)
# third quartiles
q3_f_1979 = percentile(75, deaths_f_1979)
q3_m_1979 = percentile(75, deaths_m_1979)
# max
max_f_1979 = np.max(deaths_f_1979)
max_m_1979 = np.max(deaths_m_1979)
# standard deviations
sd_f_1979 = np.std(deaths_f_1979)
sd_m_1979 = np.std(deaths_m_1979)
# display summary stats
print('Female-named hurricanes, 1979 - 2012')
print('n = ', n_f_1979, ' mean = ', mean_f_1979, ' min = ', min_f_1979,
      ' Q1 = ', q1_f_1979, ' Q2 = ', q2_f_1979, ' Q3 = ', q3_f_1979, 
      ' max = ', max_f_1979, ' std dev = ', sd_f_1979) 
print('Male-named hurricanes, 1979 - 2012')
print('n = ', n_m_1979, ' mean = ', mean_m_1979, ' min = ', min_m_1979, 
      ' Q1 = ', q1_m_1979, ' Q2 = ', q2_m_1979, ' Q3 = ', q3_m_1979, 
      ' max = ', max_m_1979, ' std dev = ', sd_m_1979) 

**Question 4.1** Using the boxplots and the summary statistics above, answer the following questions in the markdown cell below.  
1. What are the medians equal to?  If we use the medians to compare the two groups, does one group typically have more deaths? If so, which one?  
2. What are the means equal to? If we use the means, does one group typically have more deaths? If so, which one? 
3. What are the IQRs equal to?  If we use the IQRs to compare the two groups, is the variability greater for one group than the other? If so, which one? 
4. What are the standard deviations equal to?  If we use the sds to compare the two groups, is the variability greater for one group than the other? If so, which one?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### One more look at the impact of outliers 
Look back up at the side by side boxplot of the 1979 - 2012 data.  How do you think the presence of the outlier that has the highest number of deaths impacts the mean?  Let's investigate.  In the next cell, we will identify this hurricane by name.  

In [None]:
# Just run this cell
most_deaths = after_1978.where('alldeaths', are.above(100)).column('Name').item(0)
print('The hurricane that had the highest number of deaths was Hurricane', most_deaths, '.')

**Do a quick search of this Hurricane's name on the internet.** How did this hurricane compare to other Atlantic storms recorded to date?  Let's see the impact of this outlier on the mean.  Run the next cell.

In [None]:
# Just run this cell. 
mean_f_1979_2 = np.mean(after_1978.where('alldeaths', are.below(100)).where('Gender_MF', 'Female').column('alldeaths'))
mean_m_1979_2 = np.mean(after_1978.where('alldeaths', are.below(100)).where('Gender_MF', 'Male').column('alldeaths'))

print('After the outlier is dropped, the mean number of deaths caused by female-named hurricanes is equal to', round(mean_f_1979_2), 'deaths.')
print('After the outlier is dropped, the mean number of deaths caused by male-named hurricanes is equal to', round(mean_m_1979_2), 'deaths.')

# Part 5: Conclusions
Having investigated this dataset with numerical and visual summaries, do you think that the claim "Female hurricanes are deadlier than male hurricanes" is as clear-cut as the researchers state in the title of their paper?

**Question 5.1** If we use the median to represent a typical value for this data, are female-named hurricanes deadlier?  How does the choice of the measure of center change the conclusion?  


_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 5.2** How could the fact that all of the hurricanes had Female names from 1953 until 1979 bias the results?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**CONGRATULATIONS!** You have finished this jupyter notebook assignment! 
Be sure to...

- run all of the cells in this notebook, 
- check that you have answers for the open response questions,
- choose **Save and Export Notebook As** and then **PDF** from the **File** menu,
- submit the .pdf file on **canvas**.