## IS590PR 
Examples to introduce Pandas and typical operations with a single large "table" of data.

The first data file used here comes from 
https://data.nodc.noaa.gov/cgi-bin/iso?id=gov.noaa.ngdc.mgg.hazards:G10147

In [1]:
import pandas as pd
import numpy as np

In [2]:
v = pd.read_csv('Data/NOAA_Signif_Volcano_Eruptions_20181003.txt',
               sep='\t')  # it's a TAB-separated file.

In [27]:
v.head()

Unnamed: 0,Year,Month,Day,TSU,EQ,Name,Location,Country,Latitude,Longitude,...,TOTAL_MISSING,TOTAL_MISSING_DESCRIPTION,TOTAL_INJURIES,TOTAL_INJURIES_DESCRIPTION,TOTAL_DAMAGE_MILLIONS_DOLLARS,TOTAL_DAMAGE_DESCRIPTION,TOTAL_HOUSES_DESTROYED,TOTAL_HOUSES_DESTROYED_DESCRIPTION,TD_MIN,TD_MAX
0,-4360,,,,,Macauley Island,Kermadec Is,New Zealand,-30.2,-178.47,...,,,,,,,,,,
1,-4350,,,,,Kikai,Ryukyu Is,Japan,30.78,130.28,...,,,,,,3.0,,3.0,101.0,1000.0
2,-4050,,,,,Masaya,Nicaragua,Nicaragua,11.984,-86.161,...,,,,,,,,,,
3,-4000,,,,,Pago,New Britain-SW Pac,Papua New Guinea,-5.58,150.52,...,,,,,,1.0,,,1.0,50.0
4,-3580,,,,,Taal,Luzon-Philippines,Philippines,14.002,120.993,...,,,,,,,,,,


In [None]:
v.tail()

After reviewing whether the data got loaded and formatted properly, we see that some data types are inappropriate.  Especially notice how columns like 'Month', 'Day', and 'TOTAL_DEATHS' should be integers, right?

We could force those to be integers but unfortunately, if we do that, we lose support for the np.NaN values, which mean "Not a Number".  If you look through the data you'll see there are many numeric columns that have missing values.  With NaN support, we automatically get correct results for calculations like arithmetic means or standard deviations even when there are some missing values. 

### Grouping data
Let's calculate some statistics that require grouping.  You can usually tell you need grouping because the English-language description of your result has the word "each" or "per" in it:

* Number of distinct volcanoes *per country*.
* Number of eruptions *per year*.
* Number of eruptions *per century*.
* etc.

### Number of eruptions per year
First try just getting a specific year's data (2018).  We have to first calculate a Series of Boolean values by vectorizing a comparison.  That result gets used as the "slice" of the dataframe again so we get just the rows we want.


In [None]:
v[v['Year'] >= 2010]

But to count PER year, we need to do several things:
* groupby() the 'Year' column to collapse all rows with matching years together.
* call the size() method to give us the number of rows that got collapsed.

Note that the count() method sounds more intuitive here than size() but it does something different...

In [14]:
v.groupby('Year').size()

Unnamed: 0_level_0,0
Year,Unnamed: 1_level_1
-4360,1
-4350,1
-4050,1
-4000,1
-3580,1
-3550,1
-2420,1
-2040,1
-1900,1
-1860,1


### Tricky Counting
Suppose we have to produce a report of 
"List Countries by the total (approximate) number of deaths from volcanoes"

It turns out that this is harder than it seems, because of the way the data is currently encoded.  If we just add up the "TOTAL_DEATHS" column values, we'll be **way underreporting** the fatalities.  Review the NOAA published metadata regarding columns like TOTAL_DEATH_DESCRIPTION.

You'll see that column only contains 1, 2, 3, or 4, which are approximate range indicators.  TOTAL_DEATHS is only filled out if they're confident in a **specific** number.

### Bulk modifications of selected data
So we need to create some new columns to help us figure out the range of the estimated death tolls...

In [None]:
deaths = v[['Year', 'Day', 'Month', 'TOTAL_DEATHS',  'TOTAL_DEATHS_DESCRIPTION', 'Country', 'Name', 'Location']]

In [None]:
deaths.head()

In [None]:
deaths[deaths['TOTAL_DEATHS'] > 0]

In [None]:
deaths['TOTAL_DEATHS'].max()

In [15]:
conv = {1: (1, 50),
        2: (51, 100), 
        3: (101, 1000),
        4: (1001, 100000)
       }

In [None]:
deaths.head()

In [16]:
v['TD_MIN'] = np.NaN
v['TD_MAX'] = np.NaN

In [17]:
for x in range(1,5):
    rows = v['TOTAL_DEATHS_DESCRIPTION'] == x
    v.loc[rows, 'TD_MIN'] = conv[x][0]
    v.loc[rows, 'TD_MAX'] = conv[x][1]

In [21]:
v[['TD_MIN','TD_MAX','TOTAL_DEATHS','TOTAL_DEATHS_DESCRIPTION']]

Unnamed: 0,TD_MIN,TD_MAX,TOTAL_DEATHS,TOTAL_DEATHS_DESCRIPTION
0,,,,
1,101.0,1000.0,,3.0
2,,,,
3,1.0,50.0,,1.0
4,,,,
5,,,,
6,,,,
7,,,,
8,,,,
9,,,,


Next we want to copy the precise death tolls into the TD_MIN and TD_MAX columns.

In [20]:
rows = v['TOTAL_DEATHS'] > 0
v.loc[rows, 'TD_MIN'] = v['TOTAL_DEATHS']
v.loc[rows, 'TD_MAX'] = v['TOTAL_DEATHS']

### How many deaths estimated for all time?


In [26]:
print('Total Deaths estimated between {} and {}.'.format(
    v['TD_MIN'].sum(),
    v['TD_MAX'].sum()))

Total Deaths estimated between 338870.0 and 479890.0.


## How many eruptions has Mt. Vesuvius had?

In [5]:
v[v['Name'] == 'Vesuvius']

Unnamed: 0,Year,Month,Day,TSU,EQ,Name,Location,Country,Latitude,Longitude,...,TOTAL_DEATHS,TOTAL_DEATHS_DESCRIPTION,TOTAL_MISSING,TOTAL_MISSING_DESCRIPTION,TOTAL_INJURIES,TOTAL_INJURIES_DESCRIPTION,TOTAL_DAMAGE_MILLIONS_DOLLARS,TOTAL_DAMAGE_DESCRIPTION,TOTAL_HOUSES_DESTROYED,TOTAL_HOUSES_DESTROYED_DESCRIPTION
6,-2420,,,,,Vesuvius,Italy,Italy,40.821,14.426,...,,,,,,,,,,
28,79,8.0,25.0,,,Vesuvius,Italy,Italy,40.821,14.426,...,2100.0,4.0,,,,,,,,
42,787,,,,,Vesuvius,Italy,Italy,40.821,14.426,...,,1.0,,,,,,1.0,,
118,1631,12.0,16.0,TSU,EQ,Vesuvius,Italy,Italy,40.821,14.426,...,4000.0,4.0,,,,,,4.0,,4.0
141,1682,8.0,12.0,,,Vesuvius,Italy,Italy,40.821,14.426,...,4.0,1.0,,,,,,,,
143,1690,2.0,3.0,TSU,,Vesuvius,Italy,Italy,40.821,14.426,...,,,,,,,,,,
149,1698,5.0,,TSU,,Vesuvius,Italy,Italy,40.821,14.426,...,,,,,,,,,,
157,1714,6.0,30.0,TSU,,Vesuvius,Italy,Italy,40.821,14.426,...,,,,,,,,,,
175,1737,5.0,20.0,,,Vesuvius,Italy,Italy,40.821,14.426,...,2.0,1.0,,,,,,2.0,,
195,1779,8.0,8.0,,,Vesuvius,Italy,Italy,40.821,14.426,...,,1.0,,,,,,,,


In [9]:
v[v['Name'] == 'Vesuvius'].shape[0]

18