<h1>Data Analysis and Interpretation Capstone</h1>

<h3>Module 1. Identify Your Data and Research Question</h3>

My code and investigation are first, followed by the outlines of my <b><i>drafts</i></b> for my:
<div style="margin-left:25px;"><ul style="list-style-type:none;">
    <li>1) project title</li>
    <li>2) research question</li>
    <li>3) motivation or rationale for wanting to try to answer the research question </li>
    <li>4) and the potential implications of answering the research question</li>
</ul></div>

I will use the Storm Event data which contains extensive National Weather Service data of storm events by state and county.

<h2>DATA INVESTIGATION</h2>

<h4>SET UP</h4>

<i>Import packages and set options</i>

In [1]:
import pandas as pd

In [2]:
pd.set_option('display.max_rows', 12)

<i>Read in the data set</i>

In [3]:
use_cols = ['DEATHS_DIRECT','STATE','MONTH_NAME','EVENT_TYPE','CZ_TYPE','NUM_EVENTS','EVENT_ID']
data = pd.read_csv('storm_events_data/storm_event_data.csv', usecols=use_cols)

Check that there is a unique identifier included to use as the index

In [4]:
print('Unique counts for each column\n' + str(data.shape[0]) + '\n')
for each in data.columns:
    print(str(data[each].unique().shape[0]) + ' : ' + each)

Unique counts for each column
166048

166048 : EVENT_ID
67 : STATE
12 : MONTH_NAME
51 : EVENT_TYPE
2 : CZ_TYPE
18 : DEATHS_DIRECT
110 : NUM_EVENTS


In [5]:
data.set_index('EVENT_ID', inplace=True)

<i>Investigate data</i>

The number of rows and columns of the data

In [6]:
data.shape

(166048, 6)

In [7]:
cols = ['DEATHS_DIRECT','STATE','MONTH_NAME','EVENT_TYPE','CZ_TYPE','NUM_EVENTS']

Check some features of the columns

In [8]:
data[cols].isnull().sum()

DEATHS_DIRECT    0
STATE            0
MONTH_NAME       0
EVENT_TYPE       0
CZ_TYPE          0
NUM_EVENTS       0
dtype: int64

In [9]:
data[cols[0]].value_counts(dropna=False)

0     165214
1        701
2         77
3         27
4         11
5          3
       ...  
11         1
12         1
13         1
19         1
24         1
8          1
Name: DEATHS_DIRECT, dtype: int64

In [10]:
data[cols[1]].value_counts(dropna=False)

TEXAS             14167
KANSAS             7180
IOWA               6259
NEBRASKA           6080
OKLAHOMA           5451
KENTUCKY           5423
                  ...  
GUAM                 39
VIRGIN ISLANDS       36
E PACIFIC            21
GULF OF ALASKA       17
AMERICAN SAMOA       14
HAWAII WATERS         9
Name: STATE, dtype: int64

In [11]:
data[cols[2]].value_counts(dropna=False)

June         27163
July         20877
May          18231
January      16917
February     16870
April        15427
August       13308
March        12232
September     7963
December      7192
November      5411
October       4457
Name: MONTH_NAME, dtype: int64

In [12]:
data[cols[3]].value_counts(dropna=False)

Thunderstorm Wind        41351
Hail                     28516
Winter Weather           12704
Flash Flood              11501
Winter Storm              9923
Drought                   9652
                         ...  
Dense Smoke                  6
Lakeshore Flood              6
Marine Tropical Storm        5
Tropical Depression          4
Marine Dense Fog             3
Marine Lightning             1
Name: EVENT_TYPE, dtype: int64

In [13]:
data[cols[4]].value_counts(dropna=False)

C    96756
Z    69292
Name: CZ_TYPE, dtype: int64

In [14]:
data[cols[5]].value_counts(dropna=False)

1     8509
2     8496
3     8034
4     7420
5     7160
6     6708
      ... 
86      86
84      84
82      82
81      81
79      79
72      72
Name: NUM_EVENTS, dtype: int64

<img src="usa-states-map.gif">

<h6>The image above is subject to Microsoft fair use copywrite via bing search services</h6>

<h3>Some notes after looking at the data, along with a description of the values in the column</h3>

<b><i>DEATHS_DIRECT</i>: </b>

<i>The number of deaths directly related to the weather event.</i>

About 99.5% have 0 deaths, the number of deaths with the next most occurances is 1, then 2, 3, 4, 5 which is much as we would hope for.

The maximum number of deaths from one incident is 43 that occurred in Washinton with event id 505782. The minimum number of deaths from one incident is 0 and accounts for the majority of events.

<b><i>STATE</i>:</b>

<i>The state name where the event occurred.</i>

The top five states where the weather incident originated are all aligned in the vertical middle of the USA, they are: TEXAS, KANSAS, IOWA, NEBRASKA, OKLAHOMA. You can see this on the map above.

There are 67 unique values, which is more than the 50 states of America that there are. So there are either some repeated values or other categories that cover other areas.

<b><i>MONTH_NAME</i>:</b>

<i>Name of the month for the event in this record.</i>

Janurary to August seems to be the worst for these events, being relatively quiet from September to December.

<b><i>EVENT_TYPE</i>:</b>

<i> The chosen event name should be the one that most accurately describes the meteorological event leading to fatalities, injuries, damage, etc. The only events permitted in Storm Data are listed in Table 1 of Section 2.1.1 of NWS Directive 10-1605 on the <a href="http://www.nws.noaa.gov/directives/sym/pd01016005curr.pdf">NWS website</a>.</i>

The top four occuring events are <u>Thunderstorm Wind</u>, <u>Hail</u>, <u>Winter Weather</u>, and <u>Flash Flood</u>. With thunderstorm wind and hail being by far the most regularly occurring at 41,351 and 28,516 times respectively.

<b><i>CZ_TYPE</i>:</b>

<i>Indicates whether the event happened in a (C) county/parish or (Z) zone.</i>

There seems to be a slight increase in events in Z, or zone. This is a designation that applies to events that take effect over a large area such as drought, blizzard, dense fog. 

The other designation present in the data set is C, or county/parish. This designation applies to more locally based events such as thunderstorm wind, flash flood, hail. 

There is a final designation M, or marine. This is for events that take place over water such as a waterspout.

<b><i>NUM_EVENTS</i>:</b>

<i>The number of events associated with each severe weather outbreak.</i>

At the lower level of value counts of the number of events per episode (bottom of the Out[14] section) I can see that there is an unusual occurance of the counts equaling the variable value.

<h2>DRAFTS</h2>

<h4>Project Title</h4>

The Association between the Number of Deaths and the State and Time Weather Events occur.

<h4>Research Question</h4>

This anlaysis is to identify the best predictors of the number of fatalities from storm event characteristics including the month it happened, the designation of the event, the state it started in, the type of event, and the number of associated events.

<h4>Motivation</h4>

As an insurance analyst, and due to the recent extream weather partly caused by the El Niño last year (2016), I am interested in  the potential impact on businesses and people. By identifying which characteristics are important to predicting negative factors, such as deaths, we can identify what information to gather from the insured and potentially warn them about the risks they are facing so they can react responsibly.

<h4>Potential Implications</h4>

Through predicting risk reliably there will be increased accuracy in pricing the products and less information may be collected meaning it will be quicker, easier, and potentially cheaper to buy your insurance.