<a href="https://colab.research.google.com/github/danielscurlock/DS-Unit-1-Sprint-1-Data-Wrangling-and-Storytelling/blob/master/notebooks/Explanatory.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# **Explanatory Analysis of 188 Million Wild Fires:**
by: ***Daniel Scurlock*** (3/2/20)


---



# *Dataset Description:*
https://www.kaggle.com/rtatman/188-million-us-wildfires

This data publication contains a spatial database of wildfires that occurred in the United States from 1992 to 2015. It is the third update of a publication originally generated to support the national Fire Program Analysis (FPA) system. The wildfire records were acquired from the reporting systems of federal, state, and local fire organizations. The following core data elements were required for records to be included in this data publication: discovery date, final fire size, and a point location at least as precise as Public Land Survey System (PLSS) section (1-square mile grid). The data were transformed to conform, when possible, to the data standards of the National Wildfire Coordinating Group (NWCG). Basic error-checking was performed and redundant records were identified and removed, to the degree possible. The resulting product, referred to as the Fire Program Analysis fire-occurrence database (FPA FOD), includes 1.88 million geo-referenced wildfire records, representing a total of 140 million acres burned during the 24-year period.


---


***I will try to be as thorough as possible. My goal here is to ask questions and exmplain those answers with visulalizations***

In [0]:
!pip install geoplot

In [0]:
# Get things ready for analysis
import pandas as pd
from scipy.stats import ttest_ind, ttest_ind_from_stats
from scipy.special import stdtr
import matplotlib.pyplot as plt
import descartes 
from shapely.geometry import Point, Polygon

In [0]:
# Get the previously created CSV into place for tomorrow
df = pd.read_csv('/content/drive/My Drive/Lambda/DSU1-BUILD/data/firesfinal.csv', index_col=0 )

# **A regional analysis of wildfires**
I will compare and contrast wildfire data based on region. I want to know the frequency of fires per region, and the average acreage burned. I want to identify the most frequest cause of of those fires per region. I also want to identify the months where fires are more frequent

In [24]:
df.head()

Unnamed: 0,STAT_CAUSE_DESCR,FIRE_SIZE,FIRE_SIZE_CLASS,LATITUDE,LONGITUDE,STATE,DATE,YEAR,MONTH,DAY,WEEKDAY,HEINSELMAN,STATE_NAME,REGION,DIVISION
0,Other,0.1,A,40.036944,-121.005833,CA,2005-02-02,2005,February,2,Wednesday,small,California,West,Pacific
1,Lightning,0.25,A,38.933056,-120.404444,CA,2004-05-12,2004,May,12,Wednesday,small,California,West,Pacific
2,Debris Burning,0.1,A,38.984167,-120.735556,CA,2004-05-31,2004,May,31,Monday,small,California,West,Pacific
3,Lightning,0.1,A,38.559167,-119.913333,CA,2004-06-28,2004,June,28,Monday,small,California,West,Pacific
4,Lightning,0.1,A,38.559167,-119.933056,CA,2004-06-28,2004,June,28,Monday,small,California,West,Pacific


In [30]:
df.dtypes

STAT_CAUSE_DESCR     object
FIRE_SIZE           float64
FIRE_SIZE_CLASS      object
LATITUDE            float64
LONGITUDE           float64
STATE                object
DATE                 object
YEAR                  int64
MONTH                object
DAY                   int64
WEEKDAY              object
HEINSELMAN           object
STATE_NAME           object
REGION               object
DIVISION             object
dtype: object

In [25]:
# What kind of regions do we have?
df['REGION'].value_counts()

South        950358
West         589422
Midwest      178933
Northeast    139671
Name: REGION, dtype: int64

In [0]:
# Subset for each region
south = df[df['REGION'] == 'South']
west = df[df['REGION'] == 'West']
midwest = df[df['REGION'] == 'Midwest']
northeast = df[df['REGION'] == 'Northeast']

# Statistical significance
looking for anything statistically significant to talk about.

In [0]:
# Want to perform from testing, both on numerical 
# and catagorical data.

# The first of which is the ttest on the size of the fire
# The t test tells us how significant the differences in 
# fire size is between regions, and if those differences 
# (measured in means/averages) could have happened by chance.

dfs = [[south, 'south'],
       [west, 'west'], 
       [midwest, 'midwest'], 
       [northeast, 'northeast']]


tmplist = []
for rega in dfs:
  for regb in dfs:
    if rega[1] != regb[1]:
      t, p = ttest_ind(rega[0]['FIRE_SIZE'], regb[0]['FIRE_SIZE'], nan_policy='omit')
      tmplist.append([rega[1], regb[1], abs(t), p])

tpdf = pd.DataFrame(tmplist, columns=['rega', 'regb', 't-vaule', 'p-value'])

tpdf['conclusion'] = ['fail to reject' if x >= 0.5 else 'reject' for x in tpdf['p-value']]

In [64]:
tpdf.head(20)

Unnamed: 0,rega,regb,t-vaule,p-value,conclusion
0,south,west,34.088523,1.3590869999999999e-254,reject
1,south,midwest,2.138238,0.03249762,reject
2,south,northeast,8.72573,2.648306e-18,reject
3,west,south,34.088523,1.3590869999999999e-254,reject
4,west,midwest,14.917556,2.5753629999999996e-50,reject
5,west,northeast,15.940912,3.3694209999999997e-57,reject
6,midwest,south,2.138238,0.03249762,reject
7,midwest,west,14.917556,2.5753629999999996e-50,reject
8,midwest,northeast,16.681996,1.8837560000000002e-62,reject
9,northeast,south,8.72573,2.648306e-18,reject


The results of the ttests were pretty straight forward. There is a statistically significant difference between regions when it comes to the amount of acreage burned during a wildfire. Though the difference varied. This is probably because of the distace between regions. I can use that information. 

In [0]:
# Building us a dataframe to show region
# number of fires, and mean acerage, and 
# plot this

regions = ['South', 'West', 'Midwest', 'Northeast']

tmp = []

for r in regions:
  tmpdf = df[df['REGION'] == r]
  
  total_fires = tmpdf['REGION'].value_counts().sum()
  mean_acres = tmpdf['FIRE_SIZE'].mean() 
  
  tmp.append([r, 
              total_fires, 
              mean_acres])

regdf = pd.DataFrame(tmp, columns=['Region', 
                                   'Total Fires', 
                                   'Average Acreage'])



In [66]:
regdf.head()

Unnamed: 0,Region,Total Fires,Average Acreage
0,South,950358,27.395502
1,West,589422,182.758497
2,Midwest,178933,33.018378
3,Northeast,139671,2.351031
