# MODIS Open Question Exploration (v1.0)
Casey A Graff

August 11th, 2017

In [None]:
REP_DIR = '/home/cagraff/Documents/dev/fire_prediction/'
SRC_DIR = REP_DIR + 'src/'
DATA_DIR = REP_DIR + 'data/'

# Load system-wide packages
import os
from matplotlib import pyplot as plt
import numpy as np
from scipy.stats import cumfreq
from scipy.stats.stats import pearsonr
%matplotlib inline

# Load project packages
os.chdir(SRC_DIR)
from features.loaders import load_modis_df
from features.helper import date_util as du

plt.rcParams['figure.figsize'] = [15,7]

In [None]:
# Load data
modis_df = load_modis_df(os.path.join(DATA_DIR, 'interim/modis/modis_df_2007-2016.pkl'))

In [None]:
# Add local_time column to modis_df
modis_df = modis_df.assign(datetime_local=map(lambda x: du.utc_to_local_time(x[0], x[1], du.round_to_nearest_quarter_hour), zip(modis_df.index, modis_df.lon)))

In [None]:
modis_df

## Timing of satellite measurements

At what times of the day do the satellites fly over Alaska (in local time)? Is it evenly spaced, as is expected at the equator? Is there a significant difference in day/night detections? Is there a difference between the north/south of alaska?

### Temporal distribution of detections


In [None]:
def dt_to_hours_frac(dt):
    return dt.hour + (dt.minute / 60.)

# Plot a histogram of the times of measurements across all of Alaska


times = map(dt_to_hours_frac, modis_df.datetime_local)

_ = plt.hist(times, 75, facecolor='green', alpha=.75)
_ = plt.title('Time distribution of fire detections')
_ = plt.xlabel('Local Time (hourly)')

There are substantially fewer detections in the evening and early morning passes as compared to mid-day. However, there are only three groupings which suggests that the two satellites are overlapping during a time-period (most likely mid-day).

### Temporal distribution of detections (separated by satellite)

In [None]:
# Plot a histogram of the times of measurements across all of Alaska (Aqua vs Terra)
times_terra = map(dt_to_hours_frac, modis_df[modis_df.sat=='T'].datetime_local)
times_aqua = map(dt_to_hours_frac, modis_df[modis_df.sat=='A'].datetime_local)

_ = plt.hist(times_terra, 75, facecolor='red', alpha=.5)
_ = plt.hist(times_aqua, 75, facecolor='blue', alpha=.5)
_ = plt.title('Time distribution of fire detections')
_ = plt.xlabel('Local Time (hourly)')

 The orbits of Aqua and Terra overlap in the mid-day. We see that the the satellites usually fly over at approximately 2:30 a.m., 11:00 a.m., 1 p.m. and 9 p.m. (local time). 

If the satellites are "double-counting" fire pixels during their overlap, then the drop-off in detections between mid-day and evening are much smaller than orignally thought. However, the drop-off into early morning is still significant.

### Temporal distribution of detections (separated by satellite & northern latitude)

In [None]:
# Calculate lat midpoint
from features.geometry.grid_conversion import alaska_bb
lat_midpoint = (alaska_bb[1] + alaska_bb[0]) / 2.

# Filter only northern Alaska
modis_df_north = modis_df[modis_df.lat > lat_midsection]

print 'Comprises {:.2f}% of total data'.format(len(modis_df_north)/(.01 * len(modis_df)))

# Plot a histogram of the times of measurements in only northern Alaska (Aqua vs Terra
times_terra = map(dt_to_hours_frac, modis_df_north[modis_df_north.sat=='T'].datetime_local)
times_aqua = map(dt_to_hours_frac, modis_df_north[modis_df_north.sat=='A'].datetime_local)

_ = plt.hist(times_terra, 75, facecolor='red', alpha=.5)
_ = plt.hist(times_aqua, 75, facecolor='blue', alpha=.5)
_ = plt.title('Time distribution of fire detections in northern Alaska')
_ = plt.xlabel('Local Time (hourly)')

The difference between all of Alaska and just the northern half is minimal because it comprises 85% of all fire detections. It is difficult to determine if there is a significant shift in the overlap closer to the pole.

### Temporal distribution of detections (separated by satellite & southern latitude)

In [None]:
# Filter only southern Alaska
modis_df_south = modis_df[modis_df.lat < lat_midsection]

print 'Comprises {:.2f}% of total data'.format(len(modis_df_south)/(.01 * len(modis_df)))

# Plot a histogram of the times of measurements in only southern Alaska (Aqua vs Terra
times_terra = map(dt_to_hours_frac, modis_df_south[modis_df_south.sat=='T'].datetime_local)
times_aqua = map(dt_to_hours_frac, modis_df_south[modis_df_south.sat=='A'].datetime_local)

_ = plt.hist(times_terra, 75, facecolor='red', alpha=.5)
_ = plt.hist(times_aqua, 75, facecolor='blue', alpha=.5)
_ = plt.title('Time distribution of fire detections in southern Alaska')
_ = plt.xlabel('Local Time (hourly)')

There is slightly more noticeable separation in the mid-day, but still seems rather minimal.

## Confidence

Can the confidence metric be used to throw-out likely false-alarms or to weight detections? Is there sufficient variance in the metric to be worth considering (or are most values very high)?

According to Giglio et al. 2003 the confidence is the geometric mean of the following five sub-confidences (each sub-confidence is smoothed to [0, 1]).

* C(1) = temperature of the pixel in the 4um channel (T_4)
* C(2) = mean absolute deviation of T_4 relative to "background"
* C(3) = mean absolute deviation of T_4 - T_11 relative to "background"
* C(4) = decreases based on number of adjacent pixels covered in clouds
* C(5) = decreases based on number of adjacent pixels that are water

In [None]:
print 'Percent of conf ==100 is {}%'.format(sum(modis_df.conf==100) / (.01 * len(modis_df)))
print 'Percent of conf >=66 is {}%'.format(sum(modis_df.conf>=66) / (.01 * len(modis_df)))
print 'Percent of conf >=50 is {}%'.format(sum(modis_df.conf>=50) / (.01 * len(modis_df)))
print 'Percent of conf >=33 is {}%'.format(sum(modis_df.conf>=33) / (.01 * len(modis_df)))
print 'Percent of conf ==0 is {}%'.format(sum(modis_df.conf==0) / (.01 * len(modis_df)))

# Plot histogram
_, ax1 = plt.subplots()
_ = ax1.hist(modis_df.conf, 100, facecolor='blue', alpha=.5)
_ = plt.title('Confidence distribution')
_ = plt.xlabel('Confidence (0 to 100)')

# Plot "CDF" line
ax2 = ax1.twinx()
CY, _, _, _ = cumfreq(list(modis_df.conf), 100)
_ = ax2.plot(CY/CY[-1])

There is a substantial ammount of variance in the confidence. About 15% of the data falls below 50% confidence and there is a reasonable number (3%) of confidences that are equal to 0%. 

Further investigation into the spatial distribution of the confidence may be insightful. It is possible that lower confidence predictions occure primarily arround the coast or mountain where there are more adjacent water or cloud pixels. 

## Fire Radiated Power

How much does the FRP vary for fire pixels? Can it be used as a feature (or a target variable)? Is there a high correlation between FRP and confidence?

### FRP Variance

In [None]:
# Plot a histogram of the times of measurements across all of Alaska
print 'Precent of FRP >=500 is {}%'.format(sum(modis_df.FRP>=500)/(.01 * len(modis_df)))
print 'Max value of FRP is {}'.format(max(modis_df.FRP))
_ = plt.hist(modis_df.FRP, 200, range=(0,500), facecolor='green', alpha=.75)
_ = plt.title('Distribution of fire radiated power (FRP)')
_ = plt.xlabel('FRP (MW)')

The FRP varies significantly (with some values going over 5000) while most fall into the range of 0 to 100 MW. If these values are reliable (and the wide range isn't attributable to some sensor issue) then this may be useful in separating small fires (within a pixel) from larger. 

Does the FRP have more to do with the fuel that is burning (e.g. brush vs. house)?

It may worth investigating if higher temperatures are related to larger fire events. If so, is the larger event triggering a greater FRP or are higher FRP fire pixels more likely to spread? Perhaps the opposite could also be true, where a hot, isolated fire may have a large FRP because the calculation may use background pixels as a baseline. Information on how this is calculated would be useful.

## FRP Variance (log scale)

In [None]:
# Plot a histogram of the times of measurements across all of Alaska
print 'Precent of FRP >=500 is {}%'.format(sum(modis_df.FRP>=500)/(.01 * len(modis_df)))
print 'Max value of FRP is {}'.format(max(modis_df.FRP))
_ = plt.hist(modis_df.FRP, bins=10 ** np.linspace(np.log10(.01), np.log10(500), 200), facecolor='green', alpha=.75)
_ = plt.title('Distribution of fire radiated power (FRP)')
_ = plt.xlabel('FRP (MW)')

### Correlation between FRP and confidence

In [None]:
# Plot FRP vs conf
_ = plt.scatter(modis_df.conf, modis_df.FRP)
_ = plt.title('FRP vs confidence')
_ = plt.xlabel('Confidence (0 to 100)')
_ = plt.ylabel('FRP (MW)')

# Plot lin. regr. line
fit = np.polyfit(modis_df.conf, modis_df.FRP, 1)
fit_fn = np.poly1d(fit) 
_ = plt.plot(modis_df.conf, fit_fn(modis_df.conf), '--k')

pcc, _ = pearsonr(modis_df.conf, modis_df.FRP)
print 'Pearson correlation coefficient is {}'.format(pcc)

There seems to be a very loose positive trend between the confidence and FRP. The large cluster at zero confidence is now even more interesting because none of the nearby confidences have FRP values nearly as high as those found exactly at zero. This seems to suggest that at least some of these zero confidence fire detections were very abruptly brough to zero. 

## Fire Type
Is the fire type being used to remove all non-vegetation fires?

Previously, it was not. However, it is now being used (including the data above).

In [None]:
old_num_detections = 96218
num_removed = old_num_detections - len(modis_df)
print 'Num of detections removed is {}'.format(num_removed)
print 'Percent of detections removed is {}%'.format(num_removed / (.01 * old_num_detections))

While we are now ignoring this non-vegetation fires, the ammount previously included was extremely small.

## Leap Year
Does the MODIS dataset include leap year in its date?

**Yes; however, no vegetation fire detections occured on leap day in Alaska between 2007 and 2016.**

In [None]:
print 'Number of fire detections on Feb. 29th (leap day) is {}'.format(
    sum((modis_df.index.month==2) & (modis_df.index.day==29)))