## Introduction
The goal of this project is to predict wildfire ignition risk in California from using a neural network trained on historical weather and wildfire data. Two approaches were attempted and scaled. The most effective approach was a parallel long short term memory network (pLSTM). The neural network portion of this project had six major phases. See below for more detail on each part.

1. [Feature addition and smoothing](https://github.com/gperdrizet/wildfire/blob/master/notebooks/add_features.ipynb) - Added min, mean and max features for each weather variable and smooth the whole dataset with a daily average.
2. [Fullly stratified sampling](https://github.com/gperdrizet/wildfire/blob/master/notebooks/recursive_sampling.ipynb) - Dataset was split into fully stratified blocks of ~500,000 observations.
3. [Small scale MLP](https://github.com/gperdrizet/wildfire/blob/master/notebooks/keras_MLP_skopt.ipynb) - The stratified samples were used to optimize and test a 'deep' neural network architecture using a binary 'fire/no fire' paradigm.
4. [Scaled MLP]() - Insights gained from the small scale MLP were applied to the entire dataset.
5. [Single LSTM](https://github.com/gperdrizet/wildfire/blob/master/notebooks/keras_LSTM_skopt_one_spatial_bin.ipynb) - A simple long short term memory based network architecture was optimized and tested on the full 22 years of data in just one geospatial bin.
6. [Parallel LSTM](https://github.com/gperdrizet/wildfire/blob/master/notebooks/keras_parallel_LSTM_410input.ipynb) A parallel LSTM network consisting of 410 inputs, on for each geospatial bin, was trained on all 22 years of data.

## 1. Feature addition and smoothing
The motivation to add min, max and mean features and then smooth by daily averaging was to 1) reduce the size of the data set and 2) make the resolution of the weather data more closely match that of the fire data (i.e. day of fire discovery)

The first step was to add the min, max and mean features using a sliding window of 24 hr. The calculated values were added at the right edge of the window. This was done to reflect the temporal nature of the data - the conditions which led to a fire are logically likely to have occurred on and/or before the day that the fire was discovered. The results of the feature at addition for a few representative weather variable are shown below.

![Example of three weather variables showing the effect of adding min, max and mean](https://github.com/gperdrizet/wildfire/blob/master/figures/min_max_mean_added.png?raw=true)

The second step in the data preparation was to further smooth the data by taking the daily mean for each weather variable. This was done to match the resolution of the weather data to the fire data. The weather data has a periodicity of 3 hours, while most of the fire data is no more precise that 'discovery day'. The results of the smoothing for a few representative weather variable are shown below.

![Example of three weather variables showing the effect of smoothing via daily average](https://github.com/gperdrizet/wildfire/blob/master/figures/smoothed_data.png?raw=true)

## 2. Fully stratified sampling 


The motivation behind this phase of the project was two fold: 1) reduce the size of the dataset for initial testing by breaking it up into samples of ~500,000 observations. 2) ensure that the distribution of all variables is as similar as possible across the samples.

Early on in this project, it was observed that some features which *should* be of high importance would be in some training runs, but not in others (ex: month - California has a clear fire season). Ultimately, the best explanation was lucky vs unlucky train, test splits. If the distribution of a given variable does not match reasonably well in the training and testing sets it will have poor predictive power. Shown below is a comparison of several representative weather variables between two of the samples.

![Comparison of representative weather variable distributions between fully stratified samples](https://github.com/gperdrizet/wildfire/blob/master/figures/stratified_sample_comparison.png?raw=true)

A recursive strategy was employed to accomplish sampling. The sample was split randomly and the two sample Kolmogorov–Smirnov was conducted on all variables between the two haves. If none of the weather variable distributions were found to be significantly different (p >= 0.3), the split was accepted for further recursion. The base condition was n <= 500,000.

## 3. Small scale multilayer perceptron
The first neural network constructed was a feed forward multilayer perceptron. The network was trained and tested using a different fully stratified sample as the training, validation and testing datasets. scikit-optimize was used to conduct gaussian optimization of model hyperparameters. The parameters optimized were:
1. Learning rate
2. Number of hidden layers
3. Units per hidden layer
4. Dropout rate in a single distal dropout layer
5. L2 regularization lambda coefficient
6. The output class weighting

## 4. Scaled multilayer perceptron

The scaled multilayer perceptron model used the network architecture and hyper parameters from section 3 and applied them to the entire dataset.

## 5. Single geospatial long short term memory network
The single input LSTM network works on the weather and fire time series from just one location in California. It consists of one LSTM layer, then one fully connected layer and a single output. It was optimized using scikit-optimize. The hyperparameter investigated were:

1. Learning rate
2. Size of the sliding time window used to construct samples
3. The number of LSTM units
4. The number of fully connected layers after the LSTM
5. The number of units in each fully connected layer
6. The L2 regularization lambda term in the fully connected layers
7. The output class weighting

## 6. Parallel geospatial long short term memory network
The parallel geospatial LSTM network is the most complex and orignal of the models presented so far. It is an exotic architecture tailored the problem at hand using the Keras functional API. It consists of 415 input LSTM layers, each corresponding to one geospatial bin in California. The output from each LSTM is then fed through a merge layer and into a sequence of fully connected hidden layers. The final output is a fully connected layer with 410 units - one for the fire risk in each of the 410 geospatial bins.