<table style="width: 100%; border-collapse: collapse;" border="0">
<tr>
<td><b>Created:</b> Monday 30 January 2017</td>
<td style="text-align: right;"><a href="https://www.github.com/rhyswhitley/fire_limitation">github.com/rhyswhitley/fire_limitation</td>
</tr>
</table>

<div>
<center>
<font face="Times">
<br>
<h1>Quantifying the uncertainity of a global fire limitation model using Bayesian inference</h1>
<h2>Part 1: Staging data for analysis</h2>
<br>
<br>
<sup>1,* </sup>Douglas Kelley, 
<sup>2 </sup>Ioannis Bistinas, 
<sup>3, 4 </sup>Chantelle Burton, 
<sup>1 </sup>Tobias Marthews, 
<sup>5 </sup>Rhys Whitley
<br>
<br>
<br>
<sup>1 </sup>Centre for Ecology and Hydrology, Maclean Building, Crowmarsh Gifford, Wallingford, Oxfordshire, United Kingdom
<br>
<sup>2 </sup>Vrije Universiteit Amsterdam, Faculty of Earth and Life Sciences, Amsterdam, Netherlands
<br>
<sup>3 </sup>Met Office United Kingdom, Exeter, United Kingdom
<br>
<sup>4 </sup>Geography, University of Exeter, Exeter, United Kingdom
<br>
<sup>5 </sup>Natural Perils Pricing, Commercial & Consumer Portfolio & Pricing, Suncorp Group, Sydney, Australia
<br>
<br>
<h3>Summary</h3>
<hr>
<p> 
This notebook aims to process the separate netCDF4 files for the model drivers (X<sub>i=1, 2, ... M</sub>) and model target (Y) into a unified tabular data frame, exported as a compressed comma separated value (CSV) file. This file is subsequently used in the Bayesian inference study that forms the second notebook in this experiment. The advantage of the pre-processing the data separately to the analysis allows for it be quickly staged on demand. Of course other file formats may be more advantageous for greater compression (e.g. SQLite3 database file).
</p>
<br>
<b>You will need to run this notebook to prepare the dataest before you attempt the Bayesian analysis in Part 2</b>.
<br>
<br>
<br>
<i>Python code and calculations below</i>
<br>
<hr>
</font>
</center>
</div>

## Load libraries

Changed

In [64]:
# data munging and analytical libraries 
import re
import os
import numpy as np
import pandas as pd
from netCDF4 import Dataset 

# graphical libraries
import matplotlib.pyplot as plt
%matplotlib inline

# set paths
inPath = "../outputs/Australia_region/"
outPath = "../data/AUS_inference_data_2020"

startMnth = 12
nmonths_import = 18*12 + 1

## Import and clean data

Set the directory path and look for all netcdf files that correspond to the model drivers and target.

In [65]:
driver_paths_all = [os.path.join(dp, f) for (dp, _, fn) in os.walk(inPath) for f in fn if f.endswith('.nc')]
driver_paths_all = [path for path in driver_paths_all if 'full_period' not in  path]
driver_paths_all = [path for path in driver_paths_all if 'SE_TempBLRegion.nc' not in  path]
driver_names_all = [re.search('^[a-zA-Z_]*', os.path.basename(fp)).group(0) for fp in driver_paths_all]

file_table = pd.DataFrame({'filepath': driver_paths_all, 'file_name': driver_names_all})

def checkFilename(driver_name, sat):
    
    if "firecount" in driver_name:
        if sat in driver_name: return True
        return False
    
    return True

#def select_files_for_sat(sat):
#    test = [checkFilename(driver_name, sat) for driver_name in driver_names_all]
    
driver_paths = np.array(driver_paths_all)#test]
driver_names = np.array(driver_names_all)#[test]
print(driver_paths)
#    driver_names = ["firecount" if "firecount" in driver_name else driver_name for driver_name in driver_names]
#    return driver_paths, driver_names
    
driver_info = [driver_paths, driver_names]#[select_files_for_sat(sat) for sat in satalites]

['../outputs/Australia_region/burnt_area-GFED4s_2.5degree_2001-2016.nc'
 '../outputs/Australia_region/firecount-SE_Aus_2001_onwards.nc'
 '../outputs/Australia_region/climate/from_2001/emc-2001-2020.nc'
 '../outputs/Australia_region/climate/from_2001/air2001-2020.nc'
 '../outputs/Australia_region/climate/from_2001/lightning2001-2020.nc'
 '../outputs/Australia_region/climate/from_2001/precip-2001-2020.nc'
 '../outputs/Australia_region/climate/from_2001/relative_humidity2001-2020.nc'
 '../outputs/Australia_region/climate/from_2001/rhumMax.2001-2020.nc'
 '../outputs/Australia_region/climate/from_2001/rhumMaxMax.2001-2020.nc'
 '../outputs/Australia_region/climate/from_2001/soilw.0-10cm.gauss.2001-2020.nc'
 '../outputs/Australia_region/climate/from_2001/swnd.2001-2020.nc'
 '../outputs/Australia_region/climate/from_2001/swndMax.2001-2020.nc'
 '../outputs/Australia_region/climate/from_2001/tmax.2001-2020.nc'
 '../outputs/Australia_region/climate/from_2001/tmaxMax.2001-2020.nc'
 '../outputs/Aus

Define a function to extract the variable values from each netCDF4 file. Variables are flattened from a 3 dimensional array to 1 dimensional version, pooling all values both spatially and temporily. 

Don't know if this is the correct way to do this, but will come back to it once I understand the model (and its optimisation) better.

In [66]:
from pdb import set_trace as browser

def nc_extract(fpath):
    print(fpath)
    print("Processing: {0}".format(fpath))
    with Dataset(fpath, 'r') as nc_file:
        try:
            gdata = np.array(nc_file.variables['variable'][:,:,:])
            gdata = gdata[startMnth:,:,:]
        except:
            return
        if (len(gdata) < nmonths_import): 
            lastYr = gdata[-12:]
            nmonths_missing = nmonths_import-gdata.shape[0]
            nyrs_extra = np.floor(nmonths_missing/12.0)
            nmths_extra = np.int(nmonths_missing - 12 * nyrs_extra)
            
            addon = np.tile(lastYr, (np.int(nyrs_extra),1, 1))
            gdata = np.append(gdata, addon, 0)
            gdata = np.append(gdata, lastYr[0:nmths_extra], 0)
        else:
            gdata = gdata[:nmonths_import, :, :]
        gdata[gdata < -9E9] = np.nan
        gflat = gdata.flatten()
        if type(gdata) == np.ma.core.MaskedArray:
            return gflat[~gflat.mask].data
        else:
            return gflat.data

Execute the above function on all netCDF4 file paths.

In [67]:
values = [nc_extract(dp) for dp in driver_paths]

../outputs/Australia_region/burnt_area-GFED4s_2.5degree_2001-2016.nc
Processing: ../outputs/Australia_region/burnt_area-GFED4s_2.5degree_2001-2016.nc
../outputs/Australia_region/firecount-SE_Aus_2001_onwards.nc
Processing: ../outputs/Australia_region/firecount-SE_Aus_2001_onwards.nc
../outputs/Australia_region/climate/from_2001/emc-2001-2020.nc
Processing: ../outputs/Australia_region/climate/from_2001/emc-2001-2020.nc
../outputs/Australia_region/climate/from_2001/air2001-2020.nc
Processing: ../outputs/Australia_region/climate/from_2001/air2001-2020.nc
../outputs/Australia_region/climate/from_2001/lightning2001-2020.nc
Processing: ../outputs/Australia_region/climate/from_2001/lightning2001-2020.nc
../outputs/Australia_region/climate/from_2001/precip-2001-2020.nc
Processing: ../outputs/Australia_region/climate/from_2001/precip-2001-2020.nc
../outputs/Australia_region/climate/from_2001/relative_humidity2001-2020.nc
Processing: ../outputs/Australia_region/climate/from_2001/relative_humidit

Turn this into a dataframe for the analysis.

In [68]:
for v, n in zip(values, driver_names):
    print(n)
    print(v.shape)

burnt_area
(54684,)
firecount
(54684,)
emc
(54684,)
air
(54684,)
lightning
(54684,)
precip
(54684,)
relative_humidity
(54684,)
rhumMax
(54684,)
rhumMaxMax
(54684,)
soilw
(54684,)
swnd
(54684,)
swndMax
(54684,)
tmax
(54684,)
tmaxMax
(54684,)
wetdays
(54684,)
cropland
(54684,)
fract_agr
(54684,)
pasture
(54684,)
population_density
(54684,)
MaxOverMean_soilw
(54684,)
MeanAnnaul_soilw
(54684,)
nonetreecover
(54684,)
treecover
(54684,)
vegcover
(54684,)
MeanAnnual_soilw
(54684,)
MaxOverMean_soilw
(54684,)
MeanAnnual_soilw
(54684,)


In [69]:
# turn list into a dataframe
def makeDataFrave(value, driver_names):
    
    df = pd.DataFrame(np.array(value).T, columns=driver_names)
    print(df.info())
    df.dropna(inplace=True)
    return df
fire_df = makeDataFrave(values, driver_names)
#fire_df = [makeDataFrave(value, driver_names[1]) for value, driver_names in zip(values, driver_info)]
#fire_df.info()
# replace null flags with pandas null
#fire_df.replace(fire_df < -3e38, np.nan, inplace=True)
#fire_df[] = np.nan
# drop all null rows (are ocean and not needed in optim)
#fire_df.dropna(inplace=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54684 entries, 0 to 54683
Data columns (total 27 columns):
burnt_area            26040 non-null float32
firecount             26040 non-null float32
emc                   26040 non-null float32
air                   26040 non-null float32
lightning             26040 non-null float32
precip                26040 non-null float32
relative_humidity     26040 non-null float32
rhumMax               26040 non-null float32
rhumMaxMax            26040 non-null float32
soilw                 26040 non-null float32
swnd                  26040 non-null float32
swndMax               26040 non-null float32
tmax                  26040 non-null float32
tmaxMax               26040 non-null float32
wetdays               26040 non-null float32
cropland              26040 non-null float32
fract_agr             26040 non-null float32
pasture               26040 non-null float32
population_density    26040 non-null float32
MaxOverMean_soilw     26040 non-null

Check that we've built it correctly.

In [70]:
fire_df.tail(10)

Unnamed: 0,burnt_area,firecount,emc,air,lightning,precip,relative_humidity,rhumMax,rhumMaxMax,soilw,...,pasture,population_density,MaxOverMean_soilw,MeanAnnaul_soilw,nonetreecover,treecover,vegcover,MeanAnnual_soilw,MaxOverMean_soilw.1,MeanAnnual_soilw.1
54609,0.000103252,2.664952,0.225115,23.937891,811.06012,1.894206,73.983871,19.0,63.0,0.418116,...,0.02291739,20.278248,1.279936,0.364582,0.633996,0.217875,0.851871,0.364582,1.484281,0.307859
54622,0.0,2.667886,0.194782,20.699989,47.038242,0.060897,58.612904,10.0,31.0,0.0,...,0.0009379424,10.914135,4.032984,0.067369,0.634177,0.124854,0.759031,0.067369,3.28106,0.088745
54623,2.449216e-07,0.0,0.15372,23.068541,93.908463,0.179685,44.55645,5.0,22.0,0.0,...,0.01640226,24.349199,2.62349,0.113537,0.717812,0.083411,0.801224,0.113537,3.351756,0.111221
54624,4.421814e-05,0.087942,0.135244,25.321762,170.265564,0.487633,39.419353,8.0,24.0,0.0,...,5.930538e-05,1.83521,11.749692,0.007124,0.727891,0.062109,0.79,0.007124,6.230038,0.039342
54625,1.082842e-05,0.025806,0.146679,25.505629,302.761139,0.841489,44.177418,10.0,33.0,0.116142,...,-1.255226e-18,2.567215,1.943547,0.123041,0.725469,0.059948,0.785417,0.123041,2.317078,0.160295
54626,6.192541e-08,8.39992,0.18953,23.228214,423.701141,0.980087,59.040321,11.0,45.0,0.303826,...,0.01605949,5.741051,1.349902,0.297553,0.698984,0.120599,0.819583,0.297553,1.323735,0.279937
54627,3.961237e-06,9.667947,0.243641,21.349989,755.699097,1.399891,78.80645,20.0,63.0,0.394512,...,0.03286197,53.304996,1.336598,0.335186,0.576953,0.349557,0.92651,0.335186,1.494655,0.289251
54642,1.711179e-05,0.315652,0.195679,20.131447,202.003326,1.828684,58.403225,7.0,27.0,0.0,...,0.01945221,25.281853,2.889338,0.068072,0.752437,0.182239,0.934676,0.068072,3.436119,0.080267
54643,0.0001193489,0.052988,0.194018,20.384668,153.439835,2.391738,57.991936,11.0,39.0,0.0,...,0.1247526,82.593353,3.252257,0.054652,0.691615,0.240339,0.931953,0.054652,3.293818,0.071328
54644,9.288811e-07,15.048178,0.216472,19.814505,401.682892,2.62409,66.403229,20.0,52.0,0.220507,...,0.0587946,5.707099,2.491471,0.112132,0.47028,0.463988,0.934268,0.112132,2.155988,0.112192


Export this to disk to be used by the analysis notebook - used gzip compression to save on space. Beware, because of there are approximation 10 million rows of data, this may take some time.

In [71]:

outPathi = outPath + '.csv'
savepath = os.path.expanduser(outPathi)
fire_df.to_csv(savepath, index=False)

<div>
<br>
<br>
<br>
<center>
<font size="5">
<a style="font-weight: bold; size: 5" href="http://localhost:8888/notebooks/notebooks/bayesian_inference.ipynb">Part 2: click here</a>
</font>
</center>
</div>