# Wildfire Severity Prediction using Raster Tools and SciKit-Learn

#### REQUIRED SOFTWARE
- Python
    - raster-tools
    - scikit-learn
    - tqdm
    - numpy
    - dask
    - geopandas
    - dask_geopandas
    - joblib
    - ray

- Command Line
    - gdal
    - cdo

#### It is recommended to have about 300GB of free space on your drive to complete the Wildfire Severity Prediction process. (may need to be done in chunks if not enough space)

# 1. Data Collection

### 1.1 Data to run the script is obtained from multiple sources.  The table below shows the data sources and links to where the data can be found.

<table><th>Source</th><th>Link</th><th>Description</th></tr>
<tr><td>MTBS Fire Data</td><td>https://www.mtbs.gov/direct-download</td><td>Fire Bundles -> Burned Areas Boundaries & Burn Severity Mosaics -> 1986-2020 of desired state</td></tr>
<tr><td>DEM Data</td><td>https://earthexplorer.usgs.gov/</td><td>Data sets -> Digital elevation -> CONUS aspect, flow_acc, orig_dem, slope</td></tr>
<tr><td>gridMET Climate Data</td><td>https://www.climatologylab.org/gridmet.html</td><td>use the provided scripts</td></tr>
<tr><td>AdaptWest Climate Data</td><td>https://adaptwest.databasin.org/pages/adaptwest-climatena/</td><td>Climate Normals -> 1991-2020 period -> 33 Bioclimatic variables zip</td></tr>
<tr><td>DayMet Climate Data</td><td>https://daac.ornl.gov/cgi-bin/dataset_lister.pl?p=32</td><td>use the provided scripts</td></tr>
<tr><td>Landfire Data</td><td>https://www.landfire.gov/version_download.php</td><td>LF 2016 Remap -> Fuel Veg Type 2020 & 40 Scott/Burgan Fuel Models 2020</td></tr>
<tr><td>Biomass Data</td><td>https://rangelands.app/products/</td><td>use the provided scripts</td></tr>
<tr><td>NDVI Data</td><td>https://www.ncei.noaa.gov/data/land-normalized-difference-vegetation-index/access/</td><td>need 1986-2020, either manually or using the provided scripts</td></tr>
<tr><td>State Borders</td><td>https://www2.census.gov/geo/tiger/GENZ2018/shp/</td><td>need cb_2018_us_state_5m.zip</td></tr>

#### There are scripted methods available to obtain some the data.  They are as follows:

#### Biomass download and processing:

In [None]:
import subprocess

# Set the desired area of interest
# Change the values below to match the desired area of interest (i reccommend something like http://bboxfinder.com/ to get the coordinates)
STATE = "Oregon"
LONG_MIN = -124.85
LONG_MAX = -116.33
LAT_MIN = 41.86
LAT_MAX = 46.23

# Loop through years 2020 to 1986 to download biomass data
years = list(range(2020, 1985, -1))

for year in years:
    # Download biomass data
    subprocess.run(["gdal_translate", "-co", "compress=lzw", "-co", "tiled=yes", "-co", "bigtiff=yes",
                    f"/vsicurl/http://rangeland.ntsg.umt.edu/data/rap/rap-vegetation-biomass/v3/vegetation-biomass-v3-{year}.tif",
                    "-projwin", str(LONG_MIN), str(LAT_MAX), str(LONG_MAX), str(LAT_MIN),
                    f"out{year}_{STATE}.tif"], check=True)

    # Convert to netCDF format
    subprocess.run(["gdal_translate", "-of", "netCDF", "-co", "FORMAT=NC4",
                    f"out{year}_{STATE}.tif", f"{year}_biomass_{STATE}.nc"], check=True)

    # Remove temporary files
    subprocess.run(["rm", "*.tif"], check=True)

After the biomass netCDF files are ready, the dates need to be fixed. Save and run the following script to fix the dates and combine the files into one netCDF file.

In [None]:
import subprocess
import os

# Set the desired state
STATE = "OR"

# Loop through years 2020 to 1986
years = list(range(2020, 1985, -1))

for year in years:
    if year % 4 == 0:
        year_days = 366
    else:
        year_days = 365

    # Fix dates and bands of biomass data
    subprocess.run(["cdo", "settaxis,{}-01-01,00:00,{}days".format(year, year_days),
                    "{}_biomass.nc".format(year),
                    "{}_biomass_fixed.nc".format(year)], check=True)

    # Split the data into individual bands
    subprocess.run(["cdo", "splitvar",
                    "{}_biomass_fixed.nc".format(year),
                    "{}_{}_biomass_".format(STATE, year)], check=True)

    # Remove temporary files
    os.remove("{}_biomass.nc".format(year))
    os.remove("{}_biomass_fixed.nc".format(year))

# Create directories for bands
os.makedirs("b1", exist_ok=True)
os.makedirs("b2", exist_ok=True)

# Move band files to respective directories
os.rename("*Band1.nc", "b1/")
os.rename("*Band2.nc", "b2/")

# Change directory to b1
os.chdir("b1")

# Concatenate band files
subprocess.run(["cdo", "-f", "nc4", "-z", "zip", "cat", "*.nc",
                "biomass_afg_1986_2020_{}.nc".format(STATE)], check=True)

# Change directory to b2
os.chdir("../b2")

# Concatenate band files
subprocess.run(["cdo", "-f", "nc4", "-z", "zip", "cat", "*.nc",
                "biomass_pfg_1986_2020_{}.nc".format(STATE)], check=True)

#### gridMET download and processing:

In [None]:
import subprocess

# Loop through years 2020 to 1986
years = list(range(2020, 1985, -1))

for year in years:
    # Download vpd, srad, and pdsi data for each year
    subprocess.run(["wget", "-nc", "-c", "-nd",
                    f"http://www.northwestknowledge.net/metdata/data/vpd_{year}.nc"], check=True)
    subprocess.run(["wget", "-nc", "-c", "-nd",
                    f"http://www.northwestknowledge.net/metdata/data/srad_{year}.nc"], check=True)
    subprocess.run(["wget", "-nc", "-c", "-nd",
                    f"http://www.northwestknowledge.net/metdata/data/pdsi_{year}.nc"], check=True)

# Merge the downloaded data for each variable
for var in ["vpd", "srad", "pdsi"]:
    subprocess.run(["cdo", "-f", "nc4", "-z", "zip", "cat", f"{var}_*.nc",
                    f"{var}_1986-2020.nc"], check=True)

    # Calculate weekly mean for each variable
    subprocess.run(["cdo", "-f", "nc4", "-z", "zip", "-timselmean,7",
                    f"{var}_1986-2020.nc", f"{var}_1986_2020_weekly.nc"], check=True)


#### dayMET download and processing:

In [None]:
import subprocess

# Loop through years 2020 to 1986
years = list(range(2020, 1985, -1))

for year in years:
    # Download tmax and tmin data for each year
    subprocess.run(["wget", "-nc", "-c", "-nd",
                    f"https://thredds.daac.ornl.gov/thredds/fileServer/ornldaac/2131/daymet_v4_tmax_monavg_na_{year}.nc"],
                   check=True)
    subprocess.run(["wget", "-nc", "-c", "-nd",
                    f"https://thredds.daac.ornl.gov/thredds/fileServer/ornldaac/2131/daymet_v4_tmin_monavg_na_{year}.nc"],
                   check=True)

# Merge the downloaded data for each variable
for var in ["tmax", "tmin"]:
    subprocess.run(["cdo", "-f", "nc4", "-z", "zip", "cat", f"daymet_v4_{var}_monavg_na_*.nc",
                    f"{var}_1986_2020.nc"], check=True)

#### NDVI download and processing:

In [None]:
import subprocess

# Loop through years 2020 to 1986
years = list(range(2020, 1985, -1))

for year in years:
    # Download NDVI data for each year
    subprocess.run(["wget", "-erobots=off", "-nv", "-m", "-np", "-nH", "--cut-dirs=2",
                    "--reject", "index.html*", f"https://www.ncei.noaa.gov/data/land-normalized-difference-vegetation-index/access/{year}/"],
                   check=True)

average and combine the NDVI data:

In [None]:
import subprocess
import os

# Loop through years 2020 to 1986
years = list(range(2020, 1985, -1))

for year in years:
    # Change directory to the current year
    os.chdir(str(year))

    # Concatenate daily NDVI files into a single yearly file
    subprocess.run(["cdo", "cat", "*.nc", f"{year}_ndvi_daily.nc"], check=True)

    # Calculate weekly average NDVI for each year
    subprocess.run(["cdo", "-f", "nc4", "-z", "zip", "-timselmean,7",
                    f"{year}_ndvi_daily.nc", f"{year}_ndvi_weeklyavg.nc"], check=True)

    # Remove unnecessary variables from the weekly average file
    subprocess.run(["ncks", "-x", "-v", "TIMEOFDAY,QA",
                    f"{year}_ndvi_weeklyavg.nc", f"{year}_ndvi_weeklyavg.nc"], check=True)

    # Move the weekly average file to the 'weekly' directory
    os.rename(f"{year}_ndvi_weeklyavg.nc", "../weekly/{year}_ndvi_weeklyavg.nc")

    # Change directory back to the parent directory
    os.chdir("..")

# Change directory to the 'weekly' directory
os.chdir("weekly")

# Concatenate all the yearly files into a single file
subprocess.run(["cdo", "cat", "*.nc", "ndvi_1986_2020_weeklyavg.nc"], check=True)

# 2. Building the Dataset

#### The script used to build the dataset is located in the following path: Fire_Prediction/build_mtbs_dataframe.py

##### Once the data has been collected, the dataset can be built.  The files must be arranged in specific folders for the script to work.  The structure is as follows:

<table><th>Feature</th><th>Folder</th></tr>
<tr><td>MTBS Data</td><td>data/MTBS_Data</td></tr>
<tr><td>DEM Data</td><td>data/terrain</td></tr>
<tr><td>GridMet Climate Data</td><td>data/FeatureData/gridmet</td></tr>
<tr><td>DayMet Climate Data</td><td>data/FeatureData/daymet</td></tr>
<tr><td>AdaptWest Climate Data</td><td>data/FeatureData/adaptwest</td></tr>
<tr><td>Landfire Fuel Data</td><td>data/FeatureData/landfire</td></tr>
<tr><td>Biomass Data</td><td>data/FeatureData/biomass</td></tr>

With the data prepped, the folder structure in ```data/``` should look something like this:

```
├── FeatureData
│   ├── adaptwest
    │   ├── check1
│   ├── biomass
│   │   └── all_biomass
│   │       ├── b1
│   │       └── b2
│   ├── daymet
│   │   ├── tmax_monthavg
│   │   └── tmin_monthavg
│   ├── gridmet
│   ├── landfire
│   │   ├── LF2020_FBFM40_200_CONUS
│   │   │   ├── ...
│   │   └── LF2020_FVT_200_CONUS
│   │       ├── ..
│   ├── ndvi
│   └── state_borders
├── MTBS_Data
│   ├── MTBS_BSmosaics
│   │   ├── 1986
│   │   ├── ...
│   ├── mtbs_burn_mosaics
│   └── mtbs_perimeter_data
├── temp
│   ├── check1
│   ├── check2
│   ├── ...
└── terrain
    ├── us_aspect
    │   ├── ...
    ├── us_flow_acc
    │   ├── ...
    ├── us_orig_dem
    │   ├── ...
    └── us_slope
        ├── ...

```

##### The script will generate a Dask dataframe in Parquet format.  The dataframe will have 21 columns, 20 of which are features and 1 of which is the target variable.  The features are as follows:

<table><th>Feature</th><th>Column Name</th><th>Temporal Range</th></tr>
<tr><td>MTBS Severity Rating</td><td>mtbs</td><td>Const</td></tr>
<tr><td>MTBS Fire Year</td><td>year</td><td>Const</td></tr>
<tr><td>DEM elevation</td><td>dem</td><td>Const</td></tr>
<tr><td>DEM slope</td><td>dem_slope</td><td>Const</td></tr>
<tr><td>DEM aspect</td><td>dem_aspect</td><td>Const</td></tr>
<tr><td>DEM flow accumulation</td><td>dem_flow_acc</td><td>Const</td></tr>
<tr><td>DEM hillshade</td><td>hillshade</td><td>Const</td></tr>
<tr><td>GridMet Drought Index</td><td>gm_pdsi</td><td>Weekly avg</td></tr>
<tr><td>GridMet Solar Radiation</td><td>gm_srad</td><td>Weekly avg</td></tr>
<tr><td>GridMet Vapor Pressure</td><td>gm_vpd</td><td>Weekly avg</td></tr>
<tr><td>DayMet Temp Max</td><td>dm_tmax</td><td>Monthly max</td></tr>
<tr><td>DayMet Temp Min</td><td>dm_tmin</td><td>Monthly min</td></tr>
<tr><td>AdaptWest Mean Annual Temp</td><td>aw_mat</td><td>Yearly avg</td></tr>
<tr><td>AdaptWest Mean Temp Warmest Month</td><td>aw_mwmt</td><td>Avg of 1 month</td></tr>
<tr><td>AdaptWest Mean Temp Coldest Month</td><td>aw_mcmt</td><td>Avg of 1 month</td></tr>
<tr><td>AdaptWest Temp Difference</td><td>aw_td</td><td>Diff of mwmt and mcmt</td></tr>
<tr><td>Landfire Vegetation Type</td><td>landfire_fvt</td><td>2020 update</td></tr>
<tr><td>Landfire Fuel Model</td><td>landfire_fbfm40</td><td>2020 update</td></tr>
<tr><td>Biomass Annuals</td><td>biomass_afg</td><td>Yearly avg</td></tr>
<tr><td>Biomass Perennials</td><td>biomass_pfg</td><td>Yearly avg</td></tr>
<tr><td>Normalized Difference Vegetation Index</td><td>ndvi</td><td>Weekly avg</td></tr>

## 2.1 Running the script
The script can be run from the command line with the following command:

```python build_mtbs_dataframe.py```

##### Some user configuration is required before executing. The following changes need to be made to the script:

The DATA_LOC and TMP_LOC paths must be set, as well as the STATE being processed.  The STATE variable is used to filter the MTBS data to only include fires in the state of interest.  The STATE variable must be set to the two letter abbreviation of the state.  For example, to process the state of California, the STATE variable should be set to 'CA'.

In [None]:
# Location for temporary storage NOTE: this directory will vary by user
DATA_LOC = "/home/jakebova/Fire_Prediction/Oregon/data"
TMP_LOC = pjoin(DATA_LOC, "temp")

# the state to process
STATE = "MT"

The PATHS object holds the filepaths for the data.  It should look something like this:

In [None]:
# Location of clipped DEM files
DEM_DATA_DIR = pjoin(TMP_LOC, "dem_data")

# location of feature data files
FEATURE_DIR = pjoin(DATA_LOC, "FeatureData")
EDNA_DIR = pjoin(DATA_LOC, "terrain")
MTBS_DIR = pjoin(DATA_LOC, "MTBS_Data")

PATHS = {
    "states": pjoin(FEATURE_DIR, "state_borders/cb_2018_us_state_5m.shp"),
    "dem": pjoin(EDNA_DIR, "us_orig_dem/orig_dem/hdr.adf"),
    "dem_slope": pjoin(EDNA_DIR, "us_slope/slope/hdr.adf"),
    "dem_aspect": pjoin(EDNA_DIR, "us_aspect/aspect/hdr.adf"),
    "dem_flow_acc": pjoin(EDNA_DIR, "us_flow_acc/flow_acc/hdr.adf"),
    "gm_pdsi": pjoin(FEATURE_DIR, "gridmet/pdsi_weekly.nc"),
    "gm_srad": pjoin(FEATURE_DIR, "gridmet/srad_weekly.nc"),
    "gm_vpd": pjoin(FEATURE_DIR, "gridmet/vpd_1984-2020_weekly.nc"),
    "aw_mat": pjoin(FEATURE_DIR, "adaptwest/Normal_1991_2020_MAT.tif"),
    "aw_mcmt": pjoin(FEATURE_DIR, "adaptwest/Normal_1991_2020_MCMT.tif"),
    "aw_mwmt": pjoin(FEATURE_DIR, "adaptwest/Normal_1991_2020_MWMT.tif"),
    "aw_td": pjoin(FEATURE_DIR, "adaptwest/Normal_1991_2020_TD.tif"),
    "dm_tmax": pjoin(FEATURE_DIR, "daymet/tmax_monthavg/dm_tmax_1986_2020.nc"),
    "dm_tmin": pjoin(FEATURE_DIR, "daymet/tmin_monthavg/dm_tmin_1986_2020.nc"),
    "biomass_afg": pjoin(
        FEATURE_DIR, "biomass/1986_2020_biomass_afg_{}.nc".format(STATE.lower())
    ),
    "biomass_pfg": pjoin(
        FEATURE_DIR, "biomass/1986_2020_biomass_pfg_{}.nc".format(STATE.lower())
    ),
    "landfire_fvt": pjoin(
        FEATURE_DIR, "landfire/LF2020_FVT_200_CONUS/Tif/LC20_FVT_200.tif"
    ),
    "landfire_fbfm40": pjoin(
        FEATURE_DIR, "landfire/LF2020_FBFM40_200_CONUS/Tif/LC20_F40_200.tif"
    ),
    "ndvi": pjoin(FEATURE_DIR, "ndvi/1985_2020_ndvi_weekly.nc"),
    "mtbs_root": pjoin(MTBS_DIR, "MTBS_BSmosaics/"),
    "mtbs_perim": pjoin(MTBS_DIR, "mtbs_perimeter_data/mtbs_perims_DD.shp"),
}
YEARS = list(range(1986, 2021))
GM_KEYS = list(filter(lambda x: x.startswith("gm_"), PATHS))
AW_KEYS = list(filter(lambda x: x.startswith("aw_"), PATHS))
DM_KEYS = list(filter(lambda x: x.startswith("dm_"), PATHS))
BIOMASS_KEYS = list(filter(lambda x: x.startswith("biomass_"), PATHS))
LANDFIRE_KEYS = list(filter(lambda x: x.startswith("landfire_"), PATHS))
NDVI_KEYS = list(filter(lambda x: x.startswith("ndvi"), PATHS))
DEM_KEYS = list(filter(lambda x: x.startswith("dem"), PATHS))

### In order to properly build the dataset, an iterative approach to running the script is required.  There are several ```if``` checks in ```__main__``` that need manual changing:

1. When creating a new dataset, set this block to true and others to false and run the script:
```
    if 0:
        # code below for creating a new dataset for a new state / region
        df = build_mtbs_df(
            YEARS,
            year_to_mtbs_file,
            year_to_perims,
            STATE,
            out_path=mtbs_df_path,
        )
```

2. Once that is done, set the following block to true and others to false and run the script:
```
    if 0:
        df = dgpd.read_parquet(mtbs_df_path)
        clip_and_save_dem_rasters(DEM_KEYS, PATHS, state_shape, STATE)
        df = add_columns_to_df(
            df,
            DEM_KEYS,
            extract_dem_data,
            mtbs_df_temp_path,
            # Save results in serial to avoid segfaulting. Something about the
            # dem computations makes segfaults extremely likely when saving
            # The computations require a lot of memory which may be what
            # triggers the fault.
            parallel=False,
        )

```
3. After the DEM data is added, the hillshade and year data can be extracted. Set the following block to true and others to false and run the script:
```
    if 0:
        with ProgressBar():
            df = dgpd.read_parquet(mtbs_df_path)
        df = df.assign(hillshade=U8.type(0))
        df = df.map_partitions(hillshade_partition, 45, 180, meta=df._meta)
        df = df.assign(year=U16.type(0))
        df = df.map_partitions(timestamp_to_year_part, meta=df._meta)

        print(df.head())

        print("Repartitioning and saving ")
        df = df.repartition(partition_size="100MB").reset_index(drop=True)
        with ProgressBar():
            # df.to_parquet(mtbs_df_temp_path)
            df.to_parquet(mtbs_df_temp_path)
```
4. After the hillshade and year data is added, we can start to add the feature data. 
<br><b>NOTE</b>: This must be done one feature at a time (i.e. DM_KEYS, then NDVI_KEYS, etc)
<br>Set the following block to true and others to false and run the script:
```
    if 0:
        # code below used to add new features to the dataset
        with ProgressBar():
            df = dgpd.read_parquet(mtbs_df_path)
        # NOTE: DM KEYS adding to DF on {date}
        df = add_columns_to_df(
            df, DM KEYS, partition_extract_nc, checkpoint_1_path, parallel=False
        )
        df = df.repartition(partition_size="100MB").reset_index(drop=True)
        print("Repartitioning")
        with ProgressBar():
            df.to_parquet(mtbs_df_temp_path)
```
The function passed to ```add_columns_to_df``` depends on the file type of the feature being added.<br>
For netCDF, used the ```partition_extract_nc``` function.<br>
For TIF data, used the ```extract_tif_data``` function.<br>
The mtbs_df_path and mtbs_df_temp_path will have to be swapped when adding new features (i.e. add DM_KEYS, then swap the paths so it reads in the newer dataframe, then add NDVI_KEYS and swap again after, etc)

5. When adding netCDF data, NDVI and daymet data requires special attention in the ```netcdf_to_raster``` function.<br>
Note the code below from the function:
```
    # BELOW FOR NDVI ONLY!!!
    # nc_ds = nc_ds.drop_vars(
    #     ["latitude_bnds", "longitude_bnds", "time_bnds"]
    # ).rio.write_crs("EPSG:4326")
    # ABOVE FOR NDVI ONLY!!!
    # nc_ds = nc_ds.rio.write_crs("EPSG:5071")  # FOR DAYMET ONLY!!
    # nc_ds = nc_ds.rio.write_crs(
    #     nc_ds.coords["lambert_conformal_conic"].spatial_ref
    # )  # FOR DAYMET ONLY!!
    # nc_ds = nc_ds.rename({"lambert_conformal_conic": "crs"})  # FOR DAYMET ONLY!!
    # nc_ds = nc_ds.drop_vars(["lat", "lon"])  # FOR DAYMET ONLY!!
    # nc_ds = nc_ds.rename_vars({"x": "lon", "y": "lat"})  # FOR DAYMET ONLY!!
```

If adding other data, the lines stay commented out<br>
If adding DM data, uncomment the indicated lines<br>
If adding NDVI data, uncomment the indicated lines<br>

#### Once all the features are added to the dataset, the final product is a Dask GeoDataFrame in parquet format.  The output of df.head() should look something like this:

In [None]:
   mtbs                          geometry state    ig_date   gm_pdsi  gm_srad  gm_vpd  aw_mat  aw_mcmt  aw_mwmt      aw_td  ...  dem_aspect  dem_flow_acc  landfire_fvt  landfire_fbfm40  biomass_afg  biomass_pfg    ndvi   dm_tmax    dm_tmin  hillshade  year
0     2  POINT (-1830990.000 2456610.000)    OR 1986-03-20  2.779999    164.5    0.41     8.7     -1.9     20.9  22.799999  ...  305.295502           0.0        2967.0            102.0        165.0        891.0  0.0782 -3.049677 -16.621613        179  1986
1     2  POINT (-1830960.000 2456610.000)    OR 1986-03-20  2.779999    164.5    0.41     8.7     -1.9     20.9  22.799999  ...  333.025360           0.0        2967.0            102.0        411.0        638.0  0.0782 -3.049677 -16.621613        179  1986
2     2  POINT (-1830990.000 2456580.000)    OR 1986-03-20  2.779999    164.5    0.41     8.7     -1.9     20.9  22.799999  ...  253.358459           0.0        2967.0            102.0        388.0        508.0  0.0782 -3.049677 -16.621613        181  1986
3     2  POINT (-1830960.000 2456580.000)    OR 1986-03-20  2.779999    164.5    0.41     8.7     -1.9     20.9  22.799999  ...  254.867813           0.0        2080.0            122.0        411.0        638.0  0.0782 -3.049677 -16.621613        181  1986
4     2  POINT (-1830930.000 2456580.000)    OR 1986-03-20  2.779999    164.5    0.41     8.7     -1.9     20.9  22.799999  ...  339.071350           0.0        2123.0            122.0        379.0        492.0  0.0782 -3.049677 -16.621613        180  1986

# 3. Model Training

#### Once the dataset is constructed, the model can be trained.  The model is trained using the ```rf_class_mtbs.py``` script.  The script can be run from the command line with the following command:

```python rf_class_mtbs.py```

#### We use a random forest classifier with mostly default parameters to predict the class of each cell in the dataset.

##### There are several options for training the model.  The following changes can be made to the script:

In this code:
```
register_ray()
pd.set_option("display.max_rows", 500)
p = 0.2
SEED = 42
TREES = 50
filename = "rf_mtbs_oregon_50t20d.sav"
TRAIN = True
TEST = False

STATE = "OR"
DATA_LOC = "/home/jakebova/Fire_Prediction/Oregon/data"
TMP_LOC = pjoin(DATA_LOC, "temp")

DF_LOC = pjoin(TMP_LOC, f"{STATE}_mtbs_df_with_ids")
```

The following changes can be made:
- ```p``` is the percentage of the dataset to use for training.  The default is 0.2, which is 20% of the dataset.  This can be changed to any value between 0 and 1. Note that this requires a large amount of memory.  If the script crashes, try reducing the value of ```p```.
- ```SEED``` is the random seed used for the model.  The default is 42, but this can be changed to any integer.
- ```TREES``` is the number of trees to use in the model.  The default is 50, but this can be changed to any integer.  We have found best success in the 100 to 150 range. Note that this requires a large amount of memory.  If the script crashes, try reducing the value of ```TREES```.
- ```filename``` is the name of the file to save the model to.  The default is ```rf_mtbs_oregon_50t20d.sav```, but this can be changed to any string.
- ```TRAIN``` is a boolean value that determines whether or not to train the model.  The default is ```True```, but this can be changed to ```False``` to load a previously trained model.
- ```TEST``` is a boolean value that determines whether or not to test the model.  The default is ```False```, but this can be changed to ```True``` to test the model.


The verbosity of the scikit-learn random forest classifier can be changed by changing the value of verbose (typically 0, 1, or 2) in the following line:
```
forest = RandomForestClassifier(n_estimators=TREES, verbose=1, random_state=SEED)
```

##### Once complete (anywhere from ~30 minutes to several hours depending on configuration), the model will be saved to the ```filename``` location. The model can then be used to predict the class of any dataset with the same features as the training dataset.

# 4. Model Evaluation

## The output of model training script should look something like this:

```
Sampling 40.0% of data
Cleaning data
Splitting data into train and test sets
Instantiating model with 25 trees
Training model
****************************************
Accuracy score: 0.8380
****************************************
****************************************
Confusion matrix:
[[1583439  399576   28033    5013]
 [ 279435 4527912  250977   23102]
 [  32641  352893 1482150  121551]
 [   5747   25944  130280  968496]]
****************************************
****************************************
Classification Report
              precision    recall  f1-score   support

           1       0.83      0.79      0.81   2016061
           2       0.85      0.89      0.87   5081426
           3       0.78      0.75      0.76   1989235
           4       0.87      0.86      0.86   1130467

    accuracy                           0.84  10217189
   macro avg       0.83      0.82      0.83  10217189
weighted avg       0.84      0.84      0.84  10217189

****************************************
****************************************
Area under ROC Curve:
0.9616861528386682
****************************************
****************************************
Feature Importances (Impurity based):
dem                0.097426
biomass_pfg        0.094367
dem_aspect         0.087795
dem_slope          0.084736
biomass_afg        0.079713
dm_tmax            0.054383
dm_tmin            0.053868
ndvi               0.050503
dem_flow_acc       0.050460
gm_srad            0.049898
aw_mwmt            0.044709
gm_pdsi            0.039106
aw_td              0.038576
landfire_fbfm40    0.038562
landfire_fvt       0.036355
gm_vpd             0.035572
aw_mat             0.032097
aw_mcmt            0.031874
dtype: float64
END RUN .. dumping files
```