# Modelling Results 
[06/16/2023]

## Methodology

### Data

The data used for modelling are MISR data with (1) species concentration and (2) PM2.5 concentration from 2000-2021. Both datasets are then merged with daily information on fire clusters and smoke plumes that are close to each site. Each dataset is linked below for future reproduction of the experiments.

- [Original MISR_CSN data](https://github.com/cindyellow/PM25-Fire/blob/main/data/merged/MergedAll_MISR_CSN_2000_2021.csv)
- [Original MISR_AQS data](https://github.com/cindyellow/PM25-Fire/blob/main/data/merged/MergedAll_MISR_AQS_2000_2021.zip)
- [MISR_CSN with AOD imputed](https://github.com/cindyellow/PM25-Fire/blob/main/data/imputed/misr-csn_aod-only_imputed.csv)
- [MISR_CSN with AOD products imputed](https://github.com/cindyellow/PM25-Fire/blob/main/data/imputed/misr-aqs_aod-prod_imputed.zip)
- [MISR_AQS with AOD imputed](https://github.com/cindyellow/PM25-Fire/blob/main/data/imputed/misr-aqs_aod-only_imputed.zip)
- [MISR_AQS with AOD products imputed](https://github.com/cindyellow/PM25-Fire/blob/main/data/imputed/misr-aqs_aod-prod_imputed.zip)

### Features
Variables used in modelling include:
- Temporal: year, month
- Spatial: site longitude, site latitude, elevation
- Fire: distance (km) to closest fire cluster, average & variance of closest cluster's FRP, number of points in closest cluster
- Smoke: binary variables for whether the site lies in light, medium, and/or heavy smoke plume
- AOD: AOD, AOD products
- Other: land code of site (representing ecosystem), sampling method (POC)

### Transformation & Imputation
All continuous independent variables are transformed using the [Standard Scaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler.fit_transform) and categorical features are one-hot encoded. We also noticed a heavily right-skewed distribution for the target variables, so we log-transformed them during modelling.

Since there were a lot of missing values for AOD, imputation was also attempted to see if it helps with model performance. Values were imputed using KNN examining its 10 most similar neighbors. Since KNN is robust to outliers, they were not removed in this step. In the experiments below, we will be comparing models fit on non-imputed data (i.e. dataset with non-NAs for AOD features) and imputed data (i.e. dataset including all observations, imputing NAs for AOD).

### Training the models

Data was split into 70% training and 30% test. Since the purpose of these models at the moment is to investigate the importance of AOD, fire, and smoke information for predicting concentration, we only experimented with XGBoost and Random Forest to maintain interpretability. More complex models can be fitted later on if desired. 

In our case, outliers are defined as values 3 times the 75th percentile of the variable. They were removed from the non-imputed and imputed datasets prior to training. Missing values for non-AOD variables (including the target) were also dropped as we did not have the means to impute them.



## Species

### Outliers Summary

| Species | # Non-NAs (Imputed Model) | # Outliers Removed (Imputed Model) | Outlier Cutoff (Imputed Model)| 
| -------- | ------- | ------- | ------- |
| Sulfate | 1654 (complete data amt) | 29 (92) | 5.61 (5.04) |
| Nitrate | 1645 | 107 (310) | 9.15 (10.32) |
| Dust | 1662 | 35 (137) | 4.10 (3.23) |
| EC | 338 | 2 (16) | 3.27 (3.32) |
| OC | 338 | 3 (13) | 12.17 (10.95) |

### Imputation & Feature Group

There are a total of 3580 observations imputed for AOD & AOD products using 1709 non-NAs. Displayed below are the best R2 for the original and imputed models between the two feature groups (AOD vs. AOD products). The best feature group and model are selected by comparing the two R2 value for each species.

| Species | Original R2 | Imputed R2 | Original RMSE | Imputed RMSE | Best Feature Group | Best Model |
| -------- | ------- | ------- | ------- | ------- | ------- | ------- |
| Sulfate | **0.69** | 0.45 | 0.21 | 0.26 | aod | Random Forest |
| Nitrate | **0.61** | 0.44 | 0.34 | 0.45 | aod-prod | Random Forest |
| Dust | 0.34 | **0.36** | 0.26 | 0.25 | aod | Random Forest |
| EC | **0.58** | 0.45 | 0.19 | 0.23 | aod | Random Forest |
| OC | 0.22 | **0.27** | 0.35 | 0.4 | aod-prod | Random Forest |

- For sulfate, nitrate, and EC, imputation does not improve model performance on unseen data, whereas some improvement in R2 is seen for dust and OC (at the cost of RMSE for the latter). 
- Using AOD product features outperforms using AOD for nitrate and OC, but only marginally. Since OC requires imputation, it is better to still opt for the simpler model using only AOD.

### Variable Importance & Fitted vs True Graphs

Here, we take a look at the 10 most important variables and the graph of the test set's fitted versus true values for each species' best model.  Dark blue denotes the one-to-one line whereas light blue indicates the fitted linear regression line.

1. Sulfate

<img src="../img/sulfate/ft-imp/sulfate-aod_only-ft_imp_rf.png" width="450" height="400"/> <img src="../img/sulfate/fitted/sulfate-fitted_rf.png" width="450" height="400"/> 


2. Nitrate

<img src="../img/nitrate/ft-imp/nitrate-aod_only-ft_imp_rf.png" width="450" height="400"/> <img src="../img/nitrate/fitted/nitrate-fitted_rf.png" width="450" height="400"/> 

3. Dust

<img src="../img/imputed/ft-imp/dust-aod_only-ft_imp_rf.png" width="450" height="400"/><img src="../img/imputed/fitted/dust-aod_only-fitted_rf.png" width="450" height="400"/>

4. EC

<img src="../img/EC/ft-imp/EC-aod_only-ft_imp_rf.png" width="450" height="400"/> <img src="../img/EC/fitted/EC-fitted_rf.png" width="450" height="400"/> 

5. OC

<img src="../img/imputed/ft-imp/OC-aod_prod-ft_imp_rf.png" width="450" height="400"/><img src="../img/imputed/fitted/OC-aod_prod-fitted_rf.png" width="450" height="400"/>


- AOD is always in the top 2 when used in the model. 
- Month and year are usually in the top 3 important features as well, followed by site location.
- Number of points in the fire cluster covering the site is the most important among fire/smoke variables.
- Distance to closest fire follows.
- Average FRP is often in the top 10.
- Smoke only appears in the top 10 for nitrate.
- For dust and OC, the fitted models tend to overestimate values.

## PM25 Mass

### Outliers
- 916 outliers for PM2.5 were removed under the cutoff 38.1 in the model fitted on non-NAs. In the imputed model, there were 3894 outliers removed with the cutoff 37.37.
- Values less than 0 were replaced with 0 since they are likely due to error.

### Imputation & Feature Group

For modelling, we performed the same procedure as for species concentration. Displayed below are the respective performance for each feature group: 

| Original R2 | Imputed R2 | Original RMSE | Imputed RMSE | Feature Group | Best Model |
| ------- | ------- | ------- | ------- | ------- | ------- |
| **0.76** | 0.63 | 0.27 | 0.38 | aod | Random Forest |
| **0.74** | 0.65 | 0.29 | 0.36 | aod-prod | Random Forest |

In both cases, the model built on imputed data does not outperform the base model, so we should opt for modelling with non-NAs using only the AOD variable.

### Variable Importance

Now, let's take a look at the 10 most important features for the best model.

<img src="../img/PM25/PM25-aod_only-ft_imp_rf.png" width="450" height="400"/><img src="../img/PM25/PM25-aod_only-fitted_rf.png" width="450" height="400"/>

- We see similar variables as for all the species models.
- [Land code 81](https://www.mrlc.gov/data/legends/national-land-cover-database-class-legend-and-description) also appeared as significant. It refers to the ecosystem consisting of "pasture/hay-areas of grasses, legumes, or grass-legume mixtures planted for livestock grazing or the production of seed or hay crops, typically on a perennial cycle. Pasture/hay vegetation accounts for greater than 20% of total vegetation."