In [4]:
import pandas as pd

In [16]:
X = pd.read_csv("./X.csv")
y = pd.read_csv("./y.csv")
Xlasso = pd.read_csv("results/Xlasso.csv")
XElasticNet = pd.read_csv("results/XElasticNet.csv")
RandomForestScores = pd.read_csv("results/RandomForestScores.csv")

# Forest Fires Feature Selection Mini Project

## Introduction


### Data Information

The Forest Fires data is available at UCI, to reach it please click [here](http://archive.ics.uci.edu/ml/datasets/Forest+Fires).

The citation to this data set: 

[Cortez and Morais, 2007] P. Cortez and A. Morais. A Data Mining Approach to Predict Forest Fires using Meteorological Data. In J. Neves, M. F. Santos and J. Machado Eds., New Trends in Artificial Intelligence, Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, December, Guimarães, Portugal, pp. 512-523, 2007. APPIA, ISBN-13 978-989-95618-0-9. Available at: [http://www.dsi.uminho.pt/~pcortez/fires.pdf](http://www3.dsi.uminho.pt/pcortez/fires.pdf)
    
#### Attributes:

1. `X` - x-axis spatial coordinate within the Montesinho park map: 1 to 9 
2. `Y` - y-axis spatial coordinate within the Montesinho park map: 2 to 9 
3. `month` - month of the year: 'jan' to 'dec' 
4. `day` - day of the week: 'mon' to 'sun' 
5. `FFMC` - FFMC index from the FWI system: 18.7 to 96.20 
6. `DMC` - DMC index from the FWI system: 1.1 to 291.3 
7. `DC` - DC index from the FWI system: 7.9 to 860.6 
8. `ISI` - ISI index from the FWI system: 0.0 to 56.10 
9. `temp` - temperature in Celsius degrees: 2.2 to 33.30 
10. `RH` - relative humidity in %: 15.0 to 100 
11. `wind` - wind speed in km/h: 0.40 to 9.40 
12. `rain` - outside rain in mm/m2 : 0.0 to 6.4 
13. `area` - the burned area of the forest (in ha): 0.00 to 1090.84 


#### Model and Feature Selection Process:

I will also try predict the `area` variable via regression models.

 - First, I fit the data with all features to Random Forest Regression with pruned `depth` hyperparameters.
 - Then I will use to Lasso(L1 regularization) Regression and ElasticNet(L1+L2 regularization) Regression to select features. I will not use Ridge(L2 regularization) since it does not any exact zero weigthed features.
 - As last step, I will fit the data to Random Forest Regression with pruned `depth` hyperparameters onto both features selected by Lasso and ElasticNet.


### Response Variable and Predictors:

**Response Variable:** `area` which is the burned area in forest. 
- We see the original paper used this variable after log transformation since *variable is very skewed towards 0.0*. After fitting the models, the outputs were post-processed with the inverse of the ln(x+1) transform

**Predictiors:** We need to assign dummy variables for categorical variables `month` and `day`. 


The area variable before `log(area+1)` transformation:

![AreaBeforeTransformation](results/AreaBeforeTransformation.png)


The area variable after `log(area+1)` transformation:

![AreaAfterTransformation](results/AreaAfterTransformation.png)



 As we can see from the histograms, log transformation helps the area variable to spread out. 

In [10]:
#the standard full feature predictors:
X.head()

Unnamed: 0,7,5,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,...,0.8,0.9,0.10,1.1,0.11,0.12,0.13,0.14,0.15,0.16
0,7,4,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,...,0,1,0,0,0,0,0,0,1,0
1,7,4,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,...,0,1,0,0,0,1,0,0,0,0
2,8,6,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,...,0,0,0,1,0,0,0,0,0,0
3,8,6,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,...,0,0,0,0,0,0,1,0,0,0
4,8,6,92.3,85.3,488.0,14.7,22.2,29,5.4,0.0,...,0,0,0,0,0,0,1,0,0,0


### Lasso Regression (L1 Regularization )

I apply lasso regression for feature selection. I get 18 features which are important according this criterion.(non-zero weigthed features). We have some features but this features may not always be the same since we do not use cross validation and L1 does not have only unique optimum result.

Lasso gives `alpha=0.01` for optimum. I select 18 according to that value.

![](results/LassoError.png)

In [11]:
#the after lasso regularization feature predictors:
Xlasso.head()

Unnamed: 0,X,Y,FFMC,DMC,DC,ISI,temp,RH,wind,m_dec,m_feb,m_jun,m_mar,m_sep,d_fri,d_sat,d_thu,d_tue
0,7,5,86.2,26.2,94.3,5.1,8.2,51,6.7,0,0,0,1,0,1,0,0,0
1,7,4,90.6,35.4,669.1,6.7,18.0,33,0.9,0,0,0,0,0,0,0,0,1
2,7,4,90.6,43.7,686.9,6.7,14.6,33,1.3,0,0,0,0,0,0,1,0,0
3,8,6,91.7,33.3,77.5,9.0,8.3,97,4.0,0,0,0,1,0,1,0,0,0
4,8,6,89.3,51.3,102.2,9.6,11.4,99,1.8,0,0,0,1,0,0,0,0,0


### ElasticNet Regression (L1+L2 Regularization )

I apply ElasticNet regression for feature selection. I get 18 features which are important according this criterion.(non-zero weigthed features). We have some features but this features may not always be the same since we do not use cross validation. However, the optimum result of ElasticNet is unique result on the contrary of lasso.

ElasticNet gives `alpha=0.01` for optimum. I select 18 according to that value.

![](results/ElasticNetError.png)

In [12]:
#the after elasticNet regularization feature predictors:
XElasticNet.head()

Unnamed: 0,X,Y,FFMC,DMC,DC,ISI,temp,RH,wind,m_dec,m_feb,m_jun,m_mar,m_sep,d_fri,d_sat,d_thu,d_tue
0,7,5,86.2,26.2,94.3,5.1,8.2,51,6.7,0,0,0,1,0,1,0,0,0
1,7,4,90.6,35.4,669.1,6.7,18.0,33,0.9,0,0,0,0,0,0,0,0,1
2,7,4,90.6,43.7,686.9,6.7,14.6,33,1.3,0,0,0,0,0,0,1,0,0
3,8,6,91.7,33.3,77.5,9.0,8.3,97,4.0,0,0,0,1,0,1,0,0,0
4,8,6,89.3,51.3,102.2,9.6,11.4,99,1.8,0,0,0,1,0,0,0,0,0


## Random Forest Scores:

We see the Lasso features give the best validation score but we cannot say it is the best method to use. Since score values are still low.

In [22]:
# Test scores of RandomForest with depth=5:
RandomForestScores

Unnamed: 0,RandomForest with all features(29),RandomForest with Lasso features(18),RandomForest with ElasticNet features(18)
0,-0.043066,0.014617,-0.000965


### Conclusion

My approach to the problem may not be the best approach. We need to examine other regression models to predict burn areas. We can extend this study on other methods and compare their predictive power to predict burn area. Other thing we should also consider is we may need another forestfire dataset to expand our research.