# Forest Fires ML Regression Project

In this notebook we will use tree based regression models and split the data into training and testing sets. for more details about the dataset and the previous models see [Notebook-01-Linear-Regression](http://localhost:8888/notebooks/Notebook-01-Linear-Regression.ipynb)

In [1]:
# importing frameworks and libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.feature_selection import RFE

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import ExtraTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.svm import LinearSVR
from sklearn.svm import SVR

# importing tools and functions
import os
import IPython.display
import scoringfn

# setting notebook for visualization
%matplotlib inline
plt.rcParams['figure.figsize'] = (7, 5)
plt.rcParams['font.size'] = 12

In [2]:
os.chdir('Data')

In [3]:
data = pd.read_csv('forestfires_processed.csv')
data.drop(columns = ['Unnamed: 0'], inplace = True)
display(data.head())

Unnamed: 0,X,Y,sin(month),sin(day),log(FFMC),DMC,log(DC),log(ISI),temp,log(RH),wind,log(rain),log(area)
0,7,5,0.14112,-0.958924,4.468204,26.2,4.55703,1.808289,8.2,3.951244,6.7,0.0,0.0
1,7,4,-0.544021,0.909297,4.517431,35.4,6.507427,2.04122,18.0,3.526361,0.9,0.0,0.0
2,7,4,-0.544021,-0.279415,4.517431,43.7,6.533643,2.04122,14.6,3.526361,1.3,0.0,0.0
3,8,6,0.14112,-0.958924,4.529368,33.3,4.363099,2.302585,8.3,4.584967,4.0,0.182322,0.0
4,8,6,0.14112,0.656987,4.503137,51.3,4.636669,2.360854,11.4,4.60517,1.8,0.0,0.0


In [4]:
X_train, X_test, y_train, y_test = train_test_split(data.drop(['log(area)'], axis = 1), data['log(area)'], random_state = 42)

def evaluate_model(model):
    model.fit(X_train, y_train)
    train_preds = model.predict(X_train)
    test_preds = model.predict(X_test)
    print(f'R2 score on training set: {r2_score(y_train, train_preds): .6f}')
    print(f'R2 score on  testing set: {r2_score(y_test, test_preds): .6f}')

models = [LinearRegression(), Ridge(), Lasso(), ElasticNet()]
models += [DecisionTreeRegressor(), ExtraTreeRegressor(), RandomForestRegressor(), ExtraTreesRegressor()]
for model in models:
    print(type(model).__name__)
    evaluate_model(model)
    print()

LinearRegression
R2 score on training set:  0.026702
R2 score on  testing set:  0.005866

Ridge
R2 score on training set:  0.026502
R2 score on  testing set:  0.004903

Lasso
R2 score on training set:  0.010500
R2 score on  testing set: -0.016328

ElasticNet
R2 score on training set:  0.010590
R2 score on  testing set: -0.017659

DecisionTreeRegressor
R2 score on training set:  0.996112
R2 score on  testing set: -0.733428

ExtraTreeRegressor
R2 score on training set:  0.996112
R2 score on  testing set: -0.729064

RandomForestRegressor
R2 score on training set:  0.837031
R2 score on  testing set: -0.057673

ExtraTreesRegressor
R2 score on training set:  0.996112
R2 score on  testing set: -0.164139



In [5]:
# this step has no meaning but it gave out better results.
# we will just add a column representing the index of the data.
data = pd.read_csv('forestfires_processed.csv')

In [7]:
display(data.head())

Unnamed: 0.1,Unnamed: 0,X,Y,sin(month),sin(day),log(FFMC),DMC,log(DC),log(ISI),temp,log(RH),wind,log(rain),log(area)
0,0,7,5,0.14112,-0.958924,4.468204,26.2,4.55703,1.808289,8.2,3.951244,6.7,0.0,0.0
1,1,7,4,-0.544021,0.909297,4.517431,35.4,6.507427,2.04122,18.0,3.526361,0.9,0.0,0.0
2,2,7,4,-0.544021,-0.279415,4.517431,43.7,6.533643,2.04122,14.6,3.526361,1.3,0.0,0.0
3,3,8,6,0.14112,-0.958924,4.529368,33.3,4.363099,2.302585,8.3,4.584967,4.0,0.182322,0.0
4,4,8,6,0.14112,0.656987,4.503137,51.3,4.636669,2.360854,11.4,4.60517,1.8,0.0,0.0


In [6]:
X_train, X_test, y_train, y_test = train_test_split(data.drop(['log(area)'], axis = 1), data['log(area)'], random_state = 42)

for model in models:
    print(type(model).__name__)
    evaluate_model(model)
    print()

LinearRegression
R2 score on training set:  0.105297
R2 score on  testing set:  0.143035

Ridge
R2 score on training set:  0.104917
R2 score on  testing set:  0.141892

Lasso
R2 score on training set:  0.080623
R2 score on  testing set:  0.090320

ElasticNet
R2 score on training set:  0.080641
R2 score on  testing set:  0.090384

DecisionTreeRegressor
R2 score on training set:  1.000000
R2 score on  testing set: -0.208375

ExtraTreeRegressor
R2 score on training set:  1.000000
R2 score on  testing set:  0.302810

RandomForestRegressor
R2 score on training set:  0.911996
R2 score on  testing set:  0.482617

ExtraTreesRegressor
R2 score on training set:  1.000000
R2 score on  testing set:  0.527880

