# Lab 02 CARTs

In [1]:
import os
import pandas as pd

Creating the paths to load the data

In [2]:
training_path=os.path.join('data','wildfires_train.csv')
testing_path=os.path.join('data','wildfires_test.csv')

Loading the data sets and joining them with the pd.concat

In [3]:
training_dat=pd.read_csv(training_path)
testing_dat=pd.read_csv(testing_path)

In [4]:
# pd.concat takes a list of data frames
# wildfires=pd.concat([training_dat, testing_dat]) // no reason to concatenate them; misunderstood

In [5]:
training_dat.head()

Unnamed: 0,x,y,temp,humidity,windspd,winddir,rain,days,vulnerable,other,ranger,pre1950,heli,resources,traffic,burned,wlf
0,7.834467,8.306801,99.506964,65.940704,7.614523,W,3.7e-05,127,1157.377161,0,0,1,0,117.067076,med,791.620319,0
1,2.694922,3.551933,69.887657,31.895045,6.534184,E,4e-05,115,1134.429689,0,1,0,1,127.598019,hi,451.951898,0
2,6.498186,4.106111,91.15293,57.606073,11.580965,SE,4.1e-05,119,1209.603068,0,0,0,1,132.273679,hi,584.451361,1
3,8.750841,8.887995,54.360593,46.16672,15.383351,E,4e-05,112,1118.691631,0,0,0,0,116.482609,hi,589.681584,1
4,9.20021,9.810147,77.442791,25.490945,7.096639,NW,4.5e-05,146,1319.237687,0,0,1,0,136.52175,lo,1010.567058,0


Looking at the first few rows of the data set, we can see that most of the attributes are numerical and only 'winddir', 'traffic', and 'wlf' are categorical. However, 'wlf' will not be used in this analysis.

I will need to transform the attributes appropriately by building a pipeline using Column Transformer.

In [6]:
training_dat.shape

(350, 17)

The training data set only has 350 observations but 17 attributes. Will most likely use 5-fold CV.

In [7]:
# checking for missingness in the data
training_dat.isna().aggregate('sum')

x             0
y             0
temp          0
humidity      0
windspd       0
winddir       0
rain          0
days          0
vulnerable    0
other         0
ranger        0
pre1950       0
heli          0
resources     0
traffic       0
burned        0
wlf           0
dtype: int64

Here we can see that the data is in a tidy format. Since there is no missingness in the data, we can apply transformers to the data to fit models onto it. The data is already split into training and testing data.

#### Transforming data using Sci-Kit Learn

In [8]:
# first argument is the data set, then the test_size, then the random_state
# wildfires_train, wildfires_test=train_test_split(wildfires, test_size=0.2, random_state=21)

Now we will use the Sci-Kit Learn simple imputer to account for any missingness in future data.

In [9]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

one_hot_encoder=OneHotEncoder()

I will seperate the data into the predictors and the response.

In [10]:
training_dat=training_dat.drop('wlf', axis=1)

In [11]:
x_train=training_dat.drop('burned', axis=1)
y_train=training_dat['burned'].copy()

Here I will build a pipeline that will impute median for future missing values as well as one hot encode categorical variables.

In [12]:
numerical_pipeline=Pipeline([
    ('imputer', SimpleImputer(strategy='median'))
])

I need to specify the columns that will go into each pipeline

In [13]:
train_num=x_train.drop(['winddir', 'traffic'], axis='columns')

# this creates the list that will be fed into the Column Transformer
num_attribs=list(train_num)

Categorical attributes will be for One Hot Encoding; will create the Categorical pipeline as well.

In [14]:
cat_attribs=['winddir', 'traffic']

In [15]:
from sklearn.compose import ColumnTransformer

full_pipeline=ColumnTransformer([
    ('num', numerical_pipeline, num_attribs),
    ('categorical', one_hot_encoder, cat_attribs)
])

Creating the clean data that will be fed into the algorithms.

In [16]:
x_train_prepared=full_pipeline.fit_transform(x_train)

Now that I have the data prepared, I will use a Bagging Classifier in to predict number of hectares burned by the fire.

In [17]:
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor

In [18]:
bag_regressor=BaggingRegressor(DecisionTreeRegressor(),
                               n_estimators=500, # this tells BaggingClassifier to make 500 DecisionTrees
                               max_samples=1, # this is the default value; what percent of data to sample each time
                               bootstrap=True,
                               n_jobs=-1 # this tells the alg to use all available cores
                              )

In [19]:
bag_regressor.fit(x_train_prepared, y_train) # make sure to feed the prepared data onto the algorithm

BaggingRegressor(base_estimator=DecisionTreeRegressor(criterion='mse',
                                                      max_depth=None,
                                                      max_features=None,
                                                      max_leaf_nodes=None,
                                                      min_impurity_decrease=0.0,
                                                      min_impurity_split=None,
                                                      min_samples_leaf=1,
                                                      min_samples_split=2,
                                                      min_weight_fraction_leaf=0.0,
                                                      presort=False,
                                                      random_state=None,
                                                      splitter='best'),
                 bootstrap=True, bootstrap_features=False, max_features=1.0,
                 max_sample

In order to avoid overfitting the tree to the data, one can increase the parameters that start with min_* and decrease the default value of those that begin with max_*.

The method above, however, does not implement cross validation. We can use sklearn.model_selection.cross_val_predict to make use of the entire training data to get predictions on the fold of data that was held out.

In [20]:
from sklearn.model_selection import cross_val_predict

The parameters that are passed into cross validate predict must be in the form of a dictionary. They will then be passed on to the estimator that is defined.

This class outputs a n-dimensional array of predictions the same length as the data passed into it.

In [21]:
# defining the dictionary of parameters that will be passed into cv predict
parameters={
    'n_estimators':500, # default value is 10
    'max_samples':1, # this is the default value of the class
    'bootstrap':True,
    'n_jobs':-1
}

In [22]:
bagging_reg_predict=BaggingRegressor(DecisionTreeRegressor(),
                                     n_estimators=500, # this tells BaggingClassifier to make 500 DecisionTrees
                                     max_samples=1,
                                     bootstrap=True,
                                     n_jobs=-1,
                                     random_state=402)

In [23]:
y_train_predict=cross_val_predict(bagging_reg_predict,
                                 x_train_prepared,
                                 y_train,
                                 cv=5)

In [24]:
# let's see what y_train_predict looks like
y_train_predict=pd.DataFrame(y_train_predict)
y_train_predict

Unnamed: 0,0
0,675.790317
1,675.790317
2,675.790317
3,675.790317
4,675.790317
...,...
345,714.711354
346,714.711354
347,714.711354
348,714.711354


Now I will combine the predicted and observed values to calculate the MSE.

In [25]:
y_train_comp=pd.DataFrame(y_train)

comp_data=pd.concat([y_train_comp, y_train_predict], axis=1)
comp_data=comp_data.rename({0:'burned_hat'}, axis=1)
comp_data

Unnamed: 0,burned,burned_hat
0,791.620319,675.790317
1,451.951898,675.790317
2,584.451361,675.790317
3,589.681584,675.790317
4,1010.567058,675.790317
...,...,...
345,509.784673,714.711354
346,846.705612,714.711354
347,610.056881,714.711354
348,896.484081,714.711354


In [26]:
comp_data['squared_error']=(comp_data['burned'] - comp_data['burned_hat'])**2
comp_data['squared_error'].agg('mean')

88390.46456134808

This is the mean squared error using cross validation for the **untuned** Bagging Regressor. Later on, I will tune this same model using a Random Grid Search method.

To keep all the observations for each predictor, set `bootstrap=False` and `max_samples=1`. Sampling features and training instances is called the **Random Patches** method. This is done when `bootstrap_features=True` and `bootstrap=float`.

By sampling features, we trade an increase in bias for a decrease in variance.

### Random Forests

In [27]:
from sklearn.ensemble import RandomForestRegressor

In [28]:
rf_regressor=RandomForestRegressor(n_estimators=500,
                                  bootstrap=True,
                                  oob_score=True,
                                  random_state=402)

y_rf_pred=cross_val_predict(rf_regressor,
                           x_train_prepared,
                           y_train,
                           cv=5)

In [29]:
y_rf_pred=pd.DataFrame(y_rf_pred)
y_rf_pred

Unnamed: 0,0
0,718.999097
1,563.665488
2,704.792992
3,525.139538
4,921.579463
...,...
345,544.614606
346,748.036509
347,629.764562
348,744.130741


In [30]:
comp_rf=pd.concat([y_train_comp, y_rf_pred], axis=1)
comp_rf=comp_rf.rename({0:'burned_hat'}, axis=1)
comp_rf

Unnamed: 0,burned,burned_hat
0,791.620319,718.999097
1,451.951898,563.665488
2,584.451361,704.792992
3,589.681584,525.139538
4,1010.567058,921.579463
...,...,...
345,509.784673,544.614606
346,846.705612,748.036509
347,610.056881,629.764562
348,896.484081,744.130741


In [31]:
comp_rf['squared_error']=(comp_rf['burned'] - comp_rf['burned_hat']) ** 2
comp_rf['squared_error'].agg('mean')

8043.10963654247

The Random Forests MSE is much lower than the bagging model with both using 5-fold CV.