# Lab 02 CARTs

In [74]:
import os
import pandas as pd

Creating the paths to load the data

In [75]:
training_path=os.path.join('data','wildfires_train.csv')
testing_path=os.path.join('data','wildfires_test.csv')

Loading the data sets and joining them with the pd.concat

In [76]:
training_dat=pd.read_csv(training_path)
testing_dat=pd.read_csv(testing_path)

In [77]:
# pd.concat takes a list of data frames
wildfires=pd.concat([training_dat, testing_dat])
wildfires=wildfires.drop('wlf', axis=1)

In [78]:
wildfires.head()

Unnamed: 0,x,y,temp,humidity,windspd,winddir,rain,days,vulnerable,other,ranger,pre1950,heli,resources,traffic,burned
0,7.834467,8.306801,99.506964,65.940704,7.614523,W,3.7e-05,127,1157.377161,0,0,1,0,117.067076,med,791.620319
1,2.694922,3.551933,69.887657,31.895045,6.534184,E,4e-05,115,1134.429689,0,1,0,1,127.598019,hi,451.951898
2,6.498186,4.106111,91.15293,57.606073,11.580965,SE,4.1e-05,119,1209.603068,0,0,0,1,132.273679,hi,584.451361
3,8.750841,8.887995,54.360593,46.16672,15.383351,E,4e-05,112,1118.691631,0,0,0,0,116.482609,hi,589.681584
4,9.20021,9.810147,77.442791,25.490945,7.096639,NW,4.5e-05,146,1319.237687,0,0,1,0,136.52175,lo,1010.567058


Looking at the first few rows of the data set, we can see that most of the attributes are numerical and only 'winddir' and 'traffic' are categorical.

I will need to transform the attributes appropriately by building a pipeline using Column Transformer.

In [79]:
wildfires.shape

(500, 16)

The data set only has 500 observations but 17 attributes.

In [80]:
# checking for missingness in the data
wildfires.isna().aggregate('sum')

x             0
y             0
temp          0
humidity      0
windspd       0
winddir       0
rain          0
days          0
vulnerable    0
other         0
ranger        0
pre1950       0
heli          0
resources     0
traffic       0
burned        0
dtype: int64

Here we can see that the data is in a tidy format. Since there is no missingness in the data, we can apply transformers to the data to fit models onto it. But first, the data must be split into training and testing data.

#### Transforming data using Sci-Kit Learn

In [81]:
from sklearn.model_selection import train_test_split

In [82]:
# first argument is the data set, then the test_size, then the random_state
wildfires_train, wildfires_test=train_test_split(wildfires, test_size=0.2, random_state=21)

Now we will use the Sci-Kit Learn simple imputer to account for any missingness in future data.

In [83]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

In [84]:
numerical_pipeline=Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('standard_scaler', StandardScaler())
])

I need to specify the columns that will go into each pipeline

In [85]:
wildfires_num=wildfires.drop(['winddir', 'traffic'], axis='columns')
num_attribs=list(wildfires_num)

Categorical attributes will be for One Hot Encoding; will create the Categorical pipeline as well.

In [86]:
from sklearn.preprocessing import OneHotEncoder

In [87]:
one_hot_encoder=OneHotEncoder()

In [88]:
cat_attribs=['winddir', 'traffic']

In [89]:
from sklearn.compose import ColumnTransformer

full_pipeline=ColumnTransformer([
    ('num', numerical_pipeline, num_attribs),
    ('categorical', one_hot_encoder, cat_attribs)
])

Creating the clean data that excludes the label/target.

In [90]:
wildfires=wildfires_train.drop('burned', axis=1)
wildfires_labels=wildfires_train['burned'].copy()

In [91]:
wildfires_labels

56      335.889600
57     1172.924723
176     516.859659
300     563.943221
124     871.281377
          ...     
48      887.634781
260     424.163153
312     549.893486
207     851.064431
107     677.784273
Name: burned, Length: 400, dtype: float64

In [92]:
wildfires_prepared=full_pipeline.fit_transform(wildfires_train)

Now that I have the data prepared, I will use a Bagging Classifier in to predict number of hectares burned by the fire.

In [93]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier