# Multiclass Classification
In this notebook, we show the analysis and modeling of what we expected to use to predict wildfires. It was a multiclass classification because we had 0 (represents the coordinate having no fire on that day), 1 (first day of the fire), and 2 (all the other days of the fire -- 2nd day of fire until containment date). In addition to building models for the multiclass classification, we looked at a pixel factor of 3, 4, and 5. We have generated different pickle files with numbers that correspond to the factor number. 

The issue that we found was that the part where we determine if a date was the first day of the fire vs all the other days of the fire was wrong. Some of the dates listed could be under both categories, so we realized that it would be best to use binary classification. As a result, the results here are not usable, so we changed it to a simple yes fire and no fire task. The binary classification task is shown in another notebook.

In [1]:
import pandas as pd
import pickle as pkl
import numpy as np 
import seaborn as sns
import statsmodels.formula.api as smf
from sklearn.metrics import confusion_matrix
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier

**In this part of the notebook, we did multiclass classification with a factor = 5. We used the following models: CART, Random Forest, LDA, Vanilla Bagging, and Gradient Boosting Classifier.**

## Import data

In [2]:
#pixel factor is 5
fire_data = pd.DataFrame(pd.read_pickle('data/gridmetmc5.pkl'))
fire_data.columns = pd.read_pickle('data/gridmetColsmc5.pkl')
fire_data.info()
fire_data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122342 entries, 0 to 122341
Data columns (total 6 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   day                122342 non-null  datetime64[ns]
 1   burning_index_g    122342 non-null  object        
 2   relative_humidity  122342 non-null  object        
 3   air_temperature    122342 non-null  object        
 4   wind_speed         122342 non-null  object        
 5   fire               122342 non-null  object        
dtypes: datetime64[ns](1), object(5)
memory usage: 5.6+ MB


Unnamed: 0,day,burning_index_g,relative_humidity,air_temperature,wind_speed,fire
0,2017-01-01,0.0,55.100002,273.799988,4.6,0
1,2017-01-01,0.0,43.700001,277.0,4.8,0
2,2017-01-01,0.0,47.799999,276.0,5.1,0
3,2017-01-01,0.0,63.900002,272.100006,5.6,0
4,2017-01-01,0.0,53.299999,273.299988,6.0,0
...,...,...,...,...,...,...
122337,2021-12-31,2.0,53.900002,280.200012,1.4,0
122338,2021-12-31,7.0,54.200001,276.700012,2.2,0
122339,2021-12-31,0.0,47.700001,271.700012,2.9,0
122340,2021-12-31,0.0,59.5,269.399994,3.0,0


In [3]:
fire_data['month'] = pd.DatetimeIndex(fire_data['day']).month
fire_data['date'] = pd.DatetimeIndex(fire_data['day']).day
fire_data['year'] = pd.DatetimeIndex(fire_data['day']).year
fire_data.info()
fire_data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122342 entries, 0 to 122341
Data columns (total 9 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   day                122342 non-null  datetime64[ns]
 1   burning_index_g    122342 non-null  object        
 2   relative_humidity  122342 non-null  object        
 3   air_temperature    122342 non-null  object        
 4   wind_speed         122342 non-null  object        
 5   fire               122342 non-null  object        
 6   month              122342 non-null  int64         
 7   date               122342 non-null  int64         
 8   year               122342 non-null  int64         
dtypes: datetime64[ns](1), int64(3), object(5)
memory usage: 8.4+ MB


Unnamed: 0,day,burning_index_g,relative_humidity,air_temperature,wind_speed,fire,month,date,year
0,2017-01-01,0.0,55.100002,273.799988,4.6,0,1,1,2017
1,2017-01-01,0.0,43.700001,277.0,4.8,0,1,1,2017
2,2017-01-01,0.0,47.799999,276.0,5.1,0,1,1,2017
3,2017-01-01,0.0,63.900002,272.100006,5.6,0,1,1,2017
4,2017-01-01,0.0,53.299999,273.299988,6.0,0,1,1,2017


In [4]:
fire_data.isnull().sum()

day                  0
burning_index_g      0
relative_humidity    0
air_temperature      0
wind_speed           0
fire                 0
month                0
date                 0
year                 0
dtype: int64

In [7]:
fire_data_int = fire_data.drop(columns = ['fire', 'day'], index=1).apply(pd.to_numeric)
fire_data_int['fire'] = fire_data['fire']
fire_data_int['day'] = fire_data['day']
fire_data_int.info()
fire_data_int.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 122341 entries, 0 to 122341
Data columns (total 9 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   burning_index_g    122341 non-null  float64       
 1   relative_humidity  122341 non-null  float64       
 2   air_temperature    122341 non-null  float64       
 3   wind_speed         122341 non-null  float64       
 4   month              122341 non-null  int64         
 5   date               122341 non-null  int64         
 6   year               122341 non-null  int64         
 7   fire               122341 non-null  object        
 8   day                122341 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(4), int64(3), object(1)
memory usage: 9.3+ MB


Unnamed: 0,burning_index_g,relative_humidity,air_temperature,wind_speed,month,date,year,fire,day
0,0.0,55.100002,273.799988,4.6,1,1,2017,0,2017-01-01
2,0.0,47.799999,276.0,5.1,1,1,2017,0,2017-01-01
3,0.0,63.900002,272.100006,5.6,1,1,2017,0,2017-01-01
4,0.0,53.299999,273.299988,6.0,1,1,2017,0,2017-01-01
5,0.0,56.0,271.299988,5.8,1,1,2017,0,2017-01-01


## Get Sample (n=40000)

In [8]:
fire_data_int = fire_data_int.sample(n=40000, random_state=88)
fire_data_int.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40000 entries, 95367 to 4468
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   burning_index_g    40000 non-null  float64       
 1   relative_humidity  40000 non-null  float64       
 2   air_temperature    40000 non-null  float64       
 3   wind_speed         40000 non-null  float64       
 4   month              40000 non-null  int64         
 5   date               40000 non-null  int64         
 6   year               40000 non-null  int64         
 7   fire               40000 non-null  object        
 8   day                40000 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(4), int64(3), object(1)
memory usage: 3.1+ MB


### Split into train and test

In [9]:
y = fire_data_int['fire']
X = fire_data_int.drop(columns = ['fire'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=88)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((28000, 8), (12000, 8), (28000,), (12000,))

In [10]:
len(fire_data_int[fire_data_int.fire == 0]), len(fire_data_int[fire_data_int.fire == 1]), len(fire_data_int[fire_data_int.fire == 2])


(23702, 526, 15772)

In [11]:
fire_data_int

Unnamed: 0,burning_index_g,relative_humidity,air_temperature,wind_speed,month,date,year,fire,day
95367,22.0,31.400000,282.899994,1.4,11,24,2020,2,2020-11-24
86764,49.0,17.500000,304.799988,2.2,7,18,2020,0,2020-07-18
32089,29.0,21.200001,296.899994,2.7,4,24,2018,0,2018-04-24
59254,35.0,20.300001,302.600006,2.7,6,4,2019,0,2019-06-04
78148,37.0,21.200001,290.000000,3.5,3,12,2020,0,2020-03-12
...,...,...,...,...,...,...,...,...,...
4332,0.0,51.900002,274.200012,5.2,3,6,2017,0,2017-03-06
41103,83.0,13.900001,299.299988,4.7,9,6,2018,2,2018-09-06
69813,43.0,11.400001,292.600006,2.0,11,8,2019,0,2019-11-08
25955,17.0,41.500000,280.500000,1.8,1,23,2018,0,2018-01-23


In [12]:
season_dict = {1: 'Winter',
               2: 'Winter',
               3: 'Spring', 
               4: 'Spring',
               5: 'Spring',
               6: 'Summer',
               7: 'Summer',
               8: 'Summer',
               9: 'Fall',
               10: 'Fall',
               11: 'Fall',
               12: 'Winter'}
fire_data_season = fire_data_int.copy()
fire_data_season['season'] = fire_data_season['month'].apply(lambda x: season_dict[x])
fire_data_season = pd.get_dummies(fire_data_season, columns=['season'])
fire_data_season

Unnamed: 0,burning_index_g,relative_humidity,air_temperature,wind_speed,month,date,year,fire,day,season_Fall,season_Spring,season_Summer,season_Winter
95367,22.0,31.400000,282.899994,1.4,11,24,2020,2,2020-11-24,1,0,0,0
86764,49.0,17.500000,304.799988,2.2,7,18,2020,0,2020-07-18,0,0,1,0
32089,29.0,21.200001,296.899994,2.7,4,24,2018,0,2018-04-24,0,1,0,0
59254,35.0,20.300001,302.600006,2.7,6,4,2019,0,2019-06-04,0,0,1,0
78148,37.0,21.200001,290.000000,3.5,3,12,2020,0,2020-03-12,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4332,0.0,51.900002,274.200012,5.2,3,6,2017,0,2017-03-06,0,1,0,0
41103,83.0,13.900001,299.299988,4.7,9,6,2018,2,2018-09-06,1,0,0,0
69813,43.0,11.400001,292.600006,2.0,11,8,2019,0,2019-11-08,1,0,0,0
25955,17.0,41.500000,280.500000,1.8,1,23,2018,0,2018-01-23,0,0,0,1


# Models

First, we can check the VIFs of our variables. They are all below 5, which is great! 

In [13]:
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

# The dataframe passed to VIF must include the intercept term. We add it the same way we did before.
#first argument: training dataset
#second argument: columns of independent variables as a list that we want to look through
def VIF(df, columns):
    values = sm.add_constant(df[columns]).values
    num_columns = len(columns)+1
    vif = [variance_inflation_factor(values, i) for i in range(num_columns)]
    return pd.Series(vif[1:], index=columns)

cols = ['burning_index_g', 'relative_humidity', 'air_temperature',
       'wind_speed', 'month', 'date', 'year']
VIF(X_train, cols)

burning_index_g      3.070369
relative_humidity    3.351614
air_temperature      2.449430
wind_speed           1.362207
month                1.117083
date                 1.001301
year                 1.014998
dtype: float64

In [14]:
y = fire_data_int['fire']
X = fire_data_int.drop(columns = ['fire', 'day'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=88)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((28000, 7), (12000, 7), (28000,), (12000,))

In [15]:
y_train = y_train.astype('int')
y_test = y_test.astype('int')
y_train

31105     0
102328    0
79476     0
18025     2
42567     2
         ..
41515     2
122272    0
90108     2
51791     0
21636     0
Name: fire, Length: 28000, dtype: int64

In [16]:
def scores(y_test, y_pred):
    cm  = confusion_matrix(y_test, y_pred)
    acc = accuracy_score(y_test, y_pred)
    TPR = cm[1][1] / sum(cm[1])            # TP/(TP+FP)
    FPR = cm[0][1] / sum(cm[0])            # FP/(FP+TN)
    PRE = cm[1][1] / (cm[0][1] + cm[1][1]) # TP/(TP+FP)
    print(f'Accuracy: {acc:.4f}, TPR: {TPR:.4f}, FPR: {FPR:.4f}, Precision: {PRE:.4f}')
    return acc, TPR, FPR, PRE

## Baseline

In [17]:
cm = confusion_matrix(y_test,[0]*y_test.shape[0])
baseline_acc = y_test.value_counts().values.max() / len(y_test)
baseline_TPR = cm[1][1] / sum(cm[1])
baseline_FPR = cm[0][1] / sum(cm[0])
baseline_PRE = 0 
print(f'Accuracy: {baseline_acc:.4f}, TPR: {baseline_TPR:.4f}, FPR: {baseline_FPR:.4f}, Precision: {baseline_PRE:.4f}')

Accuracy: 0.5918, TPR: 0.0000, FPR: 0.0000, Precision: 0.0000


## CART: Decision Tree Classification

In [18]:
# without cross validation 
dtc = DecisionTreeClassifier(min_samples_leaf=5, min_samples_split=20, 
                             random_state=88)
dtc.fit(X_train, y_train)
y_pred = dtc.predict(X_test)
dtc_acc,dtc_TPR,dtc_FPR,dtc_PRE = scores(y_test, y_pred)

Accuracy: 0.8891, TPR: 0.6353, FPR: 0.0051, Precision: 0.7500


In [19]:
# with cross validation 
grid_values = {'ccp_alpha': np.linspace(0, 0.10, 201),
               'min_samples_leaf': [5],
               'min_samples_split': [20],
               'random_state': [88]}

dtc = DecisionTreeClassifier(random_state=88)
dtc_cv = GridSearchCV(dtc, param_grid=grid_values, scoring='accuracy', cv=10) 
dtc_cv.fit(X_train, y_train)
y_pred = dtc_cv.best_estimator_.predict(X_test)
dtc_cv_acc,dtc_cv_TPR,dtc_cv_FPR,dtc_cv_PRE = scores(y_test, y_pred)

Accuracy: 0.8993, TPR: 0.0941, FPR: 0.0018, Precision: 0.5517


## Random Forest

In [20]:
# without cross validation: vanilla bagging
rf = RandomForestClassifier(max_features=X_train.shape[1], min_samples_leaf=5, 
                            n_estimators=500, random_state=88)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

rf_acc,rf_TPR,rf_FPR,rf_PRE = scores(y_test, y_pred)

Accuracy: 0.9163, TPR: 0.5412, FPR: 0.0028, Precision: 0.8214


In [21]:
# with cross validation 
grid_values = {'max_features': np.arange(1, X_train.shape[1]),
               'min_samples_leaf': [5],
               'n_estimators': [500],
               'random_state': [88]}

rf_c = RandomForestClassifier(random_state=88)
rf_cv = GridSearchCV(rf_c, param_grid=grid_values, scoring='accuracy', cv=10)
rf_cv.fit(X_train, y_train)

y_pred = rf_cv.predict(X_test)

rf_max_features = rf_cv.best_params_['max_features']
print(f'Best max_features: {rf_max_features}')
rf_cv_acc,rf_cv_TPR,rf_cv_FPR,rf_cv_PRE = scores(y_test, y_pred)

Best max_features: 6
Accuracy: 0.9167, TPR: 0.5000, FPR: 0.0027, Precision: 0.8173


## LDA

In [22]:
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)
y_pred = lda.predict(X_test)
lda_acc,lda_TPR,lda_FPR,lda_PRE = scores(y_test, y_pred)

Accuracy: 0.7675, TPR: 0.0000, FPR: 0.0000, Precision: nan


  PRE = cm[1][1] / (cm[0][1] + cm[1][1]) # TP/(TP+FP)


## Vanilla Bagging

In [23]:
total_features = len(X_train.columns)
bagging = RandomForestClassifier(max_features=X_train.shape[1], min_samples_leaf=5, 
                            n_estimators=500, random_state=88)
bagging.fit(X_train, y_train)
y_pred = bagging.predict(X_test)
bagging_acc,bagging_TPR,bagging_FPR,bagging_PRE = scores(y_test, y_pred)

Accuracy: 0.9163, TPR: 0.5412, FPR: 0.0028, Precision: 0.8214


In [24]:
total_features = len(X_train.columns)
bagging = RandomForestClassifier(max_features = total_features, random_state=88)
bagging.fit(X_train, y_train)
y_pred = bagging.predict(X_test)
bagging_acc,bagging_TPR,bagging_FPR,bagging_PRE = scores(y_test, y_pred)

Accuracy: 0.9144, TPR: 0.6706, FPR: 0.0032, Precision: 0.8321


## Gradient Boosting Classifier

In [25]:
# wihtout cross validation
gbc = GradientBoostingClassifier(n_estimators=3300, max_leaf_nodes=10, random_state=88) 
gbc.fit(X_train, y_train)
y_pred = gbc.predict(X_test)
gbc_acc,gbc_TPR,gbc_FPR,gbc_PRE = scores(y_test, y_pred)

Accuracy: 0.8996, TPR: 0.4882, FPR: 0.0025, Precision: 0.8218


## Comparison table

In [26]:
#Create Comparison Table
#These lines are provided for you to help construct a comparison table.
#It is not requred to follow this format. + You need to find ACC, TPR, FPR, PRE for each model that you choose.
comparison_data_5 = {'Baseline':[baseline_acc,baseline_TPR,baseline_FPR, baseline_PRE],
                   'Gradient Boosting Classifer':[gbc_acc,gbc_TPR,gbc_FPR, gbc_PRE],
                   'Decision Tree Classifier':[dtc_acc,dtc_TPR,dtc_FPR,dtc_PRE],
                   'Decision Tree Classifier with CV':[dtc_cv_acc,dtc_cv_TPR,dtc_cv_FPR,dtc_cv_PRE],
                   'Random Forest':[rf_acc,rf_TPR, rf_FPR,rf_PRE],
                   'Random Forest with CV':[rf_acc,rf_TPR, rf_FPR,rf_PRE],
                   'Linear Discriminant Analysis':[lda_acc,lda_TPR, lda_FPR,lda_PRE],
                   'Vanilla Bagging': [bagging_acc,bagging_TPR,bagging_FPR,bagging_PRE]}

comparison_table_5 = pd.DataFrame(data=comparison_data_5, index=['Accuracy', 'TPR', 'FPR','PRE']).transpose()
comparison_table_5.style.set_properties(**{'font-size': '12pt',}).set_table_styles([{'selector': 'th', 'props': [('font-size', '10pt')]}])
comparison_table_5

Unnamed: 0,Accuracy,TPR,FPR,PRE
Baseline,0.591833,0.0,0.0,0.0
Gradient Boosting Classifer,0.899583,0.488235,0.002534,0.821782
Decision Tree Classifier,0.889083,0.635294,0.005069,0.75
Decision Tree Classifier with CV,0.899333,0.094118,0.00183,0.551724
Random Forest,0.91625,0.541176,0.002816,0.821429
Random Forest with CV,0.91625,0.541176,0.002816,0.821429
Linear Discriminant Analysis,0.7675,0.0,0.0,
Vanilla Bagging,0.914417,0.670588,0.003239,0.832117


<hr style="height:5px;border-width:5;color:black;background-color:black">

**In this part of the notebook, we did multiclass classification with a factor = 3. We used the following models: CART, Random Forest, LDA, Vanilla Bagging, and Gradient Boosting Classifier.**

## Import data

In [27]:
#pixel factor is 3
fire_data = pd.DataFrame(pd.read_pickle('data/gridmetmc3.pkl'))
fire_data.columns = pd.read_pickle('data/gridmetColsmc3.pkl')
fire_data.info()
fire_data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350592 entries, 0 to 350591
Data columns (total 6 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   day                350592 non-null  datetime64[ns]
 1   burning_index_g    350592 non-null  object        
 2   relative_humidity  350592 non-null  object        
 3   air_temperature    350592 non-null  object        
 4   wind_speed         350592 non-null  object        
 5   fire               350592 non-null  object        
dtypes: datetime64[ns](1), object(5)
memory usage: 16.0+ MB


Unnamed: 0,day,burning_index_g,relative_humidity,air_temperature,wind_speed,fire
0,2017-01-01,0.0,52.400002,276.100006,4.3,0
1,2017-01-01,0.0,47.600002,275.399994,3.9,0
2,2017-01-01,0.0,45.5,276.799988,4.2,0
3,2017-01-01,0.0,41.700001,278.100006,4.5,0
4,2017-01-01,0.0,48.600002,275.600006,4.8,0
...,...,...,...,...,...,...
350587,2021-12-31,0.0,47.100002,272.200012,2.9,0
350588,2021-12-31,0.0,52.5,271.700012,2.9,0
350589,2021-12-31,16.0,54.0,272.299988,3.2,0
350590,2021-12-31,10.0,55.400002,273.399994,4.0,0


In [28]:
fire_data['month'] = pd.DatetimeIndex(fire_data['day']).month
fire_data['date'] = pd.DatetimeIndex(fire_data['day']).day
fire_data['year'] = pd.DatetimeIndex(fire_data['day']).year
fire_data.info()
fire_data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350592 entries, 0 to 350591
Data columns (total 9 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   day                350592 non-null  datetime64[ns]
 1   burning_index_g    350592 non-null  object        
 2   relative_humidity  350592 non-null  object        
 3   air_temperature    350592 non-null  object        
 4   wind_speed         350592 non-null  object        
 5   fire               350592 non-null  object        
 6   month              350592 non-null  int64         
 7   date               350592 non-null  int64         
 8   year               350592 non-null  int64         
dtypes: datetime64[ns](1), int64(3), object(5)
memory usage: 24.1+ MB


Unnamed: 0,day,burning_index_g,relative_humidity,air_temperature,wind_speed,fire,month,date,year
0,2017-01-01,0.0,52.400002,276.100006,4.3,0,1,1,2017
1,2017-01-01,0.0,47.600002,275.399994,3.9,0,1,1,2017
2,2017-01-01,0.0,45.5,276.799988,4.2,0,1,1,2017
3,2017-01-01,0.0,41.700001,278.100006,4.5,0,1,1,2017
4,2017-01-01,0.0,48.600002,275.600006,4.8,0,1,1,2017


In [29]:
fire_data.isnull().sum()

day                  0
burning_index_g      0
relative_humidity    0
air_temperature      0
wind_speed           0
fire                 0
month                0
date                 0
year                 0
dtype: int64

In [32]:
fire_data_int = fire_data.drop(columns = ['fire', 'day'], index=1).apply(pd.to_numeric)
fire_data_int['fire'] = fire_data['fire']
fire_data_int['day'] = fire_data['day']
fire_data_int.info()
fire_data_int.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 350591 entries, 0 to 350591
Data columns (total 9 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   burning_index_g    350591 non-null  float64       
 1   relative_humidity  350591 non-null  float64       
 2   air_temperature    350591 non-null  float64       
 3   wind_speed         350591 non-null  float64       
 4   month              350591 non-null  int64         
 5   date               350591 non-null  int64         
 6   year               350591 non-null  int64         
 7   fire               350591 non-null  object        
 8   day                350591 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(4), int64(3), object(1)
memory usage: 26.7+ MB


Unnamed: 0,burning_index_g,relative_humidity,air_temperature,wind_speed,month,date,year,fire,day
0,0.0,52.400002,276.100006,4.3,1,1,2017,0,2017-01-01
2,0.0,45.5,276.799988,4.2,1,1,2017,0,2017-01-01
3,0.0,41.700001,278.100006,4.5,1,1,2017,0,2017-01-01
4,0.0,48.600002,275.600006,4.8,1,1,2017,0,2017-01-01
5,0.0,49.799999,274.899994,5.1,1,1,2017,0,2017-01-01


## Get Sample (n=40000)

In [33]:
fire_data_int = fire_data_int.sample(n=40000, random_state=88)
fire_data_int.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40000 entries, 193128 to 327180
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   burning_index_g    40000 non-null  float64       
 1   relative_humidity  40000 non-null  float64       
 2   air_temperature    40000 non-null  float64       
 3   wind_speed         40000 non-null  float64       
 4   month              40000 non-null  int64         
 5   date               40000 non-null  int64         
 6   year               40000 non-null  int64         
 7   fire               40000 non-null  object        
 8   day                40000 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(4), int64(3), object(1)
memory usage: 3.1+ MB


### Split into train and test

In [34]:
y = fire_data_int['fire']
X = fire_data_int.drop(columns = ['fire'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=88)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((28000, 8), (12000, 8), (28000,), (12000,))

In [35]:
len(fire_data_int[fire_data_int.fire == 0]), len(fire_data_int[fire_data_int.fire == 1]), len(fire_data_int[fire_data_int.fire == 2])



(26076, 485, 13439)

In [36]:
fire_data_int

Unnamed: 0,burning_index_g,relative_humidity,air_temperature,wind_speed,month,date,year,fire,day
193128,41.0,17.300001,296.899994,2.5,10,3,2019,2,2019-10-03
257376,66.0,16.200001,306.299988,3.1,9,2,2020,0,2020-09-02
158166,25.0,45.700001,284.000000,6.4,4,4,2019,0,2019-04-04
163414,28.0,22.700001,288.500000,1.7,5,2,2019,0,2019-05-02
172277,46.0,25.300001,298.700012,4.2,6,17,2019,2,2019-06-17
...,...,...,...,...,...,...,...,...,...
323694,69.0,14.000000,305.899994,2.6,8,13,2021,0,2021-08-13
46134,78.0,13.800000,303.200012,4.6,8,29,2017,1,2017-08-29
168584,0.0,38.700001,291.399994,3.1,5,29,2019,0,2019-05-29
89903,33.0,20.200001,289.799988,3.3,4,14,2018,0,2018-04-14


In [37]:
season_dict = {1: 'Winter',
               2: 'Winter',
               3: 'Spring', 
               4: 'Spring',
               5: 'Spring',
               6: 'Summer',
               7: 'Summer',
               8: 'Summer',
               9: 'Fall',
               10: 'Fall',
               11: 'Fall',
               12: 'Winter'}
fire_data_season = fire_data_int.copy()
fire_data_season['season'] = fire_data_season['month'].apply(lambda x: season_dict[x])
fire_data_season = pd.get_dummies(fire_data_season, columns=['season'])
fire_data_season

Unnamed: 0,burning_index_g,relative_humidity,air_temperature,wind_speed,month,date,year,fire,day,season_Fall,season_Spring,season_Summer,season_Winter
193128,41.0,17.300001,296.899994,2.5,10,3,2019,2,2019-10-03,1,0,0,0
257376,66.0,16.200001,306.299988,3.1,9,2,2020,0,2020-09-02,1,0,0,0
158166,25.0,45.700001,284.000000,6.4,4,4,2019,0,2019-04-04,0,1,0,0
163414,28.0,22.700001,288.500000,1.7,5,2,2019,0,2019-05-02,0,1,0,0
172277,46.0,25.300001,298.700012,4.2,6,17,2019,2,2019-06-17,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
323694,69.0,14.000000,305.899994,2.6,8,13,2021,0,2021-08-13,0,0,1,0
46134,78.0,13.800000,303.200012,4.6,8,29,2017,1,2017-08-29,0,0,1,0
168584,0.0,38.700001,291.399994,3.1,5,29,2019,0,2019-05-29,0,1,0,0
89903,33.0,20.200001,289.799988,3.3,4,14,2018,0,2018-04-14,0,1,0,0


# Models

First, we can check the VIFs of our variables. They are all below 5, which is great! 

In [38]:
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

# The dataframe passed to VIF must include the intercept term. We add it the same way we did before.
#first argument: training dataset
#second argument: columns of independent variables as a list that we want to look through
def VIF(df, columns):
    values = sm.add_constant(df[columns]).values
    num_columns = len(columns)+1
    vif = [variance_inflation_factor(values, i) for i in range(num_columns)]
    return pd.Series(vif[1:], index=columns)

cols = ['burning_index_g', 'relative_humidity', 'air_temperature',
       'wind_speed', 'month', 'date', 'year']
VIF(X_train, cols)

burning_index_g      3.065444
relative_humidity    3.329797
air_temperature      2.521403
wind_speed           1.318651
month                1.115389
date                 1.001623
year                 1.015574
dtype: float64

In [39]:
y = fire_data_int['fire']
X = fire_data_int.drop(columns = ['fire', 'day'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=88)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((28000, 7), (12000, 7), (28000,), (12000,))

In [40]:
y_train = y_train.astype('int')
y_test = y_test.astype('int')
y_train

51848     0
145017    0
167656    0
187509    2
188372    0
         ..
138535    0
326284    2
250325    2
204985    2
232407    0
Name: fire, Length: 28000, dtype: int64

In [41]:
def scores(y_test, y_pred):
    cm  = confusion_matrix(y_test, y_pred)
    acc = accuracy_score(y_test, y_pred)
    TPR = cm[1][1] / sum(cm[1])            # TP/(TP+FP)
    FPR = cm[0][1] / sum(cm[0])            # FP/(FP+TN)
    PRE = cm[1][1] / (cm[0][1] + cm[1][1]) # TP/(TP+FP)
    print(f'Accuracy: {acc:.4f}, TPR: {TPR:.4f}, FPR: {FPR:.4f}, Precision: {PRE:.4f}')
    return acc, TPR, FPR, PRE

## Baseline

In [42]:
cm = confusion_matrix(y_test,[0]*y_test.shape[0])
baseline_acc = y_test.value_counts().values.max() / len(y_test)
baseline_TPR = cm[1][1] / sum(cm[1])
baseline_FPR = cm[0][1] / sum(cm[0])
baseline_PRE = 0 
print(f'Accuracy: {baseline_acc:.4f}, TPR: {baseline_TPR:.4f}, FPR: {baseline_FPR:.4f}, Precision: {baseline_PRE:.4f}')

Accuracy: 0.6503, TPR: 0.0000, FPR: 0.0000, Precision: 0.0000


## CART: Decision Tree Classification

In [43]:
# without cross validation 
dtc = DecisionTreeClassifier(min_samples_leaf=5, min_samples_split=20, 
                             random_state=88)
dtc.fit(X_train, y_train)
y_pred = dtc.predict(X_test)
dtc_acc,dtc_TPR,dtc_FPR,dtc_PRE = scores(y_test, y_pred)

Accuracy: 0.8207, TPR: 0.6643, FPR: 0.0076, Precision: 0.6118


In [44]:
# with cross validation 
grid_values = {'ccp_alpha': np.linspace(0, 0.10, 201),
               'min_samples_leaf': [5],
               'min_samples_split': [20],
               'random_state': [88]}

dtc = DecisionTreeClassifier(random_state=88)
dtc_cv = GridSearchCV(dtc, param_grid=grid_values, scoring='accuracy', cv=10) 
dtc_cv.fit(X_train, y_train)
y_pred = dtc_cv.best_estimator_.predict(X_test)
dtc_cv_acc,dtc_cv_TPR,dtc_cv_FPR,dtc_cv_PRE = scores(y_test, y_pred)

Accuracy: 0.8459, TPR: 0.0000, FPR: 0.0000, Precision: nan


  PRE = cm[1][1] / (cm[0][1] + cm[1][1]) # TP/(TP+FP)


## Random Forest

In [45]:
# without cross validation: vanilla bagging
rf = RandomForestClassifier(max_features=X_train.shape[1], min_samples_leaf=5, 
                            n_estimators=500, random_state=88)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

rf_acc,rf_TPR,rf_FPR,rf_PRE = scores(y_test, y_pred)

Accuracy: 0.8568, TPR: 0.4643, FPR: 0.0037, Precision: 0.6915


In [46]:
# with cross validation 
grid_values = {'max_features': np.arange(1, X_train.shape[1]),
               'min_samples_leaf': [5],
               'n_estimators': [500],
               'random_state': [88]}

rf_c = RandomForestClassifier(random_state=88)
rf_cv = GridSearchCV(rf_c, param_grid=grid_values, scoring='accuracy', cv=10)
rf_cv.fit(X_train, y_train)

y_pred = rf_cv.predict(X_test)

rf_max_features = rf_cv.best_params_['max_features']
print(f'Best max_features: {rf_max_features}')
rf_cv_acc,rf_cv_TPR,rf_cv_FPR,rf_cv_PRE = scores(y_test, y_pred)

Best max_features: 6
Accuracy: 0.8564, TPR: 0.3786, FPR: 0.0027, Precision: 0.7162


## LDA

In [47]:
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)
y_pred = lda.predict(X_test)
lda_acc,lda_TPR,lda_FPR,lda_PRE = scores(y_test, y_pred)

Accuracy: 0.7645, TPR: 0.0000, FPR: 0.0000, Precision: nan


  PRE = cm[1][1] / (cm[0][1] + cm[1][1]) # TP/(TP+FP)


## Vanilla Bagging

In [48]:
total_features = len(X_train.columns)
bagging = RandomForestClassifier(max_features=X_train.shape[1], min_samples_leaf=5, 
                            n_estimators=500, random_state=88)
bagging.fit(X_train, y_train)
y_pred = bagging.predict(X_test)
bagging_acc,bagging_TPR,bagging_FPR,bagging_PRE = scores(y_test, y_pred)

Accuracy: 0.8568, TPR: 0.4643, FPR: 0.0037, Precision: 0.6915


In [49]:
total_features = len(X_train.columns)
bagging = RandomForestClassifier(max_features = total_features, random_state=88)
bagging.fit(X_train, y_train)
y_pred = bagging.predict(X_test)
bagging_acc,bagging_TPR,bagging_FPR,bagging_PRE = scores(y_test, y_pred)

Accuracy: 0.8503, TPR: 0.5786, FPR: 0.0049, Precision: 0.6807


## Gradient Boosting Classifier

In [50]:
# wihtout cross validation
gbc = GradientBoostingClassifier(n_estimators=3300, max_leaf_nodes=10, random_state=88) 
gbc.fit(X_train, y_train)
y_pred = gbc.predict(X_test)
gbc_acc,gbc_TPR,gbc_FPR,gbc_PRE = scores(y_test, y_pred)

Accuracy: 0.8327, TPR: 0.4643, FPR: 0.0037, Precision: 0.6915


## Comparison table

In [51]:
#Create Comparison Table
#These lines are provided for you to help construct a comparison table.
#It is not requred to follow this format. + You need to find ACC, TPR, FPR, PRE for each model that you choose.
comparison_data_3 = {'Baseline':[baseline_acc,baseline_TPR,baseline_FPR, baseline_PRE],
                   'Gradient Boosting Classifer':[gbc_acc,gbc_TPR,gbc_FPR, gbc_PRE],
                   'Decision Tree Classifier':[dtc_acc,dtc_TPR,dtc_FPR,dtc_PRE],
                   'Decision Tree Classifier with CV':[dtc_cv_acc,dtc_cv_TPR,dtc_cv_FPR,dtc_cv_PRE],
                   'Random Forest':[rf_acc,rf_TPR, rf_FPR,rf_PRE],
                   'Random Forest with CV':[rf_acc,rf_TPR, rf_FPR,rf_PRE],
                   'Linear Discriminant Analysis':[lda_acc,lda_TPR, lda_FPR,lda_PRE],
                   'Vanilla Bagging': [bagging_acc,bagging_TPR,bagging_FPR,bagging_PRE]}

comparison_table_3 = pd.DataFrame(data=comparison_data_3, index=['Accuracy', 'TPR', 'FPR','PRE']).transpose()
comparison_table_3.style.set_properties(**{'font-size': '12pt',}).set_table_styles([{'selector': 'th', 'props': [('font-size', '10pt')]}])
comparison_table_3

Unnamed: 0,Accuracy,TPR,FPR,PRE
Baseline,0.650333,0.0,0.0,0.0
Gradient Boosting Classifer,0.832667,0.464286,0.003716,0.691489
Decision Tree Classifier,0.82075,0.664286,0.00756,0.611842
Decision Tree Classifier with CV,0.845917,0.0,0.0,
Random Forest,0.856833,0.464286,0.003716,0.691489
Random Forest with CV,0.856833,0.464286,0.003716,0.691489
Linear Discriminant Analysis,0.7645,0.0,0.0,
Vanilla Bagging,0.850333,0.578571,0.004869,0.680672


<hr style="height:5px;border-width:5;color:black;background-color:black">

**In this part of the notebook, we did multiclass classification with a factor = 4. We used the following models: CART, Random Forest, LDA, Vanilla Bagging, and Gradient Boosting Classifier.**

## Import data

In [85]:
#pixel factor is 4
fire_data = pd.DataFrame(pd.read_pickle('data/gridmetmc4.pkl'))
fire_data.columns = pd.read_pickle('data/gridmetColsmc4.pkl')
fire_data.info()
fire_data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 202686 entries, 0 to 202685
Data columns (total 6 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   day                202686 non-null  datetime64[ns]
 1   burning_index_g    202686 non-null  object        
 2   relative_humidity  202686 non-null  object        
 3   air_temperature    202686 non-null  object        
 4   wind_speed         202686 non-null  object        
 5   fire               202686 non-null  object        
dtypes: datetime64[ns](1), object(5)
memory usage: 9.3+ MB


Unnamed: 0,day,burning_index_g,relative_humidity,air_temperature,wind_speed,fire
0,2017-01-01,0.0,54.900002,275.100006,4.4,0
1,2017-01-01,0.0,43.799999,277.399994,4.4,0
2,2017-01-01,0.0,42.200001,277.600006,4.7,0
3,2017-01-01,0.0,51.400002,274.799988,5.1,0
4,2017-01-01,0.0,53.900002,274.0,5.4,0
...,...,...,...,...,...,...
202681,2021-12-31,12.0,47.299999,275.799988,2.7,0
202682,2021-12-31,0.0,47.700001,271.700012,2.9,0
202683,2021-12-31,0.0,59.5,269.399994,3.0,0
202684,2021-12-31,12.0,55.200001,273.200012,3.7,0


In [86]:
fire_data['month'] = pd.DatetimeIndex(fire_data['day']).month
fire_data['date'] = pd.DatetimeIndex(fire_data['day']).day
fire_data['year'] = pd.DatetimeIndex(fire_data['day']).year
fire_data.info()
fire_data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 202686 entries, 0 to 202685
Data columns (total 9 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   day                202686 non-null  datetime64[ns]
 1   burning_index_g    202686 non-null  object        
 2   relative_humidity  202686 non-null  object        
 3   air_temperature    202686 non-null  object        
 4   wind_speed         202686 non-null  object        
 5   fire               202686 non-null  object        
 6   month              202686 non-null  int64         
 7   date               202686 non-null  int64         
 8   year               202686 non-null  int64         
dtypes: datetime64[ns](1), int64(3), object(5)
memory usage: 13.9+ MB


Unnamed: 0,day,burning_index_g,relative_humidity,air_temperature,wind_speed,fire,month,date,year
0,2017-01-01,0.0,54.900002,275.100006,4.4,0,1,1,2017
1,2017-01-01,0.0,43.799999,277.399994,4.4,0,1,1,2017
2,2017-01-01,0.0,42.200001,277.600006,4.7,0,1,1,2017
3,2017-01-01,0.0,51.400002,274.799988,5.1,0,1,1,2017
4,2017-01-01,0.0,53.900002,274.0,5.4,0,1,1,2017


In [87]:
fire_data.isnull().sum()

day                  0
burning_index_g      0
relative_humidity    0
air_temperature      0
wind_speed           0
fire                 0
month                0
date                 0
year                 0
dtype: int64

In [88]:
fire_data_int = fire_data.drop(columns = ['fire', 'day'], index=1).apply(pd.to_numeric)
fire_data_int['fire'] = fire_data['fire']
fire_data_int['day'] = fire_data['day']
fire_data_int.info()
fire_data_int.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 202685 entries, 0 to 202685
Data columns (total 9 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   burning_index_g    202685 non-null  float64       
 1   relative_humidity  202685 non-null  float64       
 2   air_temperature    202685 non-null  float64       
 3   wind_speed         202685 non-null  float64       
 4   month              202685 non-null  int64         
 5   date               202685 non-null  int64         
 6   year               202685 non-null  int64         
 7   fire               202685 non-null  object        
 8   day                202685 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(4), int64(3), object(1)
memory usage: 15.5+ MB


Unnamed: 0,burning_index_g,relative_humidity,air_temperature,wind_speed,month,date,year,fire,day
0,0.0,54.900002,275.100006,4.4,1,1,2017,0,2017-01-01
2,0.0,42.200001,277.600006,4.7,1,1,2017,0,2017-01-01
3,0.0,51.400002,274.799988,5.1,1,1,2017,0,2017-01-01
4,0.0,53.900002,274.0,5.4,1,1,2017,0,2017-01-01
5,0.0,54.799999,273.200012,5.9,1,1,2017,0,2017-01-01


## Get Sample (n=40000)

In [89]:
fire_data_int = fire_data_int.sample(n=40000, random_state=88)
fire_data_int.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40000 entries, 24453 to 23227
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   burning_index_g    40000 non-null  float64       
 1   relative_humidity  40000 non-null  float64       
 2   air_temperature    40000 non-null  float64       
 3   wind_speed         40000 non-null  float64       
 4   month              40000 non-null  int64         
 5   date               40000 non-null  int64         
 6   year               40000 non-null  int64         
 7   fire               40000 non-null  object        
 8   day                40000 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(4), int64(3), object(1)
memory usage: 3.1+ MB


### Split into train and test

In [90]:
y = fire_data_int['fire']
X = fire_data_int.drop(columns = ['fire'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=88)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((28000, 8), (12000, 8), (28000,), (12000,))

In [91]:
len(fire_data_int[fire_data_int.fire == 0]), len(fire_data_int[fire_data_int.fire == 1]), len(fire_data_int[fire_data_int.fire == 2])



(24240, 546, 15214)

In [92]:
fire_data_int

Unnamed: 0,burning_index_g,relative_humidity,air_temperature,wind_speed,month,date,year,fire,day
24453,0.0,16.800001,303.899994,2.4,8,9,2017,2,2017-08-09
84707,0.0,76.000000,274.100006,3.5,2,3,2019,0,2019-02-03
185165,0.0,36.100002,293.600006,4.7,7,27,2021,0,2021-07-27
8635,0.0,35.700001,286.399994,4.6,3,19,2017,0,2017-03-19
46815,0.0,74.200005,276.500000,5.8,2,26,2018,0,2018-02-26
...,...,...,...,...,...,...,...,...,...
83055,0.0,68.400002,279.700012,7.6,1,19,2019,0,2019-01-19
197583,21.0,46.500000,284.000000,3.2,11,16,2021,0,2021-11-16
132937,29.0,31.400000,285.700012,2.8,4,12,2020,0,2020-04-12
101611,50.0,13.900001,303.100006,2.0,7,5,2019,2,2019-07-05


In [93]:
season_dict = {1: 'Winter',
               2: 'Winter',
               3: 'Spring', 
               4: 'Spring',
               5: 'Spring',
               6: 'Summer',
               7: 'Summer',
               8: 'Summer',
               9: 'Fall',
               10: 'Fall',
               11: 'Fall',
               12: 'Winter'}
fire_data_season = fire_data_int.copy()
fire_data_season['season'] = fire_data_season['month'].apply(lambda x: season_dict[x])
fire_data_season = pd.get_dummies(fire_data_season, columns=['season'])
fire_data_season

Unnamed: 0,burning_index_g,relative_humidity,air_temperature,wind_speed,month,date,year,fire,day,season_Fall,season_Spring,season_Summer,season_Winter
24453,0.0,16.800001,303.899994,2.4,8,9,2017,2,2017-08-09,0,0,1,0
84707,0.0,76.000000,274.100006,3.5,2,3,2019,0,2019-02-03,0,0,0,1
185165,0.0,36.100002,293.600006,4.7,7,27,2021,0,2021-07-27,0,0,1,0
8635,0.0,35.700001,286.399994,4.6,3,19,2017,0,2017-03-19,0,1,0,0
46815,0.0,74.200005,276.500000,5.8,2,26,2018,0,2018-02-26,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
83055,0.0,68.400002,279.700012,7.6,1,19,2019,0,2019-01-19,0,0,0,1
197583,21.0,46.500000,284.000000,3.2,11,16,2021,0,2021-11-16,1,0,0,0
132937,29.0,31.400000,285.700012,2.8,4,12,2020,0,2020-04-12,0,1,0,0
101611,50.0,13.900001,303.100006,2.0,7,5,2019,2,2019-07-05,0,0,1,0


# Models

First, we can check the VIFs of our variables. They are all below 5, which is great! 

In [94]:
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

# The dataframe passed to VIF must include the intercept term. We add it the same way we did before.
#first argument: training dataset
#second argument: columns of independent variables as a list that we want to look through
def VIF(df, columns):
    values = sm.add_constant(df[columns]).values
    num_columns = len(columns)+1
    vif = [variance_inflation_factor(values, i) for i in range(num_columns)]
    return pd.Series(vif[1:], index=columns)

cols = ['burning_index_g', 'relative_humidity', 'air_temperature',
       'wind_speed', 'month', 'date', 'year']
VIF(X_train, cols)

burning_index_g      3.072436
relative_humidity    3.419862
air_temperature      2.501815
wind_speed           1.347985
month                1.121446
date                 1.000991
year                 1.016323
dtype: float64

In [95]:
y = fire_data_int['fire']
X = fire_data_int.drop(columns = ['fire', 'day'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=88)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((28000, 7), (12000, 7), (28000,), (12000,))

In [96]:
y_train = y_train.astype('int')
y_test = y_test.astype('int')
y_train

115986    2
17246     2
164471    0
186893    2
191783    2
         ..
180260    0
163950    0
143831    2
79585     0
119044    2
Name: fire, Length: 28000, dtype: int64

In [97]:
def scores(y_test, y_pred):
    cm  = confusion_matrix(y_test, y_pred)
    acc = accuracy_score(y_test, y_pred)
    TPR = cm[1][1] / sum(cm[1])            # TP/(TP+FP)
    FPR = cm[0][1] / sum(cm[0])            # FP/(FP+TN)
    PRE = cm[1][1] / (cm[0][1] + cm[1][1]) # TP/(TP+FP)
    print(f'Accuracy: {acc:.4f}, TPR: {TPR:.4f}, FPR: {FPR:.4f}, Precision: {PRE:.4f}')
    return acc, TPR, FPR, PRE

## Baseline

In [98]:
cm = confusion_matrix(y_test,[0]*y_test.shape[0])
baseline_acc = y_test.value_counts().values.max() / len(y_test)
baseline_TPR = cm[1][1] / sum(cm[1])
baseline_FPR = cm[0][1] / sum(cm[0])
baseline_PRE = 0 
print(f'Accuracy: {baseline_acc:.4f}, TPR: {baseline_TPR:.4f}, FPR: {baseline_FPR:.4f}, Precision: {baseline_PRE:.4f}')

Accuracy: 0.6037, TPR: 0.0000, FPR: 0.0000, Precision: 0.0000


## CART: Decision Tree Classification

In [99]:
# without cross validation 
dtc = DecisionTreeClassifier(min_samples_leaf=5, min_samples_split=20, 
                             random_state=88)
dtc.fit(X_train, y_train)
y_pred = dtc.predict(X_test)
dtc_acc,dtc_TPR,dtc_FPR,dtc_PRE = scores(y_test, y_pred)

Accuracy: 0.8827, TPR: 0.7785, FPR: 0.0064, Precision: 0.7160


In [100]:
# with cross validation 
grid_values = {'ccp_alpha': np.linspace(0, 0.10, 201),
               'min_samples_leaf': [5],
               'min_samples_split': [20],
               'random_state': [88]}

dtc = DecisionTreeClassifier(random_state=88)
dtc_cv = GridSearchCV(dtc, param_grid=grid_values, scoring='accuracy', cv=10) 
dtc_cv.fit(X_train, y_train)
y_pred = dtc_cv.best_estimator_.predict(X_test)
dtc_cv_acc,dtc_cv_TPR,dtc_cv_FPR,dtc_cv_PRE = scores(y_test, y_pred)

Accuracy: 0.8898, TPR: 0.0268, FPR: 0.0021, Precision: 0.2105


## Random Forest

In [101]:
# without cross validation: vanilla bagging
rf = RandomForestClassifier(max_features=X_train.shape[1], min_samples_leaf=5, 
                            n_estimators=500, random_state=88)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

rf_acc,rf_TPR,rf_FPR,rf_PRE = scores(y_test, y_pred)

Accuracy: 0.9093, TPR: 0.6242, FPR: 0.0026, Precision: 0.8304


In [102]:
# with cross validation 
grid_values = {'max_features': np.arange(1, X_train.shape[1]),
               'min_samples_leaf': [5],
               'n_estimators': [500],
               'random_state': [88]}

rf_c = RandomForestClassifier(random_state=88)
rf_cv = GridSearchCV(rf_c, param_grid=grid_values, scoring='accuracy', cv=10)
rf_cv.fit(X_train, y_train)

y_pred = rf_cv.predict(X_test)

rf_max_features = rf_cv.best_params_['max_features']
print(f'Best max_features: {rf_max_features}')
rf_cv_acc,rf_cv_TPR,rf_cv_FPR,rf_cv_PRE = scores(y_test, y_pred)

Best max_features: 6
Accuracy: 0.9086, TPR: 0.5638, FPR: 0.0023, Precision: 0.8317


## LDA

In [103]:
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)
y_pred = lda.predict(X_test)
lda_acc,lda_TPR,lda_FPR,lda_PRE = scores(y_test, y_pred)

Accuracy: 0.7631, TPR: 0.0000, FPR: 0.0000, Precision: nan


  PRE = cm[1][1] / (cm[0][1] + cm[1][1]) # TP/(TP+FP)


## Vanilla Bagging

In [104]:
total_features = len(X_train.columns)
bagging = RandomForestClassifier(max_features=X_train.shape[1], min_samples_leaf=5, 
                            n_estimators=500, random_state=88)
bagging.fit(X_train, y_train)
y_pred = bagging.predict(X_test)
bagging_acc,bagging_TPR,bagging_FPR,bagging_PRE = scores(y_test, y_pred)

Accuracy: 0.9093, TPR: 0.6242, FPR: 0.0026, Precision: 0.8304


In [105]:
total_features = len(X_train.columns)
bagging = RandomForestClassifier(max_features = total_features, random_state=88)
bagging.fit(X_train, y_train)
y_pred = bagging.predict(X_test)
bagging_acc,bagging_TPR,bagging_FPR,bagging_PRE = scores(y_test, y_pred)

Accuracy: 0.9063, TPR: 0.7450, FPR: 0.0035, Precision: 0.8162


## Gradient Boosting Classifier

In [106]:
# wihtout cross validation
gbc = GradientBoostingClassifier(n_estimators=3300, max_leaf_nodes=10, random_state=88) 
gbc.fit(X_train, y_train)
y_pred = gbc.predict(X_test)
gbc_acc,gbc_TPR,gbc_FPR,gbc_PRE = scores(y_test, y_pred)

Accuracy: 0.8890, TPR: 0.5302, FPR: 0.0032, Precision: 0.7745


## Comparison table

In [107]:
#Create Comparison Table
#These lines are provided for you to help construct a comparison table.
#It is not requred to follow this format. + You need to find ACC, TPR, FPR, PRE for each model that you choose.
comparison_data_4 = {'Baseline':[baseline_acc,baseline_TPR,baseline_FPR, baseline_PRE],
                   'Gradient Boosting Classifer':[gbc_acc,gbc_TPR,gbc_FPR, gbc_PRE],
                   'Decision Tree Classifier':[dtc_acc,dtc_TPR,dtc_FPR,dtc_PRE],
                   'Decision Tree Classifier with CV':[dtc_cv_acc,dtc_cv_TPR,dtc_cv_FPR,dtc_cv_PRE],
                   'Random Forest':[rf_acc,rf_TPR, rf_FPR,rf_PRE],
                   'Random Forest with CV':[rf_acc,rf_TPR, rf_FPR,rf_PRE],
                   'Linear Discriminant Analysis':[lda_acc,lda_TPR, lda_FPR,lda_PRE],
                   'Vanilla Bagging': [bagging_acc,bagging_TPR,bagging_FPR,bagging_PRE]}

comparison_table_4 = pd.DataFrame(data=comparison_data_4, index=['Accuracy', 'TPR', 'FPR','PRE']).transpose()
comparison_table_4.style.set_properties(**{'font-size': '12pt',}).set_table_styles([{'selector': 'th', 'props': [('font-size', '10pt')]}])
comparison_table_4

Unnamed: 0,Accuracy,TPR,FPR,PRE
Baseline,0.603667,0.0,0.0,0.0
Gradient Boosting Classifer,0.889,0.530201,0.003175,0.77451
Decision Tree Classifier,0.882667,0.778523,0.00635,0.716049
Decision Tree Classifier with CV,0.88975,0.026846,0.002071,0.210526
Random Forest,0.909333,0.624161,0.002623,0.830357
Random Forest with CV,0.909333,0.624161,0.002623,0.830357
Linear Discriminant Analysis,0.763083,0.0,0.0,
Vanilla Bagging,0.906333,0.744966,0.003451,0.816176


<hr style="height:5px;border-width:5;color:black;background-color:black">

Let's look at all 3 output tables when our pixel is a factor of 3, 4, and 5 respectively: 

In [108]:
comparison_table_3

Unnamed: 0,Accuracy,TPR,FPR,PRE
Baseline,0.650333,0.0,0.0,0.0
Gradient Boosting Classifer,0.832667,0.464286,0.003716,0.691489
Decision Tree Classifier,0.82075,0.664286,0.00756,0.611842
Decision Tree Classifier with CV,0.845917,0.0,0.0,
Random Forest,0.856833,0.464286,0.003716,0.691489
Random Forest with CV,0.856833,0.464286,0.003716,0.691489
Linear Discriminant Analysis,0.7645,0.0,0.0,
Vanilla Bagging,0.850333,0.578571,0.004869,0.680672


In [109]:
comparison_table_4

Unnamed: 0,Accuracy,TPR,FPR,PRE
Baseline,0.603667,0.0,0.0,0.0
Gradient Boosting Classifer,0.889,0.530201,0.003175,0.77451
Decision Tree Classifier,0.882667,0.778523,0.00635,0.716049
Decision Tree Classifier with CV,0.88975,0.026846,0.002071,0.210526
Random Forest,0.909333,0.624161,0.002623,0.830357
Random Forest with CV,0.909333,0.624161,0.002623,0.830357
Linear Discriminant Analysis,0.763083,0.0,0.0,
Vanilla Bagging,0.906333,0.744966,0.003451,0.816176


In [110]:
comparison_table_5

Unnamed: 0,Accuracy,TPR,FPR,PRE
Baseline,0.591833,0.0,0.0,0.0
Gradient Boosting Classifer,0.899583,0.488235,0.002534,0.821782
Decision Tree Classifier,0.889083,0.635294,0.005069,0.75
Decision Tree Classifier with CV,0.899333,0.094118,0.00183,0.551724
Random Forest,0.91625,0.541176,0.002816,0.821429
Random Forest with CV,0.91625,0.541176,0.002816,0.821429
Linear Discriminant Analysis,0.7675,0.0,0.0,
Vanilla Bagging,0.914417,0.670588,0.003239,0.832117
