# Forest Cover Type - Prediction

Hi! Thanks for checking this notebook. We'll we working on the Forest Cover Type dataset, which contains tree observations from four areas of the Roosevelt National Forest in Colorado.

In this notebook we'll try to predict the forest cover type given the cartographic variables the dataset provides by comparing DNN's to a random forest model.


In [2]:
%load_ext tensorboard

import tensorflow as tf
import datetime
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

!rm -rf ./logs/ 

Let's import the data and format the target so it ranges from 0 to 6 rather than from 1 to 7.

In [3]:
data = pd.read_csv('../input/covtype.csv')

data.Cover_Type = data.Cover_Type - 1

Now let's do the train-test split with stratify, so the subsets have the same proportion of classes as the original dataset. Also there is no need to drop one category since we are not dealing with linear models.

In [4]:
from sklearn.model_selection import train_test_split

dt = data.copy(deep=True)

X = dt[[col for col in data.columns if col != 'Cover_Type']]
y = dt[['Cover_Type']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, 
                                                    random_state=42, stratify=y)


The next step is to scale the numerical features since they all have different scales. Since there aren't too many or too big in magnitude outliers, standard scaling should work fine. Note that the scaling is done considering only the train data since we can't leak information of the test data to the model:

In [5]:
from sklearn.preprocessing import StandardScaler

X_train_num = X_train.iloc[:,:10]
X_test_num = X_test.iloc[:,:10]

std_scaler = StandardScaler()
std_scaler.fit(X_train_num)

X_train_num = pd.DataFrame(std_scaler.transform(X_train_num), 
                        columns=X_train_num.columns,
                        index=X_train.index)

X_train = pd.concat([X_train_num, X_train.iloc[:,10:]], axis=1)

X_test_num = pd.DataFrame(std_scaler.transform(X_test_num), 
                        columns=X_test_num.columns,
                        index=X_test_num.index)

X_test = pd.concat([X_test_num, X_test.iloc[:,10:]], axis=1)

Tensorboard setup:

In [8]:
log_dir="logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

First we'll try the Deep Neural Net with dropout for regularization. We define the model's arquitecture, compile and train it:

In [6]:
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(X_train.shape[1], )),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dense(7, activation='softmax')
])

In [7]:
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy', 'categorical_accuracy', 'sparse_categorical_accuracy'])

In [10]:
model.fit(x=X_train.values, 
          y=y_train.values.ravel(), 
          epochs=25, 
          validation_data=(X_test.values, y_test.values.ravel()), 
          callbacks=[tensorboard_callback])

Train on 435759 samples, validate on 145253 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<tensorflow.python.keras.callbacks.History at 0x7f6e690aded0>

#### Using the DNN and training for 25 epochs we achieve an accuracy close 88% in the validation set.

Now let's try a random forest for comparison's sake:

In [11]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, balanced_accuracy_score

rfc = RandomForestClassifier(n_estimators = 10, 
                             criterion = 'entropy', 
                             random_state = 42,
                             class_weight = 'balanced')

rfc.fit(X_train, y_train.values.ravel())
rfc_predictions = rfc.predict(X_test) 

bal_accuracy = balanced_accuracy_score(y_test, rfc_predictions)
print(f'Balanced accuracy: {bal_accuracy}')

cm = confusion_matrix(y_test, rfc_predictions) 
print(f'\n Confusion Matrix: \n {cm}')

Balanced accuracy: 0.8811119312939465

 Confusion Matrix: 
 [[50198  2641     1     0     8     1   111]
 [ 2815 67651   165     1    85    88    20]
 [    1   204  8476    42     6   209     0]
 [    0     3   108   550     0    26     0]
 [   45   504    27     0  1788     9     0]
 [   11   184   508    26     3  3610     0]
 [  320    32     0     0     2     0  4774]]


#### Once again we achieved a balanced accuracy close to 88%. Both models presented the same accuracy but Random Forest did so in a fraction of the time.

It seems that the heavy imbalance of classes is tampering with our predictive models. With this in mind synthetic data generation seems like a reasonable choice to further increase the accuracy of our models, as well as trying out other classification algorithms. However for the time being I will leave the notebook as it is and maybe I will extend it in the future.