# Forest Cover

Codacademy Exercise: Deep Learning Portfolio Project

Forest Cover Classification

## Preprocess and explore the dataset

Load the dataset, preprocess it, and conduct some exploratory data analysis to understand the data.

| Name                               | Data Type    | Measurement                 | Description                                   |
| ---                                | ---          | ---                         | ---                                           |
| Elevation                          | quantitative | meters                      | Elevation in meters                           |
| Aspect                             | quantitative | azimuth                     | Aspect in degrees azimuth                     |
| Slope                              | quantitative | degrees                     | Slope in degrees                              |
| Horizontal_Distance_To_Hydrology   | quantitative | meters                      | Horz Dist to nearest surface water features   |
| Vertical_Distance_To_Hydrology     | quantitative | meters                      | Vert Dist to nearest surface water features   |
| Horizontal_Distance_To_Roadways    | quantitative | meters                      | Horz Dist to nearest roadway                  |
| Hillshade_9am                      | quantitative | 0 to 255 index              | Hillshade index at 9am, summer solstice       |
| Hillshade_Noon                     | quantitative | 0 to 255 index              | Hillshade index at noon, summer solstice      |
| Hillshade_3pm                      | quantitative | 0 to 255 index              | Hillshade index at 3pm, summer solstice       |
| Horizontal_Distance_To_Fire_Points | quantitative | meters                      | Horz Dist to nearest wildfire ignition points |
| Wilderness_Area (4 binary columns) | qualitative  | 0 (absence) or 1 (presence) | Wilderness area designation                   |
| Soil_Type (40 binary columns)      | qualitative  | 0 (absence) or 1 (presence) | Soil Type designation                         |
| Cover_Type (7 types)               | integer      | 1 to 7                      | Forest Cover Type designation                 |

The cover types are the following:
- Spruce/Fir
- Lodgepole Pine
- Ponderosa Pine
- Cottonwood/Willow
- Aspen
- Douglas-fir
- Krummholz

In [None]:
import pandas as pd
from collections import Counter

# Load the data into pandas
data = pd.read_csv('forest_cover.csv')

# pd.set_option('mode.chained_assignment', None) # suppress warning message

# Convert wilderness area columns to bool
boolean_column_names = data.filter(like='Wilderness_Area').columns.tolist()
data[boolean_column_names] = data[boolean_column_names].astype('bool')

# Convert soil type columns to bool
boolean_column_names = data.filter(like='Soil_Type').columns.tolist()
data[boolean_column_names] = data[boolean_column_names].astype('bool')

# convert cover type to range [0, 6]
data['Cover_Type'] = data['Cover_Type'] - 1

# print columns and their respective types
print('Data Columns and Types')
print(data.info())

# print the class distribution
print('\nClass Distribution')
print(Counter(data['Cover_Type']))

# print the first five entries in the dataset and the summary stats
# print('\nDataset')
# print(data.head(5))
# print('\nSummary Stats')
# print(data.describe())

## Build and train the model

Split the dataset for training, validation, and testing.  
Then create and train the model using TensorFlow with Keras.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

import tensorflow as tf

# split the data into labels and features
labels = data.iloc[:, -1] # select the last column
features = data.iloc[:, 0:-1] # select all columns except the last

# split the data into a training set and a test set
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.30, random_state=42)

# standardize the numerical features
numerical_features = features.select_dtypes(include=['float64', 'int64'])
numerical_columns = numerical_features.columns
ct = ColumnTransformer([('numeric', StandardScaler(), numerical_columns)], remainder='passthrough')
features_train_scaled = ct.fit_transform(features_train)
features_test_scaled = ct.transform(features_test)

# convert the integer encoded labels into binary vectors
labels_train = tf.keras.utils.to_categorical(labels_train, dtype='int64')
labels_test = tf.keras.utils.to_categorical(labels_test, dtype='int64')

In [None]:
# build the model
num_features = features_train.shape[1]
num_classes = 7

print(f'{num_features=}')
print(f'{num_classes=}')

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.InputLayer(input_shape=(num_features))) # input layer
model.add(tf.keras.layers.Dense(32, activation='relu')) # hidden layer
model.add(tf.keras.layers.Dense(16, activation='relu')) # hidden layer
model.add(tf.keras.layers.Dense(8, activation='relu')) # hidden layer
model.add(tf.keras.layers.Dense(num_classes, activation='softmax')) # output layer

model.summary()

# initialize the gradient descent optimizer
opt = tf.keras.optimizers.Adam(learning_rate=0.01)

# compile the model
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])

In [None]:
# train the model
model.fit(x=features_train_scaled, y=labels_train, epochs=20, batch_size=128, verbose=1)

In [None]:
import numpy as np
from sklearn.metrics import classification_report

# get additional statistics
y_estimate = model.predict(features_test_scaled)
y_estimate = np.argmax(y_estimate, axis=1)
y_true = np.argmax(labels_test, axis=1)
print(classification_report(y_true, y_estimate))