# PREDICTING FOREST TYPE 
## Final Project, Python for Data Science 
### Willamette University MSDS 
by Charles Hanks, Carter McMahon, & Cleighton Roberts

[Dataset](https://www.kaggle.com/datasets/uciml/forest-cover-type-dataset)

# TASK 4 : IMPLEMENTATION & EVALUATION 

## Loading Libraries: 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold, cross_val_score
from sklearn.svm import SVC

## Loading Dataset

In [2]:
ds = pd.read_csv('/Users/chanks/workspace/forest/data/covtype.csv')

### Data Preprocessing 

In [3]:
scale = StandardScaler()

ds.columns.get_loc('Cover_Type') # finding index of dependent variable 

numerical_features = ds.iloc[:,0:10] # subsetting numerical features we want to scale
scaled_num_features =  pd.DataFrame(scale.fit_transform(numerical_features), columns = numerical_features.columns)

ds = pd.concat([scaled_num_features, ds.iloc[:,10:]], axis = 1) #combining scaled numerical features with other variables

### Train/Test Split 

In [4]:
#Subsetting feature
X = ds.drop(ds.columns[-1], axis = 1) #feature subset that excludes "Cover Type" target variable
y = ds[ds.columns[-1]] #target variable "Cover Type"

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 650, test_size = 0.3)

### Task 4.1 : Random Forest Classifier 
Implement a random forest classifier using sci-kit-learn. Create a subsection in your report describing the chosen parameters of the classifier and training method, and how they were set. Also, report the accuracy of the resulting method.

In [None]:
rf = RandomForestClassifier(random_state=650)

### Task 4.2 : Support Vector Machine 
 Modify your solution in Task 4.1 to use a Support Vector Machine (SVC) classifier using sci-kit- learn with the appropriate parameters including kernel = ‘linear’. Report the accuracy of the resulting method. Compare the results of the SVM model with the random forest model implemented in task 4.1. You may also justify the other parameters used or tuned in both models and how that affects the results.

In [None]:
#model_svc = SVC(kernel = 'linear')
#model_svc.fit(X_train, y_train)

The training attempt above takes an exceedingly long time. Per the [sci-kit learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html):

*"For large datasets consider using LinearSVC or SGDClassifier instead, possibly after a Nystroem transformer or other Kernel Approximation"*

Our dataset has 581,012 observations. We will proceed with the LinearSVC() function:

In [8]:
from sklearn.svm import LinearSVC

model_lsvc = LinearSVC(random_state = 650)
model_lsvc.fit(X_train,y_train) #model took about 6 minutes to fit, that is much more efficient than SVC(). 



In [9]:
#predicting forest type on test data using first linear SVC model:  
lsvc_predict = model_lsvc.predict(X_test)

#print accuracy score: 
print(accuracy_score(lsvc_predict, y_test))

0.7115958325683863


### SVC round 2: adding 5-fold Cross Validation, a tune grid & PCA 

#### 5-fold Cross Validation: 

In [16]:
cv_scores = cross_val_score(model_lsvc, X, y, cv=5) #creating CV object to include when refitting 


#### Principal Component Analysis: 

In [10]:
from sklearn.decomposition import PCA

pca = PCA(n_components = 15) #set baseline principal components to k = 15
pca_features = pca.fit_transform(X)



In [12]:
X_train, X_test, y_train, y_test = train_test_split(pca_features, y, test_size=0.3, random_state=650)

In [13]:
model_lsvc.fit(X_train,y_train)



In [14]:
#predicting forest type on test data using from PCA: 
lsvc_predict_pca = model_lsvc.predict(X_test)

#print accuracy score: 
print(accuracy_score(lsvc_predict_pca, y_test)) # results are worse than including all features....

0.6749529557554618


#### Hyperparameter Tuning Grid:
We are going to use a tuning grid to find the best value for the following SVC hyperparameters: 
1. C : penalty for missclassifying a row 

2. max_iter : limits the number of times the algorithm iterates. 

In [16]:
from sklearn.model_selection import GridSearchCV

grid = {
    'C': [.1,1,10], 
    'max_iter': [1000,2000,3000]
}

grid_search = GridSearchCV(model_lsvc, grid, cv = 5)





### Comparison of Random Forest vs. Support Vector Machine 

## Intro

The goal of our project was to predict the types of trees present in different areas of the Roosevelt National Forest in Colorado. We built a model using cartographic variables such as shadow coverage, distance to nearby landmarks, soil type, and local topography to classify tree type. 

## Problem Statement

The main problem we will address is how to identify the type of a tree based on its surroundings. If we know certain characteristics about a piece of land, such as elevation or the hillshade at noon, what sort of trees would we likely find there? 


## Research Questions

* What cartographic feature is most predictive of tree type? 
* Which trees are most resilient to harsh geographic and climatic conditions? 
* Which trees need the least amount of sunlight? 


## Methodology

### Random Forest
The final random forest model that we chose only differed from the default hyperparameters in that the number of estimators was set to 250 rather than 100. We decided to do that after testing higher values for min_sample_leaf, min_sample_split, and n_estimators using GridSearchCV. The results of that cross-validated hyperparameter tuning grid showed that the default values for min_sample_leaf and min_sample_split were best in our case, and that 250 was better than 100 for n_estimators. Due to our limited access to computing power, 250 was the highest value that we tested for n_estimators, so it is possible that an even higher value could have yielded better accuracy scores.

### Support Vector Machine 
Given the computational expensive of training a SVM model on a dataset of this size, we utilized the RandomSearchCV tuning grid method to optimize the model. We focused on 2 hyperparameters: 'C', the misclassification penalty, and 'max_iter', the number of times the algorithm iterates. We chose a logarithmic range of values to try for 'C', and for each value, a random 'max_iter' value between 1000 and 5000. Each combination of these hyperparameters was passed through 3-fold cross validation in our grid search. From this randomized grid search, the best value for 'C' was 14.37, and the best value for 'max_iter' was 1902. 

## Results

### Random Forest 
Despite our hyperparameter tuning efforts, the resulting improvement in accuracy over default settings was minimal. The tuned model produced an accuracy score of 0.952521 compared to 0.95201 with the default hyperparameter settings.

## Discussion and Implications

## Conclusion

## References

## Appendix