# TASK 5: Machine Learning Implementation & Evaluation 
## Predicting Forest Type 

### Final Project, Python for Data Science 
#### Willamette University MSDS 
by Charles Hanks, Carter McMahon, & Cleighton Roberts 

<br>
<br>
<br>

## Intro

The goal of our project was to predict the types of trees present in different areas of the Roosevelt National Forest in Colorado. We built a model using cartographic variables such as shadow coverage, distance to nearby landmarks, soil type, and local topography to classify tree type. 

## Problem Statement

The main problem we will address is how to identify the type of a tree based on its surroundings. If we know certain characteristics about a piece of land, such as elevation or the hillshade at noon, what sort of trees would we likely find there? 

## Primary Research Question

Which cartographic features are most predictive of tree type in the Roosevelt National Forest?


## Methodology

### Random Forest
The final random forest model that we chose only differed from the default hyperparameters in that the number of estimators was set to 250 rather than 100. We decided to do that after testing higher values for min_sample_leaf, min_sample_split, and n_estimators using GridSearchCV. The results of that cross-validated hyperparameter tuning grid showed that the default values for min_sample_leaf and min_sample_split were best in our case, and that 250 was better than 100 for n_estimators. Due to our limited access to computing power, 250 was the highest value that we tested for n_estimators, so it is possible that an even higher value could have yielded better accuracy scores.

### Support Vector Machine 
Given the computational expensive of training a SVM model on a dataset of this size, we utilized the RandomSearchCV tuning grid method to optimize the model. We focused on 2 hyperparameters: 'C', the misclassification penalty, and 'max_iter', the number of times the algorithm iterates. We chose a logarithmic range of values to try for 'C', and for each value, a random 'max_iter' value between 1000 and 5000. Each combination of these hyperparameters was passed through 3-fold cross validation in our grid search. From this randomized grid search, the best value for 'C' was 14.37, and the best value for 'max_iter' was 1902. Unfortunately, we saw negligible increase in model performance with the random search best parameters. 

### K Nearest Neighbors 
We wanted to try a KNN model for two reasons. First, KNN is a more simple and fast algorithm compared to RF and SVM. Second, we suspected that it would perform well on geospatial data given that it is based on distance between data points. The best value for hyperparamter K was 5. 

## Results

### Random Forest 
Despite our hyperparameter tuning efforts, the resulting improvement in accuracy over default settings was minimal. The tuned model produced an accuracy score of 0.952521 compared to 0.95201 with the default hyperparameter settings.

### Support Vector Machine 
Our best SVM model was the basesline LinearSVC Model trained on 70% of the dataset. The accuracy score of this model was .72. 

### K-Nearest Neighbors 
Our KNN model's performance was a pleasant surprise. With a k = 5, The accuracy score was .92. 


## Discussion & Implications 


Based on our process of training and tuning the SVM models, we can say that this dataset is not suitable for a Support Vector Machine Model. During our exploratory data analysis, we found signficant overlap among classes. That is to say, many of the different tree types share similar characteristics - after all, they are in the same forest. SVMs do not perform well with very large dataset and when classes overlap (that data set has a lot of 'noise').  These two disadvantages explain why our SVM model did not accurately predict above a .71. 

KNN perform better than our expectations, given how relatively simple the algorithm is compared to RF and SVM. We suspect that KNN performed well due to much of our dataset was spatial distances, and KNN detects neighbors based on Euclidean distance in n-dimensional feature space. 

Random Forest is a tried and true model that produces consistently good results. We attribute this to the fact that our dataset contained many one-hot encoded (1/0) columns, so the model could quickly determine tree type by eliminating certain potential tree types. For example, there were tree types that were never found in certain soil types. An ensemble of decision trees excels with this sort of data. 




### Findings

Elevation is the cartographic feature most predictive of tree type. We arrived at this conclusion by examining the variable importance plot of our Random Forest model. Other important features include how far the tree is from a road, body of water, or where a fire has started. 


## Conclusion

Our interpretation of our modeling is that human actions have a significant impact on the the types of trees that we find in the Roosevelt National Forest. Setting aside the importance of elevation (this is expected), we see the next two important features are distance to roadways and distance to fire points. We have reason to believe that distance to human activity is connected to the types of trees we find in the forest. This is yet another example of how our species shapes our environment. 



## References

[Top 4 Advantages and Disadvantages of Support Vector Machine or SVM](https://dhirajkumarblog.medium.com/top-4-advantages-and-disadvantages-of-support-vector-machine-or-svm-a3c06a2b107#:~:text=SVM%20algorithm%20is%20not%20suitable,samples%2C%20the%20SVM%20will%20underperform.)



[K-Nearest Neighbors (KNN) Algorithm in Machine Learning](https://www.enjoyalgorithms.com/blog/k-nearest-neighbours-in-ml)