# Forest Cover Prediction
## by: Avengers End Game
### authors: Naga Chandrasekaran, Scott Gatzemeier Aidan Jackson, and Andi Morey Peterson 

### Executive Summary

The goal of this project is to classify trees in four different wilderness areas of the Roosevelt National Forest in Northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices.  Accurate results of a successful model will allow US Forest Service (USFS) to predict the predominant cover type trees to plant in reforestation efforts of the 800,000 acres in the Roosevelt National Forest.

![Picture Credit to: alchetron.com](https://i.ytimg.com/vi/Yi4ICw5L4h0/maxresdefault.jpg)

The data was collected by the USFS and the US Geological Survey and provided to Kaggle by Colorado State University.  Each entry of the dataset represents a 30x30 meter cell.  The project will attempt to predict one of seven cover types using features such as elevation, slope, soil type, wilderness areas, aspect, and distance measures.


#### Table of Contents:

1) Understand, Clean and Format Data - *Should we move AFTER EDA? - Andi*

2) Exploratory Data Analysis

3) Feature Engineering & Selection

4) Initial Machine Learning Models

5) Hyperparameter Tuning

6) Evaluatation the Best Model

7) Interpret Model Results

8) Summary & Conclusions

### Exploratory Data Analysis

In [1]:
import numpy as np
import pandas as pd 
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [2]:
#Read data for analysis
raw_data=pd.read_csv('../input/forest-cover-type-dataset/covtype.csv')
print('Data Dimensions for Raw Data:')
print('   Number of Records:', raw_data.shape[0])
print('   Number of Features:', raw_data.shape[1])
train_data=pd.read_csv('../input/forest-cover-type-prediction/train.csv')
print('Data Dimensions for Train Data:')
print('   Number of Records:', train_data.shape[0])
print('   Number of Features:', train_data.shape[1])
test_data=pd.read_csv('../input/forest-cover-type-prediction/test.csv')
print('Data Dimensions for Train Data:')
print('   Number of Records:', test_data.shape[0])
print('   Number of Features:', test_data.shape[1])
print('Feature Names:')
print(raw_data.columns)

FileNotFoundError: [Errno 2] No such file or directory: '../input/forest-cover-type-dataset/covtype.csv'

#### Feature Information:

The seven cover types are:

1. Spruce/Fir
2. Lodgepole Pine
3. Ponderosa Pine
4. Cottonwood/Willow
5. Aspen
6. Douglas-fir
7. Krummholz

First, the distribution will tell if there is skewness in the predictions we are trying to replicate.

In [None]:
#Plot distribution of cover types
fig, ax = plt.subplots(1, 2, figsize=(10,5), sharey=True)

ax[0].set(ylabel='Percentage of Data')

class_dist=raw_data.groupby('Cover_Type').size()/raw_data.shape[0]
class_label=pd.DataFrame(class_dist,columns=['Size'])
sns.barplot(ax=ax[0],x=class_label.index,y='Size',data=class_label)
ax[0].set_title("Cover Types in Raw Dataset")
ax[0].set(xlabel='Cover Type')

class_dist=train_data.groupby('Cover_Type').size()/train_data.shape[0]
class_label=pd.DataFrame(class_dist,columns=['Size'])
sns.barplot(ax=ax[1],x=class_label.index,y='Size',data=class_label)
ax[1].set_title("Cover Types in Train Dataset")
ax[1].set(xlabel='Cover Type')

plt.suptitle('Cover Types for Each Dataset', fontsize=18)
plt.show()

The train data set that we will use for the model building is evenly distributed, but the raw dataset has skewness that we should be aware of and will contribute to overfitting of the train data.

#### Feature Information (cont):

The rest of the data that we can use as features for models are as follows:

- Elevation (continuous) - Elevation in meters
- Aspect (continuous) - Aspect in degrees azimuth
- Slope (continuous) - Slope in degrees
- Horizontal_Distance_To_Hydrology (continuous) - Horz Dist to nearest surface water features
- Vertical_Distance_To_Hydrology (continuous) - Vert Dist to nearest surface water features
- Horizontal_Distance_To_Roadways (continuous) - Horz Dist to nearest roadway
- Hillshade_9am (0 to 255 index) - Hillshade index at 9am, summer solstice
- Hillshade_Noon (0 to 255 index) - Hillshade index at noon, summer solstice
- Hillshade_3pm (0 to 255 index) - Hillshade index at 3pm, summer solstice
- Horizontal_Distance_To_Fire_Points (continuous) - Horz Dist to nearest wildfire ignition points
- Wilderness_Area (4 binary columns, 0 = absence or 1 = presence) - Wilderness area designation
- Soil_Type (40 binary columns, 0 = absence or 1 = presence) - Soil Type designation
- Cover_Type (7 types, integers 1 to 7) - Forest Cover Type designation

First, let's take a look at our continous data.

In [None]:
#plot the continuous data
fig, ax = plt.subplots(2, 5, figsize=(20,10))
continuous_data=raw_data.loc[:,'Elevation':'Horizontal_Distance_To_Fire_Points']

for i, col in enumerate(continuous_data.columns):
    sns.histplot(ax=ax[int(i/5),i%5],data=continuous_data[col],bins=250)
    #sns.distplot(ax=ax[int(i/5),i%5],a=continuous_data[col])
    
plt.suptitle('Histograms of Continuous Features', fontsize=18)
plt.show()

In [None]:
plt.figure(figsize=(15,5))
plt.polar((90-(np.histogram(raw_data["Aspect"], bins=60)[1][0:60]))*np.pi/180,
               np.histogram(raw_data["Aspect"], bins=60)[0][:60])
plt.title("Aspect in Polar Coordinates")
#plt.xticks([90*np.pi/180, 180*np.pi/180, 270*np.pi/180, 0*np,pi/180], ['N', 'W', 'S', 'E'])
#plt.xticks([0], ['E'])
plt.show()

*EDA to add here:  Aspect chart ... correlation chart (to show how the hillshade and slope are correlated with each other... - Andi*

In [None]:
fig, ax = plt.subplots(2, 5, figsize=(25,10))
raw_data['Cover_Type']=raw_data['Cover_Type'].astype('category') #To convert target class into category
for i, col in enumerate(continuous_data.columns):
    #plt.figure(i,figsize=(8,4))
    sns.boxplot(ax=ax[int(i/5),i%5], x=raw_data['Cover_Type'], y=col, data=raw_data)
    
plt.suptitle('Boxplots of Continuous Features', fontsize=18)
plt.show()