# Forest Cover Type Prediction
## W207 Final Project: Baseline Report
### _Team RanForest: Anu Yadav, Naga Akkineni, Lina Gurevich, Rishi Majumder_

## Contents

## 1. Problem Introduction

Natural resources are key to keep a balance in the ecological life cycle. A larger goal of this project is to map out various type of tree populations in different regions and how deforestation through natural causes like a forest fire or human activities create sustainable environmental impact. 

The Roosevelt National Forest is located in north central Colorado and covers a total area of 813,799 acres (1,271.56 sq mi). There are six officially designated wilderness areas lying within Roosevelt National Forest that are part of the National Wilderness Preservation System. 

In this study we will focus on the four forest areas characterized by minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices.

1 - **Rawah Wilderness Area** (119.4 square miles)  
Elevation ranges from 8,400 ft to 13,000 ft.

2 - **Neota Wilderness Area** (15.51 square miles)  
Elevation ranges from 10,000 ft to 11,896 ft.

3 - **Comanche Peak Wilderness Area** (104.4 square miles)  
Elevation ranges from 8,000 ft to 12,702 ft.

4 - **Cache la Poudre Wilderness Area** (14.47 square miles)  
Elevation ranges from 6,200 ft to 8,600 ft.

There are **7** types of forest cover that are common in the park:

<img src="Cover Types.PNG"> 

Our task is to predict the forest cover type (the predominant kind of tree cover) from strictly cartographic variables (as opposed to remotely sensed data). The actual forest cover type for a given 30 x 30 meter cell was determined from US Forest Service (USFS) Region 2 Resource Information System data. Independent variables were then derived from data obtained from the US Geological Survey and USFS. The data is in raw form (not scaled) and contains binary columns of data for qualitative independent variables such as wilderness areas and soil type.

## 2. Exploratory Data Analysis

The original dataset contains the following features that can be roughly split into 7 groups:

1. *Landscape*  
**Elevation** - Elevation in meters  
**Aspect** - Aspect in degrees azimuth  
**Slope** - Slope in degrees 


2. *Proximity to water supplies*  
**Horizontal_Distance_To_Hydrology** - Horz Dist to nearest surface water features  
**Vertical_Distance_To_Hydrology** - Vert Dist to nearest surface water features  


3. *Proximity to human-caused habitat alteration*  
**Horizontal_Distance_To_Roadways** - Horz Dist to nearest roadway  


4. *Sun exposure*  
**Hillshade_9am (0 to 255 index)** - Hillshade index at 9am, summer solstice  
**Hillshade_Noon (0 to 255 index)** - Hillshade index at noon, summer solstice  
**Hillshade_3pm (0 to 255 index)** - Hillshade index at 3pm, summer solstice  


5. *Wildfire risk factor*  
**Horizontal_Distance_To_Fire_Points** - Horz Dist to nearest wildfire ignition points  


6. *Location within the park*  
**Wilderness_Area (4 binary columns, 0 = absence or 1 = presence)** - Wilderness area designation  


7. *Type of soil*  
**Soil_Type (40 binary columns, 0 = absence or 1 = presence)** - Soil Type designation

Prior to conducting a thorough quantitative analysis, we can reason that any of the features listed above may affect the type of the predominant forest cover in a particular area. Therefore, we will retain all the features at this stage of our investigation.

In [8]:
# Import necessary modules and libraries

# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#from sklearn.preprocessing import LabelBinarizer
import re
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import GridSearchCV

from sklearn.pipeline import Pipeline, FeatureUnion

from sklearn.decomposition import PCA

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

# To install xgboost package with conda, run:
# conda install -c anaconda py-xgboost
#from xgboost import XGBClassifier

import sklearn.metrics
import seaborn as sns; sns.set()

In [9]:
# Read the training dataset into a pandas dataframe
raw_data = pd.read_csv("train.csv")
print("Original dataset shape: ", raw_data.shape)

Original dataset shape:  (15120, 56)


**Observations:**  

- The original dataset consists of 15,120 observations, each with 55 features and 1 target variable.