# Forest Cover Problem

## Objective
The study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. Each observation is a 30m x 30m patch. You are asked to predict an integer classification for the forest cover type. The seven types are:

1 - Spruce/Fir
2 - Lodgepole Pine
3 - Ponderosa Pine
4 - Cottonwood/Willow
5 - Aspen
6 - Douglas-fir
7 - Krummholz

The training set (15120 observations) contains both features and the Cover_Type. The test set contains only the features. You must predict the Cover_Type for every row in the test set (565892 observations).

## Steps to complete the project
The following steps must be completed to successfully complete this project
1. Read in data and clean
2. Data visualization to uncover trends
3. Feature engineering
4. Training algorithm
5. Testing algorithm
6. Optional - Ensembling

## Reading data and cleaning

Let us start off by importing important data science libraries and reading in the training and testing data. The training and testing data should be combined and shuffled to do some exploratory analysis

In [1]:
# Importing data handling libraries numpy and pandas
import numpy as np
import pandas as pd

#Importing visualization libraries matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
train = pd.read_csv("../data/train.csv")
test = pd.read_csv("../data/test.csv")
combined = train.append(test, ignore_index = True) # Does not keep index

Before we go into the data visualization, we need to check if there are any missing values in the training or testing datasets. 

In [8]:
# Checking NaN in training dataset
train.isnull().values.any()

False

In [9]:
# Checking NaN in testing dataset
test.isnull().values.any()

False

An initial check reveals that there is no missing values in the form of NaN, but sometimes, the database creators may have inputted the missing value as a '0' or another value that does not make sense in context. This will be covered in the data visualization part of the notebook. 

Furthermore, one-hot encoding should be performed on categorical variables. However, looking at the metadata, it seems that any potential categorical variables has already been one-hot encoded. Let us check whether these encoded variables are categorical. If not, we will factorize them.  (http://pbpython.com/categorical-encoding.html)

Any more data cleaning can only be done with a more indepth look into the data with visualizations.

## Exploratory Data Analysis

There are a couple of objectives I would like to meet in the EDA:
1. Check for the distribution of variables and normalize outliers
2. Standardize all variables
3. 