## Machine Learning

Now that we have done some feature engineering and did a little data exploration, it is time we start to train an algorithm to predict the cover type of the forest. There are a couple of things that we need to remind ourselves as we begin to build the algorithm:

1. Algorithm should be non-parametric: many of the other algorithms that are parametric often have assumptions that may not fit well with the form of data that we have, especially as many of our variables are non-normal

2. Reduce overfitting by using stratified k-fold cross-validation: Contintual training and testing will help us make a better model that would be less prone to overfitting, a problem encountered in many non-parameteric algorithms

Let's get started! We'll first load in our initial libraries and data.

In [43]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

train = pd.read_csv("../data/train.csv")
test = pd.read_csv("../data/test.csv")

Let's do some cross-validation. Since the problem already gave us a training and testing dataset, I will be using cross-validation on the training dataset.

I will be doing a little pre-processing to set up the datasets for training

In [2]:
Y_train = train["Cover_Type"]
X_train = train.drop("Cover_Type", axis = 1)

The non-parametric algorithms I would like to test are: decision trees, random forest, k-nearest neighbors and support vector machine.

I will iterate through each model and print out the cross-validation score for each.

In [3]:
# Libraries
from sklearn import model_selection
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

# Building models
models = []
models.append(('RF', RandomForestClassifier()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('SVM', SVC()))

results = []
names = []

for name, model in models:
    kfold = model_selection.StratifiedKFold(n_splits=10, random_state= 7)
    cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring= 'accuracy')
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

RF: 0.571098 (0.085077)
KNN: 0.092063 (0.056787)
CART: 0.535979 (0.085821)
SVM: 0.102778 (0.022351)


TypeError: fit() missing 1 required positional argument: 'y'

In [6]:
# Fitting models
svmModel = SVC().fit(X_train, Y_train)
cartModel = DecisionTreeClassifier().fit(X_train, Y_train)
knnModel = KNeighborsClassifier().fit(X_train, Y_train)
rfModel = RandomForestClassifier().fit(X_train, Y_train)

NameError: name 'DecisionTressClassifier' is not defined

Well, this didn't work out very well! The cross-validation scores are extremely low.  I will try my last shot of boosting accuracies by doing an stacked ensemble method. 

Stacking models requires us to create new features with the predictions of the model against the actual data. Then, another algorithm (usually logistic regression) will train itself on these new features. This is why ensembling is often called a second-order model. 

We will begin by putting new features into the dataset that correspond to the model predictions. 

In [8]:
train_meta = train
train_meta["SVM"] = ""
train_meta["KNN"] = ""
train_meta["CART"] = ""
train_meta["RF"] = ""

train_meta["SVM"] = svmModel.predict(X_train)
train_meta["KNN"] = knnModel.predict(X_train)
train_meta["CART"] = cartModel.predict(X_train)
train_meta["RF"] = rfModel.predict(X_train)

In [10]:
from sklearn.linear_model import LogisticRegression

ensemble_model = LogisticRegression().fit(train_meta, Y_train)

Let us now predict the test data from the ensemble model. I will first restructure the training data with the machine learning predictions

In [39]:
test_meta = test
test_x = test

test_meta["SVM"] = ""
test_meta["KNN"] = ""
test_meta["CART"] = ""
test_meta["RF"] = ""

test_meta["KNN"] = knnModel.predict(test_x)
test_meta["CART"] = cartModel.predict(test_x)
test_meta["RF"] = rfModel.predict(test_x)
test_meta["SVM"] = svmModel.predict(test_x)

ValueError: could not convert string to float: 

In [None]:
pred = ensemble_model.predict(test)
pred.to_csv("../data/pred.csv")

Unfortunately, we don't have the answers for the test dataset. We will submit the csv file of the predictions in the contest.

In [45]:
testDummy = test
testDummy["SVM"] = ""
test.dtypes

Unnamed: 0                              int64
Id                                      int64
Soil_Type1                              int64
Soil_Type10                             int64
Soil_Type11                             int64
Soil_Type12                             int64
Soil_Type13                             int64
Soil_Type14                             int64
Soil_Type15                             int64
Soil_Type16                             int64
Soil_Type17                             int64
Soil_Type18                             int64
Soil_Type19                             int64
Soil_Type2                              int64
Soil_Type20                             int64
Soil_Type21                             int64
Soil_Type22                             int64
Soil_Type23                             int64
Soil_Type24                             int64
Soil_Type25                             int64
Soil_Type26                             int64
Soil_Type27                       

In [42]:
X_train.dtypes

Unnamed: 0                              int64
Id                                      int64
Soil_Type1                              int64
Soil_Type10                             int64
Soil_Type11                             int64
Soil_Type12                             int64
Soil_Type13                             int64
Soil_Type14                             int64
Soil_Type15                             int64
Soil_Type16                             int64
Soil_Type17                             int64
Soil_Type18                             int64
Soil_Type19                             int64
Soil_Type2                              int64
Soil_Type20                             int64
Soil_Type21                             int64
Soil_Type22                             int64
Soil_Type23                             int64
Soil_Type24                             int64
Soil_Type25                             int64
Soil_Type26                             int64
Soil_Type27                       