#  MACHINE LEARNING: A general way to run and compare most common  supervised learning algorithms with R-project

By: Hector Alvaro Rojas | Data Science, Visualizations and Applied Statistics | September 2017<br>
Url: [http://www.arqmain.net]   &nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;   GitHub: [https://github.com/arqmain]
<hr>

## I Introduction

This project presents a general way to run and compare several supervised learning algorithms applied to the classification problem, evaluating and selecting the best of them according to a precision measurement (accuracy_score) and R.

The supervised learning algorithms to be considered here are:

* Logistic Regression (LR)
* Linear Discriminant Analysis (LDA)
* K-Nearest Neighbors (KNN)
* Classification and Regression Trees (CART)
* Random Forest Classifier (RF)
* Gaussian Naive Bayes (NB)
* Support Vector Machines (SVM)

I am not using any specific library -like caret- to control the central procedure to get the modeling. If I'll use one of them will be as a support for only some part of the modeling process not as a central support one.

The famous iris flowers dataset is used as a data support. The dataset contains 150 observations of iris flowers. There are four columns of measurements of the flowers in centimeters. These columns are the variables (features): SepalLength; SepalWidth; PetalLength; PetalWidth.

The fifth column is the species of the flower observed. All observed flowers belong to one of three species: Iris-setosa; Iris-versicolor; Iris-virginica. You can learn more about this dataset on [  Wikipedia ](https://en.wikipedia.org/wiki/Iris_flower_data_set).

The dataset can be gotten from [  UCI Machine Learning Repository ](https://archive.ics.uci.edu/ml/datasets/Iris), but in this project I will use a copy of this dataset which I am going to download from [  here ](https://en.wikipedia.org/wiki/Iris_flower_data_set) [ http://www.arqmain.net/MLearning/Datasets/iris.csv].

The data check, summarize and visualization process has been done already by me as part of other machine learning projects. You can download those results [  here ](https://github.com/arqmain/Machine_Learning/blob/master/R_MLearning/MLearning_Classification_Comparison_R_Caret/README.md). 

The "accuracy score" of the models will be measured using the "train/test split" method.  So, I will split the dataset into two parts: datatrain and validation. I will train models on the "datatrain" dataset and validate them on the "validation" one.

The size of the split can depend on the size and specifics of your dataset, although it is common to use any value in a range from 60% to 80% of the data for training and the remaining difference for testing.  I will use 60% and 40% for the training and validate (testing) datasets respectively. 

Special consideration must be done if you have a dataset that admitted more than one variable of classification or different representation's percentage for all the data. As always, we must be careful to get a good photograph of the original dataset in each split dataset that we get.

Using a train/test split is good for speed when using a slow algorithm and produces performance estimates with lower bias when using large datasets. This algorithm evaluation technique is very fast. It is ideal for large datasets (millions of records) where there is strong evidence that both splits of the data are representative of the underlying problem.

One negative part of this technique is that it can have a high variance. This means that differences in the training and test dataset can result in meaningful differences in the estimate of accuracy.

Some good links to get more information about the train/test split method are:

* [  Model Selection and Train/Validation/Test Sets ](https://es.coursera.org/learn/machine-learning/lecture/QGKbr/model-selection-and-train-validation-sets)
* [  Comparing machine learning models in scikit-learn ](https://www.youtube.com/watch?v=0pP4EwWJgIU&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=5)

This project has the following general two objectives to be accomplished for:

<b><i>1) Show a general way to build the models with R.<br>
2) Show a way to select the models using the "train/test split" method and the accuracy score.</b></i>


## I Loading and checking the data

The dataset can be gotten from [  UCI Machine Learning Repository ](https://archive.ics.uci.edu/ml/datasets/Iris), but in this project I will use a copy of this dataset which I am going to download from [  here ](https://en.wikipedia.org/wiki/Iris_flower_data_set) [ http://www.arqmain.net/MLearning/Datasets/iris.csv].

In [1]:
# read and attach the dataset
filename <- "http://www.arqmain.net/MLearning/Datasets/iris.csv"

# load the CSV file from the local directory
df <- read.csv(filename, header=TRUE)
attach(df)

## II Getting train and test datasets


In [2]:
library(caret)

# split training and testing dataset
percentage = 0.60
set.seed(7)
# create a list of 60% of the rows in the original dataset we can use for training
validation_index <- createDataPartition(df$Species, p=percentage, list=FALSE)
# select 40% of the data for validation
validation <- df[-validation_index,]
# use the remaining 60% of data to training and testing the models
datatrain <- df[validation_index,]

head(validation)

head(datatrain)

Loading required package: lattice
Loading required package: ggplot2


Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
5,5.0,3.6,1.4,0.2,Iris-setosa
11,5.4,3.7,1.5,0.2,Iris-setosa
13,4.8,3.0,1.4,0.1,Iris-setosa
14,4.3,3.0,1.1,0.1,Iris-setosa
18,5.1,3.5,1.4,0.3,Iris-setosa
21,5.4,3.4,1.7,0.2,Iris-setosa


Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
1,5.1,3.5,1.4,0.2,Iris-setosa
2,4.9,3.0,1.4,0.2,Iris-setosa
3,4.7,3.2,1.3,0.2,Iris-setosa
4,4.6,3.1,1.5,0.2,Iris-setosa
6,5.4,3.9,1.7,0.4,Iris-setosa
7,4.6,3.4,1.4,0.3,Iris-setosa


## III Building models
 
I have considered linear (LR and LDA), nonlinear (KNN, CART, RF, NB and SVM) algorithms. I reset the random number seed before each run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.

### 3.1 Logistic Regression (LR)

In [3]:
library(VGAM)
#Build the model
model1<-vglm(Species ~ .,family = "multinomial",data=datatrain)
#Summarize the model
summary(model1)

Loading required package: stats4
Loading required package: splines

Attaching package: 'VGAM'

The following object is masked from 'package:caret':

    predictors







"some quantities such as z, residuals, SEs may be inaccurate due to convergence at a half-step"


Call:
vglm(formula = Species ~ ., family = "multinomial", data = datatrain)


Pearson residuals:
                        Min        1Q     Median       3Q     Max
log(mu[,1]/mu[,3]) -0.02579 -0.001040  5.313e-13 0.003001 0.02556
log(mu[,2]/mu[,3]) -0.50131 -0.004555 -2.259e-05 0.001302 1.20400

Coefficients: 
              Estimate Std. Error z value Pr(>|z|)
(Intercept):1   85.750    203.410   0.422    0.673
(Intercept):2   81.084     73.110   1.109    0.267
SepalLength:1    3.144     53.916   0.058    0.954
SepalLength:2    2.629      4.198   0.626    0.531
SepalWidth:1    23.289     41.971   0.555    0.579
SepalWidth:2    19.741     21.679   0.911    0.363
PetalLength:1  -21.791     47.588  -0.458    0.647
PetalLength:2  -15.925     14.311  -1.113    0.266
PetalWidth:1   -46.236    107.562  -0.430    0.667
PetalWidth:2   -43.321     39.213  -1.105    0.269

Number of linear predictors:  2 

Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3])

Residual deviance: 3.08 

In [6]:
#Predict datatrain using the model
x<-datatrain[,1:4]
y<-datatrain[,5]
probability<-predict(model1,x,type="response")
datatrain$pred_lr<-apply(probability,1,which.max)
datatrain$pred_lr<-apply(probability,1,which.max)
datatrain$pred_lr[which(datatrain$pred_lr=="1")]<-levels(datatrain$Species)[1]
datatrain$pred_lr[which(datatrain$pred_lr=="2")]<-levels(datatrain$Species)[2]
datatrain$pred_lr[which(datatrain$pred_lr=="3")]<-levels(datatrain$Species)[3]

#Accuracy of the model
mtab<-table(datatrain$pred_lr,datatrain$Species)
library(caret)
confusionMatrix(mtab)

"fitted probabilities numerically 0 or 1 occurred"

Confusion Matrix and Statistics

                 
                  Iris-setosa Iris-versicolor Iris-virginica
  Iris-setosa              30               0              0
  Iris-versicolor           0              29              0
  Iris-virginica            0               1             30

Overall Statistics
                                          
               Accuracy : 0.9889          
                 95% CI : (0.9396, 0.9997)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9833          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: Iris-setosa Class: Iris-versicolor
Sensitivity                      1.0000                 0.9667
Specificity                      1.0000                 1.0000
Pos Pred Value                   1.0000                 1.0000
Neg Pred Value                   1.0000                 0.9836
Pre

In [7]:
#Predict data "validation" using the model
x<-validation[,1:4]
y<-validation[,5]
probability<-predict(model1,x,type="response")
validation$pred_lr<-apply(probability,1,which.max)
validation$pred_lr<-apply(probability,1,which.max)
validation$pred_lr[which(validation$pred_lr=="1")]<-levels(validation$Species)[1]
validation$pred_lr[which(validation$pred_lr=="2")]<-levels(validation$Species)[2]
validation$pred_lr[which(validation$pred_lr=="3")]<-levels(validation$Species)[3]

#Accuracy of the model
mtab<-table(validation$pred_lr,validation$Species)
library(caret)
confusionMatrix(mtab)

"fitted probabilities numerically 0 or 1 occurred"

Confusion Matrix and Statistics

                 
                  Iris-setosa Iris-versicolor Iris-virginica
  Iris-setosa              20               0              0
  Iris-versicolor           0              20              3
  Iris-virginica            0               0             17

Overall Statistics
                                          
               Accuracy : 0.95            
                 95% CI : (0.8608, 0.9896)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.925           
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: Iris-setosa Class: Iris-versicolor
Sensitivity                      1.0000                 1.0000
Specificity                      1.0000                 0.9250
Pos Pred Value                   1.0000                 0.8696
Neg Pred Value                   1.0000                 1.0000
Pre

### 3.2 Linear Discriminant Analysis (LDA)

In [8]:
library(MASS)
#Build the model
model2<-lda(Species~SepalLength+SepalWidth+PetalLength+PetalWidth,data=datatrain)
#Summarize the model
summary(model2)

        Length Class  Mode     
prior    3     -none- numeric  
counts   3     -none- numeric  
means   12     -none- numeric  
scaling  8     -none- numeric  
lev      3     -none- character
svd      2     -none- numeric  
N        1     -none- numeric  
call     3     -none- call     
terms    3     terms  call     
xlevels  0     -none- list     

In [9]:
#Predict datatrain using the model
x<-datatrain[,1:4]
y<-datatrain[,5]

#Predict using the model
datatrain$pred_lda<-predict(model2,x)$class

#Accuracy of the model
mtab<-table(datatrain$pred_lda,datatrain$Species)
confusionMatrix(mtab)

Confusion Matrix and Statistics

                 
                  Iris-setosa Iris-versicolor Iris-virginica
  Iris-setosa              30               0              0
  Iris-versicolor           0              28              0
  Iris-virginica            0               2             30

Overall Statistics
                                         
               Accuracy : 0.9778         
                 95% CI : (0.922, 0.9973)
    No Information Rate : 0.3333         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.9667         
 Mcnemar's Test P-Value : NA             

Statistics by Class:

                     Class: Iris-setosa Class: Iris-versicolor
Sensitivity                      1.0000                 0.9333
Specificity                      1.0000                 1.0000
Pos Pred Value                   1.0000                 1.0000
Neg Pred Value                   1.0000                 0.9677
Prevalence 

In [10]:
#Predict datatrain using the model
x<-validation[,1:4]
y<-validation[,5]

#Predict using the model
validation$pred_lda<-predict(model2,x)$class

#Accuracy of the model
mtab<-table(validation$pred_lda,validation$Species)
confusionMatrix(mtab)

Confusion Matrix and Statistics

                 
                  Iris-setosa Iris-versicolor Iris-virginica
  Iris-setosa              20               0              0
  Iris-versicolor           0              20              1
  Iris-virginica            0               0             19

Overall Statistics
                                          
               Accuracy : 0.9833          
                 95% CI : (0.9106, 0.9996)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.975           
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: Iris-setosa Class: Iris-versicolor
Sensitivity                      1.0000                 1.0000
Specificity                      1.0000                 0.9750
Pos Pred Value                   1.0000                 0.9524
Neg Pred Value                   1.0000                 1.0000
Pre

### 3.3 Support Vector Machine (SVM)

In [11]:
library(kernlab)
#Build the model
model3<-ksvm(Species~SepalLength+SepalWidth+PetalLength+PetalWidth,data=datatrain)
#Summarize the model
summary(model3)



Attaching package: 'kernlab'

The following object is masked from 'package:VGAM':

    nvar

The following object is masked from 'package:ggplot2':

    alpha



Length  Class   Mode 
     1   ksvm     S4 

In [12]:
#Predict datatrain using the model
x<-datatrain[,1:4]
y<-datatrain[,5]
#Predict using the model
datatrain$pred_svm<-predict(model3,x,type="response")

#Accuracy of the model
mtab<-table(datatrain$pred_svm,datatrain$Species)
confusionMatrix(mtab)

Confusion Matrix and Statistics

                 
                  Iris-setosa Iris-versicolor Iris-virginica
  Iris-setosa              30               0              0
  Iris-versicolor           0              29              1
  Iris-virginica            0               1             29

Overall Statistics
                                         
               Accuracy : 0.9778         
                 95% CI : (0.922, 0.9973)
    No Information Rate : 0.3333         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.9667         
 Mcnemar's Test P-Value : NA             

Statistics by Class:

                     Class: Iris-setosa Class: Iris-versicolor
Sensitivity                      1.0000                 0.9667
Specificity                      1.0000                 0.9833
Pos Pred Value                   1.0000                 0.9667
Neg Pred Value                   1.0000                 0.9833
Prevalence 

In [13]:
#Predict datatrain using the model
x<-validation[,1:4]
y<-validation[,5]
#Predict using the model
validation$pred_svm<-predict(model3,x,type="response")

#Accuracy of the model
mtab<-table(validation$pred_svm,validation$Species)
confusionMatrix(mtab)

Confusion Matrix and Statistics

                 
                  Iris-setosa Iris-versicolor Iris-virginica
  Iris-setosa              20               0              0
  Iris-versicolor           0              20              3
  Iris-virginica            0               0             17

Overall Statistics
                                          
               Accuracy : 0.95            
                 95% CI : (0.8608, 0.9896)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.925           
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: Iris-setosa Class: Iris-versicolor
Sensitivity                      1.0000                 1.0000
Specificity                      1.0000                 0.9250
Pos Pred Value                   1.0000                 0.8696
Neg Pred Value                   1.0000                 1.0000
Pre

### 3.4 k-Nearest Neighbors (KNN)

In [15]:
library(caret)
#Build the model
model4<-knn3(Species~SepalLength+SepalWidth+PetalLength+PetalWidth,data=datatrain,k=5)
#Summarize the model
summary(model4)


        Length Class  Mode   
learn   2      -none- list   
k       1      -none- numeric
terms   3      terms  call   
xlevels 0      -none- list   
theDots 0      -none- list   

In [16]:
#Predict datatrain using the model
x<-datatrain[,1:4]
y<-datatrain[,5]
#Predict using the model
datatrain$pred_knn<-predict(model4,x,type="class")

#Accuracy of the model
mtab<-table(datatrain$pred_knn,datatrain$Species)
confusionMatrix(mtab)

Confusion Matrix and Statistics

                 
                  Iris-setosa Iris-versicolor Iris-virginica
  Iris-setosa              30               0              0
  Iris-versicolor           0              29              2
  Iris-virginica            0               1             28

Overall Statistics
                                          
               Accuracy : 0.9667          
                 95% CI : (0.9057, 0.9931)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.95            
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: Iris-setosa Class: Iris-versicolor
Sensitivity                      1.0000                 0.9667
Specificity                      1.0000                 0.9667
Pos Pred Value                   1.0000                 0.9355
Neg Pred Value                   1.0000                 0.9831
Pre

In [17]:
#Predict datatrain using the model
x<-validation[,1:4]
y<-validation[,5]
#Predict using the model
validation$pred_knn<-predict(model4,x,type="class")

#Accuracy of the model
mtab<-table(validation$pred_knn,validation$Species)
confusionMatrix(mtab)


Confusion Matrix and Statistics

                 
                  Iris-setosa Iris-versicolor Iris-virginica
  Iris-setosa              20               0              0
  Iris-versicolor           0              19              1
  Iris-virginica            0               1             19

Overall Statistics
                                          
               Accuracy : 0.9667          
                 95% CI : (0.8847, 0.9959)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.95            
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: Iris-setosa Class: Iris-versicolor
Sensitivity                      1.0000                 0.9500
Specificity                      1.0000                 0.9750
Pos Pred Value                   1.0000                 0.9500
Neg Pred Value                   1.0000                 0.9750
Pre

### 3.5 Naive Bayes

In [18]:
library(e1071)
#Build the model
model5<-naiveBayes(Species~SepalLength+SepalWidth+PetalLength+PetalWidth,data=datatrain,k=5)
#Summarize the model
summary(model5)

        Length Class  Mode     
apriori 3      table  numeric  
tables  4      -none- list     
levels  3      -none- character
call    5      -none- call     

In [19]:
#Predict datatrain using the model
x<-datatrain[,1:4]
y<-datatrain[,5]
#Predict using the model
datatrain$pred_naive<-predict(model5,x)

#Accuracy of the model
mtab<-table(datatrain$pred_naive,datatrain$Species)
confusionMatrix(mtab)


Confusion Matrix and Statistics

                 
                  Iris-setosa Iris-versicolor Iris-virginica
  Iris-setosa              30               0              0
  Iris-versicolor           0              28              2
  Iris-virginica            0               2             28

Overall Statistics
                                          
               Accuracy : 0.9556          
                 95% CI : (0.8901, 0.9878)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9333          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: Iris-setosa Class: Iris-versicolor
Sensitivity                      1.0000                 0.9333
Specificity                      1.0000                 0.9667
Pos Pred Value                   1.0000                 0.9333
Neg Pred Value                   1.0000                 0.9667
Pre

In [20]:
#Predict datatrain using the model
x<-validation[,1:4]
y<-validation[,5]
#Predict using the model
validation$pred_naive<-predict(model5,x)

#Accuracy of the model
mtab<-table(validation$pred_naive,validation$Species)
confusionMatrix(mtab)


Confusion Matrix and Statistics

                 
                  Iris-setosa Iris-versicolor Iris-virginica
  Iris-setosa              20               0              0
  Iris-versicolor           0              19              1
  Iris-virginica            0               1             19

Overall Statistics
                                          
               Accuracy : 0.9667          
                 95% CI : (0.8847, 0.9959)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.95            
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: Iris-setosa Class: Iris-versicolor
Sensitivity                      1.0000                 0.9500
Specificity                      1.0000                 0.9750
Pos Pred Value                   1.0000                 0.9500
Neg Pred Value                   1.0000                 0.9750
Pre

### 3.6 Classification and Regression Trees(CART)

In [21]:
library(rpart)
#Build the model
model6<-rpart(Species~SepalLength+SepalWidth+PetalLength+PetalWidth,data=datatrain)
#Summarize the model
summary(model6)

Call:
rpart(formula = Species ~ SepalLength + SepalWidth + PetalLength + 
    PetalWidth, data = datatrain)
  n= 90 

         CP nsplit  rel error     xerror       xstd
1 0.5000000      0 1.00000000 1.20000000 0.06324555
2 0.4333333      1 0.50000000 0.76666667 0.07903742
3 0.0100000      2 0.06666667 0.08333333 0.03621779

Variable importance
PetalLength  PetalWidth SepalLength  SepalWidth 
         34          31          23          12 

Node number 1: 90 observations,    complexity param=0.5
  predicted class=Iris-setosa      expected loss=0.6666667  P(node) =1
    class counts:    30    30    30
   probabilities: 0.333 0.333 0.333 
  left son=2 (30 obs) right son=3 (60 obs)
  Primary splits:
      PetalLength < 2.6  to the left,  improve=30.000000, (0 missing)
      PetalWidth  < 0.7  to the left,  improve=30.000000, (0 missing)
      SepalLength < 5.45 to the left,  improve=20.384400, (0 missing)
      SepalWidth  < 3.15 to the right, improve= 9.579832, (0 missing)
  Surrogate s

In [22]:
#Predict datatrain using the model
x<-datatrain[,1:4]
y<-datatrain[,5]
#Predict using the model
datatrain$pred_cart<-predict(model6,x,type="class")

#Accuracy of the model
mtab<-table(datatrain$pred_cart,datatrain$Species)
confusionMatrix(mtab)


Confusion Matrix and Statistics

                 
                  Iris-setosa Iris-versicolor Iris-virginica
  Iris-setosa              30               0              0
  Iris-versicolor           0              29              3
  Iris-virginica            0               1             27

Overall Statistics
                                          
               Accuracy : 0.9556          
                 95% CI : (0.8901, 0.9878)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9333          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: Iris-setosa Class: Iris-versicolor
Sensitivity                      1.0000                 0.9667
Specificity                      1.0000                 0.9500
Pos Pred Value                   1.0000                 0.9062
Neg Pred Value                   1.0000                 0.9828
Pre

In [23]:
#Predict datatrain using the model
x<-validation[,1:4]
y<-validation[,5]
#Predict using the model
validation$pred_cart<-predict(model6,x,type="class")

#Accuracy of the model
mtab<-table(validation$pred_cart,validation$Species)
confusionMatrix(mtab)

Confusion Matrix and Statistics

                 
                  Iris-setosa Iris-versicolor Iris-virginica
  Iris-setosa              20               0              0
  Iris-versicolor           0              19              3
  Iris-virginica            0               1             17

Overall Statistics
                                         
               Accuracy : 0.9333         
                 95% CI : (0.838, 0.9815)
    No Information Rate : 0.3333         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.9            
 Mcnemar's Test P-Value : NA             

Statistics by Class:

                     Class: Iris-setosa Class: Iris-versicolor
Sensitivity                      1.0000                 0.9500
Specificity                      1.0000                 0.9250
Pos Pred Value                   1.0000                 0.8636
Neg Pred Value                   1.0000                 0.9737
Prevalence 

### 1.7 Random Forest

In [24]:
library(randomForest)
#Build the model
model7<-randomForest(Species~SepalLength+SepalWidth+PetalLength+PetalWidth,data=datatrain)
#Summarize the model
summary(model7)

randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'

The following object is masked from 'package:ggplot2':

    margin



                Length Class  Mode     
call               3   -none- call     
type               1   -none- character
predicted         90   factor numeric  
err.rate        2000   -none- numeric  
confusion         12   -none- numeric  
votes            270   matrix numeric  
oob.times         90   -none- numeric  
classes            3   -none- character
importance         4   -none- numeric  
importanceSD       0   -none- NULL     
localImportance    0   -none- NULL     
proximity          0   -none- NULL     
ntree              1   -none- numeric  
mtry               1   -none- numeric  
forest            14   -none- list     
y                 90   factor numeric  
test               0   -none- NULL     
inbag              0   -none- NULL     
terms              3   terms  call     

In [25]:
#Predict datatrain using the model
x<-datatrain[,1:4]
y<-datatrain[,5]
#Predict using the model
datatrain$pred_randomforest<-predict(model7,x)

#Accuracy of the model
mtab<-table(datatrain$pred_randomforest,datatrain$Species)
confusionMatrix(mtab)

Confusion Matrix and Statistics

                 
                  Iris-setosa Iris-versicolor Iris-virginica
  Iris-setosa              30               0              0
  Iris-versicolor           0              30              0
  Iris-virginica            0               0             30

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.9598, 1)
    No Information Rate : 0.3333     
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: Iris-setosa Class: Iris-versicolor
Sensitivity                      1.0000                 1.0000
Specificity                      1.0000                 1.0000
Pos Pred Value                   1.0000                 1.0000
Neg Pred Value                   1.0000                 1.0000
Prevalence                       0.3333    

In [26]:
#Predict datatrain using the model
x<-validation[,1:4]
y<-validation[,5]
#Predict using the model
validation$pred_randomforest<-predict(model7,x)

#Accuracy of the model
mtab<-table(validation$pred_randomforest,validation$Species)
confusionMatrix(mtab)

Confusion Matrix and Statistics

                 
                  Iris-setosa Iris-versicolor Iris-virginica
  Iris-setosa              20               0              0
  Iris-versicolor           0              19              2
  Iris-virginica            0               1             18

Overall Statistics
                                          
               Accuracy : 0.95            
                 95% CI : (0.8608, 0.9896)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.925           
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: Iris-setosa Class: Iris-versicolor
Sensitivity                      1.0000                 0.9500
Specificity                      1.0000                 0.9500
Pos Pred Value                   1.0000                 0.9048
Neg Pred Value                   1.0000                 0.9744
Pre

## IV Selecting best model
 
I have considered linear (LR and LDA), nonlinear (KNN, CART, RF, NB and SVM) algorithms. I reset the random number seed before each run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable. 
 
We now have 7 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.

We can create a summary table with the accuracy percentage and kappa index for both train and validate dataset, to have a general overview of the behavior of our models.

<table style="width:60%" align="left">
  <thead>
  <tr>
   <th colspan="1">Model</th>
    <th colspan="2">Accuracy</th>
    <th colspan="2">Kappa</th>
  </tr>
  <tr>
      <th></th>
    <th>Train</th>
    <th>Evaluation</th>
     <th>Train</th>
     <th>Evaluation</th>
  </tr>
    </thead>
  <tbody>    
    <tr>
        <th>LR</th>
    <td>0.9833</td>
    <td>1</td>
     <td>0.9833</td>
     <td>1</td>
  </tr>
      <tr>
        <th>LDA</th>
    <td>0.9750</td>
    <td>1</td>
     <td>0.9625</td>
     <td>1</td>
  </tr>
      <tr>
        <th>SVM</th>
    <td>0.9833</td>
    <td>0.9667</td>
     <td>0.9750</td>
     <td>0.95</td>
  </tr>
        <tr>
        <th>KNN</th>
    <td>0.9667</td>
    <td>0.9667</td>
     <td>0.95</td>
     <td>0.95</td>
  </tr>
          <tr>
        <th>NB</th>
    <td>0.9583</td>
    <td>0.9667</td>
     <td>0.9375</td>
     <td>0.95</td>
  </tr>
            <tr>
        <th>CART</th>
    <td>0.9667</td>
    <td>0.9333</td>
     <td>0.95</td>
     <td>0.90</td>
  </tr>
              <tr>
        <th>RF</th>
    <td>1.0</td>
    <td>0.9667</td>
     <td>1.0</td>
     <td>0.95</td>
  </tr>
    </tbody>
</table>

As we were expecting all models have good train and validate accuracy scores, but among both we have to consider the validate one. Remember that testing models on the train dataset is not an optimal model evaluation procedure.  So we can get this other table with the summary of the acurracy based on the validate dataset.

<table align="left">
  <thead>
  <tr>
    <th >Model</th>
    <th >Accuracy</th>
    <th>95%CI</th>
        <th>Kappa</th>
  </tr>
    </thead>
      <tbody>
  <tr>
    <td>LR</td>
        <td>1.0000</td>
    <td>(0.8843, 1.0000)</td>
        <td>1.00</td>
  </tr>
  <tr>
    <td>LDA</td>
        <td>1.0000</td>
    <td>(0.8843, 1.0000)</td>
        <td>1.00</td>
  </tr>
  <tr>
    <td>SVM</td>
        <td>0.9667</td>
    <td>(0.8278, 0.9992)</td>
        <td>0.95</td>
  </tr>
  <tr>
    <td>KNN</td>
        <td>0.9667</td>
    <td>(0.8278, 0.9992)</td>
        <td>0.95</td>
  </tr>
  <tr>
    <td>NB</td>
        <td>0.9667</td>
    <td>(0.8278, 0.9992)</td>
        <td>0.95</td>
  </tr>
  <tr>
    <td>CART</td>
        <td>0.9333</td>
    <td>(0.9169, 0.9908)</td>
        <td>0.90</td>
  </tr>
    <tr>
    <td>RF</td>
        <td>0.9667</td>
    <td>(0.8278, 0.9992)</td>
        <td>0.95</td>
  </tr>
   </tbody>
</table>

## V Conclusions

This project provides a general way to apply machine learning algorithms based on R without used any specific library to control the central procedure to get the models done.

The idea here is to provide a basic understanding of getting started with a machine learning problem applying the most common algorithms to the well known "iris" dataset.

Because training and testing models on the same dataset are not an optimal evaluation procedure, to build the models and study their quality the "train/test split" method was used. An equal percentage representation for the class variable (Species) was considered with a fraction of 1/3 for each iris species.

The performance of each model was measured by using the "accuracy score".  All models presented a high train and validate accuracy. 

Hitting at the right machine learning algorithm is the ideal approach to achieve higher accuracy. But, it is easier said than done.

Improving the models can be done by tuning their parameters. Every Machine Learning model comes with a variety of parameters to tune and these parameters can be vitally important to the performance of our classifier.

Finally, "cross validation" is an approach that you can use to estimate the performance of a machine learning algorithm with less variance than a single "train/test split". Methods like "K-fold Cross Validation", "Leave One Out Cross Validation", and "Repeated Random Test-Train Splits" are the most known ones.  Of course, the door is opening for anybody who wants to try developing this project by using one of those "cross validation" methods to measure the accuracy of the models involved. I am personally putting in a line to develop a project like after finishing some other important machine learning stuff that I am already working on.


<hr>
By: Hector Alvaro Rojas | Data Science, Visualizations and Applied Statistics | September 2017<br>
Url: [http://www.arqmain.net]   &nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;   GitHub: [https://github.com/arqmain]