In [1]:
df = read.csv('http://cssbook.net/d/mediause.csv')
model = lm(formula = 'newspaper ~ age + gender', data = df)
# summary(model) would give a lot more info, but we only care about the coefficients:
model


Call:
lm(formula = "newspaper ~ age + gender", data = df)

Coefficients:
(Intercept)          age       gender  
   -0.08956      0.06762      0.17666  


In [2]:
gender = c(1,0)
age = c(20,40)
newdata = data.frame(age, gender)
predict(model, newdata)

In [3]:
library(tidyverse)
library(rsample)
library(glue)

df = read.csv('http://cssbook.net/d/mediause.csv')
df = na.omit(df %>% mutate(usesinternet=recode(internet, .default='user', `0`='non-user')))

set.seed(42)
df$usesinternet = as.factor(df$usesinternet)
print("How many people used online news at all?")
print(table(df$usesinternet))


split = initial_split(df, prop = .8)
traindata = training(split)
testdata  = testing(split)

X_train = select(traindata, c('age', 'gender', 'education'))
y_train = traindata$usesinternet
X_test = select(testdata, c('age', 'gender', 'education'))
y_test = testdata$usesinternet

print(glue("We have {nrow(X_train)} training and {nrow(X_test)} test cases."))

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.2.1     [32m✔[39m [34mpurrr  [39m 0.3.3
[32m✔[39m [34mtibble [39m 2.1.3     [32m✔[39m [34mdplyr  [39m 0.8.4
[32m✔[39m [34mtidyr  [39m 1.0.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.4.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


Attaching package: ‘glue’


The following object is masked from ‘package:dplyr’:

    collapse




[1] "How many people used online news at all?"

non-user     user 
     803     1262 
We have 1653 training and 412 test cases.


In [4]:
library(caret)
library(naivebayes)

myclassifier = train(x = X_train, y = y_train, method = "naive_bayes")
y_pred = predict(myclassifier, newdata = X_test)

Loading required package: lattice


Attaching package: ‘caret’


The following object is masked from ‘package:purrr’:

    lift


naivebayes 0.9.6 loaded



In [5]:
install.packages('randomForest')

Installing package into ‘/home/damian/R/x86_64-pc-linux-gnu-library/3.6’
(as ‘lib’ is unspecified)



In [6]:
print(confusionMatrix(y_pred, y_test))

print("Confusion matrix:")
confmat = table(testdata$usesinternet, y_pred)
print(confmat)

print('Precision for predicting True internet users and non-internet-users, respecitively:')
precision = diag(confmat) / rowSums(confmat)
print(precision)


print('Recall for predicting True internet users and non-internet-users, respecitively:')
recall = (diag(confmat) / colSums(confmat))
print(recall)

Confusion Matrix and Statistics

          Reference
Prediction non-user user
  non-user       57   26
  user          113  216
                                          
               Accuracy : 0.6626          
                 95% CI : (0.6147, 0.7082)
    No Information Rate : 0.5874          
    P-Value [Acc > NIR] : 0.001026        
                                          
                  Kappa : 0.2466          
                                          
 Mcnemar's Test P-Value : 2.999e-13       
                                          
            Sensitivity : 0.3353          
            Specificity : 0.8926          
         Pos Pred Value : 0.6867          
         Neg Pred Value : 0.6565          
             Prevalence : 0.4126          
         Detection Rate : 0.1383          
   Detection Prevalence : 0.2015          
      Balanced Accuracy : 0.6139          
                                          
       'Positive' Class : non-user        
            

In [7]:
library(tidyverse)
library(caret)

myclassifier = train(x = X_train, y = y_train, method = 'glm',family = "binomial")
y_pred = predict(myclassifier, newdata = X_test)


In [8]:
library(tidyverse)
library(caret)
library(LiblineaR)

# !!! We normalize our features to have M = 0 and SD = 1, which we do with the preProcess argument
# This is necessary as our features are not measured on the same scale, which SVM requires
# It may also be OK to rescale to a range of [0:1] or [-1:1]

myclassifier = train(x = X_train, y = y_train,  preProcess = c("center", "scale"), method = "svmLinear3")
y_pred = predict(myclassifier, newdata = X_test)

In [9]:
library(tidyverse)
library(caret)
library(randomForest)

myclassifier = train(x = X_train, y = y_train, method = "rf")
y_pred = predict(myclassifier, newdata = X_test)


randomForest 4.6-14

Type rfNews() to see new features/changes/bug fixes.


Attaching package: ‘randomForest’


The following object is masked from ‘package:dplyr’:

    combine


The following object is masked from ‘package:ggplot2’:

    margin




note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .



In [10]:
library(tidyverse)
library(caret)

myclassifier = train(x = X_train, y = y_train, method = 'glm', family='binomial', metric='Accuracy',
                     trControl = trainControl(method = "cv", number = 5, returnResamp ='all', savePredictions = TRUE),)
print(myclassifier$resample)
print(myclassifier$results)

   Accuracy     Kappa parameter Resample
1 0.6646526 0.2376911      none    Fold1
2 0.6042296 0.0767380      none    Fold2
3 0.6283988 0.1552794      none    Fold3
4 0.6636364 0.2225901      none    Fold4
5 0.6484848 0.1889143      none    Fold5
  parameter  Accuracy     Kappa AccuracySD    KappaSD
1      none 0.6418804 0.1762426 0.02566539 0.06408044


In [13]:
#tunegrid <- expand.grid(.mtry=c(1,10), number = c(10,50,100)) 

#tunegrid <- expand.grid(.mtry = (1:15)) 
tunegrid <- expand.grid(.mtry=c(2), .number = c(10, 50, 100)) 


train_control = trainControl(method = "cv", number = 5)

rf_gridsearch =   train(x = X_train, y = y_train, method = "rf",
                       metric = 'Accuracy',  trControl = train_control, 
                       tuneGrid = tunegrid)
print(rf_gridsearch)



ERROR: Error: The tuning parameter grid should have columns mtry


In [14]:
tunegrid <- expand.grid(.mtry=c(2), .number = c(10, 50, 100)) 
tunegrid

.mtry,.number
<dbl>,<dbl>
2,10
2,50
2,100


In [15]:
# Create the grid of parameters
grid <- expand.grid(Loss=c('L1','L2'),
                   cost=c(100,1000))

# Train the model using our previously defined parameters
gridsearch = train(x = X_train, y = y_train,  preProcess = c("center", "scale"), 
                 method = "svmLinear3",
                 trControl = trainControl(method = "cv", number = 5),
                 tuneGrid = grid)
gridsearch

L2 Regularized Support Vector Machine (dual) with Linear Kernel 

1653 samples
   3 predictor
   2 classes: 'non-user', 'user' 

Pre-processing: centered (3), scaled (3) 
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1322, 1323, 1323, 1322, 1322 
Resampling results across tuning parameters:

  Loss  cost  Accuracy   Kappa     
  L1     100  0.6249162  0.06702208
  L1    1000  0.5952888  0.09515472
  L2     100  0.6443028  0.18235676
  L2    1000  0.6443028  0.18235676

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were cost = 100 and Loss = L2.