## *Table of Contents*

    * Cars Logistic Regression
    * Demo

 
## *Cars Logistic Regression*

Your goal is to use logistic regression to predict whether a used car at an auto auction is a good or bad buy using the training and test data sets you just prepared. Unlike the data cleaning step, you will be much more free on how to tackle the problem here.

However, once you have your final model you should use the confusionMatrix with a 0.5 cutoff or a cutoff of your choosing to check the accuracy of your model.

 

Don’t feel compelled to include all features into your model, looking at your training set, use those features which seem to best predict whether the car is a bad buy. You may want to do some plots to see how closely a given feature with a bad buy.

The output of which you should include in a short one to two page outline of your modeling process and results. Don’t worry about explaining the statistical principals or methods. Focus more on the results and your motivation behind your choices.

 

Example topics you could include are:

    The choices you made in constructing and processing your data set.
    Did you do anything different or extra in the data cleaning step?
    Your final model.
    If you use a different cutoff, why did you choose that value?
    The odds ratios and their interpretation.

 

Feel free to include anything feel is necessary and relevant. However, do try and keep the text down to a page or two. It’s ok if plots and figures push that beyond a page or two. Just make sure the actual text is limited.

In [3]:
##> label=mycars
# Load car data into mycars and take a quick look at it
# We see this data set has 72983 obs. of  34 variables with int, chr and num data types
# 
mycars <- read.csv(rcloud.home("training.csv", user="jl2408"),
                  header = TRUE, sep=",", stringsAsFactors = FALSE)
str(mycars)
head(mycars)
# 

In [4]:
# check to see how many observations are missing in each variable then null (removes) those with too many
# We also use Amelia graph to check this visually.
# We do not see data is na or is null
miss_obs_per_var <-apply(mycars, 2, function(x) sum(is.na(x))/length(x)*100)
miss_obs_per_var
# 
null_obs_per_var <-apply(mycars, 2, function(x) sum(is.null(x))/length(x)*100)
null_obs_per_var
# 
library(Amelia)
missmap(mycars, main = "Missing values vs observed")

In [5]:
# From str output, looks like AUCGUART and PRIMEUNIT have a lot of "NULL" - verify and remove them
# Also, we do not need PurchDate since we have car year and age information - remove it
table(mycars$AUCGUART)
table(mycars$PRIMEUNIT)
# 
mycars$PurchDate <- NULL  
mycars$AUCGUART <- NULL
mycars$PRIMEUNIT <- NULL 

In [6]:
# Check IsBadBuy
# So, about 87.7% is good buy
table(mycars$IsBadBuy)

In [7]:
library(ggplot2)
# Check Auction, vehYear, vehicleAge
# Plot histogram of vehicle year and age
# Looks like the distribution are near normal
table(mycars$Auction)
table(mycars$VehYear)
ggplot(mycars, aes(x = VehYear)) + geom_histogram(breaks=seq(2001,2010,by=1), col="red",fill="green",alpha=.2) + ggtitle("Vehicle Year Histogram")
table(mycars$VehicleAge)
ggplot(mycars, aes(x = VehicleAge)) + geom_histogram(breaks=seq(0,9,by=1), col="red",fill="green",alpha=.2) + ggtitle("Vehicle Age Histogram")
mycars$Auction <- as.factor(mycars$Auction)

In [8]:
# Check Make
# There is 1 TOYOTA SCION - move it to SCION
# There is 1 HUMMER - Move it to CADILLAC since they are both luxury with GM
# There are 2 PLYMOUTH - Move it to CHRYSLER
# 
table(mycars$Make)
mycars$Make[grep("TOYOTA SCION", mycars$Make , ignore.case=TRUE, fixed=FALSE)] <-"SCION"
mycars$Make[grep("HUMMER", mycars$Make , ignore.case=TRUE, fixed=FALSE)] <-"CADILLAC"
mycars$Make[grep("PLYMOUTH", mycars$Make , ignore.case=TRUE, fixed=FALSE)] <-"CHRYSLER"
table(mycars$Make)
# 
mycars$Make <- as.factor(mycars$Make)

In [9]:
# Check Model
# Looks like we can extract model, powertrain, cylinder information
table(mycars$Model)
#aa <- unique(sort(mycars$Model))
#head(aa, 100)
# 

In [10]:
# 25% 2WD 4.7%FWD ~30% have one of the 4 types. Assume all others are 2WD and FWD is a subset of 4WD
# Create a new variable called Drive
cnt <- length(mycars$Model)
length(grep("2WD", mycars$Model , ignore.case=TRUE, fixed=FALSE))/cnt
length(grep("4WD", mycars$Model , ignore.case=TRUE, fixed=FALSE))/cnt
length(grep("FWD", mycars$Model , ignore.case=TRUE, fixed=FALSE))/cnt
length(grep("AWD", mycars$Model , ignore.case=TRUE, fixed=FALSE))/cnt
# 
mycars$Drive <- "2WD"
mycars$Drive[grep("4WD", mycars$Model , ignore.case=TRUE, fixed=FALSE)] <- "4WD"
mycars$Drive[grep("FWD", mycars$Model , ignore.case=TRUE, fixed=FALSE)] <- "4WD"
mycars$Drive[grep("AWD", mycars$Model , ignore.case=TRUE, fixed=FALSE)] <- "AWD"
mycars$Drive <- as.factor(mycars$Drive)
table(mycars$Drive)

In [11]:
# Extract cylinder
# Looks like we have V6, V8 and some V12 + 4C and 6C
# Create a new variable Cylinder
length(grep("V4", mycars$Model , ignore.case=TRUE, fixed=FALSE))/cnt
length(grep("V6", mycars$Model , ignore.case=TRUE, fixed=FALSE))/cnt
length(grep("V8", mycars$Model , ignore.case=TRUE, fixed=FALSE))/cnt
length(grep("V10", mycars$Model , ignore.case=TRUE, fixed=FALSE))/cnt
length(grep("V12", mycars$Model , ignore.case=TRUE, fixed=FALSE))/cnt
length(grep("4C", mycars$Model , ignore.case=TRUE, fixed=FALSE))/cnt
length(grep("6C", mycars$Model , ignore.case=TRUE, fixed=FALSE))/cnt
length(grep("8C", mycars$Model , ignore.case=TRUE, fixed=FALSE))/cnt
length(grep("10C", mycars$Model , ignore.case=TRUE, fixed=FALSE))/cnt
length(grep("12C", mycars$Model , ignore.case=TRUE, fixed=FALSE))/cnt
# 
mycars$Model[grep("V12", mycars$Model , ignore.case=TRUE, fixed=FALSE)]
# 
mycars$Cylinder <- "V4"
mycars$Cylinder[grep("V6", mycars$Model , ignore.case=TRUE, fixed=FALSE)] <- "V6"
mycars$Cylinder[grep("V8", mycars$Model , ignore.case=TRUE, fixed=FALSE)] <- "V8"
mycars$Cylinder[grep("V12", mycars$Model , ignore.case=TRUE, fixed=FALSE)] <- "V12"
mycars$Cylinder[grep("6C", mycars$Model , ignore.case=TRUE, fixed=FALSE)] <- "V6"
table(mycars$Cylinder)
mycars$Cylinder <- as.factor(mycars$Cylinder)

In [12]:
# Extract model
# Try to reduce the levels - after combining some models, split the # model string and extract the first word as the model
mycars$Model[grep("1500 Ram", mycars$Model , ignore.case=TRUE, fixed=FALSE)] <-"Ram"
mycars$Model[c(grep("1500 Silverado", mycars$Model , ignore.case=TRUE, fixed=FALSE), grep("1500HD Silverado", mycars$Model , ignore.case=TRUE, fixed=FALSE))] <-"Silverado"
mycars$Model[grep("1500 Sierra", mycars$Model , ignore.case=TRUE, fixed=FALSE)] <-"Sierra"
mycars$Model[grep("4 Runner", mycars$Model , ignore.case=TRUE, fixed=FALSE)] <-"4_Runner"
mycars$Model[grep("L Series", mycars$Model , ignore.case=TRUE, fixed=FALSE)] <-"L_Series"
mycars$Model <-sapply(strsplit(as.character(mycars$Model), ' '), "[[", 1)
mycars$Model <- as.factor(mycars$Model)

In [13]:
# Check Trim
# Not much I can do here
table(mycars$Trim)
mycars$Trim <- as.factor(mycars$Trim)

In [14]:
# Check SubModel
table(mycars$SubModel)
aa <- unique(sort(mycars$SubModel))
head(aa, 100)

In [15]:
# Extract door from SubModel
# Create a new variable called Doors
mycars$Doors <- "4D"
mycars$Doors[grep("2D", mycars$SubModel , ignore.case=TRUE, fixed=FALSE)] <- "2D"
mycars$Doors[grep("3D", mycars$SubModel , ignore.case=TRUE, fixed=FALSE)] <- "3D"
mycars$Doors[grep("5D", mycars$SubModel , ignore.case=TRUE, fixed=FALSE)] <- "5D"
table(mycars$Doors)
mycars$Doors <- as.factor(mycars$Doors)

In [16]:
# Extract SubModel
# Create a new variable called Type
mycars$Type <- "SEDAN"
mycars$Type[grep("PASSENGER", mycars$SubModel , ignore.case=TRUE, fixed=FALSE)] <- "PASSENGER"
mycars$Type[grep("CAB", mycars$SubModel , ignore.case=TRUE, fixed=FALSE)] <- "CAB"
mycars$Type[grep("CUV", mycars$SubModel , ignore.case=TRUE, fixed=FALSE)] <- "CUV"
mycars$Type[grep("MINIVAN", mycars$SubModel , ignore.case=TRUE, fixed=FALSE)] <- "MINIVAN"
mycars$Type[grep("UTILITY", mycars$SubModel , ignore.case=TRUE, fixed=FALSE)] <- "UTILITY"
mycars$Type[grep("SPORT", mycars$SubModel , ignore.case=TRUE, fixed=FALSE)] <- "SPORT"
mycars$Type[grep("SUV", mycars$SubModel , ignore.case=TRUE, fixed=FALSE)] <- "SUV"
mycars$Type[grep("WAGON", mycars$SubModel , ignore.case=TRUE, fixed=FALSE)] <- "WAGON"
mycars$Type[grep("CONVERTIBLE", mycars$SubModel , ignore.case=TRUE, fixed=FALSE)] <- "CONVERTIBLE"
mycars$Type[grep("HATCHBACK", mycars$SubModel , ignore.case=TRUE, fixed=FALSE)] <- "HATCHBACK"
mycars$Type[grep("SPYDER", mycars$SubModel , ignore.case=TRUE, fixed=FALSE)] <- "SPYDER"
mycars$Type[grep("LIFTBACK", mycars$SubModel , ignore.case=TRUE, fixed=FALSE)] <- "HATCHBACK"
mycars$Type[grep("CROSSOVER", mycars$SubModel , ignore.case=TRUE, fixed=FALSE)] <- "SUV"
# 
table(mycars$Type)
mycars$Type <- as.factor(mycars$Type)

In [17]:
# Check Color
# Move "NULL" and "NOT AVAIL" to "OTHER"
table(mycars$Color)
mycars$Color[mycars$Color == "NULL" | mycars$Color == "NOT AVAIL"] <- "OTHER"
table(mycars$Color)
mycars$Color <- as.factor(mycars$Color)

In [18]:
# Check Transmission
# Move "Manual" to "MANUAL"
# Makes "NULL" and "" NA so that they can be imputed later
table(mycars$Transmission)
mycars$Transmission[grep("Manual", mycars$Transmission , ignore.case=FALSE, fixed=FALSE)] <- "MANUAL"
mycars$Transmission[mycars$Transmission == "NULL"] <- NA
mycars$Transmission[mycars$Transmission == ""] <- NA
mycars$Transmission <- factor(mycars$Transmission, levels=c("AUTO","MANUAL"))
table(mycars$Transmission)

In [19]:
# Check WheelTypeID and WheelType
# Looks like 1 = Alloy, 2 = Covers, 3 = Special and 0 or NULL = NULL
# We can just use WheelTypeID and move NULL to 0
# Remove WheelType
table(mycars$WheelTypeID)
table(mycars$WheelType)
mycars$WheelTypeID[grep("NULL", mycars$WheelTypeID , ignore.case=TRUE, fixed=FALSE)] <- 0
mycars$WheelTypeID <- factor(mycars$WheelTypeID, levels=c(0,1,2,3))
mycars$WheelType <- NULL 
table(mycars$WheelTypeID)
# str(mycars)

In [20]:
# Check VehOdo
ggplot(mycars, aes(x=VehOdo)) + geom_line(stat = "density")

In [21]:
# Check Nationality
# Move "NULL" to "OTHER"
table(mycars$Nationality)
mycars$Nationality[grep("NULL", mycars$Nationality , ignore.case=TRUE, fixed=FALSE)] <- "OTHER"
mycars$Nationality <- factor(mycars$Nationality, levels=c("AMERICAN","OTHER", "OTHER ASIAN", "TOP LINE ASIAN"))
table(mycars$Nationality)

In [22]:
# Check Size
# Move "NULL" to NA so it can be imputed later
table(mycars$Size)
mycars$Size[grep("NULL", mycars$Size , ignore.case=TRUE, fixed=FALSE)] <- NA
table(mycars$Size)
mycars$Size <- as.factor(mycars$Size)

In [23]:
# Check TopThreeAmericanName
# Move "NULL" to "OTHER"
table(mycars$TopThreeAmericanName)
mycars$TopThreeAmericanName[grep("NULL", mycars$TopThreeAmericanName , ignore.case=TRUE, fixed=FALSE)] <- "OTHER"
mycars$TopThreeAmericanName <- factor(mycars$TopThreeAmericanName, levels=c("CHRYSLER","FORD", "GM", "OTHER"))
table(mycars$TopThreeAmericanName)

In [24]:
# Check BYRNO
ggplot(mycars, aes(x=BYRNO)) + geom_line(stat = "density") 

In [25]:
# Check VehBCost
ggplot(mycars, aes(x=VehBCost)) + geom_line(stat = "density") 

In [26]:
# Check warrantyCost
ggplot(mycars, aes(x=WarrantyCost)) + geom_line(stat = "density") 

In [27]:
# Check NULL and "" for Prices
table(mycars$MMRAcquisitionAuctionAveragePrice[mycars$MMRAcquisitionAuctionAveragePrice %in% c("NULL","","0")])
table(mycars$MMRAcquisitionAuctionCleanPrice[mycars$MMRAcquisitionAuctionCleanPrice %in% c("NULL","","0")])
table(mycars$MMRAcquisitionRetailAveragePrice[mycars$MMRAcquisitionRetailAveragePrice %in% c("NULL","","0")])
table(mycars$MMRAcquisitionRetailCleanPrice[mycars$MMRAcquisitionRetailCleanPrice %in% c("NULL","","0")])
table(mycars$MMRCurrentAuctionAveragePrice[mycars$MMRCurrentAuctionAveragePrice %in% c("NULL","","0")])
table(mycars$MMRCurrentAuctionCleanPrice[mycars$MMRCurrentAuctionCleanPrice %in% c("NULL","","0")])
table(mycars$MMRCurrentRetailAveragePrice[mycars$MMRCurrentRetailAveragePrice %in% c("NULL","","0")])
table(mycars$MMRCurrentRetailCleanPrice[mycars$MMRCurrentRetailCleanPrice %in% c("NULL","","0")])

In [28]:
# Convert 0 and "NULL" to NA so they can be imputed later
mycars[,c(17:24)] <-apply(mycars[,c(17:24)], 2, function(x) x <-ifelse(x == "NULL", NA, x))
mycars[,c(17:24)] <-apply(mycars[,c(17:24)], 2, function(x) x <-ifelse(x == 0, NA, x))
table(mycars$MMRAcquisitionAuctionAveragePrice[mycars$MMRAcquisitionAuctionAveragePrice %in% c("NULL","","0")])
table(mycars$MMRAcquisitionAuctionCleanPrice[mycars$MMRAcquisitionAuctionCleanPrice %in% c("NULL","","0")])
table(mycars$MMRAcquisitionRetailAveragePrice[mycars$MMRAcquisitionRetailAveragePrice %in% c("NULL","","0")])
table(mycars$MMRAcquisitionRetailCleanPrice[mycars$MMRAcquisitionRetailCleanPrice %in% c("NULL","","0")])
table(mycars$MMRCurrentAuctionAveragePrice[mycars$MMRCurrentAuctionAveragePrice %in% c("NULL","","0")])
table(mycars$MMRCurrentAuctionCleanPrice[mycars$MMRCurrentAuctionCleanPrice %in% c("NULL","","0")])
table(mycars$MMRCurrentRetailAveragePrice[mycars$MMRCurrentRetailAveragePrice %in% c("NULL","","0")])
table(mycars$MMRCurrentRetailCleanPrice[mycars$MMRCurrentRetailCleanPrice %in% c("NULL","","0")])

In [29]:
# Convert Prices to integer
# 
mycars$MMRAcquisitionAuctionAveragePrice <- as.integer(mycars$MMRAcquisitionAuctionAveragePrice)
mycars$MMRAcquisitionAuctionCleanPrice <- as.integer(mycars$MMRAcquisitionAuctionCleanPrice)
mycars$MMRAcquisitionRetailAveragePrice <- as.integer(mycars$MMRAcquisitionRetailAveragePrice)
mycars$MMRAcquisitionRetailCleanPrice <- as.integer(mycars$MMRAcquisitionRetailCleanPrice)
mycars$MMRCurrentAuctionAveragePrice <- as.integer(mycars$MMRCurrentAuctionAveragePrice)
mycars$MMRCurrentAuctionCleanPrice <- as.integer(mycars$MMRCurrentAuctionCleanPrice)
mycars$MMRCurrentRetailAveragePrice <- as.integer(mycars$MMRCurrentRetailAveragePrice) 
mycars$MMRCurrentRetailCleanPrice <- as.integer(mycars$MMRCurrentRetailCleanPrice)
# 
# str(mycars)

In [30]:
# Create randomly imputes function based upon the distribution of the variable.
# 
random.imp <- function ( data , imp.method = c( "bootstrap", "pca" ) , ... ) {
# 
  imp.method <- match.arg ( imp.method )
# 
  if(imp.method=="bootstrap"){
# 
    if( is.vector( data ) ) {
# 
      mis       <- is.na ( data )
# 
      imputed   <- data
# 
      imputed[ mis ] <- sample( data[ !mis ], sum( mis ), replace = TRUE )
# 
    }
# 
    else if( is.matrix( data ) || is.data.frame( data )  ){
# 
      imputed <- data
# 
      for( j in 1:ncol ( data ) ) {
# 
        mis  <- is.na ( data[,j] )
# 
        if( sum(mis) == length(data[,j])){
# 
          warning(message = paste( "variable", names(data)[j], "has no observation"))
# 
        }
# 
        else {
# 
          imputed[mis,j] <- sample( data[!mis, j], sum( mis ), replace = TRUE )
# 
        }
# 
      }
# 
    }
# 
    else{
# 
      stop ( message = "Unexpected data type: data must be vector, matrix, or data frame." )
# 
    }
# 
  }
# 
  else if (imp.method=="pca"){
# 
    stop ( message = "pca imputation is not implemente in current version." )
# 
  }
# 
# 
# 
  #else{
# 
  #    imputed <- pca( data, nPcs = 3, method = "bpca" )@completeObs
# 
  #}
# 
  return( as.data.frame( imputed ) )
# 
}
# 

In [31]:
# Check VNST
table(mycars$VNST)
mycars$VNST <- as.factor(mycars$VNST)

In [32]:
# Check result of data cleaning
summary(mycars)
str(mycars)
library(Amelia)
missmap(mycars, main = "Missing values vs observed")

In [33]:
# Impute the NA
mycars_imp <-random.imp(mycars)

In [34]:
# Check distribution of various prices
ggplot(mycars_imp, aes(x=MMRAcquisitionAuctionAveragePrice)) + geom_line(stat = "density")
ggplot(mycars_imp, aes(x=MMRAcquisitionAuctionCleanPrice)) + geom_line(stat = "density") 
ggplot(mycars_imp, aes(x=MMRAcquisitionRetailAveragePrice)) + geom_line(stat = "density") 
ggplot(mycars_imp, aes(x=MMRAcquisitionRetailCleanPrice)) + geom_line(stat = "density") 
ggplot(mycars_imp, aes(x=MMRCurrentAuctionAveragePrice)) + geom_line(stat = "density") 
ggplot(mycars_imp, aes(x=MMRCurrentAuctionCleanPrice)) + geom_line(stat = "density") 
ggplot(mycars_imp, aes(x=MMRCurrentRetailAveragePrice)) + geom_line(stat = "density") 
ggplot(mycars_imp, aes(x=MMRCurrentRetailCleanPrice)) + geom_line(stat = "density") 
# 

In [35]:
# Review final set of data
summary(mycars_imp)
str(mycars_imp)
library(Amelia)
missmap(mycars_imp, main = "Missing values vs observed")

In [36]:
# Split Data into one training set and three test sets
# when using a random number, it's usually a good idea to set the seed. This
# guarantees your results are repeatable.
# 
set.seed(100)
# 
# first we create an index on all the rows in the cars_train dataset
index <- 1:nrow(mycars_imp)
# 
# this command randomly samples 40% of the index numbers 
testindex <- sample(index, trunc(length(index)/1.5))
# 
# all the index numbers not in the sample become part of our training set.
cars.trainset <- mycars_imp[-testindex, ]
# 
# the rest go into our test set.
cars.testset <- mycars_imp[testindex, ]
# 
# further split test set into 3
# first we create an index on all the rows in the cars.testset dataset
index <- 1:nrow(cars.testset)
# 
# this command randomly samples 1/3 of the index numbers 
testindex <- sample(index, trunc(length(index)/3))
# 
# the first test set.
cars.testset1 <- cars.testset[testindex, ]
# 
# the rest goes to car.testset.tmp
cars.testset.tmp <- cars.testset[-testindex, ]
# 
# further split testset.tmp set into 2
# first we create an index on all the rows in the cars.testset dataset
index <- 1:nrow(cars.testset.tmp)
# 
# this command randomly samples 1/2 of the index numbers 
testindex <- sample(index, trunc(length(index)/2))
# 
# the first test set.
cars.testset2 <- cars.testset.tmp[testindex, ]
# 
# the rest goes to car.testset.tmp
cars.testset3 <- cars.testset.tmp[-testindex, ]
# 
# 
str(cars.trainset)
str(cars.testset1)
str(cars.testset2)
str(cars.testset3)
# 

In [37]:
# Try various set of columns in logistic regression
# The summary report will provide information on which variables
# are significant and then try with the significant ones only
# Adding others based on significant leveles.
# Adding some variables cause the regression not converging
# 
Columns <- c("IsBadBuy","Auction","VehYear","VehicleAge","Transmission","WheelTypeID","VehOdo","Nationality","TopThreeAmericanName",               "BYRNO","VNZIP1","VehBCost","IsOnlineSale","WarrantyCost",               "MMRAcquisitionAuctionAveragePrice",
                "MMRAcquisitionAuctionCleanPrice",
                "MMRAcquisitionRetailAveragePrice",
                "MMRAcquisitionRetailCleanPrice",
                "MMRCurrentAuctionAveragePrice",
                "MMRCurrentAuctionCleanPrice",
                "MMRCurrentRetailAveragePrice",
                "MMRCurrentRetailCleanPrice")

In [38]:
# Do the regression and review summary report
cars.model <- glm(formula = IsBadBuy ~ ., data = cars.trainset[,(Columns)], 
          family=binomial(logit))
summary(cars.model)

In [39]:
# Run the predictions with three test data and use confusionmatrix to check the accuracy
# Looks like we can achieve about average 0.895 accuracy over the three test
library(caret)
p <- predict(cars.model, cars.testset1, type='response')
# 
# test1: create a new reference to the cars test data
my.cars<-cars.testset1
# if p > 0.5, we're predicting the car is a bad buy
# my.cars$predicted.IsBadBuy <- p>0.5
my.cars$predicted.IsBadBuy <- 0
my.cars[p>0.50,]$predicted.IsBadBuy <- 1
# the table command allows us to see how ofter our predictions were accurate.
table(my.cars$IsBadBuy,my.cars$predicted.IsBadBuy)
confusionMatrix(my.cars$IsBadBuy, my.cars$predicted.IsBadBuy)
# 
# test2: create a new reference to the cars test data
p <- predict(cars.model, cars.testset2, type='response')
my.cars<-cars.testset2
# if p > 0.5, we're predicting the car is a bad buy
# my.cars$predicted.IsBadBuy <- p>0.5
my.cars$predicted.IsBadBuy <- 0
my.cars[p>0.50,]$predicted.IsBadBuy <- 1
# the table command allows us to see how ofter our predictions were accurate.
table(my.cars$IsBadBuy,my.cars$predicted.IsBadBuy)
confusionMatrix(my.cars$IsBadBuy, my.cars$predicted.IsBadBuy)
# 
# test3: create a new reference to the cars test data
p <- predict(cars.model, cars.testset3, type='response')
my.cars<-cars.testset3
# if p > 0.5, we're predicting the car is a bad buy
# my.cars$predicted.IsBadBuy <- p>0.5
my.cars$predicted.IsBadBuy <- 0
my.cars[p>0.50,]$predicted.IsBadBuy <- 1
# the table command allows us to see how ofter our predictions were accurate.
table(my.cars$IsBadBuy,my.cars$predicted.IsBadBuy)
confusionMatrix(my.cars$IsBadBuy, my.cars$predicted.IsBadBuy)