## *Table of Contents*

    * Cars Data Cleaning
    * Load Data and Get Directory Path
    * Resolve Basic Issues
    * Model and SubModel Variables
    * Remove Variables/Observations
    * Impute Missing Values
    * Split Data
    * Demo Signup & Prep
    * Next Steps

 
## *Cars Data Cleaning*

For your final project you will be working on a Kaggle challenge data set called cars. The goal of the challenge is to predict whether a used car at an auto auction is a good or bad buy. This is done using the following variables to predict the binary IsBadBuy variable.

*Field Name:	Definition*

RefID:	Unique (sequential) number assigned to vehicles
IsBadBuy:	Identifies if the kicked vehicle was an avoidable purchase
PurchDate:	The Date the vehicle was Purchased at Auction
Auction:	Auction provider at which the vehicle was purchased
VehYear:	The manufacturer’s year of the vehicle
VehicleAge:	The Years elapsed since the manufacturer’s year
Make:	Vehicle Manufacturer
Model:	Vehicle Model
Trim:	Vehicle Trim Level
SubModel:	Vehicle Submodel
Color:	Vehicle Color
Transmission:	Vehicles transmission type (Automatic, Manual)
WheelTypeID:	The type id of the vehicle wheel
WheelType:	The vehicle wheel type description (Alloy, Covers)
VehOdo:	The vehicles odometer reading
Nationality:	The Manufacturer’s country
Size:	The size category of the vehicle (Compact, SUV, etc.)
TopThreeAmericanName:	Identifies if the manufacturer is one of the top three American manufacturers
MMRAcquisitionAuctionAveragePrice:	Acquisition price for this vehicle in average condition at time of purchase
MMRAcquisitionAuctionCleanPrice:	Acquisition price for this vehicle in the above Average condition at time of purchase
MMRAcquisitionRetailAveragePrice:	Acquisition price for this vehicle in the retail market in average condition at time of purchase
MMRAcquisitonRetailCleanPrice:	Acquisition price for this vehicle in the retail market in above average condition at time of purchase
MMRCurrentAuctionAveragePrice:	Acquisition price for this vehicle in average condition as of current day
MMRCurrentAuctionCleanPrice:	Acquisition price for this vehicle in the above condition as of current day
MMRCurrentRetailAveragePrice:	Acquisition price for this vehicle in the retail market in average condition as of current day
MMRCurrentRetailCleanPrice:	Acquisition price for this vehicle in the retail market in above average condition as of current day
PRIMEUNIT:	Identifies if the vehicle would have a higher demand than a standard purchase
AcquisitionType:	Identifies how the vehicle was acquired (Auction buy, trade in, etc)
AUCGUART:	The level guarantee provided by auction for the vehicle (Green light - Guaranteed/arbitratable, Yellow Light - caution/issue, red light - sold as is)
KickDate:	Date the vehicle was kicked back to the auction
BYRNO:	Unique number assigned to the buyer that purchased the vehicle
VNZIP:	Zipcode where the car was purchased
VNST:	State where the the car was purchased
VehBCost:	Acquisition cost paid for the vehicle at time of purchase
IsOnlineSale:	Identifies if the vehicle was originally purchased online
WarrantyCost:	Warranty price (term=36month and millage=36K)

## *Load Data and Get Directory Path*

Note: The following is a general guide in attacking this problem, feel free to approach the data cleaning process in other ways. However, keep in mind many of general goals in this guide have to be performed in some manner before a logistic regression can be performed and/or trusted.

The first thing you must do is load the data into RCloud using the File Upload GUI.  The file is now in your home directory on RCloud.  The path to your file can be seen in absolute terms by running "list.files(rcloud.home(), full.names=TRUE". However, it is good practice to use "rcloud.home()" to specify the path to your file rather than using absolute paths.  You may also create directories using R - dir.create("cars"), move files to that directory using file.rename(from="rcloud.home()", to="rcloud.home("cars") or using an RCloud shell cell and the commands

mv /opt/data/share01/yourID/filename /opt/data/share01/cars

and then access files in that directory using rcloud.home("cars"); note you can create directories with any name, not just cars.

Note: The test data set provided by the website is optional for this assignment. We will be performing all of our work solely on the training data set. Feel free to run your final model on the cleaned (by you) test data set and submit it to Kaggle to see how well your model does against others submitted to the competition.

## *Resolve Basic Issues*

Next, examine the data and fix basic issues. Items such as dropping variables that are obviously of no use or fixing things such as miss coded values. At this point I dropped the PurchDate variable due to not wanting to deal with date variables. However, if you are up to the challenge feel free to leave it in and attack the problem. It should be noted though that you will not able to just throw it into a logistic regression raw. That will leave you with too many levels which is an issue we will discuss in further detail later. However, it can be used in a logistic regression in terms of months or years and the raw data can be useful in the exploratory analysis phase.

```{R}
#corrects zero values to NA for certain variables
#cars_train[,c(18:25)] <-apply(cars_train[,c(18:25)], 2, function(x) x <-ifelse(x == 0, NA, x))
```

 
## *Model and SubModel Variables*

Next we need to deal with Model and SubModel variables. There are two major goals we want to achieve in this section. We want to extract information contained in the model names. For simplicity, you only have to extract the number of doors from the SubModel variable. However, feel free to extract other pieces of information if you like. Though, be aware, most of the other pieces of information contained in the model variables are too spotty to make up a variable of their own.

Hint: The grep function could be a useful function here.

 

The second goal we want to achieve is a reduction in the number of levels in the models. In this case there are 863 levels in submodel and 1063 levels in the model. This leads to over-parameterization (a saturated model) which causes NAs to start showing up in the regression due to lack of variance for estimating the parameters. There are several ways to attack this problem, I used a hands on approach to group levels. However, you can do a similar thing with more advance statistical techniques such as k-means. Feel free to approach the problem in any way you see fit.
1
2
3
4
5
6
7
8
9
	
#reduces the number of model levels
cars_train$NewModel <-sapply(strsplit(as.character(cars_train$Model), ' '), "[[", 1)
cars_train$NewModel[grep("1500 Ram", cars_train$Model , ignore.case=TRUE, fixed=FALSE)] <-"Ram"
cars_train$NewModel[c(grep("1500 Silverado", cars_train$Model , ignore.case=TRUE, fixed=FALSE),
                      grep("1500HD Silverado", cars_train$Model , ignore.case=TRUE, fixed=FALSE))] <-"Silverado"
cars_train$NewModel[grep("1500 Sierra", cars_train$Model , ignore.case=TRUE, fixed=FALSE)] <-"Sierra"
cars_train$NewModel[grep("4 Runner", cars_train$Model , ignore.case=TRUE, fixed=FALSE)] <-"4 Runner"
cars_train$NewModel[grep("L Series", cars_train$Model , ignore.case=TRUE, fixed=FALSE)] <-"L Series"
cars_train$NewModel <-as.factor(cars_train$NewModel)

 

 
Remove Variables/Observations

The last thing you need to do before imputation is to remove variables and observations with too many missing values. Imputation can be a wonderful tool, but beyond a certain number of missing you start to weight the variable/observation too heavily with the imputed values thus guaranteeing a result based upon how we imputed. This leaves the question of how much is too much? I can’t really answer that for you. However, I can say you generally don’t want to impute if more than 50% is missing and you generally wouldn’t even go that far with less advanced statistical techniques such as the basic one that will be used below.
1
2
	
#check to see how many observations are missing in each variable then null (removes) those with too many
miss_obs_per_var <-apply(cars_train, 2, function(x) sum(is.na(x))/length(x)*100)

 

Before we impute, feel free to do anything else you feel needs to be done in the data cleaning process. What we have done is by no means all that can be done, nor even all that should be done. Don’t forget, this is a general guideline; statistics is not easily made into an algorithm. If it was, I might not have a job.

 
Impute Missing Values

Now we can impute the missing values. This can be done using the random.imp function found in the mi library. The random.imp function randomly imputes the missing values using the distribution of the variables in which the missing values are found. There are many other imputation methods and approaches. Feel free to try out and use other methods (I’m partial to multiple imputation also found in the mi library myself). More information on imputation can be found here.

 
1
2
3
	
#Randomly imputes based upon the distribution of the variable.
library(mi)
cars_train_imp <-random.imp(cars_train)

 
Split Data

Finally we will split the data into test and training sets. There are several methods and philosophies around doing this; however, the general rule is as follows: if you have enough data, you split it by putting 40% into 1 training data set and 20% each into 3 test data sets.

 
Demo Signup & Prep

 It’s time to schedule your demo! This is about the halfway point for the iteration, the perfect time to get signed up so we can ensure your demo will take place during the demo week set aside for your group.

 

Please follow the steps below before continuing with the iteration:

    Review the demo help guide. It has additional info on scheduling and tips to be successful in your demo.
    Find your Topic Advisor. Please note that Topic Advisors can change with each iteration due to resource needs. Be in the habit of checking your Topic Advisor each time you schedule demos.
        Go to your OLE Study Group.
        In the Milestones section, click on the task for the current iteration demo week (see Example 1 below).
        The task will expand to show a description where the Topic Advisor’s name and ID are listed.
    Send Outlook invite
        Access the Topic Advisor’s calendar in Outlook. Their calendar is public, so you should be able to access. If you have any issues please engage the Topic Advisor.
        There will be blocks of time set aside each day for demos
        Find a day and time that is available within those blocks and that works with your schedule (see Example 2 below).
        Send an invite for a 30 minute meeting to the Topic Advisor.
        The Topic Advisor will accept the invite and you are now scheduled!
    Demo prep
        Start creating a document (PowerPoint, Word, OneNote, etc.) for use during your demo.
        This will be used to help show the Topic Advisor how you progressed through the iteration.
        Think about adding screen shots or plan on live demos to help supplement the document and demonstrate your learnings.
        Update the document as you move through the last week of the iteration.

Load training data into R
```{R}
list.files(rcloud.home(), full.names=TRUE)
cars_train <- read.csv(rcloud.home("training.csv", user="jl2408"), sep=",", header = T, check.names = F)


Explore the data
```{R}
str(cars_train)
summary(cars_train)

How many cars are bad buy?
```{R}
table(cars_train$IsBadBuy)

```{R}
# drop column PurchDate
cars_train <- subset(cars_train, select = -c(PurchDate))

# convert factor into numeric
#cars_train[,c(18:25)] <- #as.numeric(levels(cars_train[,c(18:25)]))[cars_train[,c(18:25)]]


#corrects zero values to NA for certain variables
cars_train[,c(18:25)] <-apply(cars_train[,c(18:25)], 2, function(x) x <-ifelse(x == 0, NA, x))
str(cars_train)
```

```{R}
levels(cars_train$PRIMEUNIT)
levels(cars_train$AUCGUART)
table(cars_train$PRIMEUNIT)
table(cars_train$AUCGUART)

# drop column these 2 columns
cars_train <- subset(cars_train, select = -c(PRIMEUNIT,AUCGUART))

```{R}
# apply(cars_train, 2, function (x) any(is.na(x)))

# apply(cars_train, 2, function (x) any(is.infinite(x)))

#check to see how many observations are missing in each variable then null (removes) those with too many

miss_obs_per_var <-apply(cars_train, 2, function(x) sum(is.na(x))/length(x)*100)
miss_obs_per_var

null_obs_per_var <-apply(cars_train, 2, function(x) sum(is.null(x))/length(x)*100)
null_obs_per_var

library(Amelia)
missmap(cars_train, main = "Missing values vs observed")

```{R}
# Randomly imputes based upon the distribution of the variable.


random.imp <- function ( data , imp.method = c( "bootstrap", "pca" ) , ... ) {

  imp.method <- match.arg ( imp.method )

  if(imp.method=="bootstrap"){

    if( is.vector( data ) ) {

      mis       <- is.na ( data )

      imputed   <- data

      imputed[ mis ] <- sample( data[ !mis ], sum( mis ), replace = TRUE )

    }

    else if( is.matrix( data ) || is.data.frame( data )  ){

      imputed <- data

      for( j in 1:ncol ( data ) ) {

        mis  <- is.na ( data[,j] )

        if( sum(mis) == length(data[,j])){

          warning(message = paste( "variable", names(data)[j], "has no observation"))

        }

        else {

          imputed[mis,j] <- sample( data[!mis, j], sum( mis ), replace = TRUE )

        }

      }

    }

    else{

      stop ( message = "Unexpected data type: data must be vector, matrix, or data frame." )

    }

  }

  else if (imp.method=="pca"){

    stop ( message = "pca imputation is not implemente in current version." )

  }

  

  #else{

  #    imputed <- pca( data, nPcs = 3, method = "bpca" )@completeObs

  #}

  return( as.data.frame( imputed ) )

}


cars_train_imp <-random.imp(cars_train)

```

# Replace missing vallue with the average

cars_train$MMRAcquisitionAuctionAveragePrice[is.na(cars_train$MMRAcquisitionAuctionAveragePrice)] <- round(mean(as.numeric(cars_train$MMRAcquisitionAuctionAveragePrice),na.rm=T))
#cars_train$MMRAcquisitionAuctionAveragePrice
cars_train$MMRAcquisitionAuctionAveragePrice <- as.factor(cars_train$MMRAcquisitionAuctionAveragePrice)

cars_train$MMRAcquisitionAuctionCleanPrice[is.na(cars_train$MMRAcquisitionAuctionCleanPrice)] <- round(mean(as.numeric(cars_train$MMRAcquisitionAuctionCleanPrice),na.rm=T))
#cars_train$MMRAcquisitionAuctionCleanPrice
cars_train$MMRAcquisitionAuctionCleanPrice <- as.factor(cars_train$MMRAcquisitionAuctionCleanPrice)

cars_train$MMRAcquisitionRetailAveragePrice[is.na(cars_train$MMRAcquisitionRetailAveragePrice)] <- round(mean(as.numeric(cars_train$MMRAcquisitionRetailAveragePrice),na.rm=T))
#cars_train$MMRAcquisitionRetailAveragePrice
cars_train$MMRAcquisitionRetailAveragePrice <- as.factor(cars_train$MMRAcquisitionRetailAveragePrice)

cars_train$MMRAcquisitionRetailCleanPrice[is.na(cars_train$MMRAcquisitionRetailCleanPrice)] <-
round(mean(as.numeric(cars_train$MMRAcquisitionRetailCleanPrice),na.rm=T))
#cars_train$MMRAcquisitionRetailCleanPrice
cars_train$MMRAcquisitionRetailCleanPrice <- as.factor(cars_train$MMRAcquisitionRetailCleanPrice)

cars_train$MMRCurrentAuctionAveragePrice[is.na(cars_train$MMRCurrentAuctionAveragePrice)] <- round(mean(as.numeric(cars_train$MMRCurrentAuctionAveragePrice),na.rm=T))
#cars_train$MMRCurrentAuctionAveragePrice
cars_train$MMRCurrentAuctionAveragePrice <- as.factor(cars_train$MMRCurrentAuctionAveragePrice)

cars_train$MMRCurrentAuctionCleanPrice[is.na(cars_train$MMRCurrentAuctionCleanPrice)] <- round(mean(as.numeric(cars_train$MMRCurrentAuctionCleanPrice),na.rm=T))
#cars_train$MMRCurrentAuctionCleanPrice
cars_train$MMRCurrentAuctionCleanPrice <- as.factor(cars_train$MMRCurrentAuctionCleanPrice)

cars_train$MMRCurrentRetailAveragePrice[is.na(cars_train$MMRCurrentRetailAveragePrice)] <- round(mean(as.numeric(cars_train$MMRCurrentRetailAveragePrice),na.rm=T))
#cars_train$MMRCurrentRetailAveragePrice
cars_train$MMRCurrentRetailAveragePrice <- as.factor(cars_train$MMRCurrentRetailAveragePrice)

cars_train$MMRCurrentRetailCleanPrice[is.na(cars_train$MMRCurrentRetailCleanPrice)] <- round(mean(as.numeric(cars_train$MMRCurrentRetailCleanPrice),na.rm=T))
#cars_train$MMRCurrentRetailCleanPrice
cars_train$MMRCurrentRetailCleanPrice <- as.factor(cars_train$MMRCurrentRetailCleanPrice)


Check what is in Model
```{R}
t <- cars_train$Model
head(t, 20)
length(levels(t))
t2 <- as.factor(sapply(strsplit(as.character(t), ' '), "[[", 1))
levels(t2)
length(levels(t2))
#t[grep("^[0-9]+", t , ignore.case=TRUE, fixed=FALSE)]
t[grep("^25+", t , ignore.case=TRUE, fixed=FALSE)]

Check what is in SubModel
```{R}
t <- cars_train$SubModel
head(t, 20)
length(levels(t))
t2 <- as.factor(sapply(strsplit(as.character(t), ' '), "[[", 1))
levels(t2)
length(levels(t2))

```{R}
# reduces the number of model levels
cars_train$Model <- as.character(cars_train$Model)
cars_train$Model[grep("1500 Ram", cars_train$Model , ignore.case=TRUE, fixed=FALSE)] <-"Ram"
cars_train$Model[c(grep("1500 Silverado", cars_train$Model , ignore.case=TRUE, fixed=FALSE),
                      grep("1500HD Silverado", cars_train$Model , ignore.case=TRUE, fixed=FALSE))] <-"Silverado"
cars_train$Model[grep("1500 Sierra", cars_train$Model , ignore.case=TRUE, fixed=FALSE)] <-"Sierra"
cars_train$Model[grep("4 Runner", cars_train$Model , ignore.case=TRUE, fixed=FALSE)] <-"4_Runner"
cars_train$Model[grep("L Series", cars_train$Model , ignore.case=TRUE, fixed=FALSE)] <-"L_Series"
cars_train$Model <-sapply(strsplit(as.character(cars_train$Model), ' '), "[[", 1)
cars_train$Model <-as.factor(cars_train$Model)

length(levels(cars_train$Model))
```

```{R}
# reduces the number of submodel levels
cars_train$SubModel <- as.character(cars_train$SubModel)
cars_train$SubModel <-sapply(strsplit(as.character(cars_train$SubModel), ' '), "[[", 1)
cars_train$SubModel <-as.factor(cars_train$SubModel)

length(levels(cars_train$SubModel))
```

````{R}
# Split Data
# when using a random number, it's usually a good idea to set the seed. This
# guarantees your results are repeatable.
 
set.seed(100)
 
# first we create an index on all the rows in the cars_train dataset
index <- 1:nrow(cars_train)
 
# this command randomly samples 40% of the index numbers 
testindex <- sample(index, trunc(length(index)/1.5))
 
# all the index numbers not in the sample become part of our training set.
cars.trainset <- cars_train[-testindex, ]
 
# the rest go into our test set.
cars.testset <- cars_train[testindex, ]

# further split test set into 3
# first we create an index on all the rows in the cars.testset dataset
index <- 1:nrow(cars.testset)

# this command randomly samples 1/3 of the index numbers 
testindex <- sample(index, trunc(length(index)/3))

# the first test set.
cars.testset1 <- cars.testset[testindex, ]

# the rest goes to car.testset.tmp
cars.testset.tmp <- cars.testset[-testindex, ]

# further split testset.tmp set into 2
# first we create an index on all the rows in the cars.testset dataset
index <- 1:nrow(cars.testset.tmp)

# this command randomly samples 1/2 of the index numbers 
testindex <- sample(index, trunc(length(index)/2))

# the first test set.
cars.testset2 <- cars.testset.tmp[testindex, ]

# the rest goes to car.testset.tmp
cars.testset3 <- cars.testset.tmp[-testindex, ]


str(cars.trainset)
str(cars.testset1)
str(cars.testset2)
str(cars.testset3)

```{R}

table(cars.testset1$IsBadBuy)
table(cars.testset1$TopThreeAmericanName)
table(cars_train$TopThreeAmericanName)
# cars.model = glm(IsBadBuy ~ Auction + VehYear + VehicleAge + Make + Model + Trim + SubModel + Color + Transmission + WheelTypeID + WheelType + VehOdo + Nationality + Size + TopThreeAmericanName + MMRAcquisitionAuctionAveragePrice + MMRAcquisitionAuctionCleanPrice + MMRAcquisitionRetailAveragePrice + MMRAcquisitionRetailCleanPrice + MMRCurrentAuctionAveragePrice + MMRCurrentAuctionCleanPrice + MMRCurrentRetailAveragePrice + MMRCurrentRetailCleanPrice + BYRNO + VNZIP1 + VNST + VehBCost + IsOnlineSale + WarrantyCost, family=binomial(logit), data=cars.trainset)

# cars.model = glm(IsBadBuy ~ Auction + VehYear + VehicleAge + Make + Model + Trim + SubModel + WheelType + VehOdo + Size + MMRAcquisitionAuctionAveragePrice + MMRAcquisitionAuctionCleanPrice + MMRAcquisitionRetailAveragePrice + MMRAcquisitionRetailCleanPrice + MMRCurrentAuctionAveragePrice + MMRCurrentAuctionCleanPrice + MMRCurrentRetailAveragePrice + MMRCurrentRetailCleanPrice + BYRNO + VNZIP1 + VehBCost + WarrantyCost, family=binomial(logit), data=cars.trainset)

# cars.model = glm(IsBadBuy ~ VehYear + VehicleAge + Make + Model + SubModel + VehOdo + Size + MMRAcquisitionAuctionCleanPrice + VehBCost, family=binomial(logit), data=cars.trainset)

# cars.model = glm(IsBadBuy ~ VehicleAge + Make + Model + VehOdo + Size + VehBCost, family=binomial(logit), data=cars.trainset)

cars.model = glm(IsBadBuy ~ VehicleAge + VehOdo + VehBCost + MMRCurrentAuctionCleanPrice + MMRCurrentRetailAveragePrice + WarrantyCost, family=binomial(logit), data=cars.trainset)

summary(cars.model)
p <- predict(cars.model, cars.testset1, type='response')

# create a new reference to the cars test data
my.cars<-cars.testset1
# if p > 0.5, we're predicting the car is a bad buy
my.cars$predicted.IsBadBuy <- p>0.1
# the table command allows us to see how ofter our predictions were accurate.
table(my.cars$IsBadBuy,my.cars$predicted.IsBadBuy)

```{R}
library(psychometric)
CIr(r=0.376, n = 61, level=.95)