In [74]:
## Title: Prediction of House Prices using R

Introduction

1.....Load Package

2.....Read data

3.....Data Exploration

4.....Variable Selection  

5.....Missing Value

6.....Data Visualization

7.....Data Partioning

8.....Build the model

9.....Final Prediction

10....Checking Accuracy

1. Import Libraries

In [80]:
library(ggplot2)
library(reshape)
library(e1071)
library(forecast)


2. Import Dataset

In [81]:
data <- read.csv(file = '../input/housing/data.csv')


3. Data Exploration

In [82]:
head(data)

In [83]:
tail(data)

In [84]:
dim(data)

In [85]:
summary(data)

In [86]:
colnames(data)

In [87]:
unique(data$city)

4. Variable Selection

In [88]:
housing.df <- data[,c("price","bedrooms","sqft_living","floors",
                  "sqft_lot", "condition", "view", "yr_built")]
head(housing.df)

5. Checking Missing Values

In [89]:
sapply(housing.df[,1:8], function(x) sum(is.na(x)))

Changing Yr_build column to house_age by changing the value to integer instead of number of years

In [90]:
housing.df$house_age <- as.integer(format(Sys.Date(), "%Y")) - housing.df$yr_built

drops <- c("yr_built")
housing.df = housing.df[ , !(names(housing.df) %in% drops)]


6. Data Visualization

Correlation matrix

In [91]:
cor(housing.df)

Plotting Correlation matrix

In [92]:
cor.mat <- round(cor(housing.df),2) 
melted.cor.mat <- melt(cor.mat)
ggplot(melted.cor.mat, aes(x = X1, y = X2, fill = value)) + 
  geom_tile() + 
  geom_text(aes(x = X1, y = X2, label = value))


Plotting Scatterplot matrix

In [93]:
pairs(price ~., data = housing.df,
      main = "Scatterplot Matrix")

Plotting boxplots to check outliers

In [94]:
par(mfrow=c(2, 3))  # divide graph area in 2 columns
boxplot(housing.df$bedrooms, main="Bedrooms")
boxplot(housing.df$sqft_living, main="sqft_living")
boxplot(housing.df$floors, main="floors")
boxplot(housing.df$condition, main="condition")
boxplot(housing.df$view, main="view")
boxplot(housing.df$house_age, main="house_age")

Plotting density plot to check normality

In [95]:
par(mfrow=c(2, 3)) 

plot(density(housing.df$bedrooms), main="Density Plot: Bedrooms", ylab="Frequency",
     sub=paste("Skewness:", round(e1071::skewness(housing.df$bedrooms), 2)))  
polygon(density(housing.df$bedrooms), col="green")

plot(density(housing.df$sqft_living), main="Density Plot: sqft_living", ylab="Frequency",
     sub=paste("Skewness:", round(e1071::skewness(housing.df$sqft_living), 2)))  
polygon(density(housing.df$sqft_living), col="orange")

plot(density(housing.df$sqft_lot), main="Density Plot: sqft_lot", ylab="Frequency",
     sub=paste("Skewness:", round(e1071::skewness(housing.df$sqft_lot), 2)))  
polygon(density(housing.df$sqft_lot), col="green")

plot(density(housing.df$condition), main="Density Plot: condition", ylab="Frequency",
     sub=paste("Skewness:", round(e1071::skewness(housing.df$condition), 2)))  
polygon(density(housing.df$condition), col="orange")

plot(density(housing.df$floors), main="Density Plot: floors", ylab="Frequency",
     sub=paste("Skewness:", round(e1071::skewness(housing.df$floors), 2)))  
polygon(density(housing.df$floors), col="green")

7. Data Partitioning

In [96]:
set.seed(123)
# randomly sample 70% of the row IDs for training
train.rows <- sample(rownames(housing.df), dim(housing.df)[1]*0.7)
# randomly sample 30% of the row IDs for validation (setdiff used to draw records not in training set)
valid.rows <- sample(setdiff(rownames(housing.df), train.rows), dim(housing.df)[1]*0.3)

train <- housing.df[train.rows, ]
valid <- housing.df[valid.rows, ]

8. Building a Model

In [97]:
reg <- lm(price~bedrooms + sqft_living + floors + sqft_lot + condition + view + house_age,
          data = housing.df, subset = train.rows) 
tr.res <- data.frame(train$price , reg$fitted.values, reg$residuals)
head(tr.res)

9. Final Prediction on Validation Data

In [98]:
options(digits = 2)
pred <- predict(reg, newdata = valid)
pred

10. Checking Accuracy

In [99]:
accuracy(reg$fitted.values, train$price)
accuracy(pred, valid$price)