# Introduction
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, we will predict the final price of each home. Dataset contains 37 numeric and 43 categorical (factor) variables. In this notebook we will be using differen regression techniques to such as lasso and ridge to predict the price of the houses.

After we upload the libraries we will use in this notebok, we will import datasets. Then we will need to combine the train and test data for data pre-processing steps. But we need to make sure that the number of features is the same for train and test dataset before we combine that. Since there is no `SalePrice` columun in test data, we can either remove this column from train data or add test data by making it equal to `0`. we will go with latter.



In [None]:
#Uploading libraries
library(tidyverse)
library(glmnet) # to apply lasso,ridge and elastic net regression.
library(e1071)#to apply skewness() function

#Importing datasets
train = read.csv('../input/houseprices/train.csv')
test=read.csv('../input/houseprices/test.csv')

#before we combine train and test data, we have to make sure the number of feature is the same. For that purpose,we add SalePrice columnto test data.
test$SalePrice=0

#combining dataset
df= rbind(train,test)


# Missing Data Handling

Before applying the models there must not be any NA values in our dataset. Firstly we will check to find the NA values in our dataset and then will replace them with proper values. There are different ways to replace an NA such as replacing with Median, Mean etc. Here we will check each of the features has NA and then will decide how to replace them. In some of the columns NA means that the houses do not have the feature, so we will replace NA with "None" for categorical columns and with 0 for numerical columns. In some of the categorical (factor) features, we will replace NA with Mode since there is a higher probability that we are not making a mistake by assigning the most common value as the replacement to NAs. We will be replacing some of the numerical colunns with mean and media. If we have a lot outliers we will prefer to replace NA with median, otherwise we will go with mean. Finally, We assume that if `GarageYrBlt` is `NA`, it built in the same year with house.


How to replace NA values

* **Mode:** "MSZoning", "Utilities", "Functional", "Exterior1st", "Exterior2nd", "Electrical", "KitchenQual", "GarageCars", "SaleType"

* **Median:** "BsmtFinSF2","BsmtUnfSF"

* **None:** "PoolQC", "MiscFeature","Alley", "Fence", "FireplaceQu","GarageFinish", "GarageQual","GarageCond","BsmtCond", "BsmtExposure", "BsmtQual", "BsmtFinType2", "BsmtFinType1", "MasVnrType", "GarageType"    

* **0:** "LotFrontage","MasVnrArea","BsmtFullBath","BsmtHalfBath","TotalBsmtSF","GarageArea"

* **Mean:** "BsmtFinSF1"



In [None]:
#check for the NA values and order the columns that have NA values
col_NA=which(colSums(is.na(df))>0)
sort(colSums(sapply(df[col_NA],is.na)),decreasing = TRUE)

In [None]:
#replacing NA values with 0
y=c("LotFrontage","MasVnrArea","BsmtFullBath","BsmtHalfBath","TotalBsmtSF","GarageArea")
df[,y]=apply(df[,y],2,function(x){ replace(x,is.na(x),0)})

#replacing NA values with "None"
y=c("PoolQC", "MiscFeature","Alley","Fence","FireplaceQu","GarageFinish", "GarageQual","GarageCond","BsmtCond", "BsmtExposure", "BsmtQual", "BsmtFinType2", "BsmtFinType1", "MasVnrType", "GarageType")
df[,y]=apply(df[,y],2,function(x){replace(x,is.na(x),"None")})

#replacing NA with Mean
df["BsmtFinSF1"]= replace(df["BsmtFinSF1"], is.na(df["BsmtFinSF1"]),mean(df[["BsmtFinSF1"]],na.rm = T))

#replacing NA with mode
y=c("MSZoning","Utilities","Functional","Exterior1st","Exterior2nd","Electrical","KitchenQual","GarageCars","SaleType")
df[,y]=apply(df[,y],2,function(x){replace(x,is.na(x),names(which.max(table(x))))})

#replacing NA with median
y=c("BsmtFinSF2","BsmtUnfSF")
df[,y]=apply(df[,y],2,function(x){replace(x,is.na(x),median(x,na.rm = T))})

#we assume that GarageYrBlt is the same with YearBuilt
df$GarageYrBlt[is.na(df$GarageYrBlt)]=df$YearBuilt[is.na(df$GarageYrBlt)]

#we want to have a last check to see whether any NA value left or not.
sort(colSums(sapply(df[col_NA],is.na)),decreasing = TRUE)

# Changing Feature Classes

We have handled NA values in the dataset. Now we need to check the feature classes. We will need to convert some of the numerical features to categorical (factor) feature and some of the categorical text (character) features to factor. In the dataset we have 44 character, 27 integer and 10 numeric features. We will need to convert all character features to categorical(factor) features.

In [None]:
table(sapply(df,class))

In [None]:
#converting characters to factor
classes=sapply(df,class)
classes_character=names(classes[classes=="character"])
df[classes_character]= lapply(df[classes_character],factor)

In [None]:
#as we can see from the table, we do not have any character variable anymore
table(sapply(df,class))

In [None]:
#change character variables into numeric variables
df$ExterQual= recode(df$ExterQual,"Po"=1,"Fa"=2,"TA"=3,"Gd"=4,"Ex"=5)
df$ExterCond=recode(df$ExterCond,"Po"=1,"Fa"=2,"TA"=3,"Gd"=4,"None"=5)
df$GarageQual=recode(df$GarageQual,"None"=0,"Po"=1,"Fa"=2,"TA"=3,"Gd"=4,"Ex"=5)
df$GarageCond=recode(df$GarageCond,"None"=0,"Po"=1,"Fa"=2,"TA"=3,"Gd"=4,"Ex"=5)
df$BsmtExposure= recode(df$BsmtExposure,"None"=0,"No"=1,"Mn"=2,"Av"=3,"Gd"=4)
df$KitchenQual= recode(df$KitchenQual,"None"=0,"Po"=1,"Fa"=2,"TA"=3,"Gd"=4,"Ex"=5)
df$BsmtFinType1=recode(df$BsmtFinType1,"None"=0,"Unf"=1,"LwQ"=2,"Rec"=3,"BLQ"=4,"ALQ"=5,"GLQ"=6)
df$BsmtFinType2=recode(df$BsmtFinType2,"None"=0,"Unf"=1,"LwQ"=2,"Rec"=3,"BLQ"=4,"ALQ"=5,"GLQ"=6)
df$HeatingQC= recode(df$HeatingQC,"None"=0,"Po"=1,"Fa"=2,"TA"=3,"Gd"=4,"Ex"=5)
df$Functional= recode(df$Functional,"Sal"=1,"Sev"=2,"Maj2"=3,"Maj1"=4,"Mod"=5,"Min2"=6,"Min1"=7,"Typ"=8)
df$FireplaceQu= recode(df$FireplaceQu,"None"=0,"Po"=1,"Fa"=2,"TA"=3,"Gd"=4,"Ex"=5)
df$GarageFinish= recode(df$GarageFinish,"None"=0,"Unf"=1,"RFn"=2,"Fin"=3)
df$PoolQC= recode(df$PoolQC,"None"=0,"Po"=1,"Fa"=2,"TA"=3,"Gd"=4,"Ex"=5)
df$Fence=recode(df$Fence,"None"=0,"MnWw"=1,"GdWo"=2,"MnPrv"=3,"GdPrv"=4)
df$BsmtQual= recode(df$BsmtQual,"None"=0,"Fa"=2,"TA"=3,"Gd"=4,"Ex"=5)
df$BsmtCond= recode(df$BsmtCond,"None"=0,"Po"=1,"Fa"=2,"TA"=3,"Gd"=4)

In [None]:
#MSSubClass,MoSold and YearBuilt are taken as a numeric, we should convert it to factor.
df$MSSubClass=factor(df$MSSubClass)
df$MoSold=factor(df$MoSold)
df$YearBuilt=factor(df$YearBuilt)

# Feature Engineering

In [None]:
#total area feature = basement area+ ground living area
df$TotalSF=df$TotalBsmtSF + df$X1stFlrSF + df$X2ndFlrSF 

#total bath
df$TotalBath= df$BsmtFullBath+df$BsmtHalfBath*0.5+df$FullBath+df$HalfBath*0.5

#garage score
df$GarageScore=df$GarageArea*df$GarageQual

#remodeled
df$Remodeled=ifelse(df$YrSold-df$YearRemodAdd<2,1,0)

#age of the house
#convert factor to numeric value without any lose to calculate age of the house.
df$YearBuilt = as.numeric(levels(df$YearBuilt)[df$YearBuilt])
df$Age=2010-df$YearBuilt
#convert YearBuilt to factor again.
df$YearBuilt=factor(df$YearBuilt)

# Dealing with Outliers

In [None]:
df$GrLivArea[df$GrLivArea>4500]=4476 #changed with max value
df$LotArea[df$LotArea>50000]= 50000 #changed with max value
df$X1stFlrSF[df$X1stFlrSF>3000]=3000  #changed with max value
df$TotalBsmtSF[df$TotalBsmtSF>3000]=3000  #changed with max value
df$TotalBath[df$TotalBath>4.5]=4.5 #changed with max value

# Skewness

In [None]:
#skewness
ggplot(train,aes(SalePrice))+geom_density()

#we take log(SalePice) as our output since the distribution of log(SalePrice) is more near to normal distribution.
df$SalePrice=log(df$SalePrice)

In [None]:
#New distribution
ggplot(train,aes(log(SalePrice)))+geom_density()

# Lasso Regression

In [None]:
#split df to train and test
df_train=df[1:1460,]
df_test=df[1461:2919,]

df_test$SalePrice=NULL #number of variable must be 1 less than training data.

#dividing x and y values and creating matrix for lasso reg.
y=df_train%>%select(SalePrice)
x=df_train%>%select(-SalePrice)

#we set the lambda parameters
lambdas=10^seq(-3,3,length.out=100)

lasso.cv=cv.glmnet(data.matrix(x),data.matrix(y), alpha=1, nfolds = 10 )
lasso_best=glmnet(data.matrix(x),data.matrix(y),alpha = 1, lambda = lasso.cv$lambda.min)
lasso_best_pred=exp(predict(lasso_best,data.matrix(df_test)))

lasso_best_pred_sub=data.frame(Id=test$Id,SalePrice=lasso_best_pred)
colnames(lasso_best_pred_sub)= c("Id","SalePrice")
write.csv(lasso_best_pred_sub,file = "lasso_best_pred_sub11.csv", row.names = F)

# Ridge Regression

In [None]:
ridge.cv=cv.glmnet(data.matrix(x),data.matrix(y), alpha=0, nfolds = 10 )
ridge_best=glmnet(data.matrix(x),data.matrix(y),alpha = 0, lambda = ridge.cv$lambda.min)
ridge_best_pred=exp(predict(ridge_best,data.matrix(df_test)))

ridge_best_pred_sub=data.frame(Id=test$Id,SalePrice=ridge_best_pred)
colnames(ridge_best_pred_sub)= c("Id","SalePrice")
write.csv(ridge_best_pred_sub,file = "ridge_best_pred_sub11.csv", row.names = F)