### Multiple Linear Regression

In this practice, we will use the same data set as in simple linear regression practice. We will add more variables to models to see if we can have a better linear model. 

#### Read the data

Load the framingham dataset in directory '/datasets/framingham/'. The following few lines are the same as in simple linear regression practice; we are creating the same data here. 

In [None]:
fr <- read.csv("../../../datasets/framingham/framingham.csv")
fr["pulseP"] <- fr$sysBP - fr$diaBP
fr_male   <- subset(fr, male==1 & age > 18 & BPMeds == 0, select=c(2,11:14,17))
fr_female <- subset(fr, male==0 & age > 18 & BPMeds == 0, select=c(2,11:14,17))
head(fr_male)

**Activity 1:** Now, let's see if we can model pulse pressure with multiple independent variables. 

In [None]:
pp_female1 <- lm(pulseP ~ age, data=fr_female)
summary(pp_female1)

# add BMI to pp_female1 model
pp_female2 <- lm(pulseP ~ <what goes in here>, data=fr_female)
summary(pp_female2)

# add heartRate to pp_female2 model
pp_female3 <- lm(pulseP ~ <what goes in here>, data=fr_female)
summary(pp_female3)

As we can see, the $R^2$ slightly increases with adding a new variable to the model. Let's do the same for males. 

In [None]:
pp_male1 <- lm(pulseP ~ age, data=fr_male)
summary(pp_male1)

pp_male2 <- lm(pulseP ~ <what goes in here>, data=fr_male)
summary(pp_male2)

pp_male3 <- lm(pulseP ~ <what goes in here>, data=fr_male)
summary(pp_male3)


For males, we can not model the pulse pressure all that well, $R^2$ does not get any better.

#### House sales data
Let's look at another data set: house sales in King county.

In [None]:
hs <- read.csv("../../../datasets/house_sales_in_king_county/kc_house_data.csv",header=TRUE)
head(hs)
str(hs)

Let's start modeling the sales price with the square footage of the house. 

In [None]:
# model the price given square footage of living space.
hs_mreg1 <- lm(<what goes in here>, data=hs)
summary(hs_mreg1)

As we can see, sqft_living is a good predictor for the price. Let's see if we can improve this model with additional variables.

In [None]:
# add the second variable: bedrooms
hs_mreg2 <- lm(price ~ <what goes in here>, data=hs)
summary(hs_mreg2)

# add the third variable: sqft_lot
hs_mreg3 <- lm(price ~ <what goes in here>, data=hs)
summary(hs_mreg3)

# add the fourth variable: floors
hs_mreg4 <- lm(price ~ <what goes in here>, data=hs)
summary(hs_mreg4)

# add the fifth variable: bathrooms
hs_mreg5 <- lm(price ~ <what goes in here>, data=hs)
summary(hs_mreg5)

Adding number of bedrooms as another variable helped to improve the model, but other additional variables 
(lot's square footage, number of floors, number of bathrooms) did not improve the model at all. Let's try 
 couple of variables that should make a real difference: waterfront and view.

In [None]:
hs_mreg6 <- lm(price ~ sqft_living + bedrooms + waterfront + view, data=hs)
summary(hs_mreg6)

$R^2$ jumped to **0.56**; this is a better model for price of the house. The other variables (lat, long, zip code, etc.) 
    are not really expected to make a difference because we don't expect a **linear** relationship between a house's 
    price and its zip code unless zip codes are demographically meaningful. Let's try and see.

In [None]:
# add zipcode to hs_mreg6
hs_mreg7 <- lm(price ~ <what goes in here>, data=hs)
summary(hs_mreg7)

As we expected, zipcode does not make much of a difference. How about latitude or longitude ? Depending on the geographic location of the King county, it might make a difference. Let's see.

In [None]:
# add lat to the model hs_mreg6
hs_mreg8 <- lm(price ~ <what goes in here>, data=hs)
summary(hs_mreg8)

# add long to the model hs_mreg6
hs_mreg9 <- lm(price ~ <what goes in here>, data=hs)
summary(hs_mreg9)

Latitude made a big difference! $R^2$ is **0.63**. Let's find out why. Take a look at [King county map](https://www.google.com/maps/place/King+County,+WA/@47.4319563,-122.3574591,9z/data=!3m1!4b1!4m5!3m4!1s0x54905c8c832d7837:0xe280ab6b8b64e03e!8m2!3d47.5480339!4d-121.9836029).
Now it should be clear why an east-to-west change in location has an effect on the house price.