# Regression Week 1 : Simple Linear Regression



In this notebook we will use data on house sales in King County to predict house prices using simple (one input) linear regression

In [1]:
train <- read.csv('kc_house_train_data.csv', header=TRUE)
test <- read.csv('kc_house_test_data.csv', header=TRUE)

## Building a Generic Simple regression

In [2]:
simple_linear_regression <- function(input_feature,output){
    x = input_feature
    y = output
    #mean of input and output
    mean_x = mean(x)
    mean_y = mean(y)
    #mean of product of input and output
    mean_xy = mean(x*y)
    #mean of x^2
    mean_xx = mean(x^2)
    #num/denominator
    numerator = (mean_xy - mean_x*mean_y)
    denominator = (mean_xx - mean_x*mean_x)
    #finding the slope
    slope = numerator/denominator
    #finding the intercept
    intercept = mean_y - (slope*mean_x)
    final <-list("intercept"=intercept,'slope'=slope)
    return(final)
}

We can test that our function works by passing it something where we know the answer. In particular we can generate a feature and then put the output exactly on a line: output = 1 + 1\*input_feature then we know both our slope and intercept should be 1

In [3]:
test_feature <- 1:5
test_output <- (1+1*test_feature)
test_ <- simple_linear_regression(test_feature,test_output)
test_intercept <- test_$intercept
test_slope <- test_$slope
paste('Intercept:',test_$intercept,' and Slope:',test_$slope)

In [4]:
final_ <- simple_linear_regression(train$sqft_living, train$price)
paste('Intercept:',final_$intercept,' and Slope:',final_$slope)

In [5]:
actual_intercept = final_$intercept
actual_slope = final_$slope

## Predicting Values

Now that we have the model parameters: intercept & slope we can make predictions. Using SArrays it's easy to multiply an SArray by a constant and add a constant value. Complete the following function to return the predicted output given the input_feature, slope and intercept:

In [6]:
get_regression_prediction <- function(input_feature, intercept, slope){
    output <- (intercept+slope*input_feature)
    return(output)
}

Now that we can calculate a prediction given the slope and intercept let's make a prediction. Use (or alter) the following to find out the estimated price for a house with 2650 squarefeet according to the squarefeet model we estiamted above.

**Quiz Question: Using your Slope and Intercept from (4), What is the predicted price for a house with 2650 sqft?**

In [7]:
house_sqrt = 2650
estimated <- get_regression_prediction(house_sqrt,actual_intercept,actual_slope)
paste('The predicted price for house with 2650 sqrt is ',estimated)

## Residual Sum of Squares
Now that we have a model and can make predictions let's evaluate our model using Residual Sum of Squares (RSS). Recall that RSS is the sum of the squares of the residuals and the residuals is just a fancy word for the difference between the predicted output and the true output. 

Complete the following (or write your own) function to compute the RSS of a simple linear regression model given the input_feature, output, intercept and slope:

In [8]:
residual_sum_square <- function(input_feature,output,intercept,slope){
    predicted <- (intercept+slope*input_feature)
    #calculating the RSS
    residual = output - predicted
    RSS = sum(residual^2)
    return(RSS)
}

In [9]:
#the RSS for test will be 0
residual_sum_square(test_feature,test_output,test_intercept,test_slope)

Now use your function to calculate the RSS on training data from the squarefeet model calculated above.

**Quiz Question: According to this function and the slope and intercept from the squarefeet model What is the RSS for the simple linear regression using squarefeet to predict prices on TRAINING data?**

In [10]:
rss_price_on_sqft <- residual_sum_square(train$sqft_living, train$price, actual_intercept, actual_slope)
paste('The RSS for predicting price beased on sqft is',format(rss_price_on_sqft, scientific = TRUE))

## Predict the squarefeet given price

What if we want to predict the squarefoot given the price? Since we have an equation y = a + b\*x we can solve the function for x. So that if we have the intercept (a) and the slope (b) and the price (y) we can solve for the estimated squarefeet (x).

Complete the following function to compute the inverse regression estimate, i.e. predict the input_feature given the output.

In [11]:
inverse_regression <- function(output, intercept, slope){
    input_feature <- (output-intercept)/slope
    return(input_feature)
}

Now that we have a function to compute the squarefeet given the price from our simple regression model let's see how big we might expect a house that costs $800,000 to be.

**Quiz Question: According to this function and the regression slope and intercept from (3) what is the estimated square-feet for a house costing $800,000?**

In [12]:
my_house_price =800000
sqft_output = inverse_regression(my_house_price,actual_intercept,actual_slope)
paste('The Estimated sqft for a house costing $800,000 is ',sqft_output)

## New Model: estimate prices from bedroo

We have made one model for predicting house prices using squarefeet, but there are many other features in the sales SFrame. 
Use your simple linear regression function to estimate the regression parameters from predicting Prices based on number of bedrooms. Use the training

In [17]:
#Estimating the slope and intecept for price and bedrooms
bdrmfinal <- simple_linear_regression(train$bedrooms,train$price)
bdrm_intercept <- bdrmfinal$intercept
bdrm_slope <- bdrmfinal$slope
paste('Intercept:',bdrmfinal$intercept,' and Slope:',bdrmfinal$slope)

## Test your Linear Regression Algorithm

Now we have two models for predicting the price of a house. How do we know which one is better? Calculate the RSS on the TEST data (remember this data wasn't involved in learning the model). Compute the RSS from predicting prices using bedrooms and from predicting prices using squarefeet.

**Quiz Question: Which model (square feet or bedrooms) has lowest RSS on TEST data? Think about why this might be the case.**

In [14]:
#RSS for price and Square feet
price.sqft <- simple_linear_regression(train$sqft_living,train$price)
Rss.price.sqft <- residual_sum_square(test$sqft_living, test$price, price.sqft$intercept, price.sqft$slope)
paste('The RSS for Price and Square feet is ',format(Rss.price.sqft,scientific = TRUE))

In [15]:
#Rss for price and bedrooms
price.bedroom <- simple_linear_regression(train$bedrooms,train$price)
Rss.price.bedroom <- residual_sum_square(test$bedrooms, test$price, price.bedroom$intercept, price.bedroom$slope)
paste('The RSS for Price and Bedrooms are ',format(Rss.price.bedroom,scientific = TRUE))