## Week One - Simple Linear Regression

<p>First, import necessary libraries. Then load test and train data into them.</p>

In [21]:
import pandas as pd

In [22]:
training_data = pd.read_csv('kc_house_train_data.csv')
test_data = pd.read_csv('kc_house_test_data.csv')

### Define a function that uses the closed form solution of simple linear regression to find the slope and intercept of our training data

In [23]:
def simple_linear_regression(input_feature, output):
    length = output.size
    input_sum = input_feature.sum()
    output_sum = output.sum()
    slope = (sum(input_feature*output) - 
            ((input_sum*output_sum)/length))/(sum(pow(input_feature,2)) - 
            ((input_sum*input_sum)/length))
    intercept = output.mean() - (slope * input_feature.mean())
    return intercept, slope

In [24]:
intercept, slope = simple_linear_regression(training_data['sqft_living'], training_data['price'])
print str(intercept) + 'is the intercept, and ' + str(slope) + ' is the slope.'

-47116.0790674is the intercept, and 281.958839628 is the slope.


### Define a function that will make predictions on an input column of data, given the slope and intercept

In [25]:
def get_regression_predictions(input_feature, intercept, slope):
    predicted_output = intercept + (slope*input_feature)
    return(predicted_output)

In [26]:
predict_example = get_regression_predictions(2650, intercept, slope)
print 'For a 2650 sqft house, the predicted price is: ' + str(predict_example)

For a 2650 sqft house, the predicted price is: 700074.845946


### Define a function that will calculate the Residual Sum of Squares given inputs, outputs, and a model

In [27]:
def get_residual_sum_of_squares(input_feature, output, intercept,slope):
    predictions = input_feature.apply(lambda x: get_regression_predictions(x, intercept, slope))
    error = output - predictions
    RSS = sum(pow(error, 2))
    return RSS

In [28]:
rss = get_residual_sum_of_squares(training_data['sqft_living'], training_data['price'], intercept, slope)
print 'For the training set the RSS is: ' + str(rss)

For the training set the RSS is: 1.20191835418e+15


### Invert the linear relationship to predict an input given an output.

In [29]:
def inverse_regression_predictions(output, intercept, slope):
    estimated_input = (output - intercept)/slope
    return estimated_input

In [30]:
predict_input_example = inverse_regression_predictions(800000, intercept, slope)
print 'The predicted size of a house costing $800,000 is: ' + str(predict_input_example) + ' sqft'

The predicted size of a house costing $800,000 is: 3004.39624516 sqft


### Finally, train on the number of bedrooms, get the RSS and compare both models

In [31]:
bedroom_intercept, bedroom_slope = simple_linear_regression(training_data['bedrooms'], training_data['price'])

In [32]:
bedroom_rss = get_residual_sum_of_squares(test_data['bedrooms'], test_data['price'], bedroom_intercept, bedroom_slope)
print 'For the bedroom test set the RSS is: ' + str(bedroom_rss)

For the bedroom test set the RSS is: 4.93363257576e+14


In [33]:
sqft_test_rss = get_residual_sum_of_squares(test_data['sqft_living'], test_data['price'], intercept, slope)
print 'For the sqft test set the RSS is: ' + str(sqft_test_rss)

For the sqft test set the RSS is: 2.75402933618e+14
