In [75]:
import turicreate as tc
data = tc.SFrame('home_data.sframe')

# Selection and summary statistics

In the notebook we covered in the module, we discovered which neighborhood (zip code) of Seattle had the highest average house sale price.

Now, take the sales data, select only the houses with this zip code, and compute the average price.

In [90]:
# group by zipcode to calculate the one with higest avg_price
zip_avg = data.groupby('zipcode', {'avg_price': tc.aggregate.MEAN('price')})
highest = zip_avg.sort('avg_price', ascending=False).head(1)['zipcode'][0]
print(highest)

# show the mean price there
data[data['zipcode'] == highest]['price'].mean()

98039


2160606.5999999996

# Filtering data

One of the key features we used in our model was the number of square feet of living space (‘sqft_living’) in the house.

For this part, we are going to use the idea of filtering (selecting) data.

Using such filters, first select the houses that have ‘sqft_living’ higher than 2000 sqft but no larger than 4000 sqft.

What fraction of the all houses have ‘sqft_living’ in this range?

In [94]:
filtered = data[(data['sqft_living'] >= 2000) & (data['sqft_living'] <= 4000)]
fraction = len(filtered) / len(data)
print (fraction)

0.4266413732475825


# Building a regression model with several more features

In the sample notebook, we built two regression models to predict house prices, one using just ‘sqft_living’ and the other one using a few more features:

In [96]:
my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

Now, going back to the original dataset, you will build a model using the following features:

In [97]:
advanced_features = [
'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode',
'condition', # condition of house				
'grade', # measure of quality of construction				
'waterfront', # waterfront property				
'view', # type of view				
'sqft_above', # square feet above ground				
'sqft_basement', # square feet in basement				
'yr_built', # the year built				
'yr_renovated', # the year renovated				
'lat', 'long', # the lat-long of the parcel				
'sqft_living15', # average sq.ft. of 15 nearest neighbors 				
'sqft_lot15', # average lot size of 15 nearest neighbors 
]

In [134]:
# train-test split, seed=0, validation_set=None
training_set, test_set = data.random_split(.8, seed=0)
simple_model = tc.linear_regression.create(training_set, features=my_features, target='price', validation_set=None)
advanced_model = tc.linear_regression.create(training_set, features=advanced_features, target='price', validation_set=None)

In [135]:
# evaluating the modules
simple_eval = simple_model.evaluate(test_set)
advanced_eval = advanced_model.evaluate(test_set)
print (simple_eval)
print (advanced_eval)
print ('Difference in RMSE between the model trained with my_features and the one trained with advanced_features')
print (simple_eval['rmse'] - advanced_eval['rmse'])

{'max_error': 3152242.7848689086, 'rmse': 180439.0729664085}
{'max_error': 3170363.181385746, 'rmse': 155269.65792807937}
Difference in RMSE between the model trained with my_features and the one trained with advanced_features
25169.41503832914
