# House Prices Part 2

In this notebook we will build a more accurate linear regression model using more of the features of the houses' data.

We will also show how Python can be used for data exploration, data transformations and machine learning.  These techniques will be key to building intelligent applications.

In [2]:
# Start up turi and load the data
import turicreate as turi
houses = turi.SFrame('home_data.sframe')

### Task 1: Selection and Summary Statistics
- Take the sales data, 
- select only the houses with the zip code of highest average house price,
- compute the average price.

In [3]:
turi.plot(houses['zipcode'], houses['price'])
# Highest-average price zipcode is 98039

<turicreate.visualization._plot.Plot at 0x7f2f6d0977f0>

In [4]:
houses['price'][houses['zipcode'] == '98039'].mean()

2160606.5999999996

### Task2. Filtering Data
Filtering is another word for selecting data.
- Use logical filtres to select rows of an Sframe.
- Select the houses with `sqft_living` higher than 2,000 but no larger than 4,000.
- Determine the portion of all houses that have `sqft_living` within this range.

In [10]:
subset = houses[(2000 < houses['sqft_living']) & (houses['sqft_living'] <= 4000)]

In [11]:
print(subset.num_rows())
print (houses.num_rows())

9118
21613


In [12]:
print( subset.num_rows()*1.0 / houses.num_rows())

0.42187572294452413


# Build an Advanced Regression Model

In [13]:
advanced_features = [
'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode',
'condition', # condition of house				
'grade', # measure of quality of construction				
'waterfront', # waterfront property				
'view', # type of view				
'sqft_above', # square feet above ground				
'sqft_basement', # square feet in basement				
'yr_built', # the year built				
'yr_renovated', # the year renovated				
'lat', 'long', # the lat-long of the parcel				
'sqft_living15', # average sq.ft. of 15 nearest neighbors 				
'sqft_lot15', # average lot size of 15 nearest neighbors 
]

In [15]:
my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

In [16]:
# Test / Train split
train_data, test_data = houses.random_split(0.8, seed=0)

In [18]:
adv_model = turi.linear_regression.create(train_data, target='price', 
                                          features=advanced_features,
                                         validation_set=None)

In [19]:
my_model = turi.linear_regression.create(train_data, target='price', 
                                          features=my_features,
                                         validation_set=None)

In [22]:
print("Less features model")
print(my_model.evaluate(test_data))
print("Advanced model")
print(adv_model.evaluate(test_data))

Less features model
{'max_error': 3152242.784868988, 'rmse': 180439.07296640595}
Advanced model
{'max_error': 3170363.181382781, 'rmse': 155269.6579279753}


In [25]:
# difference in RMSE between both models 
rmse_from_class =  179507.5721  # for some reason its diff from our my_model rmse
diff = rmse_from_class - adv_model.evaluate(test_data)['rmse']
print(diff)
# result is in dollars ($)

24237.9141720247
