# Launch Turi Create

In [1]:
import turicreate

# Load house sales data

In [2]:
sales = turicreate.SFrame('./home_data.sframe/')

In [3]:
sales

id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront
7129300520,2014-10-13 00:00:00+00:00,221900.0,3.0,1.0,1180.0,5650.0,1.0,0
6414100192,2014-12-09 00:00:00+00:00,538000.0,3.0,2.25,2570.0,7242.0,2.0,0
5631500400,2015-02-25 00:00:00+00:00,180000.0,2.0,1.0,770.0,10000.0,1.0,0
2487200875,2014-12-09 00:00:00+00:00,604000.0,4.0,3.0,1960.0,5000.0,1.0,0
1954400510,2015-02-18 00:00:00+00:00,510000.0,3.0,2.0,1680.0,8080.0,1.0,0
7237550310,2014-05-12 00:00:00+00:00,1225000.0,4.0,4.5,5420.0,101930.0,1.0,0
1321400060,2014-06-27 00:00:00+00:00,257500.0,3.0,2.25,1715.0,6819.0,2.0,0
2008000270,2015-01-15 00:00:00+00:00,291850.0,3.0,1.5,1060.0,9711.0,1.0,0
2414600126,2015-04-15 00:00:00+00:00,229500.0,3.0,1.0,1780.0,7470.0,1.0,0
3793500160,2015-03-12 00:00:00+00:00,323000.0,3.0,2.5,1890.0,6560.0,2.0,0

view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat
0,3,7.0,1180.0,0.0,1955.0,0.0,98178,47.51123398
0,3,7.0,2170.0,400.0,1951.0,1991.0,98125,47.72102274
0,3,6.0,770.0,0.0,1933.0,0.0,98028,47.73792661
0,5,7.0,1050.0,910.0,1965.0,0.0,98136,47.52082
0,3,8.0,1680.0,0.0,1987.0,0.0,98074,47.61681228
0,3,11.0,3890.0,1530.0,2001.0,0.0,98053,47.65611835
0,3,7.0,1715.0,0.0,1995.0,0.0,98003,47.30972002
0,3,7.0,1060.0,0.0,1963.0,0.0,98198,47.40949984
0,3,7.0,1050.0,730.0,1960.0,0.0,98146,47.51229381
0,3,7.0,1890.0,0.0,2003.0,0.0,98038,47.36840673

long,sqft_living15,sqft_lot15
-122.25677536,1340.0,5650.0
-122.3188624,1690.0,7639.0
-122.23319601,2720.0,8062.0
-122.39318505,1360.0,5000.0
-122.04490059,1800.0,7503.0
-122.00528655,4760.0,101930.0
-122.32704857,2238.0,6819.0
-122.31457273,1650.0,9711.0
-122.33659507,1780.0,8113.0
-122.0308176,2390.0,7570.0


### 1. Selection and summary statistics:  
 - In the notebook we covered in the module, we discovered which neighborhood (zip code) of Seattle had the **highest average house sale price**.  Now, take the sales data, select only the houses with this zip code, and compute the average price.  Save this result to answer the quiz at the end.

In [4]:
#98039, 98004, 98040
print(sales[sales['zipcode']=='98039']["price"].mean())
print(sales[sales['zipcode']=='98004']["price"].mean())
print(sales[sales['zipcode']=='98040']["price"].mean())

2160606.5999999996
1355927.0977917982
1194230.0035460996


 ### 2.  Filtering data: 
 - One of the key features we used in our model was the number of square feet of living space (‘sqft_living’) in the house. For this part, we are going to use the idea of filtering (selecting) data.
    * In particular, we are going to use logical filters to select rows of an SFrame. You can find more info in the Logical Filter section of this documentation.
    * Using such filters, first select the houses that have ‘sqft_living’ higher than 2000 sqft but no larger than 4000 sqft.
    * What fraction of the all houses have ‘sqft_living’ in this range? 

In [13]:
sales_living_filter = sales[(sales["sqft_living"] > 2000) & (sales["sqft_living"] <= 4000)]
print(sales_living_filter["sqft_living"].shape[0] / sales["sqft_living"].shape[0])

0.42187572294452413


###  3. Building a regression model with several more features:  
- In the sample notebook, we built two regression models to predict house prices, one using just ‘sqft_living’ and the other one using a few more features.

In [6]:
advanced_features = [
'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode',
'condition', # condition of house				
'grade', # measure of quality of construction				
'waterfront', # waterfront property				
'view', # type of view				
'sqft_above', # square feet above ground				
'sqft_basement', # square feet in basement				
'yr_built', # the year built				
'yr_renovated', # the year renovated				
'lat', 'long', # the lat-long of the parcel				
'sqft_living15', # average sq.ft. of 15 nearest neighbors 				
'sqft_lot15', # average lot size of 15 nearest neighbors 
]

In [7]:
training_set, test_set = sales.random_split(.8,seed=0)
my_features_model = turicreate.linear_regression.create(training_set,target='price',features=advanced_features, validation_set=None)

In [8]:
print (my_features_model.evaluate(test_set))

{'max_error': 3170363.181382781, 'rmse': 155269.6579279753}


- What is the difference in RMSE between the model trained with `my_features` and the one trained with `advanced_features`?

In [11]:
180547.56626296483 - float(my_features_model.evaluate(test_set)["rmse"])

25277.908334989537