### <div align='center'> Model Selection </div>

----

What are the types of relationships between housing features and the target (Sales Price)? Given the standard rhetoric regarding real-estate prices i.e. "location, location, location" etc, the expectation is that a linear model will not be appropriate as neighborhood housing values share no obvious linear relationship between latitude and longitude. Before examining our model choices, lets glance at the data to develop a better intuition about the types of relationships we'll be trying to model.

In [1]:
from datetime import timedelta, datetime
import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import os
import pickle
import pymongo
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

from utils import *

In [2]:
ds, target = build_dataset()


KeyboardInterrupt: 

### Latitude/Longitude
----

Lets visualize the relationship between location and price.

In [None]:
# remove datapoints with missing locations.
use_ds = np.where(ds[:,7] > -9999)
ds = ds[use_ds]
target = target[use_ds]
use_ds = np.where(ds[:,8] > -9999)
ds = ds[use_ds]
target = target[use_ds]

#plot sales after 2018
use_ds = np.where(ds[:,5] > datetime(year=2018, month=1, day=1))
ds = ds[use_ds]
target = target[use_ds]

In [None]:
plt.figure(figsize=(20,10)) 
n = matplotlib.colors.Normalize(vmin=min(np.log(target)), vmax=max(np.log(target)))
plt.scatter(ds[:,7], ds[:,8], c=n(np.log(target)), cmap=plt.cm.coolwarm)
plt.xlabel('Latitude')
plt.ylabel('Longitude')
plt.title('Housing Sale Prices by Location')
plt.show()

Clearly this is a non-linear relationship. We should either consider a mapping that may linearize this relationship or consider a non-linear model, especially since location is known to be an extremely important feature.


### Sqftage vs Sales Price

-----

In [None]:
plt.scatter(ds[:,1], target)
plt.xlabel('Main Structure Sqftage')
plt.ylabel('Sales Price ($)')

In [None]:
plt.scatter(ds[:,2], target)
plt.xlabel('Lot Sqftage')
plt.ylabel('Sales Price ($)')

### Beds & Baths vs Sales Price
----

In [None]:
plt.scatter(ds[:,4], target)
plt.xlabel('# of Beds')
plt.ylabel('Sales Price ($)')

In [None]:
plt.scatter(ds[:,3], target)
plt.xlabel('# of Baths')
plt.ylabel('Sales Price ($)')

### Prior Sales vs Sales Price
----

In [None]:
plt.scatter(ds[:,10], target)
plt.xlabel('Last Sales Price ($)')
plt.ylabel('Sales Price ($)')