# Predicting House Sale Prices

In this project we'll explore ways to build and improve a linear regression model by working with housing data for the city of Ames, Iowa from 2006 to 2010. Information on the dataset can be found [here](https://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627), and the columns info can be found [here](https://s3.amazonaws.com/dq-content/307/data_description.txt).

We'll start by importing our libraries, reading in our data, and then setting up a pipeline of functions that will help us quickly iterate over different models.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn import linear_model

pd.options.display.max_columns = 999

In [3]:
df = pd.read_csv('AmesHousing.tsv', delimiter='\t')

In [10]:
# Returns training Data Frame
def transform_features(df):
    return df

def select_features(df):
    return df[['Gr Liv Area', 'SalePrice']]

def train_and_test(df):
    train = df[:1460]
    test = df[1460:]
    
    numeric_train = train.select_dtypes(include=['integer', 'float'])
    numeric_test = test.select_dtypes(include=['integer', 'float'])
    
    features = numeric_train.columns.drop('SalePrice')
    
    lr = linear_model.LinearRegression()
    lr.fit(train[features], train['SalePrice'])
    predictions = lr.predict(test[features])
    mse = mean_squared_error(test['SalePrice'], predictions)
    rmse = np.sqrt(mse)
    
    return rmse


transform_df = transform_features(df)
filtered_df = select_features(transform_df)
rmse = train_and_test(filtered_df)

print("Root Mean Squared Error: ", round(rmse, 2))

Root Mean Squared Error:  57088.25


## Feature Engineering
