# Predicting House Sale Prices

We will be working with housing data for the city of Ames, Iowa, United States from 2006 to 2010 in order to predict house sale prices. You can also read about the different columns in the data [here](https://s3.amazonaws.com/dq-content/307/data_description.txt).

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn import linear_model

In [3]:
housing = pd.read_csv('AmesHousing.tsv', delimiter='\t')

In [4]:
def transform_features(df):
    return df

def select_features(df):
    return df[['Gr Liv Area', 'SalePrice']]

def train_and_test(df):
    
    # Hardcoded middle
    train = df[:1460]
    test = df[1460:]
    
    # Only capturing data good for linear models
    numeric_train = train.select_dtypes(include=['integer', 'float'])
    numeric_test = test.select_dtypes(include=['integer', 'float'])
    
    # SalePrice is the target feature
    features = numeric_train.columns.drop("SalePrice")
    
    # Train model
    lr = linear_model.LinearRegression()
    lr.fit(train[features], train["SalePrice"])
    
    # Tests model
    predictions = lr.predict(test[features])
    mse = mean_squared_error(test["SalePrice"], predictions)
    rmse = np.sqrt(mse)
    
    return rmse

train_and_test(transform_features(select_features(housing))) # Select then transform then evaluate

57088.25161263909

## Feature Engineering

Here we will do the following:

* Remove features that we don't want to use in the model, just based on the number of missing values or data leakage
* Transform features into the proper format (numerical to categorical, scaling numerical, filling in missing values, etc)
* Create new features by combining other features

In [5]:
def transform_features(df):
    '''
    1. Takes in a dataframe and returns a transformed version
    2. First drops any column with more than 5% of missing values
    3. For text columns, drops those with any missing values
    4. For numeric columns, fills in missing values with the the mode of that column
    5. Transforms 'Yr Sold' and 'Year Built' into 'Years Before Sale' and 'Years Since Remod'
    6. Drops non-useful columnns or those that leak data about the final sale
    '''
    
    # 2 - drops columns with more than 5% of missing values
    num_missing = df.isnull().sum()
    drop_missing_cols = num_missing[(num_missing > len(df)/20)].sort_values()
    df = df.drop(drop_missing_cols.index, axis=1)
    
    # 3 - drops text columns with any missing values
    text_mv_counts = df.select_dtypes(include=['object']).isnull().sum().sort_values(ascending=False)
    drop_missing_cols_2 = text_mv_counts[text_mv_counts > 0]
    df = df.drop(drop_missing_cols_2.index, axis=1)
    
    # 4 - finds columns with missing values and fills the missing values with the mode
    num_missing = df.select_dtypes(include=['int', 'float']).isnull().sum()
    fixable_numeric_cols = num_missing[(num_missing < len(df)/20) & (num_missing > 0)].sort_values()
    replacement_values_dict = df[fixable_numeric_cols.index].mode().to_dict(orient='records')[0]
    df = df.fillna(replacement_values_dict)
    
    # 5 - Creates new columns that creates information from other columns and drops dirty negative values
    years_sold = df['Yr Sold'] - df['Year Built']
    years_since_remod = df['Yr Sold'] - df['Year Remod/Add']
    df['Years Before Sale'] = years_sold
    df['Years Since Remod'] = years_since_remod
    df = df.drop([1702, 2180, 2181], axis=0)

    # 6 - Drops non-useful columns or columns that leak data about the final sale
    df = df.drop(["PID", "Order", "Mo Sold", "Sale Condition", "Sale Type", "Year Built", "Year Remod/Add"], axis=1)
    return df

def select_features(df):
    return df[["Gr Liv Area", "SalePrice"]]

def train_and_test(df):  
    train = df[:1460]
    test = df[1460:]
    
    ## You can use `pd.DataFrame.select_dtypes()` to specify column types
    ## and return only those columns as a data frame.
    numeric_train = train.select_dtypes(include=['integer', 'float'])
    numeric_test = test.select_dtypes(include=['integer', 'float'])
    
    ## You can use `pd.Series.drop()` to drop a value.
    features = numeric_train.columns.drop("SalePrice")
    lr = linear_model.LinearRegression()
    lr.fit(train[features], train["SalePrice"])
    predictions = lr.predict(test[features])
    mse = mean_squared_error(test["SalePrice"], predictions)
    rmse = np.sqrt(mse)
    
    return rmse

df = pd.read_csv("AmesHousing.tsv", delimiter="\t")
transform_df = transform_features(df)
filtered_df = select_features(transform_df)
rmse = train_and_test(filtered_df)

rmse

55275.36731241307

## Feature Selection

Now that we've cleaned and transformed a lot of the features in the data set, we will move on to selecting the features to use for the model for numerical features.