### Linear Regression Project - Predicting House Prices

Information about the dataset is available [here](https://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627)

Information about the columns is available [here](https://s3.amazonaws.com/dq-content/307/data_description.txt)

Data is available to download here [here](https://dsserver-prod-resources-1.s3.amazonaws.com/235/AmesHousing.txt)

#### Imports 

In [2]:
import pandas as pd
import numpy as np


In [5]:
df = pd.read_csv("AmesHousing.txt", sep="\t")

In [7]:
df.head(3)

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 82 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Order            2930 non-null   int64  
 1   PID              2930 non-null   int64  
 2   MS SubClass      2930 non-null   int64  
 3   MS Zoning        2930 non-null   object 
 4   Lot Frontage     2440 non-null   float64
 5   Lot Area         2930 non-null   int64  
 6   Street           2930 non-null   object 
 7   Alley            198 non-null    object 
 8   Lot Shape        2930 non-null   object 
 9   Land Contour     2930 non-null   object 
 10  Utilities        2930 non-null   object 
 11  Lot Config       2930 non-null   object 
 12  Land Slope       2930 non-null   object 
 13  Neighborhood     2930 non-null   object 
 14  Condition 1      2930 non-null   object 
 15  Condition 2      2930 non-null   object 
 16  Bldg Type        2930 non-null   object 
 17  House Style   

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [44]:
def train_and_test(df, features):
    X = df[features].drop('SalePrice', axis=1)
    y = df['SalePrice']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
    lr = LinearRegression()
    lr.fit(X_train, y_train)
    predictions = lr.predict(X_test)
    mse = mean_squared_error(predictions, y_test)
    return np.sqrt(mse)

In [22]:
def get_numeric_features(df):
    return df.select_dtypes(include=np.number).columns.to_list()

In [30]:
def select_features():
    return ['Gr Liv Area', 'SalePrice']

In [45]:
living_area_feature = select_features()
train_and_test(df, living_area_feature)

57585.82288777018

Using a linear regression model using only the above ground living area the RMSE is $57586

### Feature engineering 

* Any columns with more than 25% of missing values will be dropped
* Columns that leak data will be dropeed
* Data will be formatted including converting data to be of type category, scaling numerical data and imputing missing values
* Create new categories through combining original features