# FIM590: Machine Learning Project 1
## Iowa Housing Prices Modeling
George Armentrout

In this project, information regarding approximately 2000 real estate listings in Iowa will be utilized in modeling the sale price of houses in the region.

Several different models will be employed. Although they are all linear regressions, each approach will have slight variation in their regularization method. Specifically, all regressions will utilize the mean square error loss function, however one model will have no regularization, one model wil use Ridge Regression with $\lambda = 0.10,0.30,0.60,$, and one model will use Lasso Regression with $\lambda = 0.02,0.06,0.10$.

This process is comprised of several components, listed as followed:
1. Investigate the provided data and features and decide which features to use in the regression.
2. Clean the data to contain only the features to be utilized and the associated sale price.
3. Divide the given dataset into the necessary subsets (training, validation, and test sets).
4. Utilizing the package Scikit-Learn, implement the specified models.
5. Train the models using the training set.
6. Utilize the validation set to select the best model.
7. Evaluate the model's performance with the test data set.

### 1. Determining Features for Regression

In the dataset provided, there are 79 features listed with each real estate listing in addition to the sale price of the listing itself. Generally, including more features can allow for a more precise and accurate model. However, allowing for too may features may not be in the best interest of the model. This can be due to overfitting, where a model follows a dataset too closely and only accurately models the training dataset rather than the phenomena itself. Additionally, increasing the number of features demands more computational power, lowering the efficiency of the regression and worsening the performance of the model itself. For this project, 47 features are being considered as followed:

- Lot area (square feet)
- Overall quality (scale: 1 to 10)
- Overall condition (scale: 1 to 10)
- Year built
- Year remodeled (= year built if no remodeling or additions)
- Finished basement (square feet)
- Unfinished basement (square feet)
- Total basement (square feet)
- First floor (square feet)
- Second floor (square feet)
- Living area (square feet)
- Number of full bathrooms
- Number of half bathrooms
- Number of bedrooms
- Total rooms above grade
- Number of fireplaces
- Parking spaces in garage
- Garage area (square feet)
- Wood deck (square feet)
- Open porch (square feet)
- Enclosed porch (square feet)
- Neighborhood (25 features)*
- Basement quality**

\* There are 25 unique neighborhoods in the dataset. Each neighborhood is implemented as a binary feature (1 if the listing is in the specified neighborhood, 0 if not).

\*\* Basement quality is a qualitative metric that is categorized as either 'Excellent,' 'Good,' 'Typical,' 'Fair,' 'Poor,' or 'No basement.' For this regression, this metric is assigned a value of 5, 4, 3, 2, 1, or 0, respectively.

### 2. Cleaning the Data

With the above determination of features to be utilized in the regression, there is a large amount of superfluous data present in the dataset. Additionally, several of these datapoints need to be derived from information provided in the dataset. In cleaning this dataset, first ensure that all features needed for the regression are calculated and present. These can be in either pre-existing columns or new columns of the dataframe.

- Lot area (square feet): "LotArea"
- Overall quality (scale: 1 to 10): "OverallQual"
- Overall condition (scale: 1 to 10): "OverallCond"
- Year built: "YearBuilt"
- Year remodeled (= year built if no remodeling or additions): "YearRemodAdd"
- Finished basement (square feet): **Added "BsmtFinSF"**
- Unfinished basement (square feet): "BsmtUnfSF"
- Total basement (square feet): "TotalBsmtSF"
- First floor (square feet): "1stFlrSF"
- Second floor (square feet): "2ndFlrSF"
- Living area (square feet): "GrLivArea"
- Number of full bathrooms: "FullBath"
- Number of half bathrooms: "HalfBath"
- Number of bedrooms: "BedroomAbvGr"
- Total rooms above grade: "TotRmsAbvGrd"
- Number of fireplaces: "Fireplaces"
- Parking spaces in garage: "GarageCars"
- Garage area (square feet): "GarageArea"
- Wood deck (square feet): "WoodDeckSF"
- Open porch (square feet): "OpenPorchSF"
- Enclosed porch (square feet): "EnclosedPorch"
- Neighborhood (25 features)*: **Added "neighborhoodName"**
- Basement quality**: **Added "BsmtQualNum"**

Please note, typically any bedrooms or bathrooms built below grade (in the basement) are not included in the count of bedrooms and bathrooms in appraisal, and therefore are not included in the number of full/half bathrooms or bedrooms. Similarly, no finished basement square footage is included in the living area feature.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split

housing_data = pd.read_excel("IA_House_Price_Original_Data.xlsx",header=3,index_col=0)
housing_data.head()

# Add Feature "BsmtFinSF" to count the square footage of the Finished Basement
housing_data["BsmtFinSF"] = housing_data["TotalBsmtSF"] - housing_data["BsmtUnfSF"]

# Add Feature "BsmtQualNum" to quantify the basement quality
basement_conditions = [
    (housing_data["BsmtQual"] == "Ex"),
    (housing_data["BsmtQual"] == "Gd"),
    (housing_data["BsmtQual"] == "Ta"),
    (housing_data["BsmtQual"] == "Fa"),
    (housing_data["BsmtQual"] == "Po"),
    (housing_data["BsmtQual"] == "NA"),
]

basement_quality_numbers = [5,4,3,2,1,0]

housing_data["BsmtQualNum"] = np.select(basement_conditions,basement_quality_numbers)

# Add the 25 features denoting which neighborhood the listing is in.
housing_data["Blmngtn"] = np.where(housing_data["Neighborhood"]=="Blmngtn", 1,0)
housing_data["Blueste"] = np.where(housing_data["Neighborhood"]=="Blueste", 1,0)
housing_data["BrDale"] = np.where(housing_data["Neighborhood"]=="BrDale", 1,0)
housing_data["BrkSide"] = np.where(housing_data["Neighborhood"]=="BrkSide", 1,0)
housing_data["ClearCr"] = np.where(housing_data["Neighborhood"]=="ClearCr", 1,0)
housing_data["CollgCr"] = np.where(housing_data["Neighborhood"]=="CollegeCr", 1,0)
housing_data["Crawfor"] = np.where(housing_data["Neighborhood"]=="Crawfor", 1,0)
housing_data["Edwards"] = np.where(housing_data["Neighborhood"]=="Edwards", 1,0)
housing_data["Gilbert"] = np.where(housing_data["Neighborhood"]=="Gilbert", 1,0)
housing_data["IDOTRR"] = np.where(housing_data["Neighborhood"]=="IDOTRR", 1,0)
housing_data["MeadowV"] = np.where(housing_data["Neighborhood"]=="MeadowV", 1,0)
housing_data["Mitchel"] = np.where(housing_data["Neighborhood"]=="Mitchel", 1,0)
housing_data["Names"] = np.where(housing_data["Neighborhood"]=="Names", 1,0)
housing_data["NoRidge"] = np.where(housing_data["Neighborhood"]=="NoRidge", 1,0)
housing_data["NPkVill"] = np.where(housing_data["Neighborhood"]=="NPkVill", 1,0)
housing_data["NridgHt"] = np.where(housing_data["Neighborhood"]=="NridgHt", 1,0)
housing_data["NWAmes"] = np.where(housing_data["Neighborhood"]=="NWAmes", 1,0)
housing_data["OldTown"] = np.where(housing_data["Neighborhood"]=="OldTown", 1,0)
housing_data["SWISU"] = np.where(housing_data["Neighborhood"]=="SWISU", 1,0)
housing_data["Sawyer"] = np.where(housing_data["Neighborhood"]=="Sawyer", 1,0)
housing_data["SawyerW"] = np.where(housing_data["Neighborhood"]=="SawyerW", 1,0)
housing_data["Somerst"] = np.where(housing_data["Neighborhood"]=="Somerst", 1,0)
housing_data["StoneBr"] = np.where(housing_data["Neighborhood"]=="StoneBr", 1,0)
housing_data["Timber"] = np.where(housing_data["Neighborhood"]=="Timber", 1,0)
housing_data["Veenker"] = np.where(housing_data["Neighborhood"]=="Veenker", 1,0)

Next, although not technically necesssary for the sake of the regression, remove the unneeded data from the dataset. This helps promote organized and intentional data and can make analyzing the features in the regression easier.

In [2]:
clean_housing_data = housing_data[["LotArea", "OverallQual", "OverallCond", "YearBuilt", "YearRemodAdd", "BsmtFinSF", 
                                   "BsmtUnfSF", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "GrLivArea", "FullBath", "HalfBath",
                                   "BedroomAbvGr", "TotRmsAbvGrd", "Fireplaces", "GarageCars", "GarageArea", "WoodDeckSF",
                                   "OpenPorchSF", "EnclosedPorch", "BsmtQualNum", "Blmngtn", "Blueste", "BrDale", "BrkSide",
                                   "ClearCr","CollgCr","Crawfor","Edwards","Gilbert","IDOTRR","MeadowV","Mitchel","Names","NoRidge",
                                   "NPkVill","NridgHt","NWAmes","OldTown","SWISU","Sawyer","SawyerW","Somerst","StoneBr","Timber",
                                   "Veenker","SalePrice"]]

clean_housing_data.tail()

Unnamed: 0_level_0,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,BsmtFinSF,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,...,NWAmes,OldTown,SWISU,Sawyer,SawyerW,Somerst,StoneBr,Timber,Veenker,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2915,1936,4,7,1970,1970,0,546,546,546,546,...,0,0,0,0,0,0,0,0,0,90500
2916,1894,4,5,1970,1970,252,294,546,546,546,...,0,0,0,0,0,0,0,0,0,71000
2917,20000,5,7,1960,1996,1224,0,1224,1224,0,...,0,0,0,0,0,0,0,0,0,131000
2918,10441,5,5,1992,1992,337,575,912,970,0,...,0,0,0,0,0,0,0,0,0,132000
2919,9627,7,5,1993,1994,758,238,996,996,1004,...,0,0,0,0,0,0,0,0,0,188000


### 3. Partition the Training, Validation, and Test Sets

First, divide the cleaned data set into the features and the sale price. Then, utilizing Sklearn's train_test_split() function to randomly generate the appropriate training, validation, and test sets. Note that this function only splits the dataset into two groups, so the function is used twice (split into two groups, then split the second group into the second and third group).

In [3]:
housing_features = clean_housing_data.drop(columns="SalePrice")
housing_prices = clean_housing_data["SalePrice"]

x_train, x_rest, y_train, y_rest = train_test_split(housing_features,housing_prices, train_size = 1800.0/2919)
x_validate, x_test, y_validate, y_test = train_test_split(x_rest,y_rest,train_size = 600.0/(2919-1800))

# Now, (x_train, y_train), (x_valildate, y_validate), (x_test, y_test) represent the three datasets.

### 4. Implement the Specified Models

#### Linear Regression
First, consider a traditional linear regression model with no form of regularization. In this model, the cost function is just a standard mean square error function. Specifically, given $n$ features and $m$ points in a dataset, the cost function is:

$$
J(\theta) = \frac{1}{2m}\sum_{i=1}^m(y^{(i)}-h_\theta(x^{(i)}))^2
$$

This model can be implemented utilizing scikit-learn's LinearRegression class.


#### Ridge Regression
The ridge regression model is similar to linear regression, however there is an additional term in the cost function that is based upon the squared magnitude of $\theta$ values. The cost function is specifically:

$$
J(\theta) = \frac{1}{2m}\sum_{i=1}^m(y^{(i)}-h_\theta(x^{(i)}))^2 + \frac{\lambda}{m}\sum_{j=1}^n(\theta_j)^2
$$

Here, there is a parameter $\lambda$ that can be adjusted to establish sensitivity to the additional term. For this project, three different lambda values will be tested ($\lambda = 0.10,0.30,0.60$). This model will also be implemented utilizing scikit-learn's Ridge class.

#### Lasso Regression
The lasso regression model is very similar to ridge regression, except the additional term utilizes absolute value rather than squaring the coefficients.

$$
J(\theta) = \frac{1}{2m}\sum_{i=1}^m(y^{(i)}-h_\theta(x^{(i)}))^2 + \frac{\lambda}{m}\sum_{j=1}^n(\theta_j)^2
$$

Similar to the ridge regression, the parameter $\lambda$ establishes sensitivity to the additional term. For this regression, three different $\lambda$ values will be utilized ($\lambda = 0.02,0.06,0.10$). This model will be implemented utilizing scikit-learn's Lasso class.

### 5. Training the Models
Here, utilizing the specified scikit-learn classes, the models are implemented and trained utilizing the training set. Additionally, the predicted values for the validation set are computed for each regression, as well as the overall mean square error for the validation sets.

In [10]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error


### Linear Regression
linear = LinearRegression()
linear.fit(x_train,y_train)
linear_pred = linear.predict(x_validate)
linear_error = mean_squared_error(y_validate, linear_pred)
print("Linear Mean Squared Error: ", linear_error)



### Ridge Regression
ridge1 = Ridge(alpha = 0.10)
ridge1.fit(x_train,y_train)
ridge1_pred = ridge1.predict(x_validate)
ridge1_error = mean_squared_error(y_validate, ridge1_pred)
print("Ridge 0.10 Mean Squared Error: ", ridge1_error)

ridge3 = Ridge(alpha = 0.30)
ridge3.fit(x_train,y_train)
ridge3_pred = ridge3.predict(x_validate)
ridge3_error = mean_squared_error(y_validate, ridge3_pred)
print("Ridge 0.30 Mean Squared Error: ", ridge3_error)

ridge6 = Ridge(alpha = 0.60)
ridge6.fit(x_train,y_train)
ridge6_pred = ridge6.predict(x_validate)
ridge6_error = mean_squared_error(y_validate, ridge6_pred)
print("Ridge 0.60 Mean Squared Error: ", ridge6_error)



### Lasso Regression
lasso02 = Lasso(alpha = 0.02)
lasso02.fit(x_train,y_train)
lasso02_pred = lasso02.predict(x_validate)
lasso02_error = mean_squared_error(y_validate, lasso02_pred)
print("Lasso 0.02 Mean Squared Error: ", lasso02_error)

lasso06 = Lasso(alpha = 0.06)
lasso06.fit(x_train,y_train)
lasso06_pred = lasso06.predict(x_validate)
lasso06_error = mean_squared_error(y_validate, lasso06_pred)
print("Lasso 0.06 Mean Squared Error: ", lasso06_error)

lasso10 = Lasso(alpha = 0.10)
lasso10.fit(x_train,y_train)
lasso10_pred = lasso10.predict(x_validate)
lasso10_error = mean_squared_error(y_validate, lasso10_pred)
print("Lasso 0.10 Mean Squared Error: ", lasso10_error)

Linear Mean Squared Error:  676292215.7821465
Ridge 0.10 Mean Squared Error:  676302512.988753
Ridge 0.30 Mean Squared Error:  676335347.4006902
Ridge 0.60 Mean Squared Error:  676412977.666321
Lasso 0.02 Mean Squared Error:  676276723.0564508
Lasso 0.06 Mean Squared Error:  676270698.4284937
Lasso 0.10 Mean Squared Error:  676264680.2460374


  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


### 6. Utilize the Validation Set to Select the Best Model

Now, select the model with the smallest mean squared error of the validation set. As these values were computed above, the best model is therefore the ______ model.

### 7. Evaluating _____ Model with the Test Set

With the best model determined, the accuracy of this model can be evaluated with the remaining test set. This is done by utilizing the mean squared error method as well.

In [11]:
# test_predict = _____.predict(x_test)
# test_error = mean_squared_error(y_test, test_predict)
# print(test_error)