# Project Description

###### In this project we will explore and predict housing prices based on a variety of criteria, utilizing the sklearn package. Data used is in the file 'real_estate_info.csv', downloaded from kaggle.com from 'Exercise: Machine Learning Competitions' and renamed from 'test.csv'

### Set up file paths and import necessary libraries

In [12]:
#Importing relevant libraries
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

#Importing data set
home_data = pd.read_csv('real_estate_info.csv')
home_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


### Set up variables for analysis, X will be a list called features from columns in 'real_estate_info.csv'. y will be SalePrice, the variable that we are predicting. Then set up random forest regression

In [100]:
y = home_data.SalePrice

#List of Features. Start with LotArea, YearBuilt, 1stFlrSF, 2ndFlrSF, FullBath, BedrooomAbvGr, TotRmsAbvGrd
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
#Variable X
X = home_data[features]
print(X.head())

#Split variables using train_test_split into training data and validation data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 1)
features

   LotArea  YearBuilt  1stFlrSF  2ndFlrSF  FullBath  BedroomAbvGr  \
0     8450       2003       856       854         2             3   
1     9600       1976      1262         0         2             3   
2    11250       2001       920       866         2             3   
3     9550       1915       961       756         1             3   
4    14260       2000      1145      1053         2             4   

   TotRmsAbvGrd  
0             8  
1             6  
2             6  
3             7  
4             9  


['LotArea',
 'YearBuilt',
 '1stFlrSF',
 '2ndFlrSF',
 'FullBath',
 'BedroomAbvGr',
 'TotRmsAbvGrd']

In [7]:
#Define random forest (rf) model
rf_model = RandomForestRegressor(random_state = 1)

#Fit the model
rf_model.fit(train_X, train_y)

### Explore predictions and mean absolute error of rf_model

In [11]:
#Predictions
rf_predictions = rf_model.predict(val_X)

#Mean absolute error
rf_mae = mean_absolute_error(rf_predictions, val_y)
print(f'Validation Mean Absolute Error for Random Forest Model: {rf_mae}')

Validation Mean Absolute Error for Random Forest Model: 21857.15912981083


## Analyze data using a Decision Tree Regressor model

##### Test a variety of tree sizes to find which results in lowest MAE, and compare MAE to MAE from rf_model

In [21]:
#define function get_mae to return MAE of the model using Decision Tree Model
def get_mae(tree_size, train_X, val_X, train_y, val_y):
    dtr_model = DecisionTreeRegressor(max_leaf_nodes = tree_size, random_state = 1)
    dtr_model.fit(train_X, train_y)
    predictions = dtr_model.predict(val_X)
    dtr_mae = mean_absolute_error(val_y, predictions)
    return(dtr_mae)

#Loop through different tree sizes to evaluate lowest MAE
for i in [5, 50, 100, 500, 1000, 5000]:
    mae = get_mae(i, train_X, val_X, train_y, val_y)
    print(f'Max Leaf Nodes: {i} \t MAE: {mae}')

Max Leaf Nodes: 5 	 MAE: 35044.51299744237
Max Leaf Nodes: 50 	 MAE: 27405.930473214907
Max Leaf Nodes: 100 	 MAE: 27282.50803885739
Max Leaf Nodes: 500 	 MAE: 28357.63027292342
Max Leaf Nodes: 1000 	 MAE: 28933.530593607305
Max Leaf Nodes: 5000 	 MAE: 28942.75890410959


In [43]:
#Lowest MAE in this loop is 100. Lowest MAE possible could be between 50 and 500.
#Rewrite for loop in this range and return only lowest MAE
min_mae = 27282.50803885739 #Set this as default minimum, value of MAE at Max Leaf Nodes = 100
for i in range(50, 501):
    mae = get_mae(i, train_X, val_X, train_y, val_y)
    if min_mae > mae:
        min_mae = mae
print(min_mae)

26704.033546536175


##### We can stop here since the minimum Mean Absolute Error is greater than the MAE from the random forest model. If this number were lower, we could have adjusted the code to determine what the Max Leaf Nodes at this value are, but this is not necesary. We determine that this model is less precise than the Random Forest model from this number, and thus stop here

## Continuing using the Random Forest Regression Model, using a wider variety of features

###### Many columns contain categorical data instead of numerical. we will first start by creating dummy variables to account for this categorical data
##### We will start by mapping Utilities, Neighborhood, HouseStyle, RoofMatl, BsmtFinType2, Garage Type, SaleCondition to numerical data

In [57]:
#to_dict takes a column name from home_data as input and generates a dictionary where each unique
#piece of categorical data corresponds to a unique number
def to_dict(col):
    cat_list = home_data[col].tolist()
    cat_dict = {}
    num_val = 1
    for n in cat_list:
        if n not in cat_dict:
            cat_dict[n] = num_val
            num_val += 1
    return(cat_dict)

In [92]:
#Map each column of categorical data described above to numerical data by calling function to_dict
home_data['Utilities'] = home_data['Utilities'].map(to_dict('Utilities'))
home_data['Neighborhood'] = home_data['Neighborhood'].map(to_dict('Neighborhood'))
home_data['HouseStyle'] = home_data['HouseStyle'].map(to_dict('HouseStyle'))
home_data['RoofMatl'] = home_data['RoofMatl'].map(to_dict('RoofMatl'))
home_data['BsmtFinType2'] = home_data['BsmtFinType2'].map(to_dict('BsmtFinType2'))
home_data['GarageType'] = home_data['GarageType'].map(to_dict('GarageType'))
home_data['SaleCondition'] = home_data['SaleCondition'].map(to_dict('SaleCondition'))

In [101]:
#Features that will be added to total features to be analyzed
new_features = ['Utilities', 'Neighborhood', 'HouseStyle', 'RoofMatl', 'BsmtFinType2', 'GarageType', 'SaleCondition']

#Extending previous features list to include new features
features.extend(new_features)
features

['LotArea',
 'YearBuilt',
 '1stFlrSF',
 '2ndFlrSF',
 'FullBath',
 'BedroomAbvGr',
 'TotRmsAbvGrd',
 'Utilities',
 'Neighborhood',
 'HouseStyle',
 'RoofMatl',
 'BsmtFinType2',
 'GarageType',
 'SaleCondition']

In [106]:
#defining variables
y2 = home_data.SalePrice
X2 = home_data[features]

#Split up data into training and validation data
train_X2, val_X2, train_y2, val_y2 = train_test_split(X2, y2, random_state = 1)

In [107]:
#We have defined our model previously, we only need to fit it now using the new variables
rf_model.fit(train_X2, train_y2)

In [108]:
#Test predictions with validation data
rf_predictions2 = rf_model.predict(val_X2)

#Find mean average Error
rf_mae2 = mean_absolute_error(rf_predictions2, val_y2)
print(f'Validation Mean Absolute Error for Random Forest Model: {rf_mae2}')

Validation Mean Absolute Error for Random Forest Model: 20911.140589041097


###### Using categorical data and expanding the features list yields a better model, as the MAE is less than that of the previous rf model. Recall rf_mae = 21857.15912981083. Though the new MAE is still close to this value, it still yields a slightly better result