<h1>Retail Housing - Price Projections</h1>

<h3>The steps in this notebook are:</h3>

<ol>
<li>Load packages needed
<li>Read data
<li>Analyze data
<li>Feature Engineering
<li>Split data - training vs projection
<li>Train model - Decision Tree
<li>Make projections
<li>Review results
<li>Improve model - Ensemble (Random Forrest)
<li>Review results
</ol>

<h3>Data Source and Approach</h3>
This project uses the Ames housing data available on Kaggle.  The data includes 81 features describing a wide range of characteristics of 1,460 homes in Ames, Iowa sold between 2006 and 2010.  Models are trained on houses sold prior to 2010 and then evaluated on houses sold in 2010.

<h3>Step 1 - Load packages needed</h3>

In [None]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
print('Done')

<h3>Step 2 - Read data</h3>

In [None]:
# read the training data
home_data_df = pd.read_csv('train.csv')
print('Done')

<h3>Step 3 - Analyze data</h3>

In [None]:
home_data_df.head()

In [None]:
home_data_df.describe


<h3>Step 4 - Feature Engineering</h3>

In [None]:
# Create target object and call it y
#target = ['Id', 'SalePrice']
target = ['SalePrice']
y = home_data_df[target]
print('y ------------------')
print(y.head())

print(" ")

# Create features and call it X
#features = ['Id', 'LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data_df[features]
print('X ------------------')
print(X.head())

print('Done')

<h3>Step 5 - Split data - training vs projection</h3>

In [None]:
# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
print('Done')

In [None]:
# debugging - uncomment as needed
print('train_X ------------')
print(train_X.head())
print('train_y ------------')
print(train_y.head())
print('val_X   ------------')
print(val_X.head())
print('val_y   ------------')
print(val_y.head())


<h3>Step 6 - Train model - Decision Tree</h3>

In [None]:
# Specify Model
iowa_model = DecisionTreeRegressor(max_leaf_nodes=100, random_state=1)
# Fit Model
iowa_model.fit(train_X, train_y)

<h3>Step 7 - Make projections</h3>

In [None]:
# Make validation predictions and calculate mean absolute error
val_predictions = iowa_model.predict(val_X)
print(type(val_predictions))
print(val_predictions)
print('Done')

<h3>Step 8 - Review results</h3>

In [None]:
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE when not specifying max_leaf_nodes: {:,.0f}".format(val_mae))

In [None]:
# create csv output for further analysis 

In [None]:
output = pd.DataFrame({'Id': val_y.index,
                       'SalePrice': val_y.SalePrice})
output.to_csv('IDsAndPrices - Actual.csv', index=False)

print('Done')

In [None]:
val_pred_df = pd.DataFrame(val_predictions, columns = ['SalePrice'])
print(val_pred_df.head())

output = pd.DataFrame({'Id': val_y.index,
                       'SalePrice': val_pred_df.SalePrice})
output.to_csv('IDsAndPrices - Predict.csv', index=False)

print('Done')

<h3>Step 9 - Improve model - Ensemble (Random Forrest)</h3>

In [None]:
# Define the model. Set random_state to 1
rf_model = RandomForestRegressor(random_state=1)
rf_model.fit(train_X, train_y)


In [None]:
rf_val_predictions = rf_model.predict(val_X)

print('Done')

<h3>Step 10 - Review results</h3>

In [None]:
rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)

print("Validation MAE for Random Forest Model: {:,.0f}".format(rf_val_mae))

In [None]:
# project template

# path to file you will use for predictions
#test_data_path = '../input/test.csv'

# read test data file using pandas
#test_data = pd.read_csv(test_data_path)

# create test_X which comes from test_data but includes only the columns you used for prediction.
# The list of columns is stored in a variable called features
#features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
#test_X = test_data[features]

# make predictions which we will submit. 
#test_preds = rf_model_on_full_data.predict(test_X)

# The lines below shows how to save predictions in format used for competition scoring

#output = pd.DataFrame({'Id': home_data_df.Id,
#                       'SalePrice': home_data_df.SalePrice})
#output.to_csv('IDsAndPrices.csv', index=False)
