# Project Part II: Predicting Housing Prices - Build Your Own Model

 

### Grading Scheme

Your grade for the project will be based on your training RMSE and test RMSE. The thresholds are as follows:

Points | 9 | 7 | 5 | 3
--- | --- | --- | --- | ---
Training RMSE | Less than 60k | [60k, 120k) | [120k, 200k) | More than 230k

Points | 9 | 7 | 5 | 3
--- | --- | --- | --- | ---
Test RMSE | Less than 65k | [65k, 130k) | [130k, 230k) | More than 230k

The top 20% of the submissions with the least testing errors will receive the additional two points


In [1]:
# Some Imports You Might Need
import numpy as np
import pandas as pd
from pandas.api.types import CategoricalDtype

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

import zipfile
import os
import re

from sklearn.preprocessing import OneHotEncoder
from sklearn import linear_model as lm
from sklearn import ensemble

# Plot settings
plt.rcParams['figure.figsize'] = (12, 9)
plt.rcParams['font.size'] = 12

# Extract Dataset
with zipfile.ZipFile('cook_county_contest_data.zip') as item:
    item.extractall()
    
    
### Note: we filtered the data in cook_county_contest_data, 
####so please use this dataset instead of the old one.

### Note

This notebook is specifically designed to guide you through the process of exporting your model's predictions on the test dataset for submission so you can see how your model performs.

Most of what you have done in project part I should be transferrable here. 

## Step 1. Set up all the helper functions for your `process_data_fm` function.

**Copy-paste all of the helper functions your `process_data_fm` need here in the following cell**. You **do not** have to fill out all of the functions in the cell below -- only fill out those that are actually useful to your feature engineering pipeline.

In [2]:
from sklearn.preprocessing import OneHotEncoder

def add_total_bedrooms(data):
    """
    Input:
      data (data frame): a data frame containing at least the Description column.
    """
    with_rooms = data.copy()
    pattern = r'(\d+)(.\d+)? of which (are|is) (bedrooms|bedroom)'
    with_rooms['Bedrooms'] = with_rooms['Description'].apply(lambda x: int(re.findall(pattern, x)[0][0]) if len(re.findall(pattern, x))>0 else np.nan)
    with_rooms.fillna(0)
    return with_rooms

def ohe_roof_material(data):
    """
    One-hot-encodes roof material.  New columns are of the form 0x_QUALITY.
    """
    ohe = OneHotEncoder()
    new_columns = pd.DataFrame(data=ohe.fit_transform(data[['Roof Material']]).todense(), 
                              columns=ohe.get_feature_names(),
                             index = data.index)
    return data.join(new_columns)

def substitute_roof_material(data):
    """
    Input:
      data (data frame): a data frame containing a 'Roof Material' column.  Its values
                         should be limited to those found in the codebook
    Output:
      data frame identical to the input except with a refactored 'Roof Material' column
    """
    dictionary = {1:'Shingle/Asphalt', 2:'Tar&Gravel', 3:'Slate', 4:'Shake', 5:'Tile', 6:'Other'}
    data['Roof Material'] = data['Roof Material'].replace(dictionary)
    return data
  
def process_data_gm(data, pipeline_functions, prediction_col):
    """Process the data for a guided model."""
    for function, arguments, keyword_arguments in pipeline_functions:
        if keyword_arguments and (not arguments):
            data = data.pipe(function, **keyword_arguments)
        elif (not keyword_arguments) and (arguments):
            data = data.pipe(function, *arguments)
        else:
            data = data.pipe(function)
    X = data.drop(columns=[prediction_col]).to_numpy()
    y = data.loc[:, prediction_col].to_numpy()
    return X, y

def add_in_expensive_neighborhood(data, neighborhoods):
    """
    Input:
      data (data frame): a data frame containing a 'Neighborhood Code' column with values
        found in the codebook
      neighborhoods (list of strings): strings should be the names of neighborhoods
        pre-identified as rich
    Output:
      data frame identical to the input with the addition of a binary
      in_rich_neighborhood column
    """
    data['in_expensive_neighborhood'] = data['Neighborhood Code'].isin(neighborhoods).astype(int)
    return data

def find_expensive_neighborhoods(data, n=3, metric=np.median):
    """
    Input:
      data (data frame): should contain at least a string-valued Neighborhood
        and a numeric 'Sale Price' column
      n (int): the number of top values desired
      metric (function): function used for aggregating the data in each neighborhood.
        for example, np.median for median prices
    
    Output:
      a list of the top n richest neighborhoods as measured by the metric function
    """
    neighborhoods = data.groupby('Neighborhood Code').agg({'Log Sale Price':metric}).sort_values('Log Sale Price',ascending=False).head(n).index
    
    # This makes sure the final list contains the generic int type used in Python3, not specific ones used in numpy.
    return [int(code) for code in neighborhoods]

def select_columns(data, *columns):
    """Select only columns passed as arguments."""
    return data.loc[:, columns]

## Step 2. Setup your `process_data_fm` function

**Create your implementation of `process_data_fm` from into the following cell.**

Here are a few additional things **you should check and change to make sure your `process_data_fm` function satisfies**:
- Unlike part 1, we will not be expecting your `process_data_fm` function to return both the design matrix `X` and the observed target vector `y`; your function should now **only return X**.


In [3]:
# Please include all of your feature engineering process inside this function.
# Do not modify the parameters of the function below. 
# Note that data will no longer have the column Sale Price in it directly, so plan your feature engineering process around that.
def process_data_fm(data):
    # Replace the following line with your own feature engineering pipeline
    X = data
    X['Log Building Square Feet'] = np.log(X['Building Square Feet'])
    X = add_total_bedrooms(X)
    X = substitute_roof_material(X)
    X = ohe_roof_material(X)
    X = select_columns(X, 'Bedrooms', 'Log Building Square Feet', 'x0_Other', 'x0_Shake', 'x0_Shingle/Asphalt', 'x0_Slate', 'x0_Tar&Gravel', 'x0_Tile')
    return X

## Step 3. Train your model

Run the following cell to import the new set of training data to fit your model on. **You can use any regression model, the following is just an example** If your `process_data_fm` satisfies all the specified requirements, the cell should run without any error.

**As usual**, your model will predict the log-transformed sale price, and our grading will transform your predictions back to the normal vlaues.

In [4]:
def rmse(predicted, actual):
    """
    Calculates RMSE from actual and predicted values
    Input:
      predicted (1D array): vector of predicted/fitted values
      actual (1D array): vector of actual values
    Output:
      a float, the root-mean square error
    """
    return np.sqrt(np.mean((actual - predicted)**2))

In [5]:
train_data = pd.read_csv('cook_county_contest_train.csv', index_col='Unnamed: 0')
y_train = np.log(train_data['Sale Price'])
train_data = train_data.drop(columns=['Sale Price'])
X_train = process_data_fm(train_data)
model = ensemble.RandomForestRegressor(n_estimators=1000)
###You can use other models
model.fit(X_train, y_train);

## Step 4. Make Predictions on the Test Dataset

Run the following cell to estimate the sale price on the test dataset and export your model's predictions as a csv file called `predictions.csv`.

In [6]:
test_data = pd.read_csv('cook_county_contest_test.csv', index_col='Unnamed: 0')
X_test = process_data_fm(test_data)
y_test_predicted = model.predict(X_test)
predictions = pd.DataFrame({'Sale Price': y_test_predicted})
predictions.to_csv('predictions.csv')
print('Your predictions have been exported as predictions.csv. Please download the file and submit it to Canvas. ')

Your predictions have been exported as predictions.csv. Please download the file and submit it to Canvas. 
