# Multiple Linear Regression
If our columns are numeric, it's straightforward to add them to a linear regression model. Below we are adding an extra feature to a model `GrLivArea`

In [312]:
import pandas as pd
from utils import evaluate_model
from sklearn import linear_model

def train_model(training_set):
    features = encode_features(training_set)
    labels = training_set['SalePrice']
    predictor = linear_model.LinearRegression()
    
    predictor.fit(features, labels)
    
    def model(input_data):
        input_features = encode_features(input_data)
        output_value = predictor.predict(input_features)
        return output_value
    
    return(model)

In [313]:
# Load the training data
training_set = pd.read_csv("housing_price_data/training_data.csv")

In [314]:
# This function returns the transformed set of values we will use to train and evaluate
def encode_features(data):
    return(data[[
        'BedroomAbvGr',
        'GrLivArea'
    ]])


In [315]:
multi_linear_model = train_model(training_set)
evaluate_model(multi_linear_model)

The model is inaccurate by $40051.22 on average.


40051.221312056878

## Exercise

Modify the code above to add `LotArea` to the multilinear model and observe the performance of the model

## Working with categorical variables (Part 1)
In order to make use of categorical data, we first need to encode it as a number. As a simple example, we'll encode the variable `CentralAir` as 0 if there is no central air conditioning, and 1 if there is central air conditioning. We can do this by using a boolean comparison operation, and relying on the fact that `True == 1`

In [316]:
def encode_features(data):
    features = data.copy()
    
    # Encode Central Air where Y is 1, and N is 0
    features['CentralAir'] = features['CentralAir'] == 'Y'
    
    features = features[[
        'BedroomAbvGr',
        'GrLivArea',
        'CentralAir'
    ]]
    
    return(features)

In [317]:
central_air_model = train_model(training_set)
evaluate_model(central_air_model)

The model is inaccurate by $38058.32 on average.


38058.319991158562

# Exercise

Add whether or not the house has a pool to this model in the code above. (Look at the `PoolArea` variable). Compare this model's accruacy now that the presence of a pool is a feature. Did it improve it or make it worse? 

# Working with ordinal variables
If the categories of a variable follow a clear rank, then we can label them by this rank. An example of this is the basement quality column.

    BsmtCond: Evaluates the general condition of the basement

       Ex	Excellent
       Gd	Good
       TA	Typical - slight dampness allowed
       Fa	Fair - dampness or some cracking or settling
       Po	Poor - Severe cracking, settling, or wetness

We would encode this as Po:1, Fa:2, TA:3, Gd:4, and Ex:5.

For houses without a basement (i.e. `BsmtCond is NaN`), we use a default value of 0


In [318]:
def encode_features(data):
    features = data.copy()
    
    # Encode Central Air where Y is 1, and N is 0
    features['CentralAir'] = features['CentralAir'] == 'Y'

    # Encode basement condition using one-hot-encoding
    bsmt_cond_map = {'Ex':5,'Gd':4,'TA':3,'Fa':2,'Po':1}
    features['BsmtQuality'] = features['BsmtCond'].map(bsmt_cond_map).fillna(0) # Some houses have no basement

    features = features[[
        'BedroomAbvGr',
        'GrLivArea',
        'BsmtQuality',
        'CentralAir'
    ]]
    
    return(features)

In [319]:
basement_model = train_model(training_set)
evaluate_model(basement_model)

The model is inaccurate by $37885.68 on average.


37885.684919333209

In [None]:
basement_cond_model = train_basement_cond_model(training_set)
evaluate_model(basement_cond_model)

## Exercise

Add to the code above the KitchenQuality as a feature (see `KitchenQual` in `data_description.txt`)

# Working with categorical variables (part 2)


If there is no order, we use a technique called 'one hot encoding'. This involves creating a new column for each category

For example, the `Electrical` variable in the housing prices dataset contains the following categories:

       SBrkr	Standard Circuit Breakers & Romex
       FuseA	Fuse Box over 60 AMP and all Romex wiring (Average)	
       FuseF	60 AMP Fuse Box and mostly Romex wiring (Fair)
       FuseP	60 AMP Fuse Box and mostly knob & tube wiring (poor)
       Mix	Mixed

We would encode the table

|Electrical|
|---------|
  |FuseA|
  |FuseF|
  |FuseP|
  |Mix|
  |SBrkr|
  
as

|FuseA |FuseF|FuseP|Mix|SBrkr|
|-----|-----|-----|-----|
| 1| 0  | 0   | 0  | 0  |
| 0|1  | 0   | 0  | 0  |
| 0|0  | 1  | 0| 0|
| 0|0  | 0 | 1 |0 |
| 0|0  | 0 | 0 | 1|

In [320]:
def encode_electrical(electrical):
    one_hot_encoding = pd.DataFrame()
    one_hot_encoding['FuseA'] = electrical == 'FuseA'
    one_hot_encoding['FuseF'] = electrical == 'FuseF'
    one_hot_encoding['FuseP'] = electrical == 'FuseP'
    one_hot_encoding['Mix']   = electrical == 'Mix'
    one_hot_encoding['SBrkr'] = electrical == 'SBrkr'
    return(one_hot_encoding)


In [321]:
encode_electrical(training_set['Electrical']).head(20)

Unnamed: 0,FuseA,FuseF,FuseP,Mix,SBrkr
0,False,False,False,False,True
1,False,False,False,False,True
2,False,False,False,False,True
3,False,False,False,False,True
4,False,False,False,False,True
5,False,False,False,False,True
6,False,False,False,False,True
7,False,False,False,False,True
8,False,True,False,False,False
9,False,False,False,False,True


In [322]:
def encode_features(data):
    features = data.copy()
    
    # Encode Central Air where Y is 1, and N is 0
    features['CentralAir'] = features['CentralAir'] == 'Y'

    # Encode basement condition using one-hot-encoding
    bsmt_cond_map = {'Ex':5,'Gd':4,'TA':3,'Fa':2,'Po':1}
    features['BsmtQuality'] = features['BsmtCond'].map(bsmt_cond_map).fillna(0) # Some houses have no basement

    # Encode electrical
    features = pd.concat([features, encode_electrical(features['Electrical'])], axis=1)

    features = features[[
        'BedroomAbvGr',
        'GrLivArea',
        'CentralAir', 
        'FuseA',
        'FuseF',
        'FuseP', 
        'Mix',
        'SBrkr', 
        'BsmtQuality'
    ]]

    return(features)

In [323]:
electrical_model = train_model(training_set)
evaluate_model(electrical_model)


The model is inaccurate by $37271.98 on average.


37271.976859797694

## Exericse

Encode `Heating` using One Hot encoding

# Combining features

Sometimes, combing some features together will aid the machine learning process.  In this example we will multiple the number of bedrooms with the number of bathrooms:

In [324]:
def encode_features(data):
    features = data.copy()
    
    # Encode Central Air where Y is 1, and N is 0
    features['CentralAir'] = features['CentralAir'] == 'Y'

    # Encode basement condition using one-hot-encoding
    bsmt_cond_map = {'Ex':5,'Gd':4,'TA':3,'Fa':2,'Po':1}
    features['BsmtQuality'] = features['BsmtCond'].map(bsmt_cond_map).fillna(0) # Some houses have no basement

    # Encode electrical
    features = pd.concat([features, encode_electrical(features['Electrical'])], axis=1)
    
    # Combine bed and bath    
    features['BedBath'] = features['FullBath'] * features['BedroomAbvGr']

    features = features[[
        'BedroomAbvGr',
        'GrLivArea',
        'CentralAir', 
        'FuseA',
        'FuseF',
        'FuseP', 
        'Mix',
        'SBrkr', 
        'BsmtQuality',
        'BedBath'
    ]]

    return(features)

In [325]:
bedroom_bathroom_model = train_model(training_set)
evaluate_model(bedroom_bathroom_model)

The model is inaccurate by $36891.25 on average.


36891.251064716809

# Scaling Features

Many machine learning algorithms require all variables to be on the same scale, ideally between -1 and 1. Let's compare LotArea to SalePrice

In [326]:
training_set[['LotArea','SalePrice']].describe()

Unnamed: 0,LotArea,SalePrice
count,1168.0,1168.0
mean,10521.577055,180590.277397
std,10678.605035,78815.697902
min,1300.0,34900.0
25%,7555.0,129900.0
50%,9423.0,162950.0
75%,11608.5,214000.0
max,215245.0,755000.0


We can transform these variables to be on the same scale using a preprocessing trick called _min/max scaling_. 

`MinMaxScale(X) = (X - min(X))/(max(X) - min(X))` 

In [327]:
from sklearn.preprocessing import MinMaxScaler

In [328]:
scaler = MinMaxScaler()
scaled = pd.DataFrame(scaler.fit_transform(training_set[['LotArea','SalePrice']]),
             columns=['LotArea','SalePrice'])
scaled.head()

Unnamed: 0,LotArea,SalePrice
0,0.03342,0.241078
1,0.038795,0.203583
2,0.046507,0.261908
3,0.038561,0.145952
4,0.060576,0.298709


In [329]:
scaled.describe()

Unnamed: 0,LotArea,SalePrice
count,1168.0,1168.0
mean,0.043103,0.20232
std,0.049913,0.109451
min,0.0,0.0
25%,0.029236,0.131926
50%,0.037968,0.177823
75%,0.048183,0.248715
max,1.0,1.0


We can put these back to their original scale using the `inverse_transform` method.

In [330]:
pd.DataFrame(scaler.inverse_transform(scaled),
             columns=["LotArea","SalePrice"]).head()

Unnamed: 0,LotArea,SalePrice
0,8450.0,208500.0
1,9600.0,181500.0
2,11250.0,223500.0
3,9550.0,140000.0
4,14260.0,250000.0


In [331]:
training_set[["LotArea","SalePrice"]].head()

Unnamed: 0,LotArea,SalePrice
0,8450,208500
1,9600,181500
2,11250,223500
3,9550,140000
4,14260,250000


We can also add new features to our model by combining two or more existing features. For example, let's
multiply bedrooms by bathrooms. 
We'll also scale our input and output.

In [338]:
def encode_features(data, scaler=None):
    features = data.copy()
    
    # Encode Central Air where Y is 1, and N is 0
    features['CentralAir'] = features['CentralAir'] == 'Y'

    # Encode basement condition using one-hot-encoding
    bsmt_cond_map = {'Ex':5,'Gd':4,'TA':3,'Fa':2,'Po':1}
    features['BsmtQuality'] = features['BsmtCond'].map(bsmt_cond_map).fillna(0) # Some houses have no basement

    # Encode electrical
    features = pd.concat([features, encode_electrical(features['Electrical'])], axis=1)
    
    # Combine bed and bath    
    features['BedBath'] = features['FullBath'] * features['BedroomAbvGr']
    
    # Scale numeric values
        
    scaled_columns = [
        'FullBath',
        'BedroomAbvGr', 
        'BedBath',
        'GrLivArea',
        'CentralAir', 
        'FuseA',
        'FuseF',
        'FuseP', 
        'Mix',
        'SBrkr', 
        'BsmtQuality'
    ]
    scaled_features = features[scaled_columns]
    if not scaler:
        scaler = MinMaxScaler()
        scaler.fit(scaled_features)
        
    scaled_features = pd.DataFrame(scaler.transform(scaled_features), columns = scaled_columns)
    
    return scaled_features, scaler

def encode_label(data):
    labels = data.copy()['SalePrice']
    scaler = MinMaxScaler()
    scaler.fit(labels)    
    labels = pd.DataFrame(scaler.transform(labels), columns = ['SalePrice'])
    return (labels['SalePrice'], scaler)

def decode_label(data, scaler):
    return scaler.inverse_transform(data)

def train_model(training_set):
    features, feature_scaler = encode_features(training_set)
    labels, label_scaler = encode_label(training_set)
    predictor = linear_model.LinearRegression()
    
    predictor.fit(features,labels)
    
    def model(input_data):
        input_features = encode_features(input_data, scaler=feature_scaler)
        output_value = predictor.predict(input_features)
        return decode_label(output_value, label_scaler)
    
    return(model)

In [339]:
scaled_model = train_model(training_set)
evaluate_model(scaled_model)



TypeError: float() argument must be a string or a number, not 'MinMaxScaler'

What's happened here?  Let's check the error against the original training data

In [303]:
# Let's calculate the MAE against the training data
np.mean(np.abs(scaled_model(training_set) - training_set['SalePrice']))



31912.800597684025

# Competition Exercise

Build a linear model to achieve as low score as possible. Winner will receive an Eliiza water bottle.This is an example of over-fitting, where the model is too tightly fitted to the traning data, and not general enough to accomodate the test data.  This has happened due to the non-linear relationship of the features and the degrees of freedom allowed by having so many dimensions