# Multiple Linear Regression

In [None]:
import pandas as pd
import numpy as np
from sklearn import linear_model

# Load training set
training_set = pd.read_csv("https://raw.githubusercontent.com/eliiza/ml-training-data/master/housing_price_data/training_data.csv")

# Use same evaluate_model function as before
def evaluate_model(model_fn, print_result=False):
    '''
    Consumes a function model_fn
    and evaluates its predictive accuracy against 
    the housing prices test set.
    We have included a switch for the output to be a more human readable
    printed version or the uncurtailed floating point value of the average.
    '''
    test_data = pd.read_csv("https://raw.githubusercontent.com/eliiza/ml-training-data/master/housing_price_data/test_data.csv")
    actual_values = test_data['SalePrice']
    # Pass in all columns except SalePrice
    test_input = test_data.filter(regex='^(?!SalePrice$).*')
    predicted_saleprice = model_fn(test_input)
    mae = np.mean(np.abs(predicted_saleprice-actual_values))
    if print_result:
        return print("The model is inaccurate by $%.2f on average." % mae)
    else:
        return mae

## Numeric Features

If our columns are numeric, it's straightforward to add them to a linear regression model. 

Below we are adding an extra feature to the simple `OverallQual` model: `GrLivArea`.

In [None]:
# Create linear regression model function based on 'OverallQual' and 'GrLivArea'
# returns the model predictions
    
# Create features data frame and train model    
training_features = training_set[['OverallQual','GrLivArea']]
predictor = linear_model.LinearRegression()
predictor.fit(training_features, training_set['SalePrice'])
    
# Define model function which outputs predictions 
def multi_linear_model(input_data):
    return(predictor.predict(input_data[['OverallQual','GrLivArea']]))

In [None]:
# Evaluate model
evaluate_model(multi_linear_model, print_result=True)

**Question:** How does this compare with the accuracy of the single variable models in the previous notebook?

## Exercise

Modify the code above to add `YearBuilt` to the multilinear model and observe the performance of the model.

## Working with categorical variables (Part 1)
In order to make use of categorical data, we first need to encode it as a number. As a simple example, we'll encode the variable `CentralAir` as 0 if there is no central air conditioning, and 1 if there is central air conditioning. We can do this by using a Boolean comparison operation, and relying on the fact that `True == 1`.

**Key:** We need to apply the same feature engineering to the train and test datasets. To make this easy and repeatable, we will write a function to carry out the desired encoding (`encode_data()`) each time. In this case, it encodes the `CentralAir` variable as described above and selects the `OverallQual`, `GrLivArea` and `CentralAir` columns.

In [None]:
def encode_data(data):
    """
    Create copy of input data and transform string values (Y/N) of 'CentralAir' column to Booleans,
    return data frame with specified training columns
    """
    features = data.copy()
    features['CentralAir'] = features['CentralAir'] == 'Y'
    return features[['OverallQual','GrLivArea','CentralAir']]

In [None]:
# Create linear regression model function using the above encode_data feature engineering
# returns the model predictions

# Create features data frame and train model
training_features = encode_data(training_set)
predictor = linear_model.LinearRegression()
predictor.fit(training_features, training_set['SalePrice'])

# Define model function which outputs predictions
def central_air_model(input_data):
    input_features = encode_data(input_data)
    return(predictor.predict(input_features))

In [None]:
# Evaluate
evaluate_model(central_air_model, print_result=True)

# Exercise

Add whether or not the house has a pool to this model in the code above. (Look at the `PoolArea` variable).  
Compare this model's accuracy now that the presence of a pool is a feature.  
Did it improve it or make it worse? 

# Working with ordinal variables
If the categories of a variable follow a clear rank, then we can label them by this rank. An example of this is the basement quality column.

    BsmtCond: Evaluates the general condition of the basement

       Ex	Excellent
       Gd	Good
       TA	Typical - slight dampness allowed
       Fa	Fair - dampness or some cracking or settling
       Po	Poor - Severe cracking, settling, or wetness

We would encode this as Po:1, Fa:2, TA:3, Gd:4, and Ex:5.

For houses without a basement (i.e. `BsmtCond is NaN`), we use a default value of 0


In [None]:
# Function to use 'OverallQual' and 'GrLivArea', and create 'BsmtQuality' feature
def encode_data(data):
    """
    Create copy of input data and transform categories to numerical values,
    return data frame with specified training columns        
    """
    features = data.copy()

    # Create dictionary to map category string to numerical value
    bsmt_cond_map = {'Ex':5,'Gd':4,'TA':3,'Fa':2,'Po':1}

    # Create 'BsmtQuality' feature by mapping the numerical value to the basement condition
    # and fill gaps where map failed - Some houses have no basement
    features['BsmtQuality'] = features['BsmtCond'].map(bsmt_cond_map).fillna(0)

    return features[['OverallQual','GrLivArea','BsmtQuality']]


# Create features data frame and train model
training_features = encode_data(training_set)
predictor = linear_model.LinearRegression()
predictor.fit(training_features, training_set['SalePrice'])

# Define model function which outputs predictions
def basement_cond_model(input_data):
    input_features = encode_data(input_data)
    return(predictor.predict(input_features))

In [None]:
# Evaluate
evaluate_model(basement_cond_model, print_result=True)

## Exercise

Add to the code above the KitchenQuality as a feature: see `KitchenQual` in [data_description.txt](https://github.com/eliiza/ml-training-data/blob/master/housing_price_data/data_description.txt).

Does this make the model perform better?

# Working with categorical variables (part 2)

If there is no order, we use a technique called 'one hot encoding'.  
This involves creating a new column for each category.

For example, the `Electrical` variable in the housing prices dataset contains the following categories:

       SBrkr	Standard Circuit Breakers & Romex
       FuseA	Fuse Box over 60 AMP and all Romex wiring (Average)	
       FuseF	60 AMP Fuse Box and mostly Romex wiring (Fair)
       FuseP	60 AMP Fuse Box and mostly knob & tube wiring (poor)
       Mix	Mixed

We would encode the table

|Electrical|
|:---------|
  |FuseA|
  |FuseF|
  |FuseP|
  |Mix|
  |SBrkr|
  

as 

|FuseA |FuseF|FuseP|Mix|SBrkr| 
|-:-|-:-|-:-|-:-|-:-|
| 1| 0 | 0 | 0 | 0 |
| 0|1  | 0 | 0 | 0 |
| 0|0  | 1 | 0 | 0 |
| 0|0  | 0 | 1 | 0 |
| 0|0  | 0 | 0 | 1 |

In [None]:
def encode_electrical(electrical):
    """
    Create data frame with one column per category in 'electrical' column, rows are Boolean with respect to
    category string in electrical column.
    """
    one_hot_encoding = pd.DataFrame()
    one_hot_encoding['FuseA'] = electrical == 'FuseA'
    one_hot_encoding['FuseF'] = electrical == 'FuseF'
    one_hot_encoding['FuseP'] = electrical == 'FuseP'
    one_hot_encoding['Mix']   = electrical == 'Mix'
    one_hot_encoding['SBrkr'] = electrical == 'SBrkr'
    return(one_hot_encoding)

In [None]:
encode_electrical(training_set['Electrical']).head(20)

In [None]:
# Function to use 'OverallQual' and 'GrLivArea', and one hot encode 'Electrical'
def encode_data(data):
    """
    Create copy of input data and transform categories to Boolean values,
    return data frame with specified training columns        
    """

    # Create one hot encoded data frame
    electrical = encode_electrical(data['Electrical'])

    # Join one hot encoded data frame and feature columns from training data
    data = pd.concat([electrical,data[["OverallQual","GrLivArea"]]],axis=1)

    return(data)


# Create features data frame and train model    
training_features = encode_data(training_set)
predictor = linear_model.LinearRegression()
predictor.fit(training_features, training_set['SalePrice'])

# Define model function which outputs predictions
def electrical_model(input_data):
    input_features = encode_data(input_data)
    return(predictor.predict(input_features))

In [None]:
# Evaluate
evaluate_model(electrical_model, print_result=True)

## Exercise

Encode `Heating` using One Hot encoding

# Scaling features

Many machine learning algorithms require all variables to be on the same scale, ideally between -1 and 1.  
Let's compare `OverallQual` to `GrLivArea`:

In [None]:
training_set[['OverallQual','GrLivArea']].describe()

We can transform these variables to be on the same scale using a preprocessing trick called _min/max scaling_. 

`MinMaxScale(X) = (X - min(X))/(max(X) - min(X))` 

In [None]:
# The sklearn library has a function that can do this for us
from sklearn.preprocessing import MinMaxScaler

In [None]:
# Define scale function
scaler = MinMaxScaler()

# Create data frame by fitting the scale function on the 'OverallQual' and 'GrLivArea' columns
# of the training set and transforming their values as columns in the data frame
scaled = pd.DataFrame(scaler.fit_transform(training_set[['OverallQual','GrLivArea']].astype(float)),
             columns=['OverallQual','GrLivArea'])

# Inspect the result
scaled.describe()

We can put these back to their original scale using the `inverse_transform` method.

In [None]:
pd.DataFrame(scaler.inverse_transform(scaled),
             columns=['OverallQual','GrLivArea']).head()

Do they match with the original data?

In [None]:
training_set[['OverallQual','GrLivArea']].head()

# Combining features

We can also add new features to our model by combining two or more existing features. For example, let's
multiply `OverallQual` by `GrLivArea`. These are often called **interaction** variables.

We'll also scale our input and output. When we carry out scaling, we need to pass the scaling object to the model function, so that it can apply the same scaling process to unseen input data.

In [None]:
def encode_data(data,scaler = None):
    """
    Create copy of input data and create new feature, scale feature and
    return data frame with specified, scaled training columns        
    """

    data = data.copy()
    data = data[['OverallQual','GrLivArea']]
    # New feature is (unscaled) product of selected columns
    data['QualAreaInteract'] = data['OverallQual'] * data['GrLivArea']

    # Convert to float data type for scaling process
    data = data.astype(float)

    # Create and fit scaler if no scaling function is given
    if(not scaler):
        scaler = MinMaxScaler()
        scaler.fit(data)

    # Create data frame with scaled features only    
    pd.DataFrame(scaler.transform(data), columns = ['OverallQual','GrLivArea','QualAreaInteract'])
    return(data,scaler)


# Create features data frame and train model    
training_features, training_scaler = encode_data(training_set)
predictor = linear_model.LinearRegression()
predictor.fit(training_features, training_set['SalePrice'])

# Define model function which outputs predictions, using the training_scaler from above
def interaction_model(input_data, scaler = training_scaler):
    input_features,returned_scaler = encode_data(input_data,scaler)
    predictions = predictor.predict(input_features)
    return(predictions)

In [None]:
# Evaluate
evaluate_model(interaction_model, print_result=True)

# Competition Exercise (optional)

Build a multiple linear model to achieve as low score as possible!