# Feature Encoding

First, let's load the data

In [None]:
# Load the training data
import pandas as pd
training_set = pd.read_csv("housing_price_data/training_data.csv")

If our columns are numeric, it's straightforward to add them to a linear regression model. Below we are adding an extra feature to a model `GrLivArea`

In [None]:
training_set[['BedroomAbvGr','GrLivArea']].head()

## Encoding binomial values

In order to make use of categorical data, we first need to encode it as a number. As a simple example, we'll encode the variable `CentralAir` as 0 if there is no central air conditioning, and 1 if there is central air conditioning. We can do this by using a boolean comparison operation, and relying on the fact that `True == 1`

Run the cell below to see the output of encoding CentralAir this way

In [None]:
(training_set['CentralAir'] == 'Y').head()

## Encoding ordinal values
If the categories of a variable follow a clear rank, then we can label them by this rank. An example of this is the basement quality column.

    BsmtCond: Evaluates the general condition of the basement

       Ex	Excellent
       Gd	Good
       TA	Typical - slight dampness allowed
       Fa	Fair - dampness or some cracking or settling
       Po	Poor - Severe cracking, settling, or wetness

We would encode this as Po:1, Fa:2, TA:3, Gd:4, and Ex:5.

For houses without a basement (i.e. `BsmtCond is NaN`), we use a default value of 0


In [None]:
# Encode basement condition using one-hot-encoding
bsmt_cond_map = {'Ex':5,'Gd':4,'TA':3,'Fa':2,'Po':1}
training_set['BsmtCond'].map(bsmt_cond_map).fillna(0).head() # Some houses have no basement

## Encoding categorical values


If there is no order, we use a technique called 'one hot encoding'. This involves creating a new column for each category

For example, the `Electrical` variable in the housing prices dataset contains the following categories:

       SBrkr	Standard Circuit Breakers & Romex
       FuseA	Fuse Box over 60 AMP and all Romex wiring (Average)	
       FuseF	60 AMP Fuse Box and mostly Romex wiring (Fair)
       FuseP	60 AMP Fuse Box and mostly knob & tube wiring (poor)
       Mix	Mixed

We would encode the table

|Electrical|
|---------|
  |FuseA|
  |FuseF|
  |FuseP|
  |Mix|
  |SBrkr|
  
as

|FuseA |FuseF|FuseP|Mix|SBrkr|
|-----|-----|-----|-----|
| 1| 0  | 0   | 0  | 0  |
| 0|1  | 0   | 0  | 0  |
| 0|0  | 1  | 0| 0|
| 0|0  | 0 | 1 |0 |
| 0|0  | 0 | 0 | 1|

In [None]:
def encode_electrical(electrical):
    one_hot_encoding = pd.DataFrame()
    one_hot_encoding['FuseA'] = electrical == 'FuseA'
    one_hot_encoding['FuseF'] = electrical == 'FuseF'
    one_hot_encoding['FuseP'] = electrical == 'FuseP'
    one_hot_encoding['Mix']   = electrical == 'Mix'
    one_hot_encoding['SBrkr'] = (1 - one_hot_encoding.sum(axis=1)) # 'SBrkr' is standard default
    
    return(one_hot_encoding)


In [None]:
encode_electrical(training_set['Electrical']).head()

## Combining features

Sometimes, combing some features together will aid the machine learning process.  In this example we will multiple the number of bedrooms with the number of bathrooms:

In [None]:
bed_and_bath = training_set['FullBath'] * training_set['BedroomAbvGr']
bed_and_bath.head(10)

# Scaling Features

Many machine learning algorithms require all variables to be on the same scale, ideally between -1 and 1. Let's compare LotArea to SalePrice

In [None]:
training_set[['LotArea','SalePrice']].describe()

We can transform these variables to be on the same scale using a preprocessing trick called _min/max scaling_. 

`MinMaxScale(X) = (X - min(X))/(max(X) - min(X))` 

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()
scaled = pd.DataFrame(scaler.fit_transform(training_set[['LotArea','SalePrice']]),
             columns=['LotArea','SalePrice'])
scaled.head()

In [None]:
scaled.describe()

We can put these back to their original scale using the `inverse_transform` method.

In [None]:
pd.DataFrame(scaler.inverse_transform(scaled),
             columns=["LotArea","SalePrice"]).head()

In [None]:
training_set[["LotArea","SalePrice"]].head()

We can also add new features to our model by combining two or more existing features. For example, let's
multiply bedrooms by bathrooms. 
We'll also scale our input and output.

# Putting it all together

Let's take what we've learnt above and create an encode_features function that encodes a number of features. 

In [None]:
from utils import evaluate_model
from sklearn import linear_model

def encode_features(data, scaler=None):
    features = data.copy()
    
    # Encode Central Air where Y is 1, and N is 0
    features['CentralAir'] = features['CentralAir'] == 'Y'

    # Encode basement condition using one-hot-encoding
    bsmt_cond_map = {'Ex':5,'Gd':4,'TA':3,'Fa':2,'Po':1}
    features['BsmtQuality'] = features['BsmtCond'].map(bsmt_cond_map).fillna(0) # Some houses have no basement

    # Encode electrical
    features = pd.concat([features, encode_electrical(features['Electrical'])], axis=1)
    
    # Combine bed and bath    
    features['BedBath'] = features['FullBath'] * features['BedroomAbvGr']
    
    # Scale numeric values
    scaled_columns = [
        'FullBath',
        'BedroomAbvGr', 
        'BedBath',
        'GrLivArea',
        'CentralAir', 
        'FuseA',
        'FuseF',
        'FuseP', 
        'Mix',
        'SBrkr', 
        'BsmtQuality'
    ]
    scaled_features = features[scaled_columns]
    if not scaler:
        scaler = MinMaxScaler()
        scaler.fit(scaled_features)
        
    scaled_features = pd.DataFrame(scaler.transform(scaled_features), columns = scaled_columns)
    
    return scaled_features, scaler

def encode_label(data):
    labels = data.copy()['SalePrice']
    scaler = MinMaxScaler()
    scaler.fit(labels)    
    labels = pd.DataFrame(scaler.transform(labels), columns = ['SalePrice'])
    return (labels['SalePrice'], scaler)

def decode_label(data, scaler):
    return scaler.inverse_transform(data)

def train_model(training_set):
    features, feature_scaler = encode_features(training_set)
    labels, label_scaler = encode_label(training_set)
    predictor = linear_model.LinearRegression()
    
    predictor.fit(features,labels)
    
    def model(input_data):
        input_features,_ = encode_features(input_data, scaler=feature_scaler)
        output_value = predictor.predict(input_features)
        return decode_label(output_value, label_scaler)
    
    return(model)

In [None]:
data,_ =encode_features(training_set)
data.head()
scaled_model = train_model(training_set)
evaluate_model(scaled_model)

## Exercise

Try some of the following:
- add `LotArea` to the multilinear model and observe the performance of the model
- add whether or not the house has a pool to this model in the code above. (Look at the `PoolArea` variable).
- add to the code above the KitchenQuality as a feature (see `KitchenQual` in `data_description.txt`)
- add `Heating` using One Hot encoding

Compare your new models accuracy to the above. Did it improve it or make it worse? 