# The SciKit API


We now have a sense of the workflow necessary to use machine learning models.

The steps we take are:

1. _feature engineering_: Encode all the data we want to use into numbers. Perform any other transformations you think are necessary (e.g. scaling). We end up with a 2D array `X_train` containing all the columns we want to process, as well as an array `y_train` containing the column we want to predict (possibly scaled).

2. _train our model_:  This can be done with the command `model.fit(X_train,y_train)`.

3. _predict_: Use our freshly trained model to make some predictions.  This is done with the command `model.predict(input)`.

4. _evaluate_: We can use the `evaluate_model` function.

We will now apply this workflow to different algorithms from the SciKit Learn toolkit: Gradient Boosting, Random Forests and Support Vector Regression.

In [None]:
# Import modules
from sklearn.preprocessing import MinMaxScaler
import numpy as np
import pandas as pd
import sklearn.ensemble
import sklearn.svm

In [None]:
# Import data
training_set = pd.read_csv("https://raw.githubusercontent.com/eliiza/ml-training-data/master/housing_price_data/training_data.csv")

In [None]:
# Use same evaluate_model function as before
def evaluate_model(model_fn, print_result=False):
    '''
    Consumes a function model_fn
    and evaluates its predictive accuracy against 
    the housing prices test set.
    We have included a switch for the output to be a more human readable
    printed version or the uncurtailed floating point value of the average.
    '''
    test_data = pd.read_csv("https://raw.githubusercontent.com/eliiza/ml-training-data/master/housing_price_data/test_data.csv")
    actual_values = test_data['SalePrice']
    # Pass in all columns except SalePrice
    test_input = test_data.filter(regex='^(?!SalePrice$).*')
    predicted_saleprice = model_fn(test_input)
    mae = np.mean(np.abs(predicted_saleprice-actual_values))
    if print_result:
        return print("The model is inaccurate by $%.2f on average." % mae)
    else:
        return mae

## Step 1: Feature Engineering

We are going to use some of the features we looked at in previous notebooks, initially:

- Numeric:
    - `OverallQual`
    - `GrLivArea`
    - `BedroomAbvGr`
    - `FullBath`
    - `YearBuilt`
- Engineered:
    - `QualAreaInteract` (from `OverallQual` and `GrLivArea`)
- Categorical (to be one-hot encoded):
    - `CentralAir`
    - `Electrical`
    - `Heating`

### Exercise

- Review the code below - we are simply making a function to do the feature engineering steps in a repeatable manner.
- Add any missing features that you found helped improve accuracy in the previous notebook. **For example the best ordinal variable.**

In [None]:
# Step 1: Feature Engineering
def encode_data(data,scaler = None):
    """
    Encode a dataframe of house price data using the desired feature engineering process. 
    The scaler argument allows you to either scale the data anew (scaler = None), 
    or use previously derived scaling parameters
    e.g. when you want to encoding test data using the scaling parameters from the training dataset.
    Returns a dataframe of engineered features and the scaler object.
    """
    
    features = data.copy()
    
    # Numerical features
    features = features[['OverallQual','GrLivArea','BedroomAbvGr','FullBath','YearBuilt']]
    features['QualAreaInteract'] = features['OverallQual'] * features['GrLivArea']
    
    # Ordinal feature(s)
    
    # Categorical features - one-hot encode using pre-written helper functions (below)
    features['CentralAir'] = data['CentralAir'] == 'Y'
    electrical = encode_electrical(data['Electrical'])
    heating = encode_heating(data['Heating'])
    features = pd.concat([features,electrical,heating],axis=1)
 
    # Convert to float data type for scaling process
    features = features.astype(float)
    
    # Scale all the features
    # If no `scaler` object in the function arguments - carry out scaling anew
    # If `scaler` object in the function arguments - use those scaling parameters
    if(not scaler):
       scaler = MinMaxScaler()
       scaler.fit(features)
    features = pd.DataFrame(scaler.transform(features), 
                            columns = ['OverallQual','GrLivArea','BedroomAbvGr','FullBath','YearBuilt',
                                       'QualAreaInteract',
                                       'CentralAir',
                                       'FuseA','FuseF','FuseP','Mix','SBrkr',
                                       'GasA','GasW','Grav','Wall'])
    
    # Return the desired data frame and the scaling parameters used
    return(features,scaler)


# Helper functions for one hot encoding

def encode_electrical(electrical):
    """
    Create data frame with one column per category in 'electrical' column, rows are Boolean with respect to
    category string in electrical column.
    """
    one_hot_encoding = pd.DataFrame()
    one_hot_encoding['FuseA'] = electrical == 'FuseA'
    one_hot_encoding['FuseF'] = electrical == 'FuseF'
    one_hot_encoding['FuseP'] = electrical == 'FuseP'
    one_hot_encoding['Mix']   = electrical == 'Mix'
    one_hot_encoding['SBrkr'] = electrical == 'SBrkr'
    return(one_hot_encoding)

def encode_heating(heating):
    """
    Create data frame with one column per category in 'heating' column, rows are Boolean with respect to
    category string in heating column.
    """
    one_hot_encoding = pd.DataFrame()
    one_hot_encoding['GasA'] = heating == 'GasA'
    one_hot_encoding['GasW'] = heating == 'GasW'
    one_hot_encoding['Grav'] = heating == 'Grav'
    one_hot_encoding['Wall'] = heating == 'Wall'
    return(one_hot_encoding)

## Example: Gradient Boosting

We will be using SciKit Learn's Gradient Boosting implementation: [sklearn.ensemble.GradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html). There are other implementations of all of these algorithms which we won't use here, including [XGBoost](https://xgboost.readthedocs.io/en/latest/).

In [None]:
# Step 1: Feature engineering
training_features, training_scaler = encode_data(training_set)

In [None]:
# Gradient Boosting 
# Step 2: Train the model
predictor = sklearn.ensemble.GradientBoostingRegressor()
predictor.fit(training_features, training_set['SalePrice'])
    
# Step 3: Create a function that can make predictions using the above trained model and scaler
def boosting_model(input_data, scaler = training_scaler):
    input_features,_ = encode_data(input_data,scaler)
    predictions = predictor.predict(input_features)
    return(predictions)

In [None]:
# Step 4: Evaluate
evaluate_model(boosting_model, print_result=True)

**Question:** How does this accuracy compare with your previous models?

This disciplined workflow makes it easy to try other models.  
This [cheat sheet](https://www.analyticsvidhya.com/infographics/Scikit-Learn-Infographic.pdf) provides a list of models supported by scikit learn.  
Let's try two more...

## Exercise: Errors

- Explore the predictions and errors in this Gradient Boosting model. 
- Are the outliers the same as in the previous models you've fitted?

## Exercise: Random Forest

Implement a model using the same workflow as above with [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html). Hint: to avoid a warning regarding argument defaults, set `n_estimators = 100` within `RandomForestRegressor()`.

How does it perform?

## Exercise: Support Vector Regression

Implement a model using [SVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html). Hint: to avoid a warning regarding argument defaults, set `gamma='auto'` within `SVR()`.

Is this performance surprising? Have a quick look at the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html) - any ideas how you might improve its performance?

## Exercise: Algorithm hyperparameters (optional)

Each of these algorithms has various hyperparameters that can be adjusted to fine tune the model.

Hyperparameters are the options used to direct the learning process. Parameters are the learned model coefficients.

- Gradient Boosting
    - Main tuning hyperparameters:
         - `n_estimators`
         - `learning_rate`
         - `max_depth`
- Random Forests
    - Main tuning hyperparameters:
         - `n_estimators`
         - `max_features`
         - `max_depth`
- Support Vector Regression
    - Main tuning hyperparameters:
        - `C`
        - `epsilon`
        - `kernel`
        
        
- Different algorithms respond to hyperparameter adjustments differently. For this data problem, the performance of the SVR model is *highly dependent* on the values of the hyperparameters. This is what can improve the SVR model to closer to the accuracy of the others.


- Have a look at the documentation to see what each of these parameters do.
- What are their default values?
- Play around with changing these hyperparameters to see if you can improve the model accuracies.
- For the SVR model - try much larger values for `epsilon` and `C`!