# The SciKit API


We now have a sense of the workflow necessary to use machine learning models.

The steps we take are:

1. _feature engineering_: Encode all the data we want to use into numbers. Perform any other transformations you think are necessary (e.g. scaling). We end up with a 2D array `X_train` containing all the columns we want to process, as well as an array `y_train` containing the column we want to predict (possibly scaled).

2. _train our model_:  This can be done with the command `model.fit(X_train,y_train)`.

3. _predict_: Use our freshly trained model to make some predictions.  This is done with the command `model.predict(input)`.

4. _evaluate_: We can use the `evaluate_model` function.

We will now apply this workflow to different algorithms from the SciKit Learn toolkit: Gradient Boosting, Random Forests and Support Vector Regression.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearn.model_selection import train_test_split

def load_housing_data(test_size=0.2, random_state=None):
    # Load data from Eliiza's github page
    raw_data = pd.read_csv("https://raw.githubusercontent.com/eliiza/ml-training-data/master/housing_price_data/housing_data.csv") 

    # Separate labels from feature columns.
    X = raw_data.drop('SalePrice', axis=1)
    y = raw_data['SalePrice']
    
    # Split the dataset with the requested proportions.
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    
    # Return in standard order.
    return (X_train, y_train), (X_test, y_test)

(X_train, y_train), (X_test, y_test) = load_housing_data(random_state=100)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

def train_and_evaluate_model(pipeline, model=LinearRegression()):
    # Load the data
    (X_train, y_train), (X_test, y_test) = load_housing_data(random_state=100)
    
    # Prepare the input pipeline and training data
    X_train_prepared = pipeline.fit_transform(X_train)
    
    # Train the model
    model.fit(X_train_prepared, y_train)
    
    # Prepare the test data
    X_test_processed = pipeline.transform(X_test)
    
    # Make some predictions
    y_pred = model.predict(X_test_processed)
    
    # Calculate the error
    mae = mean_absolute_error(y_test, y_pred)
    
    return mae

## Step 1: Feature Engineering

We are going to use some of the features we looked at in previous notebooks, initially:

- Numeric:
    - `OverallQual`
    - `GrLivArea`
    - `LotFrontage`
    - `LotArea`
    - `YearBuilt`
    - `MSSubClass`
    - `OverallCond`
- Categorical (to be one-hot encoded):
    - `Electrical`
    - `Heating`
    - `Neighborhood`

### Exercise

- Review the code below - we are simply making a function to do the feature engineering steps in a repeatable manner.
- Add any missing features that you found helped improve accuracy in the previous notebook. **For example ordinal variable(s).**

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Numeric features:
numeric_features = ['OverallQual', 'GrLivArea', 'LotFrontage', 'LotArea', 'YearBuilt', 'MSSubClass', 'OverallCond']

numeric_pipeline = Pipeline([
    ('simple_imputer', SimpleImputer(strategy='median')),
    ('standard_scaler', StandardScaler())
])

# Categorical features:
one_hot_features = ['Electrical', 'Heating', 'Neighborhood']

one_hot_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="most_frequent")),
    ('one_hot_encoder', OneHotEncoder())
])

# Combine into single pipeline:
pipeline = ColumnTransformer([
    ('numeric_pipeline', numeric_pipeline, numeric_features),
    ('one_hot_pipeline', one_hot_pipeline, one_hot_features)
])

train_and_evaluate_model(pipeline)

## Example: Gradient Boosting

We will be using SciKit Learn's Gradient Boosting implementation: [sklearn.ensemble.GradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html). There are other implementations of all of these algorithms which we won't use here, including [XGBoost](https://xgboost.readthedocs.io/en/latest/).

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

train_and_evaluate_model(pipeline, GradientBoostingRegressor())

**Question:** How does this accuracy compare with your previous models?

This disciplined workflow makes it easy to try other models.  
This [cheat sheet](https://www.analyticsvidhya.com/infographics/Scikit-Learn-Infographic.pdf) provides a list of models supported by scikit learn.  
Let's try two more...

## Exercise: Errors

- Explore the predictions and errors in this Gradient Boosting model. 
- Are the outliers the same as in the previous models you've fitted?

## Exercise: Random Forest

Implement a model using the same workflow as above with [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html). Hint: to avoid a warning regarding argument defaults, set `n_estimators = 100` within `RandomForestRegressor()`.

How does it perform?

## Exercise: Support Vector Regression

Implement a model using [SVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html). Hint: to avoid a warning regarding argument defaults, set `gamma='auto'` within `SVR()`.

Is this performance surprising? Have a quick look at the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html) - any ideas how you might improve its performance?

## Exercise: Algorithm hyperparameters (optional)

Each of these algorithms has various hyperparameters that can be adjusted to fine tune the model.

Hyperparameters are the options used to direct the learning process. Parameters are the learned model coefficients.

- Gradient Boosting
    - Main tuning hyperparameters:
         - `n_estimators`
         - `learning_rate`
         - `max_depth`
- Random Forests
    - Main tuning hyperparameters:
         - `n_estimators`
         - `max_features`
         - `max_depth`
- Support Vector Regression
    - Main tuning hyperparameters:
        - `C`
        - `epsilon`
        - `kernel`
        
        
- Different algorithms respond to hyperparameter adjustments differently. For this data problem, the performance of the SVR model is *highly dependent* on the values of the hyperparameters. This is what can improve the SVR model to closer to the accuracy of the others.


- Have a look at the documentation to see what each of these parameters do.
- What are their default values?
- Play around with changing these hyperparameters to see if you can improve the model accuracies.
- For the SVR model - try much larger values for `epsilon` and `C`!