# The SciKit API


We now have a sense of the workflow necessary to use machine learning models.

The steps we take are:

1. _feature engineering_: Encode all the data we want to use into numbers. Perform any other transformations you think are necessary (e.g. scaling). We end up with a 2D array `X_train` containing all the columns we want to process, as well as an array `y_train` containing the column we want to predict (possibly scaled)

2. _train our model_:  This can be done with the command `model.fit(X_train,y_train)`

3. _predict_: Use our freshly trained model to make some predictions.  This is done with the command `model.predict(input)``

4. _evaluate_: We can use the `evaluate_model` function.



In [9]:
import xgboost
from utils import evaluate_model, encode_features, encode_label, decode_label
import pandas as pd
import sklearn

def train_xgboost_model(training_set):
    # Step 1: Encode Features
    features, feature_scaler = encode_features(training_set)
    labels, label_scaler = encode_label(training_set)
    
    # Step 2: Train the model
    predictor = xgboost.XGBRegressor()
    
    predictor.fit(features, labels)
    
    def model(input_data):
        input_features, _ = encode_features(input_data, scaler=feature_scaler)
        
        # Step 3: Make predictions
        output_value = predictor.predict(input_features)
        return decode_label(output_value, label_scaler)
    
    return model

In [10]:
training_set = pd.read_csv("housing_price_data/training_data.csv")

In [11]:
xgboost_model = train_xgboost_model(training_set)

# Step 4: Evaluate
evaluate_model(xgboost_model)

The model is inaccurate by $30151.39 on average.




30151.387762200342

This disciplined workflow makes it easy to try other models. This [cheat sheet](https://www.analyticsvidhya.com/infographics/Scikit-Learn-Infographic.pdf) provides a list of models supported by scikit learn. Here's an implementation of SVRRegressor

In [12]:
import sklearn.svm as svm

def train_svr_model(training_set):
    # Step 1: Encode Features
    features, feature_scaler = encode_features(training_set)
    labels, label_scaler = encode_label(training_set)
    
    # Step 2: Train the model
    predictor = svm.SVR()
    
    predictor.fit(features, labels)
    
    def model(input_data):
        input_features, _ = encode_features(input_data, scaler=feature_scaler)
        
        # Step 3: Make predictions
        output_value = predictor.predict(input_features)
        return decode_label(output_value, label_scaler)
    
    return model

In [13]:
svr_model= train_svr_model(training_set)
evaluate_model(svr_model)

The model is inaccurate by $38669.74 on average.




38669.738630703934

Exercise: Implement a model using [SGDRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html)