# The SciKit API


We now have a sense of the workflow necessary to use machine learning models.

The steps we take are:

1. _feature engineering_: Encode all the data we want to use into numbers. Perform any other transformations you think are necessary (e.g. scaling). We end up with a 2D array `X_train` containing all the columns we want to process, as well as an array `y_train` containing the column we want to predict (possibly scaled)

2. _train our model_:  This can be done with the command `model.fit(X_train,y_train)`

3. _predict_: Use our freshly trained model to make some predictions.  This is done with the command `model.predict(input)``

4. _evaluate_: We can use the `evaluate_model` function.



In [1]:
import xgboost
from utils import evaluate_model, encode_features, encode_label, decode_label
import pandas as pd
import sklearn

def train_xgboost_model(training_set):
    # Step 1: Encode Features
    features, feature_scaler = encode_features(training_set)
    labels, label_scaler = encode_label(training_set)
    
    # Step 2: Train the model
    predictor = xgboost.XGBRegressor()
    
    predictor.fit(features, labels)
    
    def model(input_data):
        input_features, _ = encode_features(input_data, scaler=feature_scaler)
        
        # Step 3: Make predictions
        output_value = predictor.predict(input_features)
        return decode_label(output_value, label_scaler)
    
    return model

In [2]:
training_set = pd.read_csv("housing_price_data/training_data.csv")

In [3]:
xgboost_model = train_xgboost_model(training_set)

# Step 4: Evaluate
evaluate_model(xgboost_model)

[ 224417.703125    672388.875       131598.546875    142147.109375
  177939.03125     364340.6875      282854.9375      314666.8125
  123160.5625      141114.1875      142388.578125    105892.1171875
  303076.21875     246289.921875    679541.875       131271.296875
  171365.609375    134029.625       217494.015625    221952.46875
  207514.828125    223359.6875      175564.875       195570.84375     154607.5
  174722.890625    138362.734375    181725.265625    192129.890625
  172479.890625    169788.234375    145447.28125     121331.8125
  217494.015625    145447.28125     220146.296875    131598.546875
  267661.84375     131271.296875    217494.015625    142147.109375
  207514.828125    244050.109375    168473.421875     97836.6953125
  146777.21875     129307.21875     123160.5625      155282.765625
  170942.984375     69358.8984375   141114.1875      122833.3125
  141114.1875      244050.109375    130979.640625    186624.515625
  135024.09375     208190.546875    122833.3125      27



30151.387762200342

This disciplined workflow makes it easy to try other models. This [cheat sheet](https://www.analyticsvidhya.com/infographics/Scikit-Learn-Infographic.pdf) provides a list of models supported by scikit learn. Here's an implementation of SVRRegressor

In [4]:
import sklearn.svm as svm

def train_svr_model(training_set):
    # Step 1: Encode Features
    features, feature_scaler = encode_features(training_set)
    labels, label_scaler = encode_label(training_set)
    
    # Step 2: Train the model
    predictor = svm.SVR()
    
    predictor.fit(features, labels)
    
    def model(input_data):
        input_features, _ = encode_features(input_data, scaler=feature_scaler)
        
        # Step 3: Make predictions
        output_value = predictor.predict(input_features)
        return decode_label(output_value, label_scaler)
    
    return model

In [5]:
svr_model= train_svr_model(training_set)
evaluate_model(svr_model)

[ 248333.92019058  428464.69071442  137086.55887891  151412.34630052
  205438.1170974   400778.37810921  272352.5249677   325293.97883576
  117606.65224325  147702.95253787  159638.87908792   89630.49263401
  289412.02584133  246362.2031974   505755.56416158  159834.06464234
  182642.72174285  125850.35399951  229088.0945472   243511.90984259
  225106.45019929  239804.65161714  187668.78693134  224879.07424051
  172593.30281342  202793.31675208  138980.79658888  204277.20186566
  223425.56654219  194504.83916983  179385.12929799  163553.90225233
  111728.14330288  237253.63798054  160140.72567017  222043.3784478
  132684.92683255  279719.86859822  152449.38316296  229190.1814281
  153618.7294045   225106.45019929  243987.37882687  212125.59642818
  114294.45952729  167355.27724411  125190.10321328  114716.22797381
  175475.12360066  189785.54356206   51457.82949049  147101.59584179
  138105.1624904   146400.07398763  244089.38066549  145416.52530116
  207748.40406476  139088.35679479  



38669.738630703934

Exercise: Implement a model using [SGDRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html)