# The SciKit API


We now have a sense of the workflow necessary to use machine learning models.

The steps we take are:

1. _feature engineering_: Encode all the data we want to use into numbers. Perform any other transformations you think are necessary (e.g. scaling). We end up with a 2D array `X_train` containing all the columns we want to process, as well as an array `y_train` containing the column we want to predict (possibly scaled)

2. _train our model_:  This can be done with the command `model.fit(X_train,y_train)`

3. _predict_: Use our freshly trained model to make some predictions.  This is done with the command `model.predict(input)``

4. _evaluate_: We can use the `evaluate_model` function.



In [4]:
import xgboost
from utils import evaluate_model
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import sklearn


# Step 1: Feature Engineering
def encode_data(data,scaler = None):
    data = data.copy()
    data = data[['FullBath','BedroomAbvGr','LotArea']]
    data['BedBath'] = data['FullBath'] * data['BedroomAbvGr']
        
    if(not scaler):
        scaler = MinMaxScaler()
        scaler.fit(data)
    pd.DataFrame(scaler.transform(data), columns = ['FullBath','BedroomAbvGr','LotArea','BedBath'])
    return(data,scaler)



def train_xgboost_model(training_set):
    # Step 2: Train the model
    training_features, scaler = encode_data(training_set)
    predictor = xgboost.XGBRegressor()
    predictor.fit(training_features, training_set['SalePrice'])
    
    # Step 3: Make predictions
    def xgboost_model(input_data):
        input_features,_ = encode_data(input_data,scaler)
        predictions = predictor.predict(input_features)
        return(predictions)
    return(xgboost_model)

In [5]:
training_set = pd.read_csv("housing_price_data/training_data.csv")
xgboost_model = train_xgboost_model(training_set)
evaluate_model(xgboost_model)

The model is inaccurate by $39753.53 on average.


39753.53293557363

This disciplined workflow makes it easy to try other models. This [cheat sheet](https://www.analyticsvidhya.com/infographics/Scikit-Learn-Infographic.pdf) provides a list of models supported by scikit learn. Here's an implementation of SVRRegressor

In [6]:
import sklearn.svm
def train_svr_model(training_set):
    # Step 2: Train the model
    training_features, scaler = encode_data(training_set)
    predictor = sklearn.svm.SVR()
    predictor.fit(training_features, training_set['SalePrice'])
    
    # Step 3: Make predictions
    def svr_model(input_data):
        input_features,_ = encode_data(input_data,scaler)
        predictions = predictor.predict(input_features)
        return(predictions)
    return(svr_model)

In [7]:
svr_model= train_svr_model(training_set)
evaluate_model(svr_model)

The model is inaccurate by $55334.82 on average.


55334.82226589019

Exercise: Implement a model using [SGDRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html)