# 3b. Support Vector Regressor

We can now load either our PCA data or our UMAP data and use supervised learning algorithms to predict our outcome variable using the scaled and reduced data. I have created two separate files for this step, differing on which algorithm they use (Random Forest Regressor vs. Support Vector Regressor). These are designed to work completely separate from one another, and only running one of the two is necessary, but it is helpful to run both and compare them.

In [21]:
import numpy as np
import pandas as pd
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict

In [22]:
class SVregressor:
    
    '''
        INIT FUNCTION:
        
        -- This __init__ function is slightly different from the ones used in KNN Imputation, PCA, and UMAP.
        
        -- In addition to specifying "IDs", specify an outcome variable to predict. "IDs" should still include this variable.
        
        -- self.X is an array containing all the data except the IDs; self.y is an array of the values of the outcome variable.
    '''
    
    def __init__(self, datafile, outcome, IDs = []):
        self.df = pd.read_csv(datafile)
        self.X = np.array(self.df.drop(IDs, 1))
        self.y = np.array(self.df[outcome])
        self.Xdf = pd.DataFrame(self.X)
        self.ydf = pd.DataFrame(self.y)
        
    '''
        REGRESS METHOD:
        
        --Specify a kernel, a scoring method, and a number of folds for cross-validation.
        
        --self.predictions outputs predicted values; self-scores outputs specified metrics of model performance.
    '''    
    
    def regress(self, kernel, scoring, cv=[]):
        self.svr_model = SVR(kernel=kernel)
        self.predictions = cross_val_predict(self.svr_model, self.X, self.y, cv=cv)
        self.scores = cross_val_score(self.svr_model, self.X, self.y, cv=cv, scoring=scoring)
        print(self.scores)

In [23]:
#Linear kernel
#Inputting PCA data for this example
#Outcome variable is presence, IDs are presence and labvisitid
linear = SVregressor("SCALED_PCA_DATA.csv", 'presence', IDs = ['labvisitid', 'presence'])

In [24]:
#Outputting (negative) mean squared error
#For some reason, sklearn outputs negative values for mean squared error and some related metrics,...
#...so they changed the name to 'neg_mean_squared_error'
linear.regress('linear','neg_mean_squared_error', cv=5)

[-0.57381543 -0.87892184 -0.5007123  -0.67230961 -0.58374557]


In [25]:
#RBF kernel
rbf = SVregressor("SCALED_PCA_DATA.csv", 'presence', IDs = ['labvisitid', 'presence'])

In [26]:
#Outputting (negative) mean squared error
rbf.regress('rbf', 'neg_mean_squared_error', cv=5)

[-0.62976218 -0.69928623 -0.49504079 -0.55943645 -0.56887076]


In [27]:
#Comparing scores of (negative) mean squared error from both kernels
#The RBF kernel performs better on average
print("Linear:", linear.scores)
print("RBF:", rbf.scores)

Linear: [-0.57381543 -0.87892184 -0.5007123  -0.67230961 -0.58374557]
RBF: [-0.62976218 -0.69928623 -0.49504079 -0.55943645 -0.56887076]
