# Support Vector Machine Modeling

## 1. Identify the model approach(es), describe, and justify the selection

Support vector machines (SVM) are a type of supervised learning used for classification and regression. Here, SVMs will be used to classify patients as either readmission (readmitted within 30 days of being discharged) or non_readmission. SVMs try to find a hyperplane (decision boundary) that separates classes of observations in feature space. Unlike probability models, the SVM does not use probability for classification; instead, we aim for the direct caclulation of a separating hyperplane (as in notes). Moreover, SVMs are a type of *large margin* classifiers. In these types of classifiers, we try to find the best separating hyperplane that is farthest as possible from any points. In other words, we want to minimize the norm of the parameter vector by choosing a $\theta$ such that the projection of each point x onto $\theta$ is a maximum.

Support vector machines are useful in this problem because they work well in high dimensional spaces, and we have multiple variables that we want to use to predict readmission. Even though they can use many features accurately, they are also memory efficient; SVMS only use a subset of training points in the decision function (i.e., the points "closest" to the decision boundary line, because those that lie farther from the boundary are easy to classify). Many tuning parameters are available for SVMS, including different kernels and regularization terms that can account for overfitting and bias errors. I will test multiple different kernel possibilities, and then investigate the regularization term. I expect that I will need something more nuanced than a basic Gaussian or linear kernel, and I predict that the most accurate models with come from polynomial kernels.


In [10]:
np.random.seed(1)

In [11]:
%matplotlib inline
import numpy as np
import scipy as sp
import pandas as pd 
import pymc3 as pm
import seaborn as sns
import theano
import theano.tensor as tt
import matplotlib.pylab as plt
import matplotlib.cm as cmap
sns.set_context('talk')
import seaborn as sns; sns.set_context('notebook')
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.svm import SVC
from sklearn import svm
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import brier_score_loss, make_scorer
np.random.seed(1)

## 2. Code, parameterize, and run model (including visualization)


We'll start with the simplest SVM model: a linear kernel, and no regularization.

In [12]:
data_path = "/Users/sarahmaddox/Desktop/Data/"

In [21]:
from sklearn import svm

X = pd.read_csv(data_path+"x_data.csv", index_col = "patient_id")
del X['Unnamed: 0']
y_data = pd.read_csv(data_path+"y_data.csv")
y=y_data.pop("readmitted_true")


In [17]:
svc = svm.SVC(kernel='linear')
svc.fit(X, y)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [19]:
svc.score(X,y)

0.71825000000000006

In [6]:
from matplotlib.colors import ListedColormap
# Create color maps for 3-class classification problem, as with iris
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

def plot_estimator(estimator, X, y, ax=None):
    
    try:
        X, y = X.values, y.values
    except AttributeError:
        pass
    
    if ax is None:
        _, ax = plt.subplots()
    
    estimator.fit(X, y)
    x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1
    y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                         np.linspace(y_min, y_max, 100))
    Z = estimator.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    ax.pcolormesh(xx, yy, Z, cmap=cmap_light)

    # Plot also the training points
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
    ax.axis('tight')
    ax.axis('off')
    plt.tight_layout()

In [None]:
%matplotlib inline
plot_estimator(svc, X[['bmi_mean','bmi_std']], y)

## 3. Cross-validation


In [22]:
from sklearn import model_selection

X_train, X_test, y_train, y_test = model_selection.train_test_split(
        X.values, y.values, test_size=0.4, random_state=0)

#5 fold cross-validation, 5 way partition 
#you can see you get a really good score for some but not the others 
scores = model_selection.cross_val_score(svc, X.values, y.values, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.72 (+/- 0.01)


In [23]:
pred_train = svc.predict(X_train)
pred_test = svc.predict(X_test)

In [24]:
pd.crosstab(y_train, pred_train, 
            rownames=["Actual"], colnames=["Predicted"])

Predicted,0.0,1.0
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,3245,116
1.0,1224,215


This model does a fairly bad job at predicting when the actual is 1; it is more often predicted as 0 even if the actual value is one (1224 versus 215). The model is probably biased towards predicting that the patient is not readmitted within 30 days.

## 4. Goodness of fit assessments, performance characteristics (including visualization)


## 5. Improvements to model/tuning of parameters; model selection methods, justification of improvements/tests


### Kernel Type

### Regularization

C corresponds to the inverse of the regularization parameter. The choice of C will either help reduce bias, reduce variance, or something in the middle:
large C = low bias, high variance
small C = high bias, low variance
In an SVM, a lot of regularization means that the model will have a "soft margin" that allows some points to cross the optimal decision boundary and get misclassified. 


## 6. Comparison of models; identification of best model


## 7. Results


## 8. Implications of model and conclusions