## Support Vector Machine (SVM) Example

Data Set: sample_data/voting_records.csv

Category: Supervised<br/>
Type: Classification<br/>

Notes: The data shows how different congressmen voted on a number of bills.  We are trying to create a model that can predict whether a congressman is republican or democrat based on their voting patters.<br/>

The data is not clean.  We need to first clean it, then fit it. We use a Pipeline for that.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('data/voting_records.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 434 entries, 0 to 433
Data columns (total 17 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   republican  434 non-null    object
 1   n           434 non-null    object
 2   y           434 non-null    object
 3   n.1         434 non-null    object
 4   y.1         434 non-null    object
 5   y.2         434 non-null    object
 6   y.3         434 non-null    object
 7   n.2         434 non-null    object
 8   n.3         434 non-null    object
 9   n.4         434 non-null    object
 10  y.4         434 non-null    object
 11  ?           434 non-null    object
 12  y.5         434 non-null    object
 13  y.6         434 non-null    object
 14  y.7         434 non-null    object
 15  n.5         434 non-null    object
 16  y.8         434 non-null    object
dtypes: object(17)
memory usage: 57.8+ KB


In [3]:
# Let's use pandas to replace '?' with NaN and, because sklearn cannot work with categorical data, convert y/n to 
# numeric values.
df[df == '?'] = np.nan
df[df == 'n'] = 0
df[df == 'y'] = 1

X = df.drop(['republican'], axis=1)
y = df['republican']

X.head()

Unnamed: 0,n,y,n.1,y.1,y.2,y.3,n.2,n.3,n.4,y.4,?,y.5,y.6,y.7,n.5,y.8
0,0.0,1,0,1.0,1.0,1,0,0,0,0,0,1.0,1,1,0,
1,,1,1,,1.0,1,0,0,0,0,1,0.0,1,1,0,0.0
2,0.0,1,1,0.0,,1,0,0,0,0,1,0.0,1,0,0,1.0
3,1.0,1,1,0.0,1.0,1,0,0,0,0,1,,1,1,1,1.0
4,0.0,1,1,0.0,1.0,1,0,0,0,0,0,0.0,1,1,1,1.0


In [7]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

# Setup the pipeline steps
steps = [
    ('imputation', SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
    ('svm', SVC())
]

pipeline = Pipeline(steps)

# Train / test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Fit + predict
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

    democrat       0.99      0.94      0.96        83
  republican       0.90      0.98      0.94        48

    accuracy                           0.95       131
   macro avg       0.95      0.96      0.95       131
weighted avg       0.96      0.95      0.95       131

