# Support Vector Machines

### Support Vector Machine (SVM): The "Maximal Margin" Classifier

The Analogy: Drawing the Widest Possible Line

Imagine you have red and blue marbles on a table, and they are mostly separated into two groups. Your job is to draw a straight line to divide them. There are many lines you could draw, but you want the one that creates the widest possible "no-man's-land" between the reds and the blues.

An SVM finds this "widest street." It focuses on the marbles that are hardest to separate (the ones closest to the other color) and draws the boundary as far away from them as possible.

In simple terms: It's great at finding the clearest and most robust boundary to separate different groups of things.

### Financial Applications:

Bankruptcy Prediction: Classifying companies as likely to fail/survive

Stock Market Classification: Bull/bear market prediction

Credit Rating Classification: Assigning credit ratings

Market Regime Detection: Identifying different market conditions

### Companies Using It:

Bank of America: Fraud detection systems

Barclays: Credit card fraud prevention

HSBC: Money laundering detection

Deutsche Bank: Risk classification

In [2]:
import os

import pandas as pd
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, auc
from sklearn.model_selection import train_test_split

In [34]:
df_bankdata = pd.read_csv("Datasets/bank.csv")

In [35]:
df_bankdata.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no


In [36]:
df_bankdata.shape

(4521, 17)

In [37]:
df_bankdata.dtypes

age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
y            object
dtype: object

In [38]:
df_bankdata.isnull().sum()

age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64

#### Perform standardization on numeric variables

In [18]:
print("Total number of class labels: {}".format(df_bankdata.shape[0]))
print("Number of people opted for Term Deposit: {}".format(df_bankdata[df_bankdata.y == 'yes'].shape[0]))
print("Number of people not opted for Term Deposit: {}".format(df_bankdata[df_bankdata.y == 'no'].shape[0]))

Total number of class labels: 4521
Number of people opted for Term Deposit: 521
Number of people not opted for Term Deposit: 4000


In [39]:
# We convert our target class to 1 & 0
df_bankdata['y'] = (df_bankdata['y']=='yes').astype(int)

In [40]:
# Using select_dtypes() to select only the non-numeric type variable
column_type = ['object']
df_bank_data_category_cols = df_bankdata.select_dtypes(column_type)

# This will give you the names of the non-numeric variables
category_column_names = df_bank_data_category_cols.columns.values.tolist()
category_column_names

['job',
 'marital',
 'education',
 'default',
 'housing',
 'loan',
 'contact',
 'month',
 'poutcome']

In [41]:
from sklearn.preprocessing import OneHotEncoder

# Initialize the OneHotEncoder
encoder = OneHotEncoder(sparse_output=False, drop='first')  # drop='first' to avoid multicollinearity

for each_col in category_column_names:
    # Reshape the column to 2D array as required by sklearn
    column_data = df_bank_data_category_cols[each_col].values.reshape(-1, 1)
    
    # Fit and transform the data
    encoded_array = encoder.fit_transform(column_data)
    
    # Get feature names for the encoded columns
    feature_names = encoder.get_feature_names_out([each_col])
    
    # Convert to DataFrame
    dummy_var = pd.DataFrame(encoded_array, columns=feature_names, index=df_bankdata.index)
    
    # Join with main dataframe and drop original column
    df_joindata = df_bankdata.join(dummy_var)
    df_joindata.drop([each_col], axis=1, inplace=True)
    df_bankdata = df_joindata
   

In [42]:
df_bankdata

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,y,job_blue-collar,job_entrepreneur,...,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_other,poutcome_success,poutcome_unknown
0,30,1787,19,79,1,-1,0,0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,33,4789,11,220,1,339,4,0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,35,1350,16,185,1,330,1,0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,30,1476,3,199,4,-1,0,0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,59,0,5,226,1,-1,0,0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4516,33,-333,30,329,5,-1,0,0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4517,57,-3313,9,153,1,-1,0,0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
4518,57,295,19,151,11,-1,0,0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4519,28,1137,6,129,4,211,3,0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [43]:
print(df_bankdata.head()) 

   age  balance  day  duration  campaign  pdays  previous  y  job_blue-collar  \
0   30     1787   19        79         1     -1         0  0              0.0   
1   33     4789   11       220         1    339         4  0              0.0   
2   35     1350   16       185         1    330         1  0              0.0   
3   30     1476    3       199         4     -1         0  0              0.0   
4   59        0    5       226         1     -1         0  0              1.0   

   job_entrepreneur  ...  month_jul  month_jun  month_mar  month_may  \
0               0.0  ...        0.0        0.0        0.0        0.0   
1               0.0  ...        0.0        0.0        0.0        1.0   
2               0.0  ...        0.0        0.0        0.0        0.0   
3               0.0  ...        0.0        1.0        0.0        0.0   
4               0.0  ...        0.0        0.0        0.0        1.0   

   month_nov  month_oct  month_sep  poutcome_other  poutcome_success  \
0       

In [44]:
# Separate features & response variable
X=df_bankdata.iloc[:, :-1]
Y=df_bankdata['y']

In [45]:
X.shape

(4521, 42)

In [46]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=1)

In [47]:
Y_train.shape

(3616,)

In [48]:
Y_train.dtypes

dtype('int64')

#### Using default rbf kernel

In [49]:
svc_model = SVC(kernel='rbf') 
svc_model.fit(X_train, Y_train)

train_predictedvalues=svc_model.predict(X_train)
test_predictedvalues=svc_model.predict(X_test)

In [50]:
print('Train Accuracy Score:')
print(accuracy_score(Y_train,train_predictedvalues))

print('Test Accuracy Score:')
print(accuracy_score(Y_test,test_predictedvalues))

Train Accuracy Score:
0.8877212389380531
Test Accuracy Score:
0.8729281767955801


#### With Polynomial kernel

In [51]:
from sklearn.svm import SVC
from sklearn import metrics

svc_model = SVC(kernel='poly') 
svc_model.fit(X_train, Y_train)

train_predictedvalues=svc_model.predict(X_train)
test_predictedvalues=svc_model.predict(X_test)

print('Train Accuracy Score:')
print(metrics.accuracy_score(Y_train,train_predictedvalues))

print('Test Accuracy Score:')
print(metrics.accuracy_score(Y_test,test_predictedvalues))

Train Accuracy Score:
0.8877212389380531
Test Accuracy Score:
0.8729281767955801


#### Using linear kernel

In [52]:
from sklearn.svm import SVC
from sklearn import metrics

svc_model = SVC(kernel='linear') 
svc_model.fit(X_train, Y_train)

train_predictedvalues=svc_model.predict(X_train)
test_predictedvalues=svc_model.predict(X_test)

print('Train Accuracy Score:')
print(metrics.accuracy_score(Y_train,train_predictedvalues))

print('Test Accuracy Score:')
print(metrics.accuracy_score(Y_test,test_predictedvalues))

Train Accuracy Score:
0.9958517699115044
Test Accuracy Score:
0.9955801104972376
