# Support Vector Machines (SVM) - Regression

Data Source: [Protein]("https://archive.ics.uci.edu/ml/datasets/Physicochemical+Properties+of+Protein+Tertiary+Structure")

**Attributes**
- RMSD-Size of the residue.
- F1 - Total surface area.
- F2 - Non polar exposed area.
- F3 - Fractional area of exposed non polar residue.
- F4 - Fractional area of exposed non polar part of residue.
- F5 - Molecular mass weighted exposed area.
- F6 - Average deviation from standard exposed area of residue.
- F7 - Euclidian distance.
- F8 - Secondary structure penalty.
- F9 - Spacial Distribution constraints (N,K Value).

Dependent variable = RMSD
Independent variables = F1, F2, F3, F4, F5, F6, F7, F8, F9

In [1]:
# Importing the necessary packages
import pandas as pd
import numpy as np
from sklearn.svm import LinearSVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from math import sqrt

import warnings
warnings.filterwarnings("ignore")

In [2]:
# Loading the dataset
protein = pd.read_csv("./protein/CASP.csv")
protein.head()

Unnamed: 0,RMSD,F1,F2,F3,F4,F5,F6,F7,F8,F9
0,17.284,13558.3,4305.35,0.31754,162.173,1872791.0,215.359,4287.87,102,27.0302
1,6.021,6191.96,1623.16,0.26213,53.3894,803446.7,87.2024,3328.91,39,38.5468
2,9.275,7725.98,1726.28,0.22343,67.2887,1075648.0,81.7913,2981.04,29,38.8119
3,15.851,8424.58,2368.25,0.28111,67.8325,1210472.0,109.439,3248.22,70,39.0651
4,7.962,7460.84,1736.94,0.2328,52.4123,1021020.0,94.5234,2814.42,41,39.9147


In [3]:
# Display the characteristics of protein dataset
print("Dimensions of the dataset is: ", protein.shape)
print("The variables present in dataset are: \n", protein.columns)

Dimensions of the dataset is:  (45730, 10)
The variables present in dataset are: 
 Index(['RMSD', 'F1', 'F2', 'F3', 'F4', 'F5', 'F6', 'F7', 'F8', 'F9'], dtype='object')


In [4]:
# Verify the missing values
print("Null values in the dataset are: \n", protein.isnull().sum())
print("Not available values in the dataset are: \n", protein.isna().sum())

Null values in the dataset are: 
 RMSD    0
F1      0
F2      0
F3      0
F4      0
F5      0
F6      0
F7      0
F8      0
F9      0
dtype: int64
Not available values in the dataset are: 
 RMSD    0
F1      0
F2      0
F3      0
F4      0
F5      0
F6      0
F7      0
F8      0
F9      0
dtype: int64


In [5]:
# Using random seed function to generate the same dataset
np.random.seed(3000)

In [6]:
# Train-Test Split for both independent and dependent features
training, test = train_test_split(protein, test_size = 0.3)

x_trg = training.drop("RMSD", axis = 1)
y_trg = training["RMSD"]

x_test = test.drop("RMSD", axis = 1)
y_test = test["RMSD"]

### Model Building - SVM

In [7]:
svm_protein = LinearSVR(random_state = 0)

# Fitting the model
svm_protein.fit(x_trg, y_trg)

# Prediction on test set
svm_pred = svm_protein.predict(x_test)

# Calculate the RMSE for the model
svm_rmse = sqrt(mean_squared_error(y_test, svm_pred))
print("RMSE value for SVM model is: %0.3f" % svm_rmse)

RMSE value for SVM model is: 10.491


### Compare with k-NN model

In [8]:
# Model Building - kNN
knn_rmselist = []

for K in range(21):
    K = K + 1
    # Model building - kNN
    knn_protein = KNeighborsRegressor(n_neighbors = K)
    
    # Fit the model
    knn_protein.fit(x_trg, y_trg)
    
    # Predict via model
    knn_pred = knn_protein.predict(x_test)
    
    # Calculate the RMSE for the model
    knn_rmse = sqrt(mean_squared_error(y_test, knn_pred))
    print("RMSE value for kNN model is: %0.3f" %knn_rmse)
    
    knn_rmselist.append(knn_rmse)
    
print("\n") 
print("The least RMSE value using kNN is: %0.3f" %min(knn_rmselist))

RMSE value for kNN model is: 7.038
RMSE value for kNN model is: 6.327
RMSE value for kNN model is: 6.081
RMSE value for kNN model is: 5.975
RMSE value for kNN model is: 5.915
RMSE value for kNN model is: 5.856
RMSE value for kNN model is: 5.822
RMSE value for kNN model is: 5.809
RMSE value for kNN model is: 5.802
RMSE value for kNN model is: 5.785
RMSE value for kNN model is: 5.779
RMSE value for kNN model is: 5.780
RMSE value for kNN model is: 5.782
RMSE value for kNN model is: 5.782
RMSE value for kNN model is: 5.781
RMSE value for kNN model is: 5.793
RMSE value for kNN model is: 5.794
RMSE value for kNN model is: 5.798
RMSE value for kNN model is: 5.800
RMSE value for kNN model is: 5.802
RMSE value for kNN model is: 5.803


The least RMSE value using kNN is: 5.779


### Compare with Linear Regression model

In [9]:
# Model Building - Linear Regression
lr_protein = linear_model.LinearRegression()

# Fit the model
lr_protein.fit(x_trg, y_trg)
print("Accuracy of the LR model on training dataset is: ", lr_protein.score(x_trg, y_trg))
print("Accuracy of the LR model on test dataset is: ", lr_protein.score(x_test, y_test))

# Prediction via model
lr_pred = lr_protein.predict(x_test)
print("Coefficient of independent variables in the model is: \n", lr_protein.coef_)
print("Intercept in the model is: ", lr_protein.intercept_)

# Calculate the RMSE for the model
lr_rmse = sqrt(mean_squared_error(y_test, lr_pred))
print("RMSE value of LR model is: %0.3f" %lr_rmse)

Accuracy of the LR model on training dataset is:  0.2779751481134759
Accuracy of the LR model on test dataset is:  0.2921172345489872
Coefficient of independent variables in the model is: 
 [ 1.50138602e-03  1.36176886e-03  1.80011149e+01 -1.09646611e-01
 -3.45693711e-06 -2.30039034e-02 -1.26241391e-04  1.53926026e-02
 -1.09728772e-01]
Intercept in the model is:  6.02666508722966
RMSE value of LR model is: 5.155


The model performances are summarized as below:

- SVM = RMSE value - 10.491
- kNN = RMSE value - 5.779
- LR = RMSE value - 5.155

So we can infer here that the best model is Linear Regression model not SVM model.