# **Support Vector Machine**

In [6]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
import numpy as np

# Load the dataset
auto = pd.read_csv('/content/drive/MyDrive/Auto.csv')
print(auto.head())

    mpg  cylinders  displacement horsepower  weight  acceleration  year  \
0  18.0          8         307.0        130    3504          12.0    70   
1  15.0          8         350.0        165    3693          11.5    70   
2  18.0          8         318.0        150    3436          11.0    70   
3  16.0          8         304.0        150    3433          12.0    70   
4  17.0          8         302.0        140    3449          10.5    70   

   origin                       name  
0       1  chevrolet chevelle malibu  
1       1          buick skylark 320  
2       1         plymouth satellite  
3       1              amc rebel sst  
4       1                ford torino  


In [3]:
# Part (a): Create the binary target variable
# We create a binary variable that is 1 if mpg is above the median, and 0 otherwise.
median_mpg = auto['mpg'].median()
auto['high_mileage'] = (auto['mpg'] > median_mpg).astype(int)

In [4]:
# Drop non-predictive columns for the analysis, such as 'mpg' and 'name'
auto_data = auto.drop(columns=['mpg', 'name'])

In [7]:
# Part (b): Split the data into features and target variable
X = auto_data.drop(columns=['high_mileage'])
y = auto_data['high_mileage']

# Handle non-numeric data in 'horsepower' column if needed by converting to numeric
X['horsepower'] = pd.to_numeric(X['horsepower'], errors='coerce')

# Replace any NaNs in 'horsepower' with the median value of the column
X['horsepower'].fillna(X['horsepower'].median(), inplace=True)

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Define a range of C values for testing
C_values = [0.01, 0.1, 1, 10, 100]
results = {'C': [], 'F1_score': []}

# Fit a linear SVC for each C value and record cross-validation F1-scores
for C in C_values:
    svc_linear = SVC(kernel='linear', C=C)
    f1_scores = cross_val_score(svc_linear, X_scaled, y, cv=5, scoring='f1')
    results['C'].append(C)
    results['F1_score'].append(f1_scores.mean())

# Display the linear kernel SVC results
results_df = pd.DataFrame(results)
print("SVC Linear Kernel Cross-Validation Results")
print(results_df)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X['horsepower'].fillna(X['horsepower'].median(), inplace=True)


SVC Linear Kernel Cross-Validation Results
        C  F1_score
0    0.01  0.881162
1    0.10  0.827338
2    1.00  0.809832
3   10.00  0.833151
4  100.00  0.838883


The results show that using a small value for C, especially C=0.01, gives the best balance for predicting whether a car gets high or low gas mileage. With C=0.01, we get the highest F1-score, meaning our model is both accurate and generalizes well without trying too hard to fit every little detail in the data. As we increase C (to values like 0.1, 1, 10, and 100), the model becomes stricter about classifying every point correctly, but the F1-score doesn’t improve much and even dips slightly. This tells us that the model doesn’t gain much from being overly complex, and keeping it simple with a low C actually works best for this task.

In [8]:
# For part (c): Test radial (RBF) and polynomial kernel SVMs
# Define parameter ranges for C, gamma, and degree
gamma_values = [0.1, 1, 10]
degree_values = [2, 3, 4]

# Initialize dictionary to store results
svm_results = {'Kernel': [], 'C': [], 'Gamma': [], 'Degree': [], 'F1_score': []}

# Test Radial Basis Function (RBF) Kernel
for C in C_values:
    for gamma in gamma_values:
        svc_rbf = SVC(kernel='rbf', C=C, gamma=gamma)
        f1_scores = cross_val_score(svc_rbf, X_scaled, y, cv=5, scoring='f1')
        svm_results['Kernel'].append('RBF')
        svm_results['C'].append(C)
        svm_results['Gamma'].append(gamma)
        svm_results['Degree'].append(None)  # Degree is not used for RBF
        svm_results['F1_score'].append(f1_scores.mean())

# Test Polynomial Kernel
for C in C_values:
    for degree in degree_values:
        svc_poly = SVC(kernel='poly', C=C, degree=degree)
        f1_scores = cross_val_score(svc_poly, X_scaled, y, cv=5, scoring='f1')
        svm_results['Kernel'].append('Polynomial')
        svm_results['C'].append(C)
        svm_results['Gamma'].append(None)  # Gamma is not used explicitly in poly in this context
        svm_results['Degree'].append(degree)
        svm_results['F1_score'].append(f1_scores.mean())

# Convert results to DataFrame and display
svm_results_df = pd.DataFrame(svm_results)
print("SVM RBF and Polynomial Kernel Results")
print(svm_results_df)

SVM RBF and Polynomial Kernel Results
        Kernel       C  Gamma  Degree  F1_score
0          RBF    0.01    0.1     NaN  0.697655
1          RBF    0.01    1.0     NaN  0.000000
2          RBF    0.01   10.0     NaN  0.000000
3          RBF    0.10    0.1     NaN  0.891554
4          RBF    0.10    1.0     NaN  0.834600
5          RBF    0.10   10.0     NaN  0.000000
6          RBF    1.00    0.1     NaN  0.824650
7          RBF    1.00    1.0     NaN  0.811188
8          RBF    1.00   10.0     NaN  0.565067
9          RBF   10.00    0.1     NaN  0.778045
10         RBF   10.00    1.0     NaN  0.779145
11         RBF   10.00   10.0     NaN  0.584012
12         RBF  100.00    0.1     NaN  0.768339
13         RBF  100.00    1.0     NaN  0.773366
14         RBF  100.00   10.0     NaN  0.584012
15  Polynomial    0.01    NaN     2.0  0.097586
16  Polynomial    0.01    NaN     3.0  0.781380
17  Polynomial    0.01    NaN     4.0  0.684184
18  Polynomial    0.10    NaN     2.0  0.514436
19

For the RBF kernel, the best performance happens when C=0.1 and gamma=0.1, giving a strong F1-score around 0.89, which means the model is well-balanced in capturing the patterns without overfitting. When
gamma gets too high (like 10), the F1-score drops a lot, even to zero in some cases, showing that the model overfits and struggles with new data.

For the polynomial kernel, degree 3 with C around 10 seems to work well, giving F1-scores around 0.82 to 0.83. Higher degrees or very low C values don’t perform as well, showing that too much complexity isn’t helpful for this dataset.