<a href="https://colab.research.google.com/github/elenachau/machine-learning/blob/main/social_advertising.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Objectives
* understand SVM
* apply svm on any dataset to extract meaningful insights from it
* understand various kernels of svm

## Dataset

The dataset choosen for this experiment is social advertising dataset. The dataset contains 400 samples and 5 columns.

In [None]:
!wget https://raw.githubusercontent.com/shivang98/Social-Network-ads-Boost/master/Social_Network_Ads.csv

--2024-01-07 11:50:54--  https://raw.githubusercontent.com/shivang98/Social-Network-ads-Boost/master/Social_Network_Ads.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10926 (11K) [text/plain]
Saving to: ‘Social_Network_Ads.csv’


2024-01-07 11:50:54 (74.1 MB/s) - ‘Social_Network_Ads.csv’ saved [10926/10926]



In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import train_test_split

In [None]:
df = pd.read_csv('Social_Network_Ads.csv')
df.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   User ID          400 non-null    int64 
 1   Gender           400 non-null    object
 2   Age              400 non-null    int64 
 3   EstimatedSalary  400 non-null    int64 
 4   Purchased        400 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 15.8+ KB


In [None]:
def convert_gender(row) -> int:
  if row == 'Male':
    return 1
  return 0

df['Gender'] = df['Gender'].apply(convert_gender)
df

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,1,19,19000,0
1,15810944,1,35,20000,0
2,15668575,0,26,43000,0
3,15603246,0,27,57000,0
4,15804002,1,19,76000,0
...,...,...,...,...,...
395,15691863,0,46,41000,1
396,15706071,1,51,23000,1
397,15654296,0,50,20000,1
398,15755018,1,36,33000,0


In [None]:
features = df.drop(['User ID', 'Purchased'], axis=1)
target = df['Purchased']

x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=0.25, random_state = 42)

In [None]:
svc = SVC()
svc.fit(x_train, y_train)

train_pred = svc.predict(x_train)
test_pred = svc.predict(x_test)

print(f"Train accuracy: {accuracy_score(y_train, train_pred)}")
print(f"Test accuracy: {accuracy_score(y_test, test_pred)}")

Train accuracy: 0.7733333333333333
Test accuracy: 0.75


In [None]:
svc = SVC(kernel='linear', C=1, gamma='scale')
svc.fit(x_train, y_train)

train_pred = svc.predict(x_train)
test_pred = svc.predict(x_test)

print(f"Train accuracy: {accuracy_score(y_train, train_pred)}")
print(f"Test accuracy: {accuracy_score(y_test, test_pred)}")

Train accuracy: 0.8166666666666667
Test accuracy: 0.84


In [None]:
df['Purchased'].value_counts(normalize=True)

0    0.6425
1    0.3575
Name: Purchased, dtype: float64

In [None]:
svc = SVC(kernel='linear', C=1, gamma='scale')
svc.fit(x_train, y_train)

train_pred = svc.predict(x_train)
test_pred = svc.predict(x_test)

print(f"Train accuracy: {f1_score(y_train, train_pred, average='weighted')}") #use weighted avg because of imbalance for weighted f1 score
print(f"Test accuracy: {f1_score(y_test, test_pred, average='weighted')}")

Train accuracy: 0.8175821090669908
Test accuracy: 0.8408149405772497
