# Support Vector Machine
by; Ari Sulistiyo Prabowo
____

**Contents:**
1. What is Support Vector Machine (SVM)?
2. Applied SVM for binary classification
3. What is Multiclass Classification?
4. Applied ML for Multiclass Classification

## What is Support Vector Machine?
Support vector machines (SVMs) are a set of supervised learning methods used for **classification**, **regression** and **outliers detection**. 

SVM generally works by drawing a straight line between two or more labels/classes/categories called linear SVM

![picture](https://miro.medium.com/max/1400/1*NfVVZm9pcoy18dfigS3bFw.png)

## Applied SVM for binary classification
Using human capital data to predict whether the employee should be **promoted (1)** or **not promoted (0)**

In [None]:
# import library
import pandas as pd
from collections import Counter

# import ML library
from sklearn import svm, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import GradientBoostingClassifier


# import evaluation metrics
from sklearn.metrics import classification_report, accuracy_score

import warnings
warnings.filterwarnings('ignore')

In [None]:
data = pd.read_csv("https://raw.githubusercontent.com/densaiko/data_science_learning/main/dataset/Human%20Capital.csv")
display(data.shape)
data.head()

(54808, 13)

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,awards_won,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,0,49.0,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,60.0,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,50.0,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,50.0,0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,73.0,0


In [None]:
# target variable (is_promoted)
data['is_promoted'].value_counts(normalize=True)

0    0.91483
1    0.08517
Name: is_promoted, dtype: float64

The target variable is imbalance. Therefore, we need to make it at least almost balance.

### Data Preprocessing

In [None]:
# Select desired columns
data = data[["department","education","gender","recruitment_channel",
             "no_of_trainings","age","previous_year_rating","length_of_service","awards_won","avg_training_score","is_promoted"]]

In [None]:
# Handling character value by using one hot encoding
data_encoded = pd.get_dummies(data, prefix_sep="_")
data_encoded = data_encoded.dropna()
display(data_encoded.shape)
data_encoded.head()

(48326, 24)

Unnamed: 0,no_of_trainings,age,previous_year_rating,length_of_service,awards_won,avg_training_score,is_promoted,department_Analytics,department_Finance,department_HR,department_Legal,department_Operations,department_Procurement,department_R&D,department_Sales & Marketing,department_Technology,education_Bachelor's,education_Below Secondary,education_Master's & above,gender_f,gender_m,recruitment_channel_other,recruitment_channel_referred,recruitment_channel_sourcing
0,1,35,5.0,8,0,49.0,0,0,0,0,0,0,0,0,1,0,0,0,1,1,0,0,0,1
1,1,30,5.0,4,0,60.0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,1,0,0
2,1,34,3.0,7,0,50.0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,1
3,2,39,1.0,10,0,50.0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,1,0,0
4,1,45,3.0,2,0,73.0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,1,1,0,0


In [None]:
# Handling imbalance data
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=43, sampling_strategy=1)

# Separating dependent and independent variable
X = data_encoded.drop(columns="is_promoted") #independent variable
y = data_encoded["is_promoted"] #dependent variable

In [None]:
# Performing train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43)

# Fit the over sampling
X_train_smote, y_train_smote = sm.fit_sample(X_train, y_train)

print("Before over sampling: {}".format(Counter(y_train)))
print("After over sampling: {}".format(Counter(y_train_smote)))

Before over sampling: Counter({0: 35346, 1: 3314})
After over sampling: Counter({0: 35346, 1: 35346})


### Modelling

In [None]:
# SVM Modelling
clf = svm.SVC()
clf.fit(X_train_smote, y_train_smote)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [None]:
# Evaluation
y_predict_train = clf.predict(X_train_smote)
y_predict_test = clf.predict(X_test)

training_acc = accuracy_score(y_train_smote, y_predict_train)
testing_acc = accuracy_score(y_test, y_predict_test)

print("Training Accuracy: {}".format(training_acc))
print("Testing Accuracy: {}".format(testing_acc))

print(classification_report(y_test, y_predict_test))

Training Accuracy: 0.7006167600294234
Testing Accuracy: 0.7122905027932961
              precision    recall  f1-score   support

           0       0.95      0.72      0.82      8812
           1       0.18      0.65      0.28       854

    accuracy                           0.71      9666
   macro avg       0.57      0.68      0.55      9666
weighted avg       0.89      0.71      0.77      9666



## What is Multiclass Classification?
Multiclass Classification is a classification task with more than two classes. Each sample can only be labeled as one class.
![picture](https://www.baeldung.com/wp-content/uploads/sites/4/2020/10/multiclass-svm3-e1601952776445.png)

## Applied Machine Learning for Multiclass Classification

In [None]:
data_2 = pd.read_csv("https://raw.githubusercontent.com/densaiko/data_science_learning/main/dataset/fifa_dataset.csv")
display(data_2.shape)
data_2.head()

(18207, 89)

Unnamed: 0.1,Unnamed: 0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,Club Logo,Value,Wage,Special,Preferred Foot,International Reputation,Weak Foot,Skill Moves,Work Rate,Body Type,Real Face,Position,Jersey Number,Joined,Loaned From,Contract Valid Until,Height,Weight,LS,ST,RS,LW,LF,CF,RF,RW,LAM,CAM,RAM,LM,...,LB,LCB,CB,RCB,RB,Crossing,Finishing,HeadingAccuracy,ShortPassing,Volleys,Dribbling,Curve,FKAccuracy,LongPassing,BallControl,Acceleration,SprintSpeed,Agility,Reactions,Balance,ShotPower,Jumping,Stamina,Strength,LongShots,Aggression,Interceptions,Positioning,Vision,Penalties,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,0,158023,L. Messi,31,https://cdn.sofifa.org/players/4/19/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,94,94,FC Barcelona,https://cdn.sofifa.org/teams/2/light/241.png,€110.5M,€565K,2202,Left,5.0,4.0,4.0,Medium/ Medium,Messi,Yes,RF,10.0,"Jul 1, 2004",,2021,5'7,159lbs,88+2,88+2,88+2,92+2,93+2,93+2,93+2,92+2,93+2,93+2,93+2,91+2,...,59+2,47+2,47+2,47+2,59+2,84.0,95.0,70.0,90.0,86.0,97.0,93.0,94.0,87.0,96.0,91.0,86.0,91.0,95.0,95.0,85.0,68.0,72.0,59.0,94.0,48.0,22.0,94.0,94.0,75.0,96.0,33.0,28.0,26.0,6.0,11.0,15.0,14.0,8.0,€226.5M
1,1,20801,Cristiano Ronaldo,33,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Juventus,https://cdn.sofifa.org/teams/2/light/45.png,€77M,€405K,2228,Right,5.0,4.0,5.0,High/ Low,C. Ronaldo,Yes,ST,7.0,"Jul 10, 2018",,2022,6'2,183lbs,91+3,91+3,91+3,89+3,90+3,90+3,90+3,89+3,88+3,88+3,88+3,88+3,...,61+3,53+3,53+3,53+3,61+3,84.0,94.0,89.0,81.0,87.0,88.0,81.0,76.0,77.0,94.0,89.0,91.0,87.0,96.0,70.0,95.0,95.0,88.0,79.0,93.0,63.0,29.0,95.0,82.0,85.0,95.0,28.0,31.0,23.0,7.0,11.0,15.0,14.0,11.0,€127.1M
2,2,190871,Neymar Jr,26,https://cdn.sofifa.org/players/4/19/190871.png,Brazil,https://cdn.sofifa.org/flags/54.png,92,93,Paris Saint-Germain,https://cdn.sofifa.org/teams/2/light/73.png,€118.5M,€290K,2143,Right,5.0,5.0,5.0,High/ Medium,Neymar,Yes,LW,10.0,"Aug 3, 2017",,2022,5'9,150lbs,84+3,84+3,84+3,89+3,89+3,89+3,89+3,89+3,89+3,89+3,89+3,88+3,...,60+3,47+3,47+3,47+3,60+3,79.0,87.0,62.0,84.0,84.0,96.0,88.0,87.0,78.0,95.0,94.0,90.0,96.0,94.0,84.0,80.0,61.0,81.0,49.0,82.0,56.0,36.0,89.0,87.0,81.0,94.0,27.0,24.0,33.0,9.0,9.0,15.0,15.0,11.0,€228.1M
3,3,193080,De Gea,27,https://cdn.sofifa.org/players/4/19/193080.png,Spain,https://cdn.sofifa.org/flags/45.png,91,93,Manchester United,https://cdn.sofifa.org/teams/2/light/11.png,€72M,€260K,1471,Right,4.0,3.0,1.0,Medium/ Medium,Lean,Yes,GK,1.0,"Jul 1, 2011",,2020,6'4,168lbs,,,,,,,,,,,,,...,,,,,,17.0,13.0,21.0,50.0,13.0,18.0,21.0,19.0,51.0,42.0,57.0,58.0,60.0,90.0,43.0,31.0,67.0,43.0,64.0,12.0,38.0,30.0,12.0,68.0,40.0,68.0,15.0,21.0,13.0,90.0,85.0,87.0,88.0,94.0,€138.6M
4,4,192985,K. De Bruyne,27,https://cdn.sofifa.org/players/4/19/192985.png,Belgium,https://cdn.sofifa.org/flags/7.png,91,92,Manchester City,https://cdn.sofifa.org/teams/2/light/10.png,€102M,€355K,2281,Right,4.0,5.0,4.0,High/ High,Normal,Yes,RCM,7.0,"Aug 30, 2015",,2023,5'11,154lbs,82+3,82+3,82+3,87+3,87+3,87+3,87+3,87+3,88+3,88+3,88+3,88+3,...,73+3,66+3,66+3,66+3,73+3,93.0,82.0,55.0,92.0,82.0,86.0,85.0,83.0,91.0,91.0,78.0,76.0,79.0,91.0,77.0,91.0,63.0,90.0,75.0,91.0,76.0,61.0,87.0,94.0,79.0,88.0,68.0,58.0,51.0,15.0,13.0,5.0,10.0,13.0,€196.4M


In [None]:
# get the required variable
data_2 = data_2[["Position", 'Finishing', 'HeadingAccuracy', 'ShortPassing', 'Volleys', 'Dribbling',
       'Curve', 'FKAccuracy', 'LongPassing', 'BallControl', 'Acceleration',
       'SprintSpeed', 'Agility', 'Reactions', 'Balance', 'ShotPower',
       'Jumping', 'Stamina', 'Strength', 'LongShots', 'Aggression',
       'Interceptions', 'Positioning', 'Vision', 'Penalties', 'Composure',
       'Marking', 'StandingTackle', 'SlidingTackle', 'GKDiving', 'GKHandling',
       'GKKicking', 'GKPositioning', 'GKReflexes']]

In [None]:
# categorizing position
forward_player = ["ST", "LW", "RW", "LF", "RF", "RS","LS", "CF"]
midfielder_player = ["CM","RCM","LCM", "CDM","RDM","LDM", "CAM", "LAM", "RAM", "RM", "LM"]
defender_player = ["CB", "RCB", "LCB", "LWB", "RWB", "LB", "RB"]

# labeling the position
data_2.loc[data_2["Position"] == "GK", "Position"] = 0

data_2.loc[data_2["Position"].isin(defender_player), "Position"] = 1

data_2.loc[data_2["Position"].isin(midfielder_player), "Position"] = 2

data_2.loc[data_2["Position"].isin(forward_player), "Position"] = 3

#drop null values
data_2 = data_2.dropna()
data_2.shape

# distributin of target variable
data_2['Position'].value_counts(normalize=True)

2    0.376812
1    0.323249
3    0.188351
0    0.111589
Name: Position, dtype: float64

In [None]:
# change the type of position variable into int
data_2['Position'] = data_2['Position'].astype('int64')

### Data Preprocessing

In [None]:
# Convert independent variable from object to numeric using One hot encoder
data_encoded = pd.get_dummies(data_2, prefix_sep="_")
data_encoded = data_encoded.dropna()
display(data_encoded.shape)
data_encoded.head()

(18147, 34)

Unnamed: 0,Position,Finishing,HeadingAccuracy,ShortPassing,Volleys,Dribbling,Curve,FKAccuracy,LongPassing,BallControl,Acceleration,SprintSpeed,Agility,Reactions,Balance,ShotPower,Jumping,Stamina,Strength,LongShots,Aggression,Interceptions,Positioning,Vision,Penalties,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes
0,3,95.0,70.0,90.0,86.0,97.0,93.0,94.0,87.0,96.0,91.0,86.0,91.0,95.0,95.0,85.0,68.0,72.0,59.0,94.0,48.0,22.0,94.0,94.0,75.0,96.0,33.0,28.0,26.0,6.0,11.0,15.0,14.0,8.0
1,3,94.0,89.0,81.0,87.0,88.0,81.0,76.0,77.0,94.0,89.0,91.0,87.0,96.0,70.0,95.0,95.0,88.0,79.0,93.0,63.0,29.0,95.0,82.0,85.0,95.0,28.0,31.0,23.0,7.0,11.0,15.0,14.0,11.0
2,3,87.0,62.0,84.0,84.0,96.0,88.0,87.0,78.0,95.0,94.0,90.0,96.0,94.0,84.0,80.0,61.0,81.0,49.0,82.0,56.0,36.0,89.0,87.0,81.0,94.0,27.0,24.0,33.0,9.0,9.0,15.0,15.0,11.0
3,0,13.0,21.0,50.0,13.0,18.0,21.0,19.0,51.0,42.0,57.0,58.0,60.0,90.0,43.0,31.0,67.0,43.0,64.0,12.0,38.0,30.0,12.0,68.0,40.0,68.0,15.0,21.0,13.0,90.0,85.0,87.0,88.0,94.0
4,2,82.0,55.0,92.0,82.0,86.0,85.0,83.0,91.0,91.0,78.0,76.0,79.0,91.0,77.0,91.0,63.0,90.0,75.0,91.0,76.0,61.0,87.0,94.0,79.0,88.0,68.0,58.0,51.0,15.0,13.0,5.0,10.0,13.0


In [None]:
# Separating dependent and independent variable
X = data_encoded.drop(columns="Position") #independent variable
y = data_encoded["Position"] #dependent variable

# Performing train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43)

### Modelling

In [None]:
# Modelling using Gradient Boosting
lr_list = [0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1]

for learning_rate in lr_list:
  gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=learning_rate, 
                                      max_features=2, max_depth=2, random_state=43)
  
  gb_clf.fit(X_train, y_train)

  print("Learning rate: {}".format(learning_rate))
  print("Accuracy score (training): {:.3f}".format(gb_clf.score(X_train, y_train)))
  print("Accuracy score (validation): {:.3f}".format(gb_clf.score(X_test, y_test)))
  print()

Learning rate: 0.05
Accuracy score (testing): 0.850
Accuracy score (validation): 0.848

Learning rate: 0.075
Accuracy score (testing): 0.863
Accuracy score (validation): 0.860

Learning rate: 0.1
Accuracy score (testing): 0.870
Accuracy score (validation): 0.864

Learning rate: 0.25
Accuracy score (testing): 0.888
Accuracy score (validation): 0.876

Learning rate: 0.5
Accuracy score (testing): 0.899
Accuracy score (validation): 0.874

Learning rate: 0.75
Accuracy score (testing): 0.904
Accuracy score (validation): 0.870

Learning rate: 1
Accuracy score (testing): 0.903
Accuracy score (validation): 0.862



Learning rate of 0.5 and 0.75 can be used.

In [None]:
# choose the optimal modeling
gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.75, 
                                      max_features=2, max_depth=2, random_state=43)
  
gb_clf.fit(X_train, y_train)

predictions = gb_clf.predict(X_test)
print("Classification Report")
print(classification_report(y_test, predictions))

Classification Report
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       387
           1       0.91      0.90      0.91      1163
           2       0.83      0.85      0.84      1398
           3       0.82      0.79      0.81       682

    accuracy                           0.87      3630
   macro avg       0.89      0.88      0.89      3630
weighted avg       0.87      0.87      0.87      3630



**Reference**
1. https://towardsdatascience.com/machine-learning-multiclass-classification-with-imbalanced-data-set-29f6a177c1a
2. https://www.geeksforgeeks.org/ml-gradient-boosting/