<a href="https://colab.research.google.com/github/byhqsr/DSAI-Professional-Training-in-Machine-Learning/blob/main/Support_Vector_Machine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

As we just did with logistic regression, in this exercise, we are going to perform classification on the penguin dataset using Support Vector Machines.

Data Scrubbing:
*   Removing rows with missing values
*   One-hot encoding for island and sex
*   Standardization using StandardScaler for all independent variables

Independent Variables:
*   bill_length_mm
*   bill_depth_mm
*   flipper_length_mm
*   day body_mass_g
*   island
*   sex

Dependent Variable:
*   species

Evaluation:
*   Confusion matrix
*   Classification report

In [1]:
# 1) Import the following Python libraries: A) pandas B) train_test_split from Scikit-learn C) StandardScaler from Scikit-learn D) SVC from Scikit-learn E) confusion_matrix and classification_report from Scikit-learn
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix

In [2]:
# 2) Import dataset from the web: https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv')

In [3]:
# 3) Delete rows containing missing values
df.dropna(axis = 0, how = 'any', thresh = None, subset = None, inplace = True)

In [4]:
# 4) Convert non-numeric variables using one-hot encoding. These variables include sex and island.
df = pd.get_dummies(df, columns=['sex', 'island'])

In [5]:
# 5) Standardize the independent variables using StandardScaler
scaler = StandardScaler()
scaler.fit(df.drop('species',axis=1))
scaled_features = scaler.transform(df.drop('species',axis=1))

In [6]:
# 6) Assign the X and y variables
X = scaled_features
y = df['species']

In [7]:
# 7) Shuffle the dataset and split the data into test/train sets (70/30 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True)

In [8]:
# 8) Assign the classification version of Support Vectors Machine (SVC) as the model's algorithm
model = SVC()

In [9]:
# 9) Link model to X and y variables using the fit function
model.fit(X_train, y_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [10]:
# 10) Run algorithm on test data to make predictions
model_test = model.predict(X_test)

In [11]:
# 11) Evaluate predictions by comparing the model's predictions and the actual outcome of the test data using a confusion matrix and classification report
print(confusion_matrix(y_test, model_test)) 
print(classification_report(y_test, model_test))

[[44  0  0]
 [ 0 20  0]
 [ 0  0 36]]
              precision    recall  f1-score   support

      Adelie       1.00      1.00      1.00        44
   Chinstrap       1.00      1.00      1.00        20
      Gentoo       1.00      1.00      1.00        36

    accuracy                           1.00       100
   macro avg       1.00      1.00      1.00       100
weighted avg       1.00      1.00      1.00       100



In [12]:
# 12) Make a prediction with the model using a sample data point (called 'penguin') and the predict function
# Data point to predict
penguin = [
	39, #bill_length_mm
	18.5, #bill_depth_mm
	180, #flipper_length_mm 
	3750, #body_mass_g
	0, #island_Biscoe    
	0, #island_Dream
	1, #island_Torgersen    
	1, #sex_Male
	0, #sex_Female
]

# Make prediction
new_penguin = model.predict([penguin])
new_penguin

array(['Adelie'], dtype=object)