## Support Vector Classifier (Number of Ingredients by Cuisine Types)

This notebook requires:
* trainEngineered.csv

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate
from sklearn.svm import LinearSVC, SVC
from sklearn.metrics import classification_report

In [None]:
finalDF = pd.read_csv('trainEngineered.csv')
finalDF.head()

Unnamed: 0,greek,southern_us,filipino,indian,jamaican,spanish,italian,mexican,chinese,british,...,cajun_creole,brazilian,french,japanese,irish,korean,moroccan,russian,general,cuisine
0,6,1,0,2,0,0,6,7,1,0,...,1,0,3,0,0,0,1,0,8,greek
1,0,5,0,1,3,0,2,1,1,2,...,2,0,1,1,1,0,0,0,14,southern_us
2,0,0,1,2,0,0,1,3,1,0,...,1,0,2,1,0,1,0,0,18,filipino
3,0,0,0,0,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,6,indian
4,1,3,0,14,2,2,3,5,5,1,...,3,1,1,6,0,1,3,0,22,indian


### Split into Train and Test

In [None]:
X_train, X_test, y_train, y_test = train_test_split(finalDF.drop(['cuisine'], axis = 1), 
                                                    finalDF['cuisine'], 
                                                    train_size = 0.8, 
                                                    random_state = 5)

### Feature Scaling

Since we are using support vector machine, we can scale the data to [0, 1] range for faster convergence (speed of training) during training.

In [None]:
scaler = MinMaxScaler()
scaledData = scaler.fit_transform(X_train)

In [None]:
# The list of cuisine type including 'general'
cuisineTypes = finalDF.columns[:-1]

In [None]:
scaledDF = pd.DataFrame(scaledData, columns=cuisineTypes)
scaledDF.head()

Unnamed: 0,greek,southern_us,filipino,indian,jamaican,spanish,italian,mexican,chinese,british,...,vietnamese,cajun_creole,brazilian,french,japanese,irish,korean,moroccan,russian,general
0,0.0,0.034483,0.071429,0.02381,0.0,0.058824,0.065574,0.047619,0.212121,0.05,...,0.090909,0.086957,0.0,0.1,0.333333,0.0,0.222222,0.0,0.090909,0.206897
1,0.117647,0.068966,0.0,0.0,0.0,0.058824,0.081967,0.261905,0.030303,0.0,...,0.090909,0.043478,0.076923,0.05,0.037037,0.0,0.0,0.0,0.0,0.068966
2,0.0,0.0,0.0,0.0,0.0,0.0,0.032787,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.103448
3,0.117647,0.137931,0.071429,0.119048,0.125,0.058824,0.04918,0.095238,0.030303,0.15,...,0.0,0.086957,0.0,0.15,0.037037,0.5,0.0,0.125,0.363636,0.172414
4,0.117647,0.137931,0.0,0.02381,0.0,0.058824,0.147541,0.095238,0.030303,0.0,...,0.0,0.130435,0.153846,0.15,0.0,0.0,0.0,0.0625,0.0,0.344828


## Linear Support Vector Classifier

In [None]:
max_iter=20
svc_model = LinearSVC(loss='squared_hinge', 
                      penalty='l2', 
                      dual=False, 
                      fit_intercept=True,
                      intercept_scaling=1, 
                      max_iter=max_iter,
                      multi_class='ovr')

We choose dual = False because n_samples > n_features.

In [None]:
svc_model.fit(scaledDF, y_train)

LinearSVC(dual=False, max_iter=20)

In [None]:
print("Score: ", svc_model.score(scaledDF, y_train))

Score:  0.7382067318268959


In [None]:
# Transform the test set data according to the min-max scaler
X_test_tf = scaler.transform(X_test)
X_test_tf = pd.DataFrame(scaler.transform(X_test), columns=cuisineTypes)

In [None]:
print(f'Max iterations: {max_iter}, Accuracy: {svc_model.score(X_test_tf, y_test)}')

Max iterations: 20, Accuracy: 0.7401634192331866


Compared to the one-hot encoded version with accuracy 77%, linear SVC performs slightly worse in the current dataset.

## Support Vector Classifier

As linear SVC uses one-vs-rest in multiclass support, we will try to use the generalised SVC that uses one-vs-one.

In [None]:
max_iter=3800
svc_model = SVC(kernel='rbf',
                max_iter=max_iter)

In [None]:
svc_model.fit(scaledDF, y_train)

SVC(max_iter=3800)

In [None]:
print("Score: ", svc_model.score(scaledDF, y_train))

Score:  0.7795342405481002


In [None]:
X_test_tf = scaler.transform(X_test)
X_test_tf = pd.DataFrame(scaler.transform(X_test), columns=cuisineTypes)

In [None]:
print(f'Max iterations: {max_iter}, Accuracy: {svc_model.score(X_test_tf, y_test)}')

Max iterations: 3800, Accuracy: 0.7553739786297926


Even though the accuracy has improved slightly compared to linear SVC, it still lacks behind the linear SVC on one-hot encoded data.

## Cross Validation

In [None]:
X_validation, X_test, y_validation, y_test = train_test_split(finalDF.drop(['cuisine'], axis = 1), 
                                                              finalDF['cuisine'], 
                                                              train_size = 0.5, 
                                                              random_state = 42)

In [None]:
scaler = MinMaxScaler()
scaledData = scaler.fit_transform(X_validation)
scaledDF = pd.DataFrame(scaledData, columns=cuisineTypes)

In [None]:
scores = cross_validate(svc_model, scaledDF, y_validation, cv=5,
                        scoring='accuracy',
                        return_estimator=True)

In [None]:
print(scores['test_score'])

[0.75666164 0.75238813 0.75810913 0.74955997 0.75836057]


Clearly, SVC does not perform better on the engineered dataset.