## CS-471: Machine Learning
### **Submitted By**:
#### **Name**: Ayesh Ahmad
#### **CMS**: 365966
#### **Class**: BESE-12A
---
## Lab 7
#### Perform multi-class classification through Support Vector Machines. In the training and test data files, each row contains data about one instance of a plant category where four predictors/attributes are recorded for each plant (namely, leaf length, leaf width, flower length, and flower width), while “plant” is the target class which could be any one of the following at a time: “Arctica” or “Harlequin” or “Caroliniana”.

##### Imports

In [19]:
import pandas as pd
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

##### Data Preprocessing and Analysis

In [20]:
def load_and_preprocess_data(train_file, test_file):
    train_data = pd.read_excel(train_file)
    test_data = pd.read_excel(test_file)
    
    # Filling empty values with 'Unknown'
    test_data['plant'].fillna('Unknown', inplace=True)
 
    # Perform data preprocessing
    combined_data = pd.concat([train_data, test_data], axis=0)
    scaler = StandardScaler()
    combined_data[['leaf.length', 'leaf.width', 'flower.length', 'flower.width']] = scaler.fit_transform(combined_data[['leaf.length', 'leaf.width', 'flower.length', 'flower.width']])
    train_data = combined_data.iloc[:len(train_data), :]
    test_data = combined_data.iloc[len(train_data):, :]
    train_data.reset_index(drop=True, inplace=True)
    test_data.reset_index(drop=True, inplace=True)
    
    # Separate features and labels
    X_train = train_data[['leaf.length', 'leaf.width', 'flower.length', 'flower.width']]
    y_train = train_data['plant']
    X_test = test_data[['leaf.length', 'leaf.width', 'flower.length', 'flower.width']]
    y_test = test_data['plant']
    
    return X_train, y_train, X_test, y_test

train_file = 'TrainingSet.xlsx'
test_file = 'TestingSet.xlsx'
X_train, y_train, X_test, y_test = load_and_preprocess_data(train_file, test_file)

print("Training data:")
print(X_train.head())
print("\nTraining labels:")
print(y_train.head())
print("\nTest data:")
print(X_test.head())
print("\nTest labels:")
print(y_test.head())

Training data:
   leaf.length  leaf.width  flower.length  flower.width
0    -0.537178    1.479398      -1.283389     -1.315444
1    -1.264185    0.788808      -1.226552     -1.315444
2    -1.264185   -0.131979      -1.340227     -1.447076
3    -1.870024   -0.131979      -1.510739     -1.447076
4    -0.052506    2.169988      -1.453901     -1.315444

Training labels:
0    Arctica
1    Arctica
2    Arctica
3    Arctica
4    Arctica
Name: plant, dtype: object

Test data:
   leaf.length  leaf.width  flower.length  flower.width
0    -1.748856   -0.362176      -1.340227     -1.315444
1    -1.506521    0.098217      -1.283389     -1.315444
2    -1.506521    0.788808      -1.340227     -1.183812
3    -1.385353    0.328414      -1.397064     -1.315444
4    -1.143017   -0.131979      -1.340227     -1.315444

Test labels:
0    Unknown
1    Unknown
2    Unknown
3    Unknown
4    Unknown
Name: plant, dtype: object


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  test_data['plant'].fillna('Unknown', inplace=True)
  test_data['plant'].fillna('Unknown', inplace=True)


##### Training the SciKit SVM Classifier using GridSearch
---
5-fold cross validation is implemented into GridSearch by default along with hyperparameter tuning

In [25]:
params = {'C': [0.1, 1, 10, 100, 1000],
          'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
          'kernel': ['rbf']}

svm_classifier = GridSearchCV(SVC(), params, refit = True, verbose = 10)
svm_classifier.fit(X_train, y_train)

Fitting 5 folds for each of 25 candidates, totalling 125 fits
[CV 1/5; 1/25] START C=0.1, gamma=1, kernel=rbf, penalty=l2.....................


ValueError: Invalid parameter 'penalty' for estimator SVC(C=0.1, gamma=1). Valid parameters are: ['C', 'break_ties', 'cache_size', 'class_weight', 'coef0', 'decision_function_shape', 'degree', 'gamma', 'kernel', 'max_iter', 'probability', 'random_state', 'shrinking', 'tol', 'verbose'].

##### Optimal Parameters & Score

In [22]:
print("Optimal Hyperparameters: ", svm_classifier.best_params_)
print("Optimal Hyperparameters Accuracy:", svm_classifier.best_score_)

Optimal Hyperparameters:  {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}
Optimal Hyperparameters Accuracy: 0.9666666666666666


##### Predicting Test Data

In [23]:
predictions = svm_classifier.predict(X_test)
print(predictions)

['Arctica' 'Arctica' 'Arctica' 'Arctica' 'Arctica' 'Arctica' 'Harlequin'
 'Carolinian' 'Arctica' 'Arctica' 'Arctica' 'Harlequin' 'Arctica'
 'Harlequin' 'Harlequin' 'Carolinian' 'Harlequin' 'Carolinian'
 'Carolinian' 'Harlequin' 'Harlequin' 'Carolinian' 'Harlequin'
 'Carolinian' 'Harlequin' 'Harlequin' 'Carolinian' 'Carolinian'
 'Carolinian' 'Carolinian']


##### Adding Predictions to the Test File

In [24]:
data = pd.read_excel(test_file)
data.iloc[:,-1:] = predictions
data.to_excel('Predictions.xlsx', index=False)

 'Carolinian' 'Arctica' 'Arctica' 'Arctica' 'Harlequin' 'Arctica'
 'Harlequin' 'Harlequin' 'Carolinian' 'Harlequin' 'Carolinian'
 'Carolinian' 'Harlequin' 'Harlequin' 'Carolinian' 'Harlequin'
 'Carolinian' 'Harlequin' 'Harlequin' 'Carolinian' 'Carolinian'
 'Carolinian' 'Carolinian']' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  data.iloc[:,-1:] = predictions
