# Feature Vector MultiClass Classification

This notebook contains the pipeline that allows to perform classification experiments of the Multi-Class version of this problem. **Cardboard**, **Metal**, **Paper**, **Glass**, and **Plastic** are the classes that are considered in this set of experiments. The used <u>data set</u> may be found in the following [Kaggle Repository](https://www.kaggle.com/asdasdasasdas/garbage-classification). 

clean_research_practice1_version.ipynb set the baseline in order to work with in this notebook, the binary classification version of the problem is worked in there. In case of doubts, this should be the reference to be consulted.

# Imports

In [6]:
import os
import errno

import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

import xgboost as xgb

from image_processing import applypca, applynmf
from evaluation_functions import (hyperparametertunning, learningcurve, 
                                  plotlearningcurve, multiclass_CV)

# from google.colab import drive
# drive.mount('/content/drive/', force_remount=True)

# Load the data

In [2]:
df = pd.read_csv('data/feature_extraction.csv', index_col=0)
# df = pd.read_csv('data/fine_tuning.csv', index_col=0)

X = df.iloc[:, 1:-5]
y = df.iloc[:,-5:].to_numpy().argmax(axis=1)
image_filenames = df.iloc[:, 0]

# Dimension Reduction

In [3]:
# Apply PCA
X_pca, pca = applypca(X) 

variance = pca.explained_variance_ratio_

suma = 0
cont = 0
while suma < 0.8:
  suma += variance[cont]
  cont += 1

X_pca80 = X_pca.iloc[:,:cont]

print("Original BoF shape:",X.shape)
print("BoF PCA shape:",X_pca.shape)
print("BoF PCA80 shape",X_pca80.shape)

Original BoF shape: (2390, 1280)
BoF PCA shape: (2390, 1280)
BoF PCA80 shape (2390, 168)


# Model Evaluation

## Experiment Setting

In [4]:
# Create Directory
root = 'experiments/'
# destination = '/content/drive/MyDrive/PI2/experiments'

try:
    os.mkdir(root)
except OSError as e:
    if e.errno == errno.EEXIST:
        print('Directory already exist')
    else:
        raise

# Create Experiment Directory
experiment = str(input('Type Experiment Name: '))
path = root + experiment
try:
    os.mkdir(path)
except OSError as e:
    if e.errno == errno.EEXIST:
        print('Directory already exist')
    else:
        raise
model_name = str(input('Type Model Name: '))

# Create training dictionary
X_dict = {'Regular':X,'PCA':X_pca,'PCA80':X_pca80}

Directory already exist
Type Experiment Name: feature_extraction_XGBoost
Type Model Name: XGBoost


## Hyperparameter Setting

In [None]:
%%capture

# Classifier - XGBoost
model = xgb.XGBClassifier(objective='multi:softmax')
hyperparameters = {
    "max_depth": [3, 4, 5, 7],
    "learning_rate": [0.1, 0.01, 0.05],
    "gamma": [0, 0.25, 1],
    "reg_lambda": [0, 1, 10],
    "scale_pos_weight": [1, 3, 5],
    "subsample": [0.8],
    "colsample_bytree": [0.5],
}

# Classifier - Random Forest
# model = RandomForestClassifier(n_jobs=-1)
# hyperparameters = {'n_estimators': [100, 250, 500],
#                   'max_depth': [4, 8, 16, 32],
#                   'criterion': ['gini', 'entropy']}

# Classifier - Gaussian Naive Bayes
# model = GaussianNB()
# hyperparameters = {}

# # Classifier - SVM
# model = SVC(probability=True)

# hyperparameters = {'C': [0.1, 1, 10, 100, 1000],
#               'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
#               'kernel': ['rbf', 'sigmoid', 'linear']}

# Classifier - Logistic
# model = LogisticRegression(multi_class='multinomial')

# hyperparameters = {'C': np.logspace(0,4,10), 
#                    'penalty': ['l1','l2','elasticnet','none']}

# Tune Hyperparameters
param_dict, param_title_dictionary = hyperparametertunning(model, X_dict, y, 
                                                           hyperparameters, 5, 
                                                           'f1_macro')

## Learning Curves

In [None]:
%%capture
(train_sizes_dict, train_scores_mean_dict, train_scores_std_dict,
  test_scores_mean_dict, 
  test_scores_std_dict) = learningcurve(model, X_dict, y, 
                                        5, param_dict, 'f1_macro', 
                                        np.linspace(0.1,1,50))

In [None]:
plotlearningcurve(model_name, param_dict, param_title_dictionary, 'f1_macro', 
                  train_sizes_dict, train_scores_mean_dict, 
                  train_scores_std_dict, test_scores_mean_dict,
                  test_scores_std_dict, path)

## Evaluating Performance Across Classes

In [None]:
CLASSES = ['cardboard', 'glass', 'metal', 'paper', 'plastic']
df, model_wrong_preds = multiclass_CV(model, 5, X_dict, y, param_dict, 
                                      param_title_dictionary, CLASSES, 
                                      model_name, path, image_filenames)

In [None]:
df