# Laborator 6

Versiunea 2020-04-01

## Modele de clasificare

Folositi 4 seturi de date pentru probleme de clasificare, plecand de la repository-urile specificate in Cursul 5; de exemplu, [ics.uci.edu](http://archive.ics.uci.edu/ml/datasets.php?format=mat&task=cla&att=&area=&numAtt=&numIns=&type=mvar&sort=nameUp&view=table). Cel putin doua seturi de date sa fie cu valori lipsa. 


1. (20 puncte) Aplicati o metoda de missing value imputation, unde este cazul; justificati si documentati metoda folosita.
1. (numar de modele * numar de seturi de date \* 1 punct = 20 de puncte) Pentru fiecare set de date aplicati 5 modele de clasificare din scikit learn. Pentru fiecare raportati: acuratete, scorul F1 - a se vedea [sklearn.metrics](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) - folosind 5 fold cross validation. Raportati mediile rezultatelor atat pentru fold-urile de antrenare, cat si pentru cele de testare. Rularile se vor face cu valori fixate ale hiperparametrilor. 
1. (numar modele * 4 puncte = 20 puncte) Documentati in jupyter notebook fiecare din modelele folosite, in limba romana. Daca acelasi algoritm e folosit pentru mai multe seturi de date, puteti face o sectiune separata cu documentarea algoritmilor + trimitere la algoritm. 
1. (numar de modele * numar de seturi de date * 1 punct = 20 de puncte) Raportati performanta fiecarui model, folosind 5 fold cross validation. Pentru fiecare din cele 5 rulari, cautati hiperparametrii optimi folosind 4-fold cross validation. Performanta modelului va fi raportata ca medie a celor  5 rulari. 
    *Observatie:* la fiecare din cele 5 rulari, hiperparametrii optimi pot diferi, din cauza datelor utilizate pentru antrenare/validare. 

Se acorda 20 de puncte din oficiu. 

Exemple de modele de clasificare:
1. [Multi-layer Perceptron classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier)
1. [KNN](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier)
1. [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)
1. [Gaussian processes](https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessClassifier.html#sklearn.gaussian_process.GaussianProcessClassifier)
1. [RBF](https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.kernels.RBF.html#sklearn.gaussian_process.kernels.RBF)
1. [Decision tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)
1. [Random forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)
1. [Gaussian Naive bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB) 

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

print ('numpy:', np.__version__)
print ('pandas:', pd.__version__)

import sklearn
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler

from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

### 1. Missing value imputation

### 2. Accuracy and F1 score

### Wine Data Set
#### (http://archive.ics.uci.edu/ml/datasets/Wine)

#### Load data

In [None]:
wine_data = pd.read_csv('D:\Projects\Facultate\II\II_Sem_2_10_REPOS\IDS\Lab6\Data\Wine\wine.data', header=None)
wine_data.head()

#### Interpret data

In [None]:
X_Wine = wine_data.values[:, 1:]
y_Wine = wine_data.values[:, 0] #prima coloana reprezinta ground truth

#### Preprocess data

In [None]:
def print_ranges(X):
    for col_index in range(X.shape[1]):
        column = X[:, col_index]
        print(f'{np.min(column)} \t {np.max(column)}')
        
print_ranges(X_Wine)

#### Scale data

In [None]:
def scale_data(m: np.array) -> np.array:    
    scaler = MinMaxScaler()
    scaler.fit(m)
    return scaler.transform(m)

In [None]:
X_Wine = scale_data(X_Wine)
print_ranges(X_Wine)

#### Split data

In [None]:
test_size = 1/3
random_state = 5
X_Wine_train, X_Wine_test, y_Wine_train, y_Wine_test = train_test_split(X_Wine, y_Wine, test_size=test_size, random_state=random_state)

### Multi-layer Perceptron classifier

#### Fit data

In [None]:
max_iters = 10000
model_MLPC_Wine = MLPClassifier(max_iter=max_iters)
model_MLPC_Wine.fit(X_Wine_train, y_Wine_train)

#### Predict

In [None]:
y_Wine_h_MLPC = model_MLPC_Wine.predict(X_Wine_test)

#### Accuracy

In [None]:
accuracy_MLPC_Wine = accuracy_score(y_Wine_test, y_Wine_h_MLPC)
print(f'Accuracy (MLPC): {accuracy_MLPC_Wine}')

#### Accuracy and F1 score using 5-fold cross validation

In [None]:
cv = 5
scoring_acc = 'accuracy'
scoring_f1 = 'f1_macro'
average_f1 = 'weighted'

In [None]:
#Accuracy (5-fold)
score_MLPC_Wine_accuracy = cross_val_score(model_MLPC_Wine, X_Wine, y_Wine, cv=cv, scoring=scoring_acc)
print(f'Accuracy (MLPC, 5-fold cv):\n {score_MLPC_Wine_accuracy}')
mean_score_MLPC_Wine_accuracy = score_MLPC_Wine_accuracy.mean()
print(f'Mean Accuracy (MLPC, 5-fold cv): {mean_score_MLPC_Wine_accuracy}')

In [None]:
#F1 score (5-fold)
score_MLPC_Wine_f1 = f1_score(y_Wine_test, y_Wine_h_MLPC, average=average_f1)
print(f'F1 score (MLPC, 5-fold cv):\n {score_MLPC_Wine_f1}')

In [None]:
#Accuracy (5-fold) for training and test set
score_MLPC_Wine_accuracy_train = cross_val_score(model_MLPC_Wine, X_Wine_train, y_Wine_train, cv=cv, scoring=scoring_acc)
print(f'Accuracy (MLPC, 5-fold cv, train):\n {score_MLPC_Wine_accuracy_train}')

mean_score_MLPC_Wine_accuracy_train = score_MLPC_Wine_accuracy_train.mean()
print(f'Mean Accuracy (MLPC, 5-fold cv, train): {mean_score_MLPC_Wine_accuracy_train}')

print()

score_MLPC_Wine_accuracy_test = cross_val_score(model_MLPC_Wine, X_Wine_test, y_Wine_test, cv=cv, scoring=scoring_acc)
print(f'Accuracy (MLPC, 5-fold cv, test):\n {score_MLPC_Wine_accuracy_test}')

mean_score_MLPC_Wine_accuracy_test = score_MLPC_Wine_accuracy_test.mean()
print(f'Mean Accuracy (MLPC, 5-fold cv, test): {mean_score_MLPC_Wine_accuracy_test}')

In [None]:
#F1 score (5-fold) for training and test set
score_MLPC_Wine_f1_train = cross_val_score(model_MLPC_Wine, X_Wine_train, y_Wine_train, cv=cv, scoring=scoring_f1)
print(f'F1 score (MLPC, 5-fold cv, train):\n {score_MLPC_Wine_f1_train}')

mean_score_MLPC_Wine_f1_train = score_MLPC_Wine_f1_train.mean()
print(f'Mean F1 score (MLPC, 5-fold cv, train): {mean_score_MLPC_Wine_f1_train}')

print()

score_MLPC_Wine_f1_test = cross_val_score(model_MLPC_Wine, X_Wine_test, y_Wine_test, cv=cv, scoring=scoring_f1)
print(f'F1 score (MLPC, 5-fold cv, test):\n {score_MLPC_Wine_f1_test}')

mean_score_MLPC_Wine_f1_test = score_MLPC_Wine_f1_test.mean()
print(f'Mean F1 score (MLPC, 5-fold cv, test): {mean_score_MLPC_Wine_f1_test}')

#### Hyperparameter search

### KNN

#### Fit data

In [None]:
model_KNN_Wine = KNeighborsClassifier(n_neighbors=5)
model_KNN_Wine.fit(X_Wine_train, y_Wine_train)

#### Predict

In [None]:
y_Wine_h_KNN = model_KNN_Wine.predict(X_Wine_test)

#### Accuracy

In [None]:
accuracy_KNN_Wine = accuracy_score(y_Wine_test, y_Wine_h_KNN)
print(f'Accuracy (KNN): {accuracy_KNN_Wine}')

#### Accuracy and F1 score using 5-fold cross validation

In [None]:
cv = 5
scoring_acc = 'accuracy'
scoring_f1 = 'f1_macro'

In [None]:
#Accuracy (5-fold)
score_KNN_Wine_accuracy = cross_val_score(model_KNN_Wine, X_Wine, y_Wine, cv=cv, scoring=scoring_acc)
print(f'Accuracy (KNN, 5-fold cv):\n {score_KNN_Wine_accuracy}')
mean_score_KNN_Wine_accuracy = score_KNN_Wine_accuracy.mean()
print(f'Mean Accuracy (KNN, 5-fold cv): {mean_score_KNN_Wine_accuracy}')

In [None]:
#F1 score (5-fold)
score_KNN_Wine_f1 = f1_score(y_Wine_test, y_Wine_h_KNN, average='weighted')
print(f'F1 score (KNN, 5-fold cv):\n {score_KNN_Wine_f1}')

In [None]:
#Accuracy (5-fold) for training and test set
score_KNN_Wine_accuracy_train = cross_val_score(model_KNN_Wine, X_Wine_train, y_Wine_train, cv=cv, scoring=scoring_acc)
print(f'Accuracy (KNN, 5-fold cv, train):\n {score_KNN_Wine_accuracy_train}')

mean_score_KNN_Wine_accuracy_train = score_KNN_Wine_accuracy_train.mean()
print(f'Mean Accuracy (KNN, 5-fold cv, train): {mean_score_KNN_Wine_accuracy_train}')

print()

score_KNN_Wine_accuracy_test = cross_val_score(model_KNN_Wine, X_Wine_test, y_Wine_test, cv=cv, scoring=scoring_acc)
print(f'Accuracy (KNN, 5-fold cv, test):\n {score_KNN_Wine_accuracy_test}')

mean_score_KNN_Wine_accuracy_test = score_KNN_Wine_accuracy_test.mean()
print(f'Mean Accuracy (KNN, 5-fold cv, test): {mean_score_KNN_Wine_accuracy_test}')

In [None]:
#F1 score (5-fold) for training and test set
score_KNN_Wine_f1_train = cross_val_score(model_KNN_Wine, X_Wine_train, y_Wine_train, cv=cv, scoring=scoring_f1)
print(f'F1 score (KNN, 5-fold cv, train):\n {score_KNN_Wine_f1_train}')

mean_score_KNN_Wine_f1_train = score_KNN_Wine_f1_train.mean()
print(f'Mean F1 score (KNN, 5-fold cv, train): {mean_score_KNN_Wine_f1_train}')

print()

score_KNN_Wine_f1_test = cross_val_score(model_KNN_Wine, X_Wine_test, y_Wine_test, cv=cv, scoring=scoring_f1)
print(f'F1 score (KNN, 5-fold cv, test):\n {score_KNN_Wine_f1_test}')

mean_score_KNN_Wine_f1_test = score_KNN_Wine_f1_test.mean()
print(f'Mean F1 score (KNN, 5-fold cv, test): {mean_score_KNN_Wine_f1_test}')

#### Hyperparameter search

In [None]:
cv = 5
cv_4 = 4
scoring_acc = 'accuracy'
scoring_f1 = 'f1_micro'
average_f1 = 'micro'

In [None]:
parameter_grid = {'n_neighbors': list(range(1, 10)), 'p': [1, 2, 3, 4.7]}

def hyper_KNN(X_train, y_train, X_test, y_test, X, y) -> None:
    
    """
    :param X: matricea de intrare
    :param y: ground truth
    """
    
    print("5 over 4-fold cross validation (nested cross validation)")
    
    #initializam modelul de clasificare
    model = KNeighborsClassifier()
    
    #initializam GridSearch pentru Accuracy
    grid_search_acc = GridSearchCV(estimator = model,
                                   param_grid = parameter_grid, 
                                   scoring = scoring_acc, 
                                   cv = cv,
                                   return_train_score = True)   
    
    #la fiecare rulare din 5-fold se face un 4-fold cross validation
    scores_acc = cross_val_score(grid_search_acc, X, y, cv = cv_4)
    
    print('Accuracy:')
    
    print(f'Scores for 4-fold cross validation:\n {scores_acc}')
    print(f'Mean score for 4-fold cross validation: {scores_acc.mean()}')
    
    grid_search_acc.fit(X_train, y_train)
    y_estimated_acc = grid_search_acc.predict(X_test)
        
    acc_score = accuracy_score(y_test, y_estimated_acc)
    print(f'accuracy_score(y_test, y_estimated_acc): {acc_score}') 
    
    best_params = grid_search_acc.best_params_
    print(f'Best parameters (X_test): {best_params}')
    
    #initializam GridSearch pentru F1 score
    grid_search_f1 = GridSearchCV(estimator = model,
                                  param_grid = parameter_grid,
                                  scoring = scoring_f1,
                                  cv = cv,
                                  return_train_score = True)
    
    #la fiecare rulare din 5-fold se face un 4-fold cross validation
    scores_f1 = cross_val_score(grid_search_f1, X, y, cv = cv_4)  
    
    print('F1 score:')
    
    print(f'Scores for 4-fold cross validation:\n {scores_f1}')
    print(f'Mean score for 4-fold cross validation: {scores_f1.mean()}')
    
    grid_search_f1.fit(X_train, y_train)
    y_estimated_f1 = grid_search_f1.predict(X_test)
        
    f1_sc = f1_score(y_test, y_estimated_f1, average=average_f1)
    print(f'f1_score(y_test, y_estimated_f1): {f1_sc}') 
    
    best_params = grid_search_f1.best_params_
    print(f'Best parameters (X_test): {best_params}')

In [None]:
hyper_KNN(X_Wine_train, y_Wine_train, X_Wine_test, y_Wine_test, X_Wine, y_Wine)

### 3. Documentation

### 4. Hyperparameter search