# Laborator 6

Versiunea 2020-04-01

## Modele de clasificare

Folositi 4 seturi de date pentru probleme de clasificare, plecand de la repository-urile specificate in Cursul 5; de exemplu, [ics.uci.edu](http://archive.ics.uci.edu/ml/datasets.php?format=mat&task=cla&att=&area=&numAtt=&numIns=&type=mvar&sort=nameUp&view=table). Cel putin doua seturi de date sa fie cu valori lipsa. 


1. (20 puncte) Aplicati o metoda de missing value imputation, unde este cazul; justificati si documentati metoda folosita.
1. (numar de modele * numar de seturi de date \* 1 punct = 20 de puncte) Pentru fiecare set de date aplicati 5 modele de clasificare din scikit learn. Pentru fiecare raportati: acuratete, scorul F1 - a se vedea [sklearn.metrics](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) - folosind 5 fold cross validation. Raportati mediile rezultatelor atat pentru fold-urile de antrenare, cat si pentru cele de testare. Rularile se vor face cu valori fixate ale hiperparametrilor. 
1. (numar modele * 4 puncte = 20 puncte) Documentati in jupyter notebook fiecare din modelele folosite, in limba romana. Daca acelasi algoritm e folosit pentru mai multe seturi de date, puteti face o sectiune separata cu documentarea algoritmilor + trimitere la algoritm. 
1. (numar de modele * numar de seturi de date * 1 punct = 20 de puncte) Raportati performanta fiecarui model, folosind 5 fold cross validation. Pentru fiecare din cele 5 rulari, cautati hiperparametrii optimi folosind 4-fold cross validation. Performanta modelului va fi raportata ca medie a celor  5 rulari. 
    *Observatie:* la fiecare din cele 5 rulari, hiperparametrii optimi pot diferi, din cauza datelor utilizate pentru antrenare/validare. 

Se acorda 20 de puncte din oficiu. 

Exemple de modele de clasificare:
1. [Multi-layer Perceptron classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier)
1. [KNN](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier)
1. [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)
1. <strike>[Gaussian processes](https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessClassifier.html#sklearn.gaussian_process.GaussianProcessClassifier)</strike>
1. <strike>[RBF](https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.kernels.RBF.html#sklearn.gaussian_process.kernels.RBF)</strike>
1. [Decision tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)
1. [Random forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)
1. <strike>[Gaussian Naive bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB)</strike> 

In [None]:
import numpy as np
import pandas as pd

print ('numpy:', np.__version__)
print ('pandas:', pd.__version__)

import sklearn
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import KNNImputer

from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

### 3. Documentation

### Multi-layer Perceptron Classifier (MLPC)

### K-Neighbours Classifier (KNN)

### Support Vector Classifier (SVC)

### Decision Tree (DT)

### Random Forest (RF)

### Model testing

In [None]:
run_model_testing = False

#### Accuracy on prediction

In [None]:
def accuracy_test(name:str, model, y_test, y_predicted) -> None:
    accuracy_test = accuracy_score(y_test, y_predicted)
    print(f'Accuracy ({name}): {accuracy_test}')

#### Accuracy (5-fold)

In [None]:
def accuracy_5fold(name:str, model, X, y) -> None:
    
    scoring_acc = 'accuracy'
    cv = 5
    
    score_accuracy = cross_val_score(model, X, y, cv=cv, scoring=scoring_acc)
    print(f'Accuracy ({name}, {cv}-fold cv):\n {score_accuracy}')
    mean_score_accuracy = score_accuracy.mean()
    print(f'Mean Accuracy ({name}, {cv}-fold cv): {mean_score_accuracy}')

#### F1 score (5-fold)

In [None]:
def f1_5fold(name:str, model, y_test, y_predicted) -> None:
    
    average = 'weighted'
    cv = 5
    
    score_f1 = f1_score(y_test, y_predicted, average=average)
    print(f'F1 score ({name}, {cv}-fold cv):\n {score_f1}')

#### Score measuring

In [None]:
def score(name:str, scoring, model, X_train, y_train, X_test, y_test) -> None:
    
    cv = 5
    
    score_accuracy_train = cross_val_score(model, X_train, y_train, cv=cv, scoring=scoring)
    print(f'Train {scoring} ({name}, {cv}-fold cv):\n {score_accuracy_train}')

    mean_score_accuracy_train = score_accuracy_train.mean()
    print(f'Train Mean {scoring} ({name}, {cv}-fold cv): {mean_score_accuracy_train}')

    print()

    score_accuracy_test = cross_val_score(model, X_test, y_test, cv=cv, scoring=scoring)
    print(f'Test {scoring} ({name}, {cv}-fold cv):\n {score_accuracy_test}')

    mean_score_accuracy_test = score_accuracy_test.mean()
    print(f'Test Mean {scoring} ({name}, {cv}-fold cv): {mean_score_accuracy_test}')

#### Hyperparameter search

In [None]:
def hyper(model, parameter_grid, X, y, X_train, y_train, X_test, y_test) -> None:
    
    """
    :param X: matricea de intrare
    :param y: ground truth
    """
    
    cv = 5
    nest_cv = 4
    scoring = 'accuracy'
    n_jobs = 3
    
    #initializam GridSearch pentru Accuracy
    grid_search = GridSearchCV(estimator = model,
                               param_grid = parameter_grid,
                               scoring = scoring,
                               cv = cv,
                               return_train_score = True,
                               n_jobs = n_jobs)
    
    #la fiecare rulare din 5-fold se face un 4-fold cross validation
    scores = cross_val_score(grid_search, X, y, cv = nest_cv, n_jobs = n_jobs)

    print('Accuracy:')
    
    print(f'Scores for {nest_cv}-fold cross validation:\n {scores}')
    print(f'Mean score for {nest_cv}-fold cross validation: {scores.mean()}')
    
    grid_search.fit(X_train, y_train)

    best_params = grid_search.best_params_
    print(f'Best parameters (X_test): {best_params}')
        
    y_estimated = grid_search.predict(X_test)   
    acc_score = accuracy_score(y_test, y_estimated)
    print(f'accuracy_score(y_test, y_estimated): {acc_score}') 

#### Helper function

In [None]:
def print_separator(ch: str) -> None:
    print(ch * 69)
def print_separator_info(name:str, ch: str) -> None:
    l = len(name)
    r = 69 - l
    if l % 2 == 0:
        half = int(r/2) + 1
        print(f'{ch * (half - 1)}{name}{ch * half}')    
    else:
        half = int(r/2)
        print(f'{ch * half}{name}{ch * half}')    

#### Apply model

In [None]:
def apply_model(name: str, model, parameter_grid, X, y, X_train, y_train, X_test, y_test) -> None:
    
    #Fit
    model.fit(X_train, y_train)

    #Predict
    y_predicted = model.predict(X_test)
    
    #Accuracy after prediction
    print('Accuracy on test set after prediction:')
    accuracy_test(name, model, y_test, y_predicted)
    print_separator('-')   

    #Accuracy score (5-fold)
    print('Accuracy score (5-fold):')
    accuracy_5fold(name, model, X, y)
    print_separator('-')   

    #F1 score (5-fold)
    print('F1 score (5-fold):')
    f1_5fold(name, model, y_test, y_predicted)
    print_separator('-')

    #Accuracy (5-fold) for training and test set
    print('Accuracy (5-fold) for training and test set:')
    score(name, 'accuracy', model, X_train, y_train, X_test, y_test)
    print_separator('-')

    #F1 score (5-fold) for training and test set
    print('F1 score (5-fold) for training and test set:')
    score(name, 'f1_macro', model, X_train, y_train, X_test, y_test)
    print_separator('-')

    #Hyperparameter search
    print('Hyperparameter search:')
    hyper(model, parameter_grid, X, y,
          X_train, y_train,
          X_test, y_test)
    print()

#### Run models on data set

In [None]:
def run_models(X, y, X_train, y_train, X_test, y_test) -> None:
    
    print_separator_info('MLPC', '_')
    #Multi-layer Perceptron classifier
    max_iters = 10000
    model_MLPC = MLPClassifier(max_iter=max_iters)
    #apply Multi-layer Perceptron classifier
    parameter_grid_MLPC = {'alpha' : [0.0001, 0.001, 0.01, 0.0005, 0.005, 0.05]}
    apply_model('MLPC', model_MLPC, parameter_grid_MLPC, X, y, X_train, y_train, X_test, y_test)
    print()
    
    print_separator_info('KNN', '_')
    #KNN
    n_neighbours = 5
    model_KNN = KNeighborsClassifier(n_neighbors=n_neighbours)
    #apply KNN
    parameter_grid_KNN = {'n_neighbors': range(1, 10), 'p': [1, 2, 3, 4.7]}
    apply_model('KNN', model_KNN, parameter_grid_KNN, X, y, X_train, y_train, X_test, y_test)
    print()
    
    print_separator_info('SVC', '_')
    #SVC
    gamma = 'scale'
    kernel = 'rbf'
    model_SVC = SVC(kernel=kernel, gamma=gamma)
    #apply SVC
    parameter_grid_SVC = {'C': [0.1, 0.2, 0.3, 1, 10, 100], 'gamma': [0.01, 0.1, 0.03, 0.3, 0.5, 1]}
    apply_model('SVC', model_SVC, parameter_grid_SVC, X, y, X_train, y_train, X_test, y_test)
    print()
    
    print_separator_info('Decision Tree', '_')
    #Decision Tree
    criterion = 'gini'
    model_DT = DecisionTreeClassifier(criterion=criterion)
    #apply Decision Tree
    parameter_grid_DT = {'max_depth': range(1, 10), 'min_samples_split': range(2, 40), 'min_impurity_decrease':[0.01, 0.02, 0.03, 0.05, 0.1]}
    apply_model('Decision Tree', model_DT, parameter_grid_DT, X, y, X_train, y_train, X_test, y_test)
    print()
    
    print_separator_info('Random Forest', '_')
    #Random Forest
    criterion = 'gini'
    model_RF = RandomForestClassifier()
    #apply Random Forest
    parameter_grid_RF = {'max_depth': range(1, 10), 'min_samples_split': range(2, 5)}
    apply_model('Random Forest', model_RF, parameter_grid_RF, X, y, X_train, y_train, X_test, y_test)

### Data

#### Preprocess data

In [None]:
def print_ranges(X):
    """
    Functia printeaza minimul si maximul de pe fiecare coloana.
    """
    
    for col_index in range(X.shape[1]):
        column = X[:, col_index]
        print(f'{np.min(column)} \t {np.max(column)}')

#### Missing value imputter

In [None]:
def fill_missing_values(data, columns) -> pd.DataFrame:
    """
    Functia foloseste un KNNImputter sa umple valorile lipsa din setul de date.
    """
    
    n_neighbors = 2
    weights = 'uniform'
    imputer = KNNImputer(n_neighbors = n_neighbors, weights = weights)
    f_filled = imputer.fit_transform(data)
    df = pd.DataFrame(f_filled, columns = columns)
    return df

#### Scale data

In [None]:
def scale_data(mat: np.array) -> np.array:
    """
    Functia scaleaza valorile matricei de intrare intre 0 si 1.
    """
    
    scaler = MinMaxScaler()
    scaler.fit(mat)
    return scaler.transform(mat)

#### Augment data

In [None]:
def design_matrix(mat: np.array) -> np.array:
    """
    Functia adauga o coloana de 1 inainte de prima coloana.
    """
    
    l, c = mat.shape
    aux = np.ones((l, 1))
    result = np.concatenate((aux, mat), axis = 1)
    
    return result

### Satellite Image Data Set
#### (https://sci2s.ugr.es/keel/dataset.php?cod=71)

#### Load data

In [None]:
columns_Sat = ['Sp11', 'Sp12', 'Sp13', 'Sp14', 'Sp15', 'Sp16', 'Sp17', 'Sp18', 'Sp19', 
               'Sp21', 'Sp22', 'Sp23', 'Sp24', 'Sp25', 'Sp26', 'Sp27', 'Sp28', 'Sp29', 
               'Sp31', 'Sp32', 'Sp33', 'Sp34', 'Sp35', 'Sp36', 'Sp37', 'Sp38', 'Sp39', 
               'Sp41', 'Sp42', 'Sp43', 'Sp44', 'Sp45', 'Sp46', 'Sp47', 'Sp48', 'Sp49', 
               'Class']
path_Sat = r'./Data/SatImage/satimage.csv'
sat_dataframe = pd.read_csv(path_Sat, header=None, names=columns_Sat)
sat_dataframe.head()

#### Missing value imputation

In [None]:
sat_data = fill_missing_values(sat_dataframe, columns_Sat)
sat_data.head()

#### Interpret and preprocess data

In [None]:
X_Sat = sat_data.values[:, :-1]
y_Sat = sat_data.values[:, -1] #ultima coloana reprezinta ground truth

In [None]:
print_ranges(X_Sat)

#### Split data

In [None]:
test_size = 1/3
random_state = 5

X_Sat_train, X_Sat_test, y_Sat_train, y_Sat_test = \
train_test_split(X_Sat, y_Sat, 
                 test_size=test_size, 
                 random_state=random_state)

#### Augment matrix

In [None]:
X_Sat_train = design_matrix(X_Sat_train)
print(X_Sat_train)
X_Sat_test = design_matrix(X_Sat_test)
print(X_Sat_test)

#### Run models

In [None]:
if run_model_testing:
    run_models(X_Sat, y_Sat, 
               X_Sat_train, y_Sat_train, 
               X_Sat_test, y_Sat_test)

### Australian Satellite Image Data Set
#### (https://sci2s.ugr.es/keel/dataset.php?cod=71)

#### Load data

In [None]:
names_Aus = ['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 
             'A11', 'A12', 'A13', 'A14', 
             'Class']
path_Aus = r'./Data/SatImage/australian.csv'
aus_dataframe = pd.read_csv(path_Aus, header=None, names=names_Aus)
aus_dataframe.head()

#### Missing value imputation

In [None]:
aus_data = fill_missing_values(aus_dataframe, names_Aus)
aus_data.head()

#### Interpret and preprocess data

In [None]:
X_Aus = aus_data.values[:, :-1]
y_Aus = aus_data.values[:, -1] #ultima coloana reprezinta ground truth

In [None]:
print_ranges(X_Aus)

#### Scale data

In [None]:
X_Aus = scale_data(X_Aus)
print_ranges(X_Aus)

#### Split data

In [None]:
test_size = 1/3
random_state = 5

X_Aus_train, X_Aus_test, y_Aus_train, y_Aus_test = \
train_test_split(X_Aus, y_Aus, 
                 test_size=test_size, 
                 random_state=random_state)

#### Augment matrix

In [None]:
X_Aus_train = design_matrix(X_Aus_train)
print(X_Aus_train[:5, :5])
X_Aus_test = design_matrix(X_Aus_test)
print(X_Aus_test[:5, :5])

#### Run models

In [None]:
if run_model_testing:
    run_models(X_Aus, y_Aus, 
               X_Aus_train, y_Aus_train, 
               X_Aus_test, y_Aus_test)

### Wine Data Set
#### (http://archive.ics.uci.edu/ml/datasets/Wine)

#### Load data

In [None]:
path_Wine = r'./Data/Wine/wine.data'
wine_data = pd.read_csv(path_Wine, header=None)
wine_data.head()

#### Interpret and preprocess data

In [None]:
X_Wine = wine_data.values[:, 1:]
y_Wine = wine_data.values[:, 0] #prima coloana reprezinta ground truth

#### Scale data

In [None]:
print_ranges(X_Wine)
X_Wine = scale_data(X_Wine)

In [None]:
print_ranges(X_Wine)

#### Split data

In [None]:
test_size = 1/3
random_state = 5

X_Wine_train, X_Wine_test, y_Wine_train, y_Wine_test = \
train_test_split(X_Wine, y_Wine, 
                 test_size=test_size, 
                 random_state=random_state)

#### Augment matrix

In [None]:
X_Wine_train = design_matrix(X_Wine_train)
print(X_Wine_train[:5, :5])
X_Wine_test = design_matrix(X_Wine_test)
print(X_Wine_test[:5, :5])

#### Run models

In [None]:
if run_model_testing:
    run_models(X_Wine, y_Wine, 
               X_Wine_train, y_Wine_train, 
               X_Wine_test, y_Wine_test)

### Semeion Data

#### Load data

In [None]:
path_Semeion = r'./Data/Semeion/semeion.data'
semeion_data = pd.read_csv(path_Semeion, sep=r'\s+', header=None)
semeion_data.head()

#### Interpret and preprocess data

In [None]:
X_Semeion = semeion_data.values[:, :256]
y_Semeion = semeion_data.values[:, 256:] #ultimele zece coloane reprezinta ground truth

y_Semeion = np.argmax(y_Semeion, axis=1)

#### Split data

In [None]:
test_size = 1/3
random_state = 5

X_Semeion_train, X_Semeion_test, y_Semeion_train, y_Semeion_test = \
train_test_split(X_Semeion, y_Semeion, 
                 test_size=test_size, 
                 random_state=random_state)

#### Augment matrix

In [None]:
X_Semeion_train = design_matrix(X_Semeion_train)
print(X_Semeion_train)
X_Semeion_test = design_matrix(X_Semeion_test)
print(X_Semeion_test)

#### Run models on data set

In [None]:
if run_model_testing:
    run_models(X_Semeion, y_Semeion, 
               X_Semeion_train, y_Semeion_train, 
               X_Semeion_test, y_Semeion_test)