# Laborator 6

Versiunea 2020-04-01

## Modele de clasificare

Folositi 4 seturi de date pentru probleme de clasificare, plecand de la repository-urile specificate in Cursul 5; de exemplu, [ics.uci.edu](http://archive.ics.uci.edu/ml/datasets.php?format=mat&task=cla&att=&area=&numAtt=&numIns=&type=mvar&sort=nameUp&view=table). Cel putin doua seturi de date sa fie cu valori lipsa. 


1. (20 puncte) Aplicati o metoda de missing value imputation, unde este cazul; justificati si documentati metoda folosita.
1. (numar de modele * numar de seturi de date \* 1 punct = 20 de puncte) Pentru fiecare set de date aplicati 5 modele de clasificare din scikit learn. Pentru fiecare raportati: acuratete, scorul F1 - a se vedea [sklearn.metrics](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) - folosind 5 fold cross validation. Raportati mediile rezultatelor atat pentru fold-urile de antrenare, cat si pentru cele de testare. Rularile se vor face cu valori fixate ale hiperparametrilor. 
1. (numar modele * 4 puncte = 20 puncte) Documentati in jupyter notebook fiecare din modelele folosite, in limba romana. Daca acelasi algoritm e folosit pentru mai multe seturi de date, puteti face o sectiune separata cu documentarea algoritmilor + trimitere la algoritm. 
1. (numar de modele * numar de seturi de date * 1 punct = 20 de puncte) Raportati performanta fiecarui model, folosind 5 fold cross validation. Pentru fiecare din cele 5 rulari, cautati hiperparametrii optimi folosind 4-fold cross validation. Performanta modelului va fi raportata ca medie a celor  5 rulari. 
    *Observatie:* la fiecare din cele 5 rulari, hiperparametrii optimi pot diferi, din cauza datelor utilizate pentru antrenare/validare. 

Se acorda 20 de puncte din oficiu. 

Exemple de modele de clasificare:
1. [Multi-layer Perceptron classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier)
1. [KNN](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier)
1. [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)
1. <strike>[Gaussian processes](https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessClassifier.html#sklearn.gaussian_process.GaussianProcessClassifier)</strike>
1. <strike>[RBF](https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.kernels.RBF.html#sklearn.gaussian_process.kernels.RBF)</strike>
1. [Decision tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)
1. [Random forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)
1. <strike>[Gaussian Naive bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB)</strike> 

In [1]:
import numpy as np
import pandas as pd

print ('numpy:', np.__version__)
print ('pandas:', pd.__version__)

import sklearn
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import KNNImputer

from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.gaussian_process.kernels import ConstantKernel as C
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

numpy: 1.18.1
pandas: 1.0.1


### 1. Missing value imputation

### 2. Accuracy and F1 score

### 3. Documentation

### 4. Hyperparameter search

### Model

#### Accuracy on prediction

In [2]:
def accuracy_test(name:str, model, y_test, y_predicted) -> None:
    accuracy_test = accuracy_score(y_test, y_predicted)
    print(f'Accuracy ({name}): {accuracy_test}')

#### Accuracy (5-fold)

In [3]:
def accuracy_5fold(name:str, model, X, y) -> None:
    
    scoring_acc = 'accuracy'
    cv = 5
    
    score_accuracy = cross_val_score(model, X, y, cv=cv, scoring=scoring_acc)
    print(f'Accuracy ({name}, {cv}-fold cv):\n {score_accuracy}')
    mean_score_accuracy = score_accuracy.mean()
    print(f'Mean Accuracy ({name}, {cv}-fold cv): {mean_score_accuracy}')

#### F1 score (5-fold)

In [4]:
def f1_5fold(name:str, model, y_test, y_predicted) -> None:
    
    average = 'weighted'
    cv = 5
    
    score_f1 = f1_score(y_test, y_predicted, average=average)
    print(f'F1 score ({name}, {cv}-fold cv):\n {score_f1}')

#### Score measuring

In [5]:
def score(name:str, scoring, model, X_train, y_train, X_test, y_test) -> None:
    
    cv = 5
    
    score_accuracy_train = cross_val_score(model, X_train, y_train, cv=cv, scoring=scoring)
    print(f'Train {scoring} ({name}, {cv}-fold cv):\n {score_accuracy_train}')

    mean_score_accuracy_train = score_accuracy_train.mean()
    print(f'Train Mean {scoring} ({name}, {cv}-fold cv): {mean_score_accuracy_train}')

    print()

    score_accuracy_test = cross_val_score(model, X_test, y_test, cv=cv, scoring=scoring)
    print(f'Test {scoring} ({name}, {cv}-fold cv):\n {score_accuracy_test}')

    mean_score_accuracy_test = score_accuracy_test.mean()
    print(f'Test Mean {scoring} ({name}, {cv}-fold cv): {mean_score_accuracy_test}')

#### Hyperparameter search

In [6]:
def hyper(model, parameter_grid, X, y, X_train, y_train, X_test, y_test) -> None:
    
    """
    :param X: matricea de intrare
    :param y: ground truth
    """
    
    cv = 5
    nest_cv = 4
    scoring = 'accuracy'
    n_jobs = 3
    
    #initializam GridSearch pentru Accuracy
    grid_search = GridSearchCV(estimator = model,
                               param_grid = parameter_grid,
                               scoring = scoring,
                               cv = cv,
                               return_train_score = True,
                               n_jobs = n_jobs)
    
    #la fiecare rulare din 5-fold se face un 4-fold cross validation
    scores = cross_val_score(grid_search, X, y, cv = nest_cv, n_jobs = n_jobs)

    print('Accuracy:')
    
    print(f'Scores for {nest_cv}-fold cross validation:\n {scores}')
    print(f'Mean score for {nest_cv}-fold cross validation: {scores.mean()}')
    
    grid_search.fit(X_train, y_train)

    best_params = grid_search.best_params_
    print(f'Best parameters (X_test): {best_params}')
        
    y_estimated = grid_search.predict(X_test)   
    acc_score = accuracy_score(y_test, y_estimated)
    print(f'accuracy_score(y_test, y_estimated): {acc_score}') 

#### Helper function

In [7]:
def print_separator(ch: str) -> None:
    print(ch * 69)
def print_separator_info(name:str, ch: str) -> None:
    l = len(name)
    r = 69 - l
    if l % 2 == 0:
        half = int(r/2) + 1
        print(f'{ch * (half - 1)}{name}{ch * half}')    
    else:
        half = int(r/2)
        print(f'{ch * half}{name}{ch * half}')    

#### Apply model

In [8]:
def apply_model(name: str, model, parameter_grid, X, y, X_train, y_train, X_test, y_test) -> None:
    
    #Fit
    model.fit(X_train, y_train)

    #Predict
    y_predicted = model.predict(X_test)
    
    #Accuracy after prediction
    print('Accuracy on test set after prediction:')
    accuracy_test(name, model, y_test, y_predicted)
    print_separator('-')   

    #Accuracy score (5-fold)
    print('Accuracy score (5-fold):')
    accuracy_5fold(name, model, X, y)
    print_separator('-')   

    #F1 score (5-fold)
    print('F1 score (5-fold):')
    f1_5fold(name, model, y_test, y_predicted)
    print_separator('-')

    #Accuracy (5-fold) for training and test set
    print('Accuracy (5-fold) for training and test set:')
    score(name, 'accuracy', model, X_train, y_train, X_test, y_test)
    print_separator('-')

    #F1 score (5-fold) for training and test set
    print('F1 score (5-fold) for training and test set:')
    score(name, 'f1_macro', model, X_train, y_train, X_test, y_test)
    print_separator('-')

    #Hyperparameter search
    print('Hyperparameter search:')
    hyper(model, parameter_grid, X, y,
          X_train, y_train,
          X_test, y_test)
    print()

#### Run models on data set

In [9]:
def run_models(X, y, X_train, y_train, X_test, y_test) -> None:
    
    print_separator_info('MLPC', '_')
    #Multi-layer Perceptron classifier
    max_iters = 10000
    model_MLPC = MLPClassifier(max_iter=max_iters)
    #apply Multi-layer Perceptron classifier
    parameter_grid_MLPC = {'alpha' : [0.0001, 0.001, 0.01, 0.0005, 0.005, 0.05]}
    apply_model('MLPC', model_MLPC, parameter_grid_MLPC, X, y, X_train, y_train, X_test, y_test)
    print()
    
    print_separator_info('KNN', '_')
    #KNN
    n_neighbours = 5
    model_KNN = KNeighborsClassifier(n_neighbors=n_neighbours)
    #apply KNN
    parameter_grid_KNN = {'n_neighbors': range(1, 10), 'p': [1, 2, 3, 4.7]}
    apply_model('KNN', model_KNN, parameter_grid_KNN, X, y, X_train, y_train, X_test, y_test)
    print()
    
    print_separator_info('SVC', '_')
    #SVC
    gamma = 'scale'
    kernel = 'rbf'
    model_SVC = SVC(kernel=kernel, gamma=gamma)
    #apply SVC
    parameter_grid_SVC = {'C': [0.1, 0.2, 0.3, 1, 10, 100], 'gamma': [0.01, 0.1, 0.03, 0.3, 0.5, 1]}
    apply_model('SVC', model_SVC, parameter_grid_SVC, X, y, X_train, y_train, X_test, y_test)
    print()
    
    print_separator_info('Decision Tree', '_')
    #Decision Tree
    criterion = 'gini'
    model_DT = DecisionTreeClassifier(criterion=criterion)
    #apply Decision Tree
    parameter_grid_DT = {'max_depth': range(1, 10), 'min_samples_split': range(2, 40), 'min_impurity_decrease':[0.01, 0.02, 0.03, 0.05, 0.1]}
    apply_model('Decision Tree', model_DT, parameter_grid_DT, X, y, X_train, y_train, X_test, y_test)
    print()
    
    print_separator_info('Random Forest', '_')
    #Random Forest
    criterion = 'gini'
    model_RF = RandomForestClassifier()
    #apply Random Forest
    parameter_grid_RF = {'max_depth': range(1, 10), 'min_samples_split': range(2, 5)}
    apply_model('Random Forest', model_RF, parameter_grid_RF, X, y, X_train, y_train, X_test, y_test)

### Data

#### Preprocess data

In [10]:
def print_ranges(X):
    for col_index in range(X.shape[1]):
        column = X[:, col_index]
        print(f'{np.min(column)} \t {np.max(column)}')

#### Scale data

In [11]:
def scale_data(mat: np.array) -> np.array:    
    scaler = MinMaxScaler()
    scaler.fit(mat)
    return scaler.transform(mat)

#### Augment data

In [12]:
def design_matrix(mat: np.array) -> np.array:
    """
    Functia adauga o coloana de 1 inainte de prima coloana.
    """
    
    l, c = mat.shape
    aux = np.ones((l, 1))
    result = np.concatenate((aux, mat), axis = 1)
    
    return result

### Satellite Image Data Set
#### (https://sci2s.ugr.es/keel/dataset.php?cod=71)

In [13]:
columns = ['Sp11', 'Sp12',  'Sp13' ,   'Sp14'  ,  'Sp15'  , 'Sp16' , 'Sp17'  ,   'Sp18'   ,  'Sp19'   ,  'Sp21'   ,  'Sp22'   , 'Sp23'   , 'Sp24'   , 'Sp25'   , 'Sp26'   ,'Sp27'   , 'Sp28'   ,'Sp29'   , 'Sp31'   , 'Sp32'   ,'Sp33'   ,'Sp34'   , 'Sp35'   ,'Sp36'   ,'Sp37'   ,'Sp38'   , 'Sp39'   ,'Sp41'  , 'Sp42'   , 'Sp43'   , 'Sp44'   , 'Sp45'   , 'Sp46'   , 'Sp47'   , 'Sp48'   ,'Sp49', 'Class'   ]
path_Sat = 'D:\Projects\Facultate\II\II_Sem_2_10_REPOS\IDS\Lab6\Data\SatImage\satimage.csv'
satimage_data = pd.read_csv(path_Sat, header=None, names=columns)
satimage_data.head(20)

Unnamed: 0,Sp11,Sp12,Sp13,Sp14,Sp15,Sp16,Sp17,Sp18,Sp19,Sp21,...,Sp41,Sp42,Sp43,Sp44,Sp45,Sp46,Sp47,Sp48,Sp49,Class
0,92.0,115.0,120.0,94.0,84.0,102.0,106.0,79.0,84.0,,...,104.0,88.0,121.0,128.0,100.0,84.0,107.0,113.0,87.0,3
1,84.0,102.0,106.0,79.0,84.0,102.0,102.0,83.0,80.0,102.0,...,100.0,,107.0,113.0,87.0,84.0,99.0,104.0,79.0,3
2,84.0,102.0,,83.0,80.0,102.0,102.0,79.0,,,...,87.0,84.0,99.0,,79.0,84.0,,104.0,79.0,3
3,80.0,,102.0,79.0,84.0,94.0,102.0,79.0,80.0,94.0,...,79.0,84.0,99.0,104.0,79.0,,103.0,,79.0,3
4,,94.0,102.0,79.0,80.0,94.0,98.0,76.0,80.0,102.0,...,79.0,84.0,103.0,104.0,79.0,79.0,107.0,109.0,87.0,3
5,80.0,94.0,98.0,76.0,80.0,102.0,102.0,79.0,76.0,102.0,...,79.0,79.0,107.0,109.0,87.0,79.0,107.0,109.0,87.0,3
6,,102.0,106.0,83.0,76.0,102.0,106.0,87.0,80.0,98.0,...,87.0,79.0,103.0,104.0,83.0,79.0,,104.0,79.0,3
7,76.0,102.0,106.0,,80.0,,106.0,,,,...,83.0,79.0,103.0,104.0,79.0,79.0,95.0,100.0,79.0,3
8,76.0,89.0,98.0,,76.0,94.0,98.0,76.0,76.0,,...,75.0,75.0,91.0,96.0,71.0,,87.0,93.0,71.0,4
9,76.0,94.0,98.0,76.0,76.0,98.0,102.0,72.0,76.0,94.0,...,71.0,79.0,87.0,93.0,71.0,79.0,87.0,93.0,67.0,4


In [14]:
imputer = KNNImputer(n_neighbors=2,weights="uniform")
f_filled = imputer.fit_transform(satimage_data)
satdataframe=pd.DataFrame(f_filled, columns=columns) 
satdataframe.head(20)

Unnamed: 0,Sp11,Sp12,Sp13,Sp14,Sp15,Sp16,Sp17,Sp18,Sp19,Sp21,...,Sp41,Sp42,Sp43,Sp44,Sp45,Sp46,Sp47,Sp48,Sp49,Class
0,92.0,115.0,120.0,94.0,84.0,102.0,106.0,79.0,84.0,98.5,...,104.0,88.0,121.0,128.0,100.0,84.0,107.0,113.0,87.0,3.0
1,84.0,102.0,106.0,79.0,84.0,102.0,102.0,83.0,80.0,102.0,...,100.0,80.0,107.0,113.0,87.0,84.0,99.0,104.0,79.0,3.0
2,84.0,102.0,102.0,83.0,80.0,102.0,102.0,79.0,84.0,101.5,...,87.0,84.0,99.0,106.5,79.0,84.0,103.0,104.0,79.0,3.0
3,80.0,99.0,102.0,79.0,84.0,94.0,102.0,79.0,80.0,94.0,...,79.0,84.0,99.0,104.0,79.0,79.0,103.0,96.5,79.0,3.0
4,81.0,94.0,102.0,79.0,80.0,94.0,98.0,76.0,80.0,102.0,...,79.0,84.0,103.0,104.0,79.0,79.0,107.0,109.0,87.0,3.0
5,80.0,94.0,98.0,76.0,80.0,102.0,102.0,79.0,76.0,102.0,...,79.0,79.0,107.0,109.0,87.0,79.0,107.0,109.0,87.0,3.0
6,79.5,102.0,106.0,83.0,76.0,102.0,106.0,87.0,80.0,98.0,...,87.0,79.0,103.0,104.0,83.0,79.0,100.5,104.0,79.0,3.0
7,76.0,102.0,106.0,83.0,80.0,97.0,106.0,81.0,83.0,98.5,...,83.0,79.0,103.0,104.0,79.0,79.0,95.0,100.0,79.0,3.0
8,76.0,89.0,98.0,77.0,76.0,94.0,98.0,76.0,76.0,94.5,...,75.0,75.0,91.0,96.0,71.0,74.5,87.0,93.0,71.0,4.0
9,76.0,94.0,98.0,76.0,76.0,98.0,102.0,72.0,76.0,94.0,...,71.0,79.0,87.0,93.0,71.0,79.0,87.0,93.0,67.0,4.0


In [15]:
X_Sat = satdataframe.values[:, :-1]
y_Sat = satdataframe.values[:, -1] #ultima coloana reprezinta ground truth

In [16]:
print_ranges(X_Sat)

39.0 	 102.0
27.0 	 132.0
53.0 	 140.0
33.0 	 154.0
39.0 	 104.0
27.0 	 137.0
53.0 	 145.0
29.0 	 157.0
40.0 	 102.0
27.0 	 130.0
50.0 	 145.0
29.0 	 157.0
40.0 	 102.0
27.0 	 137.0
53.0 	 145.0
33.0 	 154.0
40.0 	 104.0
27.0 	 130.0
50.0 	 145.0
33.0 	 154.0
39.0 	 104.0
27.0 	 128.0
53.0 	 145.0
29.0 	 154.0
39.0 	 104.0
27.0 	 130.0
53.0 	 140.0
34.0 	 154.0
39.0 	 101.0
27.0 	 130.0
50.0 	 139.0
29.0 	 157.0
40.0 	 104.0
27.0 	 130.0
50.0 	 145.0
29.0 	 157.0


In [17]:
X_Sat = design_matrix(X_Sat)
print(X_Sat)

[[  1.  92. 115. ... 107. 113.  87.]
 [  1.  84. 102. ...  99. 104.  79.]
 [  1.  84. 102. ... 103. 104.  79.]
 ...
 [  1.  56.  68. ...  83.  92.  74.]
 [  1.  56.  68. ...  83.  96.  70.]
 [  1.  60.  71. ...  79. 108.  92.]]


In [18]:
test_size = 1/3
random_state = 5

X_Sat_train, X_Sat_test, y_Sat_train, y_Sat_test = \
train_test_split(X_Sat, y_Sat, 
                 test_size=test_size, 
                 random_state=random_state)

#### Run models

In [19]:
run_models(X_Sat, y_Sat, 
           X_Sat_train, y_Sat_train, 
           X_Sat_test, y_Sat_test)

________________________________MLPC_________________________________
Accuracy on test set after prediction:
Accuracy (MLPC): 0.7664422578974625
---------------------------------------------------------------------
Accuracy score (5-fold):
Accuracy (MLPC, 5-fold cv):
 [0.80672994 0.69430052 0.72884283 0.80397237 0.77374784]
Mean Accuracy (MLPC, 5-fold cv): 0.7615186994922964
---------------------------------------------------------------------
F1 score (5-fold):
F1 score (MLPC, 5-fold cv):
 0.7417719924111615
---------------------------------------------------------------------
Accuracy (5-fold) for training and test set:
Train accuracy (MLPC, 5-fold cv):
 [0.73316062 0.81476684 0.77331606 0.6865285  0.72150259]
Train Mean accuracy (MLPC, 5-fold cv): 0.7458549222797928

Test accuracy (MLPC, 5-fold cv):
 [0.72093023 0.80310881 0.71761658 0.71761658 0.74870466]
Test Mean accuracy (MLPC, 5-fold cv): 0.7415953729364984
---------------------------------------------------------------------
F

Test f1_macro (Random Forest, 5-fold cv):
 [0.84902908 0.85937188 0.86639678 0.87007344 0.8408179 ]
Test Mean f1_macro (Random Forest, 5-fold cv): 0.8571378163181889
---------------------------------------------------------------------
Hyperparameter search:
Accuracy:
Scores for 4-fold cross validation:
 [0.88950276 0.88674033 0.88328729 0.87836904]
Mean score for 4-fold cross validation: 0.8844748565330441
Best parameters (X_test): {'max_depth': 9, 'min_samples_split': 3}
accuracy_score(y_test, y_estimated): 0.8886587260486795



### Australian Satellite Image Data Set
#### (https://sci2s.ugr.es/keel/dataset.php?cod=71)

In [28]:
names = ['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'A11', 'A12', 'A13', 'A14','Class']
path_Aus = 'D:\Projects\Facultate\II\II_Sem_2_10_REPOS\IDS\Lab6\Data\SatImage\australian.csv'
australian_data = pd.read_csv(path_Sat, header=None, names=names)
australian_data.head(20)

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,Class
92.0,115.0,120.0,94.0,84.0,102.0,106.0,79.0,84.0,,102.0,83.0,101.0,126.0,133.0,103.0,92.0,112.0,118.0,85.0,84.0,103.0,104.0,81.0,102.0,,,104.0,88.0,121.0,128.0,100.0,84.0,107.0,113.0,87.0,3
84.0,102.0,106.0,79.0,84.0,102.0,102.0,83.0,80.0,102.0,102.0,79.0,92.0,,,85.0,84.0,103.0,104.0,81.0,84.0,,104.0,78.0,88.0,121.0,128.0,100.0,,107.0,113.0,87.0,84.0,99.0,104.0,79.0,3
84.0,102.0,,83.0,80.0,102.0,102.0,79.0,,,,79.0,84.0,103.0,104.0,81.0,84.0,99.0,104.0,78.0,84.0,,104.0,81.0,84.0,107.0,113.0,87.0,84.0,99.0,,79.0,84.0,,104.0,79.0,3
80.0,,102.0,79.0,84.0,94.0,102.0,79.0,80.0,94.0,98.0,,84.0,99.0,104.0,78.0,84.0,99.0,104.0,81.0,,99.0,104.0,81.0,84.0,99.0,104.0,79.0,84.0,99.0,104.0,79.0,,103.0,,79.0,3
,94.0,102.0,79.0,80.0,94.0,98.0,76.0,80.0,102.0,102.0,79.0,84.0,99.0,104.0,81.0,76.0,99.0,104.0,81.0,76.0,99.0,108.0,85.0,84.0,99.0,104.0,79.0,84.0,103.0,104.0,79.0,79.0,107.0,109.0,87.0,3
80.0,94.0,98.0,76.0,80.0,102.0,102.0,79.0,76.0,102.0,102.0,79.0,76.0,,104.0,81.0,76.0,99.0,108.0,85.0,76.0,103.0,118.0,88.0,84.0,103.0,104.0,79.0,79.0,107.0,109.0,87.0,79.0,107.0,109.0,87.0,3
,102.0,106.0,83.0,76.0,102.0,106.0,87.0,80.0,98.0,106.0,79.0,,107.0,118.0,,,112.0,118.0,88.0,80.0,107.0,113.0,85.0,79.0,107.0,113.0,87.0,79.0,103.0,104.0,83.0,79.0,,104.0,79.0,3
76.0,102.0,106.0,,80.0,,106.0,,,,102.0,76.0,80.0,112.0,118.0,88.0,,107.0,113.0,,80.0,95.0,100.0,78.0,79.0,,104.0,83.0,79.0,103.0,104.0,79.0,79.0,95.0,100.0,79.0,3
76.0,89.0,98.0,,76.0,94.0,98.0,76.0,76.0,,102.0,72.0,80.0,95.0,104.0,74.0,,91.0,104.0,74.0,76.0,95.0,100.0,78.0,75.0,91.0,96.0,75.0,75.0,91.0,96.0,71.0,,87.0,93.0,71.0,4
76.0,94.0,98.0,76.0,76.0,98.0,102.0,72.0,76.0,94.0,90.0,76.0,76.0,91.0,104.0,74.0,76.0,95.0,100.0,78.0,76.0,91.0,100.0,74.0,75.0,91.0,96.0,71.0,79.0,87.0,93.0,71.0,79.0,87.0,93.0,67.0,4


In [29]:
imputer = KNNImputer(n_neighbors=2,weights="uniform")
f_filled = imputer.fit_transform(australian_data)
ausdataframe=pd.DataFrame(f_filled, columns=names) 
ausdataframe.head(20)

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,Class
0,104.0,81.0,102.0,129.0,128.0,104.0,88.0,121.0,128.0,100.0,84.0,107.0,113.0,87.0,3.0
1,104.0,78.0,88.0,121.0,128.0,100.0,87.0,107.0,113.0,87.0,84.0,99.0,104.0,79.0,3.0
2,104.0,81.0,84.0,107.0,113.0,87.0,84.0,99.0,104.0,79.0,84.0,97.0,104.0,79.0,3.0
3,104.0,81.0,84.0,99.0,104.0,79.0,84.0,99.0,104.0,79.0,79.5,103.0,98.5,79.0,3.0
4,108.0,85.0,84.0,99.0,104.0,79.0,84.0,103.0,104.0,79.0,79.0,107.0,109.0,87.0,3.0
5,118.0,88.0,84.0,103.0,104.0,79.0,79.0,107.0,109.0,87.0,79.0,107.0,109.0,87.0,3.0
6,113.0,85.0,79.0,107.0,113.0,87.0,79.0,103.0,104.0,83.0,79.0,104.0,104.0,79.0,3.0
7,100.0,78.0,79.0,96.5,104.0,83.0,79.0,103.0,104.0,79.0,79.0,95.0,100.0,79.0,3.0
8,100.0,78.0,75.0,91.0,96.0,75.0,75.0,91.0,96.0,71.0,78.0,87.0,93.0,71.0,4.0
9,100.0,74.0,75.0,91.0,96.0,71.0,79.0,87.0,93.0,71.0,79.0,87.0,93.0,67.0,4.0


In [30]:
X_Aus = ausdataframe.values[:, :-1]
y_Aus = ausdataframe.values[:, -1] #ultima coloana reprezinta ground truth

In [31]:
print_ranges(X_Aus)

53.0 	 145.0
29.0 	 154.0
39.0 	 104.0
27.0 	 130.0
53.0 	 140.0
34.0 	 154.0
39.0 	 101.0
27.0 	 130.0
50.0 	 139.0
29.0 	 157.0
40.0 	 104.0
27.0 	 130.0
50.0 	 145.0
29.0 	 157.0


In [32]:
X_Aus = design_matrix(X_Aus)
print(X_Aus)

[[  1.  104.   81.  ... 107.  113.   87. ]
 [  1.  104.   78.  ...  99.  104.   79. ]
 [  1.  104.   81.  ...  97.  104.   79. ]
 ...
 [  1.   89.   75.  ...  83.   92.   74. ]
 [  1.  109.   92.  ...  83.   91.5  70. ]
 [  1.  109.   96.  ...  79.  108.   92. ]]


In [33]:
test_size = 1/3
random_state = 5

X_Aus_train, X_Aus_test, y_Aus_train, y_Aus_test = \
train_test_split(X_Aus, y_Aus, 
                 test_size=test_size, 
                 random_state=random_state)

#### Run models

In [34]:
run_models(X_Aus, y_Aus, 
           X_Aus_train, y_Aus_train, 
           X_Aus_test, y_Aus_test)

________________________________MLPC_________________________________
Accuracy on test set after prediction:
Accuracy (MLPC): 0.7079233557742103
---------------------------------------------------------------------
Accuracy score (5-fold):
Accuracy (MLPC, 5-fold cv):
 [0.73166523 0.83160622 0.73920553 0.76165803 0.76770294]
Mean Accuracy (MLPC, 5-fold cv): 0.7663675880434118
---------------------------------------------------------------------
F1 score (5-fold):
F1 score (MLPC, 5-fold cv):
 0.6738215432966322
---------------------------------------------------------------------
Accuracy (5-fold) for training and test set:
Train accuracy (MLPC, 5-fold cv):
 [0.7992228  0.78367876 0.76813472 0.77979275 0.74611399]
Train Mean accuracy (MLPC, 5-fold cv): 0.7753886010362694

Test accuracy (MLPC, 5-fold cv):
 [0.69767442 0.75388601 0.77202073 0.76683938 0.74093264]
Test Mean accuracy (MLPC, 5-fold cv): 0.746270635016267
---------------------------------------------------------------------
F1

Test f1_macro (Random Forest, 5-fold cv):
 [0.7756506  0.84231922 0.82380817 0.80345323 0.81527271]
Test Mean f1_macro (Random Forest, 5-fold cv): 0.812100786621445
---------------------------------------------------------------------
Hyperparameter search:
Accuracy:
Scores for 4-fold cross validation:
 [0.88121547 0.8480663  0.83218232 0.84105045]
Mean score for 4-fold cross validation: 0.8506286344007605
Best parameters (X_test): {'max_depth': 9, 'min_samples_split': 2}
accuracy_score(y_test, y_estimated): 0.8482651475919213



### Wine Data Set
#### (http://archive.ics.uci.edu/ml/datasets/Wine)

#### Load data

In [None]:
wine_data = pd.read_csv('D:\Projects\Facultate\II\II_Sem_2_10_REPOS\IDS\Lab6\Data\Wine\wine.data', header=None)
wine_data.head()

#### Interpret and preprocess data

In [None]:
X_Wine = wine_data.values[:, 1:]
y_Wine = wine_data.values[:, 0] #prima coloana reprezinta ground truth

In [None]:
print_ranges(X_Wine)
X_Wine = scale_data(X_Wine)

In [None]:
print_ranges(X_Wine)

In [None]:
X_Wine = design_matrix(X_Wine)
print(X_Wine)

#### Split data

In [None]:
test_size = 1/3
random_state = 5

X_Wine_train, X_Wine_test, y_Wine_train, y_Wine_test = \
train_test_split(X_Wine, y_Wine, 
                 test_size=test_size, 
                 random_state=random_state)

#### Run models

In [None]:
run_models(X_Wine, y_Wine, 
           X_Wine_train, y_Wine_train, 
           X_Wine_test, y_Wine_test)

### Semeion Data

In [None]:
path_Semeion = 'D:\Projects\Facultate\II\II_Sem_2_10_REPOS\IDS\Lab6\Data\Semeion\semeion.data'
semeion_data = pd.read_csv(path_Semeion, sep=r'\s+', header=None)
semeion_data.head()

In [None]:
X_Semeion = semeion_data.values[:, :256]
y_Semeion = semeion_data.values[:, 256:] #ultimele zece coloane reprezinta ground truth

y_Semeion = np.argmax(y_Semeion, axis=1)
print(y_Semeion.shape, X_Semeion.shape)

In [None]:
test_size = 1/3
random_state = 5

X_Semeion_train, X_Semeion_test, y_Semeion_train, y_Semeion_test = \
train_test_split(X_Semeion, y_Semeion, 
                 test_size=test_size, 
                 random_state=random_state)

#### Run models on data set

In [None]:
run_models(X_Semeion, y_Semeion, 
           X_Semeion_train, y_Semeion_train, 
           X_Semeion_test, y_Semeion_test)