# 1. Projekat iz predmeta Mašinsko učenje

Eye State Classification - EEG
https://www.kaggle.com/datasets/robikscube/eye-state-classification-eeg-dataset/data

Marija Cvetković 1940

Luka Kocić 2022

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, KFold


In [None]:
df = pd.read_csv("input-eeg.csv")

## 1. Analiza podataka

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df['eyeDetection'].value_counts()

Da bismo razumeli kako da izaberemo koje informacije želimo da uključimo u treniranje našeg model treba da znamo osnovne stvari o kolona koje imamo.
Opisi kolona:
Svaki od ovih parametara je električni signal sa određene pozicije na glavi meren u vremenu.

1. AF3 i AF4 se nalaze kod obrva i izuzetno su osetljivi na pokrete očiju. Kada se oko zatvori ili trepne trebalo bi da signal pravi veliki pik
2. F7 i F8 blizu očiju, levo i desno.
3. F3 i F4 takođe blizu očiju, ali ne koliko i prethodna dva
4. FC5 i FC6 mešavina frontalne i centralne aktivnosti
5. T7 i T8 nalaze se sa strane glave i ne bi trebalo toliko da utiču na treptaje, nalaze se daleko
6. P7 i P8 su senzorni regioni
7. O1 i O2 direktno vezani za vid, tako da bi trebalo da jako utiču.

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), cmap='coolwarm', annot=True, fmt='.2f')
plt.title("Correlation matrix")
plt.show()

Na osnovu matrice korelacije možemo da zaključimo sledeće:
1. EyeDetection kolona prema matrici korelacije trenutno nema linearnu vezu sa ostalim kanalima sve vrednosti su izmedju 0.00 i -0.06 (jer je eyeDetection boolean)

2. Postoje visoke korelacije izmedju AF3 i F8 (1.00), AF3 i P8 (1.00), P8 i F8(1.00), FC5 i O1 (1.00), PC8 i F8 (1.00)  kao i izmedju P7 i AF4(0.99). Imamo duplirane podatke što znači da model može da bude nestabilan. Možemo da izbacimo jednu kolonu iz svakog od navedenih parova.

Za izbor kolone za isključivanje biće korišćena point-beserial korelacija sa eyeDetection kolonom:

In [None]:
from scipy.stats import pointbiserialr

for col in df.select_dtypes(include='number'):
    corr, p_value = pointbiserialr(df["eyeDetection"], df[col])
    print(f"{col}: {abs(corr):.4f}")

## 2. Deskriptivna analiza i čišćenje podataka

In [None]:
df.isnull().sum()

In [None]:
df.duplicated().any()

In [None]:
df = df.drop_duplicates()

### Detekcija outlier-a


Box plot nad svim numerickim podacima

In [None]:
number_columns = df.select_dtypes(include='number')
number_columns.plot(kind='box', subplots=True, layout=(4,4), sharex=False, sharey=False, figsize=(20,16))
plt.show()

Vizualizacija outlier-a pomoću boxplot-a, scatter plot-a i histograma:

In [None]:
for column in number_columns.columns:
        plt.figure(figsize=(14, 4))
        
        # Box plot
        plt.subplot(1, 3, 1)
        sns.boxplot(y=df[column], color='skyblue')
        plt.title(f"Box Plot - {column}")
        
        # Scatter plot
        plt.subplot(1, 3, 2)
        plt.scatter(x=range(len(df)), y=df[column], color='red')
        plt.title(f"Scatter Plot - {column}")
        plt.xlabel("Indeks")
        plt.ylabel("Vrednost")
        
        # Histogram
        plt.subplot(1, 3, 3)
        plt.hist(df[column], bins=16, color='lightgreen', edgecolor='black')
        plt.title(f"Histogram - {column}")
        plt.xlabel("Vrednost")
        plt.ylabel("Frekvencija")
        
        plt.tight_layout()
        plt.show()

Primena IQR i Z_Score metode za detekciju outlier-a:

In [None]:
for column in number_columns.columns:
        print(f"\n--- Kolona: {column} ---")

        # IQR metoda
        Q1 = df[column].quantile(0.25)
        Q3 = df[column].quantile(0.75)
        IQR = Q3 - Q1
        donja_granica = Q1 - 1.5 * IQR
        gornja_granica = Q3 + 1.5 * IQR
        outlieri_iqr = df[(df[column] < donja_granica) | (df[column] > gornja_granica)]
        print("Outlieri po IQR metodi:")
        print(outlieri_iqr)

        # Z-Score metoda
        mean = df[column].mean()
        std = df[column].std()
        z_scores = (df[column] - mean) / std
        outlieri_zscore = df[(z_scores > 3) | (z_scores < -3)]
        print("Outlieri po Z-Score metodi:")
        print(outlieri_zscore)

U primeru iznad smo detektovali da imamo outlier-e. Sledeći korak je da vidimo koja je najbolja metoda za njihovo otklanjanje. Prvo ćemo da vidimo koliko reodva ima najmanje jedan outliere, ako nije preveliki broj redova možemo ih samo izbrisati.

In [None]:
iqr_mask = pd.Series(False, index=df.index)
zscore_mask = pd.Series(False, index=df.index)

for column in number_columns.columns:
    # IQR
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    donja = Q1 - 1.5 * IQR
    gornja = Q3 + 1.5 * IQR

    iqr_mask |= (df[column] < donja) | (df[column] > gornja)

    # Z-score
    std = df[column].std()
    if std != 0:
        z = (df[column] - df[column].mean()) / std
        zscore_mask |= (z > 3) | (z < -3)

print(f"Broj redova sa bar jednim IQR outlierom: {iqr_mask.sum()}")
print(f"Broj redova sa bar jednim Z-score outlierom: {zscore_mask.sum()}")


Veliki broj redova sadrži bar jedan outlier-e tako da odbacujemo opciju brisanja redova.

## 4. Treniranje modela

### Podela podataka:

In [None]:
le = LabelEncoder()
y = df['eyeDetection']

X = df.drop(
    ['eyeDetection'],
    axis=1
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

### Lazy Predict

In [None]:
from lazypredict.Supervised import LazyClassifier

clf = LazyClassifier(
    verbose=0, 
    ignore_warnings=True, 
    custom_metric=None, 
    predictions=False,
)

models, predictions = clf.fit(X_train, X_test, y_train, y_test)

print(models)

Funkcija za izveštaj o performansama modela:

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

def report(y_test, y_pred, title):
    acc = accuracy_score(y_test, y_pred)
    cr = classification_report(y_test, y_pred, output_dict=True)
    cm = confusion_matrix(y_test, y_pred)
    return {
        "accuracy": acc,
        "classification_report": cr,  # dict forma
        "confusion_matrix": cm
    }

### Random Forest

In [None]:
from sklearn.preprocessing import MinMaxScaler, RobustScaler


def run_random_forest_classifier(X_train, y_train, X_test, y_test, preprocessors):
    forest = RandomForestClassifier(random_state=42, n_jobs=1)
    pipe = make_pipeline(*preprocessors, forest)
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    res = report(y_test, y_pred, "Random Forest Classifier")
    return res["accuracy"]

### Extra Trees Classifier

Radi slično kao Random Forest Classifier, ali u praksi ima veću preciznost. \
n_estimators - broj stabala

In [None]:
from sklearn.ensemble import ExtraTreesClassifier

def run_extra_trees_classifier(X_train, y_train, X_test, y_test, preprocessors):
    et_classifier = ExtraTreesClassifier(n_estimators=100, criterion='gini', random_state=42)

    pipe = make_pipeline(*preprocessors, et_classifier)
    pipe.fit(X_train, y_train)

    y_pred = pipe.predict(X_test)

    res = report(y_test, y_pred, "Extra Trees Classifier")
    return res["accuracy"]        

### K-Nearest Neighbours (KNN)

Lazy predictor \
K - broj najbiližih tačaka koje su uzimaju u obzir

In [None]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier

def run_knn_classifier(X_train, y_train, X_test, y_test, preprocessors):
    pipe = make_pipeline(*preprocessors, KNeighborsClassifier())

    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)

    res = report(y_test, y_pred, "K-Nearest Neighbour Classifier")
    return res["accuracy"]

### SVM

In [None]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

def run_svm_classifier(X_train, y_train, X_test, y_test, preprocessors):
    pipe = make_pipeline(*preprocessors, SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42))
    pipe.fit(X_train, y_train)

    y_pred = pipe.predict(X_test)

    res = report(y_test, y_pred, "SVM Classifier")
    return res["accuracy"]

### Logisticka regresija



Pogodna je za binarnu klasifikaciju.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer

def run_logistic_regression_classifier(X_train, y_train, X_test, y_test, preprocessors):
    clf = LogisticRegression(
        n_jobs = 1
    )

    pipe = make_pipeline(*preprocessors, clf)

    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)

    res = report(y_test, y_pred, "Logistic Regression Classifier")
    return res["accuracy"]
    

### Naive Bayes

Naivna metoda - algoritam smatra da je svaki fature nekorelisan.

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score, StratifiedKFold

def run_naive_bayes_classifier(X_train, y_train, X_test, y_test, preprocessors):
    pipe = make_pipeline(*preprocessors, GaussianNB())
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)

    res = report(y_test, y_pred, "Naive Bayes Classifier")
    return res["accuracy"]

### Decision Tree Classifier

Kreiranje stabla odlučivanja na osnovu feature-a

In [None]:
from sklearn.tree import DecisionTreeClassifier

def run_decision_tree_classifier(X_train, y_train, X_test, y_test, preprocessors):
    clf = DecisionTreeClassifier(random_state=42)

    pipe = make_pipeline(*preprocessors, clf)
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)

    res = report(y_test, y_pred, "Decision Tree Classifier")
    return res["accuracy"]

In [None]:
def run_classifiers_matrix(X_train, y_train, X_test, y_test, scalers_dict):
    algorithms = {
        "Random Forest": run_random_forest_classifier,
        "Extra Trees": run_extra_trees_classifier,
        "KNN": run_knn_classifier,
        "SVM": run_svm_classifier,
        "Logistic Regression": run_logistic_regression_classifier,
        "Naive Bayes": run_naive_bayes_classifier,
        "Decision Tree": run_decision_tree_classifier,
    }

    matrix = pd.DataFrame(index=algorithms.keys(), columns=scalers_dict.keys(), dtype=float)

    for algo_name, algo_fn in algorithms.items():
        for scaler_name, preprocessors in scalers_dict.items():
            acc = algo_fn(X_train, y_train, X_test, y_test, preprocessors) * 100
            matrix.loc[algo_name, scaler_name] = acc

    return matrix

def plot_matrix_heatmap(df, title="Accuracy matrix"):
    data = df.values.astype(float)

    fig, ax = plt.subplots(figsize=(10, 5))
    im = ax.imshow(data, aspect="auto") 

    ax.set_title(title)
    ax.set_xticks(np.arange(df.shape[1]))
    ax.set_yticks(np.arange(df.shape[0]))
    ax.set_xticklabels(df.columns, rotation=30, ha="right")
    ax.set_yticklabels(df.index)

    for i in range(df.shape[0]):
        for j in range(df.shape[1]):
            val = data[i, j]
            txt = "NA" if np.isnan(val) else f"{val:.3f}"
            ax.text(j, i, txt, ha="center", va="center")

    fig.colorbar(im, ax=ax)
    plt.tight_layout()
    plt.show()
    
scalers = {
    "None": [],
    "StandardScaler": [StandardScaler()],
    "MinMaxScaler": [MinMaxScaler()],
    "RobustScaler": [RobustScaler()],
}

acc_matrix1 = run_classifiers_matrix(X_train, y_train, X_test, y_test, scalers)
plot_matrix_heatmap(acc_matrix1, title="Model accuracy by scaler")


In [None]:
df_no_outliers = df[~zscore_mask]

y_no_outliers = df_no_outliers['eyeDetection']

X_no_outliers = df_no_outliers.drop(
    ['eyeDetection'],
    axis=1
)

X_train_no_outliers, X_test_no_outliers, y_train_no_outliers, y_test_no_outliers = train_test_split(
    X_no_outliers, y_no_outliers, test_size=0.2, random_state=42
)

scalers = {
    "None": [],
    "StandardScaler": [StandardScaler()],
    "MinMaxScaler": [MinMaxScaler()],
    "RobustScaler": [RobustScaler()],
}

acc_matrix2 = run_classifiers_matrix(X_train_no_outliers, y_train_no_outliers, X_test_no_outliers, y_test_no_outliers, scalers)
plot_matrix_heatmap(acc_matrix2, title="Preciznost modela po skaliru nakon uklanjanja outliera")

In [None]:
plot_matrix_heatmap(acc_matrix2 - acc_matrix1, title="Razlika u preciznosti modela po skaliru nakon uklanjanja outliera")

Iz matrice se može videti da je uglavnom došlo do poboljšanja preciznosti većine modela. Preciznost kod Decision Tree i KNN je malo opala, ali neuporedivo sa poboljšanjem ostalih.

### Balansiranje

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(12, 4))

y_train_no_outliers.value_counts().plot(kind='bar', ax=ax[0], color=['steelblue', 'coral'])
ax[0].set_title('Distribucija klasa u train setu')
ax[0].set_xlabel('Klasa')
ax[0].set_ylabel('Broj uzoraka')
ax[0].set_xticklabels(['Eye Open (0)', 'Eye Closed (1)'], rotation=0)

y_train_no_outliers.value_counts().plot(kind='pie', ax=ax[1], autopct='%1.1f%%', 
                                         labels=['Eye Open (0)', 'Eye Closed (1)'],
                                         colors=['steelblue', 'coral'])
ax[1].set_title('Procenat klasa')
ax[1].set_ylabel('')

plt.tight_layout()
plt.show()

imbalance_ratio = y_train_no_outliers.value_counts().max() / y_train_no_outliers.value_counts().min()
print(f"Imbalance ratio: {imbalance_ratio:.2f}:1")

#### Metode balansiranja

1. **Random Oversampling** - duplikacija manjinske klase
2. **Random Undersampling** - smanjenje većinske klase
3. **SMOTE** - generisanje sintetičkih uzoraka manjinske klase
4. **ADASYN** - adaptivno generisanje sintetičkih uzoraka
5. **SMOTETomek** - kombinacija SMOTE i Tomek links
6. **SMOTEENN** - kombinacija SMOTE i Edited Nearest Neighbours

In [None]:
from imblearn.over_sampling import RandomOverSampler, SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTETomek, SMOTEENN

def evaluate_balancing_methods(X_train, y_train, X_test, y_test):
    balancing_methods = {
        "Random Oversampling": RandomOverSampler(random_state=42),
        "Random Undersampling": RandomUnderSampler(random_state=42),
        "SMOTE": SMOTE(random_state=42),
        "ADASYN": ADASYN(random_state=42),
        "SMOTETomek": SMOTETomek(random_state=42),
        "SMOTEENN": SMOTEENN(random_state=42)
    }
    
    results = []
    balanced_datasets = {}
    
    for method_name, sampler in balancing_methods.items():
        if sampler is None:
            X_train_balanced = X_train
            y_train_balanced = y_train
        else:
            X_train_balanced, y_train_balanced = sampler.fit_resample(X_train, y_train)
        
        balanced_datasets[method_name] = (X_train_balanced, y_train_balanced)
        
        res = run_classifiers_matrix(X_train_balanced, y_train_balanced, X_test, y_test, scalers) - acc_matrix2
        
        plot_matrix_heatmap(res, title="Model accuracy - " + method_name)
        results.append({
            "Balancing Method": method_name,
            "Accuracy Matrix": res
        })
    
    return pd.DataFrame(results), balanced_datasets

balancing_results, balanced_datasets = evaluate_balancing_methods(
    X_train_no_outliers, 
    y_train_no_outliers, 
    X_test_no_outliers, 
    y_test_no_outliers
)

#### Analiza najbolje metode balansiranja

In [None]:
fig, axes = plt.subplots(2, 4, figsize=(18, 8))
axes = axes.ravel()

for idx, (method_name, (X_bal, y_bal)) in enumerate(balanced_datasets.items()):
    if idx < len(axes):
        counts = pd.Series(y_bal).value_counts().sort_index()
        axes[idx].bar(['Eye Open (0)', 'Eye Closed (1)'], counts.values, 
                     color=['steelblue', 'coral'], alpha=0.7)
        axes[idx].set_title(f'{method_name}\n({counts.sum()} uzoraka)', fontsize=10)
        axes[idx].set_ylabel('Broj uzoraka')
        axes[idx].grid(axis='y', alpha=0.3)
        
        for i, v in enumerate(counts.values):
            axes[idx].text(i, v + 50, str(v), ha='center', va='bottom', fontweight='bold')

for idx in range(len(balanced_datasets), len(axes)):
    axes[idx].axis('off')

plt.suptitle('Distribucija klasa po metodama balansiranja', 
             fontsize=14, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, LabelEncoder
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score, f1_score
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif

### 5.1 Definicija funkcija za evaluaciju

In [None]:
def evaluate_all_models(X_train, X_test, y_train, y_test, step_name=""):
    models = {
        "Random Forest": RandomForestClassifier(random_state=42, n_jobs=1),
        "Extra Trees": ExtraTreesClassifier(n_estimators=100, random_state=42, n_jobs=1),
        "KNN": KNeighborsClassifier(n_neighbors=5),
        "SVM": SVC(kernel='rbf', random_state=42),
        "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
        "Naive Bayes": GaussianNB(),
        "Decision Tree": DecisionTreeClassifier(random_state=42),
    }
    
    results = []
    for name, model in models.items():
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        
        acc = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred, average='weighted')
        
        results.append({
            "Model": name,
            "Accuracy (%)": round(acc * 100, 2),
            "F1-Score (%)": round(f1 * 100, 2),
            "Step": step_name
        })
    
    return pd.DataFrame(results)

def plot_comparison(results_df, title="Poređenje modela"):
    fig, ax = plt.subplots(figsize=(12, 6))
    
    x = np.arange(len(results_df))
    width = 0.35
    
    bars1 = ax.bar(x - width/2, results_df['Accuracy (%)'], width, label='Accuracy', color='steelblue')
    bars2 = ax.bar(x + width/2, results_df['F1-Score (%)'], width, label='F1-Score', color='coral')
    
    ax.set_xlabel('Model')
    ax.set_ylabel('Score (%)')
    ax.set_title(title)
    ax.set_xticks(x)
    ax.set_xticklabels(results_df['Model'], rotation=45, ha='right')
    ax.legend()
    ax.set_ylim(0, 100)
    
    for bar in bars1:
        height = bar.get_height()
        ax.annotate(f'{height:.1f}',
                    xy=(bar.get_x() + bar.get_width() / 2, height),
                    xytext=(0, 3), textcoords="offset points",
                    ha='center', va='bottom', fontsize=8)
    
    plt.tight_layout()
    plt.show()

### Uklanjanje visoko korelisanih feature-a

Na osnovu ranije analize korelacije, uklanjamo redundantne feature-e koji imaju korelaciju > 0.95.

In [None]:
import itertools
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

def get_high_corr_pairs(df: pd.DataFrame, threshold: float = 0.9):
    corr = df.corr().abs()
    cols = list(corr.columns)

    pairs = []
    for i in range(len(cols)):
        for j in range(i + 1, len(cols)):
            v = corr.iloc[i, j]
            if v > threshold:
                pairs.append((cols[i], cols[j], float(v)))
    return pairs

def evaluate_hight_corr_drops2(X, y, pairs, max_drop=5):
    all_corr_cols = set()
    for col_a, col_b, _ in pairs:
        all_corr_cols.add(col_a)
        all_corr_cols.add(col_b)
    
    all_corr_cols = list(all_corr_cols)
    
    results = []
    
    for num_to_drop in range(1, min(max_drop + 1, len(all_corr_cols) + 1)):
        combinations = itertools.combinations(all_corr_cols, num_to_drop)
        
        for drop_set in combinations:
            drop_set = list(drop_set)
            X_reduced = X.drop(columns=drop_set)
            
            X_train, X_test, y_train, y_test = train_test_split(
                X_reduced, y, test_size=0.2, random_state=42
            )
            
            res = evaluate_all_models(X_train, X_test, y_train, y_test, 'Izbacivanje kolona')
            
            best_idx = res["Accuracy (%)"].idxmax()
            best_result = res.loc[best_idx]

            results.append({
                "Dropped Columns": drop_set,
                "Num Dropped": len(drop_set),
                "Best Classifier": best_result["Model"],
                "Accuracy (%)": best_result["Accuracy (%)"],
                "Remaining Features": X_reduced.shape[1]
            })
    
    results_df = pd.DataFrame(results)
    results_df = results_df.sort_values(by="Accuracy (%)", ascending=False)
    
    return results_df

print(evaluate_all_models(X_train,X_test, y_train, y_test, ''))
pairs = get_high_corr_pairs(X, threshold=0.95)
drop2_results = evaluate_hight_corr_drops2(X, y, pairs, max_drop=5)
print(f"Ukupno varijanti: {len(drop2_results)}")
print("\nTop 10 najboljih kombinacija:")
print(drop2_results.head(10).to_string(index=False))


print(evaluate_all_models(X_train_no_outliers,X_test_no_outliers, y_train_no_outliers, y_test_no_outliers, ''))
pairs2 = get_high_corr_pairs(X_no_outliers, threshold=0.9)
drop2_results2 = evaluate_hight_corr_drops2(X_no_outliers, y_no_outliers, pairs2, max_drop=5)
print(f"Ukupno varijanti: {len(drop2_results2)}")
print("\nTop 10 najboljih kombinacija bez outlier-a:")
print(drop2_results2.head(10).to_string(index=False))

In [None]:
X_train_no_P8 = X_train_no_outliers.drop(columns=['P8'])
X_test_no_P8 = X_test_no_outliers.drop(columns=['P8'])

y_train_no_P8 = y_train_no_outliers
y_test_no_P8 = y_test_no_outliers

acc_matrix3 = run_classifiers_matrix(X_train_no_P8, y_train_no_P8, X_test_no_P8, y_test_no_P8, scalers)
plot_matrix_heatmap(acc_matrix3 - acc_matrix2, title="Promena preciznost modela po skaliru nakon uklanjanja kolone P8")

### Redukcija dimenzionalnosti (PCA)

Primenjujemo PCA (Principal Component Analysis) za redukciju dimenzionalnosti sa zadržavanjem 95% varijanse.

In [None]:
pca_results = []

for n_components in [3, 5, 7, 9, 12]:
    pca = PCA(n_components=n_components)
    X_train_pca = pca.fit_transform(X_train_no_P8)
    X_test_pca = pca.transform(X_test_no_P8)
    
    explained_var = sum(pca.explained_variance_ratio_) * 100

    res = run_classifiers_matrix(X_train_pca, y_train_no_P8, X_test_pca, y_test_no_P8, scalers)

    plot_matrix_heatmap(res - acc_matrix3, title=f"PCA ({n_components} komponenata, {explained_var:.2f}% varijanse) - Promena preciznosti modela po skaliru")

    pca_results.append({
        "n_components": n_components,
        "explained_variance": explained_var,
        "accuracy": res * 100
    })

### Optimizacija hiperparametara

In [None]:
print("\n--- Random Forest optimizacija ---")
rf_param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

rf = RandomForestClassifier(random_state=42, n_jobs=1)
rf_grid = RandomizedSearchCV(rf, rf_param_grid, n_iter=20, cv=5, scoring='accuracy', 
                              random_state=42, n_jobs=1, verbose=1)
rf_grid.fit(X_train_no_P8, y_train_no_P8)

print(f"Najbolji parametri: {rf_grid.best_params_}")
print(f"Najbolja CV tačnost: {rf_grid.best_score_*100:.2f}%")

rf_best = rf_grid.best_estimator_
rf_pred = rf_best.predict(X_test_no_P8)
print(f"Test tačnost: {accuracy_score(y_test_no_P8, rf_pred)*100:.2f}%")

In [None]:
# Extra Trees optimizacija
print("\n--- Extra Trees optimizacija ---")
et_param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}

et = ExtraTreesClassifier(random_state=42, n_jobs=1)
et_grid = RandomizedSearchCV(et, et_param_grid, n_iter=20, cv=5, scoring='accuracy', 
                              random_state=42, n_jobs=1, verbose=1)
et_grid.fit(X_train_no_P8, y_train_no_P8)

print(f"Najbolji parametri: {et_grid.best_params_}")
print(f"Najbolja CV tačnost: {et_grid.best_score_*100:.2f}%")

et_best = et_grid.best_estimator_
et_pred = et_best.predict(X_test_no_P8)
print(f"Test tačnost: {accuracy_score(y_test_no_P8, et_pred)*100:.2f}%")

In [None]:
# # KNN optimizacija
# print("\n--- KNN optimizacija ---")
# knn_param_grid = {
#     'n_neighbors': range(1, 300, 2) ,
#     'weights': ['uniform', 'distance'],
#     'metric': ['euclidean', 'manhattan', 'minkowski']
# }

# knn = KNeighborsClassifier()
# knn_grid = GridSearchCV(knn, knn_param_grid, cv=5, scoring='accuracy', n_jobs=1, verbose=1, random_state=42)
# knn_grid.fit(X_train_no_P8, y_train_no_P8)

# print(f"Najbolji parametri: {knn_grid.best_params_}")
# print(f"Najbolja CV tačnost: {knn_grid.best_score_*100:.2f}%")

# knn_best = knn_grid.best_estimator_
# knn_pred = knn_best.predict(X_test_no_P8)
# print(f"Test tačnost: {accuracy_score(y_test_no_P8, knn_pred)*100:.2f}%")


In [None]:
# SVM optimizacija
# print("\n--- SVM optimizacija ---")
# svm_param_grid = {
#     'C': [0.1, 1, 10, 100],
#     'gamma': ['scale', 'auto', 0.01, 0.1],
#     'kernel': ['rbf', 'poly']
# }

# svm = SVC(random_state=42)
# svm_grid = RandomizedSearchCV(svm, svm_param_grid, n_iter=15, cv=5, scoring='accuracy', 
#                                random_state=42, n_jobs=1, verbose=1)
# svm_grid.fit(X_train_no_P8, y_train_no_P8)

# print(f"Najbolji parametri: {svm_grid.best_params_}")
# print(f"Najbolja CV tačnost: {svm_grid.best_score_*100:.2f}%")

# svm_best = svm_grid.best_estimator_
# svm_pred = svm_best.predict(X_test_no_P8)
# print(f"Test tačnost: {accuracy_score(y_test_no_P8, svm_pred)*100:.2f}%")

In [None]:
# Sumarni rezultati nakon optimizacije hiperparametara
print("\n" + "=" * 60)
print("Rezultati nakon optimizacije hiperparametara")
print("=" * 60)

optimized_results = pd.DataFrame([
    {"Model": "Random Forest (opt)", "Accuracy (%)": round(accuracy_score(y_test_no_P8, rf_pred)*100, 2), 
     "F1-Score (%)": round(f1_score(y_test_no_P8, rf_pred, average='weighted')*100, 2), "Step": "Optimizovano"},
    {"Model": "Extra Trees (opt)", "Accuracy (%)": round(accuracy_score(y_test_no_P8, et_pred)*100, 2), 
     "F1-Score (%)": round(f1_score(y_test_no_P8, et_pred, average='weighted')*100, 2), "Step": "Optimizovano"},
])

print(optimized_results.to_string(index=False))
plot_comparison(optimized_results, "KORAK 4 - Optimizovani modeli")


## 6. Analiza interpretabilnosti modela

U ovom delu analiziramo interpretabilnost modela koristeći:
- **Feature Importance** - značaj feature-a u tree-based modelima
- **SHAP vrednosti** - SHapley Additive exPlanations
- **LIME** - Local Interpretable Model-agnostic Explanations

### 6.1 Feature Importance (Random Forest & Extra Trees)

In [None]:
feature_names = X_train_no_P8.columns.tolist()

rf_importance = pd.DataFrame({
    'Feature': feature_names,
    'Importance': rf_best.feature_importances_
}).sort_values('Importance', ascending=False)

et_importance = pd.DataFrame({
    'Feature': feature_names,
    'Importance': et_best.feature_importances_
}).sort_values('Importance', ascending=False)

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

axes[0].barh(rf_importance['Feature'], rf_importance['Importance'], color='steelblue')
axes[0].set_xlabel('Importance')
axes[0].set_title('Random Forest - Feature Importance')
axes[0].invert_yaxis()

axes[1].barh(et_importance['Feature'], et_importance['Importance'], color='coral')
axes[1].set_xlabel('Importance')
axes[1].set_title('Extra Trees - Feature Importance')
axes[1].invert_yaxis()

plt.tight_layout()
plt.show()

print("\nTop 5 najvažnijih feature-a (Random Forest):")
print(rf_importance.head().to_string(index=False))
print("\nTop 5 najvažnijih feature-a (Extra Trees):")
print(et_importance.head().to_string(index=False))

### SHAP Analiza (SHapley Additive exPlanations)

SHAP vrednosti pokazuju doprinos svakog feature-a predikciji modela.

In [None]:
import shap

explainer_rf = shap.Explainer(rf_best, X_train_no_P8)
shap_values_rf = explainer_rf(X_test_no_P8, check_additivity=False)

print("SHAP Summary Plot - Random Forest")
plt.figure(figsize=(12, 8))
shap.summary_plot(shap_values_rf, feature_names=feature_names, show=False)
plt.tight_layout()
plt.show()

In [None]:
shap.plots.beeswarm(shap_values_rf)
shap.plots.bar(shap_values_rf)
shap.plots.waterfall(shap_values_rf[0])

### LIME Analiza (Local Interpretable Model-agnostic Explanations)

LIME objašnjava pojedinačne predikcije kreirajući lokalni interpretabilan model.

In [None]:
from lime import lime_tabular

explainer_lime = lime_tabular.LimeTabularExplainer(
    X_train_no_P8.values,
    feature_names=feature_names,
    class_names=['Eye Open', 'Eye Closed'],
    mode='classification',
    random_state=42
)

print("LIME objašnjenja za 3 instance:")
for i in [0, 50, 100]:
    exp = explainer_lime.explain_instance(
        X_test_no_P8.values[i], 
        rf_best.predict_proba, 
        num_features=10
    )
    print(f"\n--- Instanca {i} ---")
    print(f"Stvarna klasa: {y_test.iloc[i]}")
    print(f"Prediktovana klasa: {rf_best.predict(X_test_no_P8[i:i+1])[0]}")
    
    fig = exp.as_pyplot_figure()
    plt.title(f'LIME Objašnjenje - Instanca {i}')
    plt.tight_layout()
    plt.show()