PROJEKTNI ZADATAK


Kreirana je neuronska mreza za rad sa RCV1 skupom podataka, koji sadrži novinske članke iz Reutersa i uključuje različite kategorije koje predstavljaju teme članaka. Prlikom rada koristim biblioteku Tensorflow koji se na Windows OS može instalirati komandom:

pip install tensorflow


In [1]:
#Svi potrebni import-i
import tensorflow as tf
import numpy as np
from sklearn.datasets import fetch_rcv1
from tensorflow.keras.models import Sequential
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.layers import Dense, Dropout, Activation, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import load_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support, precision_score, recall_score
from tensorflow.keras.callbacks import Callback


RCV1 sadrži oko 804.414 novinskih članaka, 47.236 riječi kao karakteristike (features) i 103 kategorije (klase). S obzirom da je dovoljno koristiti samo 50% uzoraka, dijelim set u skladu sa tim.

In [2]:
# Učitavanje podataka
rcv1 = fetch_rcv1()
X = rcv1.data
y = rcv1.target.toarray().argmax(axis=1)  # Pretvaranje cilja u odgovarajući format


# Dijeljenje podataka na trening i test skup
X_train_full, X_test_full, y_train_full, y_test_full = train_test_split(X, y, test_size=0.5, shuffle=True, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_train_full, y_train_full, test_size=0.2, shuffle=True, random_state=42)

#Normalizacija podataka
scaler = StandardScaler(with_mean=False)  # with_mean=False zbog sparse matrice
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [3]:
#Provjeravanje tipova i formata
print("X_train_full, X_train")
print(X_train_full.shape)
print(X_train.shape)

print("X_test_full, X_test")
print(X_test_full.shape)
print(X_test.shape)

print("Y_train_full, Y_train")
print(y_train_full.shape)
print(y_train.shape)

print("Y_test_full, Y_test")
print(y_test_full.shape)
print(y_test.shape)

# Provjeravanje broja jedinstvenih klasa
num_classes = np.max(y) + 1
print(f"Number of classes: {num_classes}")

X_train_full, X_train
(402207, 47236)
(321765, 47236)
X_test_full, X_test
(402207, 47236)
(80442, 47236)
Y_train_full, Y_train
(402207,)
(321765,)
Y_test_full, Y_test
(402207,)
(80442,)
Number of classes: 103


Analizirajući set podataka zaključujemo da je ulazni broj neurona jednak broju karakteristika, a izlazni broju klasa = 103.
Za sve slojeve, osim posljednjeg, ćemo koristiiti ReLU aktivacionu funkciju. Mreža ima 5 skrivenih slojeva.

Komplajliranje mreže se vrši funkcijom 'compile' kojoj kao metriku proslijeđujemo 'accuracy', uz definisanje Adamovog optimizera.
model.summary() omogućava grafički prikaz karakteristika mreže. Za funkciju greške koristimo SparseCategoricalCrossentropy, zbog formata ulaznih podataka.

In [4]:

model = Sequential()

# Ulazni sloj
model.add(Dense(512, input_shape=(X_train.shape[1],)))
model.add(Activation('relu'))
model.add(Dropout(0.3))  # Regularizacija

# Prvi skriveni sloj
model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dropout(0.3))  # Regularizacija

# Drugi skriveni sloj
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.3))  # Regularizacija

# Treći skriveni sloj
model.add(Dense(64))
model.add(Activation('relu'))
model.add(Dropout(0.3))  # Regularizacija

# Četvrti skriveni sloj
model.add(Dense(32))
model.add(Activation('relu'))
model.add(Dropout(0.3))  # Regularizacija

# Peti skriveni sloj
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dropout(0.3))  # Regularizacija

# Izlazni sloj s brojem neurona jednakim broju klasa
model.add(Dense(num_classes))
model.add(Activation('softmax'))  # Softmax aktivacija za višeklasnu klasifikaciju

# Kompajliranje modela sa metrikom: accuracy
model.compile(
    optimizer=Adam(learning_rate=1e-3),
    loss=SparseCategoricalCrossentropy(),
    metrics=['accuracy']
)


# Ispis strukture modela
model.summary()

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Da bismo pratili odziv i preciznost svake epohe i sačuvali chechpoint modela sa najboljim performansama, implementiraćemo PrecisionRecallCallback. 
Najbolji model se memoriše u fajl best_precision_recall_model sa ekstenzijom 'keras'.

In [5]:


class PrecisionRecallCallback(Callback):
    def __init__(self, X_val, y_val, filepath):
        super().__init__()
        self.X_test = X_val
        self.y_test = y_val
        self.filepath = filepath
        self.best_precision = -float('inf')
        self.best_recall = -float('inf')

    def on_epoch_end(self, epoch, logs=None):
        # Predikcija na validacionom skupu
        y_pred = self.model.predict(self.X_test).argmax(axis=1)

        # Računanje preciznosti i odziva
        precision = precision_score(self.y_test, y_pred, average='weighted',zero_division=0)
        recall = recall_score(self.y_test, y_pred, average='weighted',zero_division=0)

        # Ispis preciznosti i odziva
        print(f'Epoch {epoch + 1}: Precision: {precision:.4f}, Recall: {recall:.4f}')

        # Čuvanje modela na osnovu preciznosti i odziva
        if precision > self.best_precision or recall > self.best_recall:
            self.best_precision = max(self.best_precision, precision)
            self.best_recall = max(self.best_recall, recall)
            self.model.save(self.filepath)
            print(f'Model saved to {self.filepath} with Precision: {precision:.4f}, Recall: {recall:.4f}')

precision_recall_callback = PrecisionRecallCallback(X_test, y_test, 'best_precision_recall_model.keras')


Treniranje modela po epohama vršimo pozivom funkcije 'fit' koja za praćenje performansi prilikom rada sa podacima koristi instancu funkcije  PrecisionRecallCallback, koja je prethodno definisana.

In [6]:
EPOCHS = 10
BATCH_SIZE = 256


history = model.fit(
    X_train,
    y_train,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    validation_data=(X_test, y_test),
    callbacks=[precision_recall_callback]
)

Epoch 1/10
[1m2514/2514[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 16ms/step
Epoch 1: Precision: 0.6016, Recall: 0.6793
Model saved to best_precision_recall_model.keras with Precision: 0.6016, Recall: 0.6793
[1m1257/1257[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1019s[0m 762ms/step - accuracy: 0.4306 - loss: 2.3362 - val_accuracy: 0.6793 - val_loss: 1.1497
Epoch 2/10
[1m2514/2514[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m127s[0m 50ms/step
Epoch 2: Precision: 0.6494, Recall: 0.7081
Model saved to best_precision_recall_model.keras with Precision: 0.6494, Recall: 0.7081
[1m1257/1257[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1044s[0m 830ms/step - accuracy: 0.6580 - loss: 1.2009 - val_accuracy: 0.7081 - val_loss: 1.1065
Epoch 3/10
[1m2514/2514[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m47s[0m 19ms/step
Epoch 3: Precision: 0.7000, Recall: 0.7404
Model saved to best_precision_recall_model.keras with Precision: 0.7000, Recall: 0.7404
[1m1257/1257[0m [

DRUGI DIO projektnog zadatka se odnosi na pretragu hiperparametara pomoću Optuna biblioteke. Instalacija ove bibliteke na Windows OS se vrši pomoću komande:

pip install optuna

In [3]:
#Svi potrebni importi
import lightgbm as lgb
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
import optuna


LightGBM je implementacija algoritma Gradient Boosting Decision Trees (GBDT), pa s tim u vezi radimo sa hiperparametrima: 
num_leaves - broj listova,
feature_fraction - kontroliše broj karakteristika koji se koristi pri građenju svakog stabla,
bagging fraction - kontroliše proprociju uzoraka podataka koje svako stablo koristi,
min_child_samples - broj uzoraka koji moraju biti prisutni u čvoru prije nego što model može podijeliti taj čvor na dva nova čvora i
learning_rate - koeficijent učenja koji određuje koliko će model u svakom koraku učenja prilagoditi svoje predikcije na osnovu grešaka iz prethodnog koraka.

Poslije definisanja opsega hiperparametara, kreiramo LightGBM model (radimo sa 50% podataka) za trening i evaluaciju. Vrši se oprimizacija modela u 5 iteracija, tražeći maksimalnu tačnost. 

Trenira se najbolji model sa optimalnim hiperparametrima na cjelokupnom skupu podataka za treniranje.
Evaluira se najbolji model na testnom skupu, računajući preciznost (precision), odziv (recall), i tačnost (accuracy).

In [4]:
def objective(trial):
    # Definisanje opsega vrijednosti hiperparametara koje će Optuna optimizovati
    param = {
        "objective": "multiclass",
        "metric": "multi_logloss",
        "verbosity": -1,
        "boosting_type": "gbdt",
        "num_leaves": trial.suggest_int("num_leaves", 16, 128),  
        "feature_fraction": trial.suggest_float("feature_fraction", 0.6, 0.9),
        "bagging_fraction": trial.suggest_float("bagging_fraction", 0.6, 0.9),  
        "min_child_samples": trial.suggest_int("min_child_samples", 10, 50),  
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.1, log=True)
    }
    
    # Kreiranje i treniranje LightGBM modela
    model = lgb.LGBMClassifier(**param)
    model.fit(X_train, y_train)
    
    # Predikcija na testnom skupu
    y_pred = model.predict(X_test)
    
    # Računanje tačnosti
    accuracy = accuracy_score(y_test, y_pred)
    
    return accuracy

# Kreiranje studije za optimizaciju
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=5)

# Prikaz rezultata
print("Number of finished trials: {}".format(len(study.trials)))
print("Best trial:")
trial = study.best_trial

print("  Value: {}".format(trial.value))
print("  Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))





# Treniranje modela sa najboljim hiperparametrima
best_params = trial.params
best_model = lgb.LGBMClassifier(**best_params)
best_model.fit(X_train, y_train)

# Evaluacija na test skupu
y_pred = best_model.predict(X_test)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
accuracy = accuracy_score(y_test, y_pred)

print(f"Final Precision: {precision:.4f}, Recall: {recall:.4f}, Accuracy: {accuracy:.4f}")

# Upoređivanje performansi sa neuronskom mrežom

neural_net_accuracy = 0.7645  # Nije postavljen najnoviji rezltat

if accuracy > neural_net_accuracy:
    print("LightGBM model je bolji od neuronske mreže.")
else:
    print("Neuronska mreža je bolja od LightGBM modela.")


[I 2024-08-21 18:43:24,039] A new study created in memory with name: no-name-01fb8254-7749-424f-a110-672762b61404
[I 2024-08-22 08:43:45,324] Trial 0 finished with value: 0.7878222818925437 and parameters: {'num_leaves': 54, 'feature_fraction': 0.6080904263743, 'bagging_fraction': 0.8681621457560066, 'min_child_samples': 23, 'learning_rate': 0.04525045973660737}. Best is trial 0 with value: 0.7878222818925437.
[I 2024-08-22 12:09:16,906] Trial 1 finished with value: 0.8082593669973397 and parameters: {'num_leaves': 102, 'feature_fraction': 0.7288393575387057, 'bagging_fraction': 0.7211864764970205, 'min_child_samples': 23, 'learning_rate': 0.010018712969863282}. Best is trial 1 with value: 0.8082593669973397.
[I 2024-08-22 15:24:40,980] Trial 2 finished with value: 0.8115039407274807 and parameters: {'num_leaves': 78, 'feature_fraction': 0.8493269310740921, 'bagging_fraction': 0.7264558552312639, 'min_child_samples': 20, 'learning_rate': 0.015729898211458426}. Best is trial 2 with valu

Number of finished trials: 5
Best trial:
  Value: 0.8115039407274807
  Params: 
    num_leaves: 78
    feature_fraction: 0.8493269310740921
    bagging_fraction: 0.7264558552312639
    min_child_samples: 20
    learning_rate: 0.015729898211458426
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 84.782888 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1445131
[LightGBM] [Info] Number of data points in the train set: 321765, number of used features: 24167
[LightGBM] [Info] Start training from score -3.499945
[LightGBM] [Info] Start training from score -4.213154
[LightGBM] [Info] Start training from score -3.195500
[LightGBM] [Info] Start training from score -4.768056
[LightGBM] [Info] Start training from score -1.708494
[LightGBM] [Info] Start training from score -6.524598
[LightGBM] [Info] Start training from score -3.245217
[LightGBM] [Info] Start training from score -3.159276
[LightGBM] [Info] Start 

Ukoliko ažuriramo tačnost neuronske mreže i po datoj metrici poredimo performanse, NEURONSKA MREŽA je bolji model za ovaj problem.
Posmatrajući odziv i preciznost, LightGBM pokazuje za nijansu bolje rezultate (za 3% u slučaju preciznosti i 5% u slučaju odziva).

In [5]:
neural_net_accuracy = 0.8414  

if accuracy > neural_net_accuracy:
    print("LightGBM model je bolji od neuronske mreže.")
else:
    print("Neuronska mreža je bolja od LightGBM modela.")

Neuronska mreža je bolja od LightGBM modela.
