# **Projektni zadatak br. 2**
Za potrebe prvog dijela projektnog zadatka, kreirana je neuronska mreža nad ***RCV1*** skupom podataka.

Za instalaciju biblioteka potrebnih za pokretanje koda ovog dijela projektnog zadatka (na Windows operativnom sistemu) potrebno je u command prompt-u pokrenuti naredbe:
```bash
pip install tensorflow
pip install scikit-learn
pip install imbalanced-learn
```

In [2]:
import tensorflow as tf

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.optimizers import Adam

***RCV1*** skup ima 804 414 podataka.
Skup je prikazan kao vektor dimenzionalnosti 47 236, pa će u ulaznom sloju neuronske mreže biti 47 236 neurona, a kako radimo klasifikaciju nad skupom podataka koji ima 103 klase, neuronska mreža će u posljednjem sloju imati 103 neurona.

In [3]:
from sklearn.datasets import fetch_rcv1
from sklearn.model_selection import train_test_split

# učitavanje RCV1 podataka
rcv1_train = fetch_rcv1(subset='train')
rcv1_test = fetch_rcv1(subset='test')

X_train = rcv1_train.data
y_train = rcv1_train.target.toarray().argmax(axis=1)  # konvertovanje u jedinstvene oznake
X_test = rcv1_test.data
y_test = rcv1_test.target.toarray().argmax(axis=1)

Iz specifikacije skupa možemo vidjeti da je skup već podijeljen na trening i testni podskup, pa je dovoljno da pri uvozu samo naglasimo koji od podskupova želimo preuzeti.

Zatim te podskupove dijelimo na odgovarajuće *data* i *target* skupove.

In [4]:
# izdvajanje po 50% podataka iz trening i test skupa
X_train, _, y_train, _ = train_test_split(X_train, y_train, test_size=0.5, random_state=42)
X_test, _, y_test, _ = train_test_split(X_test, y_test, test_size=0.5, random_state=42)

Zbog obimnosti skupa smanjujemo veličinu svakog od podskupova s kojim radimo za 50%.

In [5]:
# izdvajanje validacionog skupa
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.15, random_state=42)

Iz trening skupa izdvajamo 15% podataka u validacioni skup.

In [6]:
# provjera tipa
print(type(X_train))
print(type(X_test))
print(type(X_valid))

<class 'scipy.sparse._csr.csr_matrix'>
<class 'scipy.sparse._csr.csr_matrix'>
<class 'scipy.sparse._csr.csr_matrix'>


Uočavamo da su podaci sa kojim raspolažemo u obliku **scipy sparse matrica**. Konverzija scipy sparse matrice u **dense numpy array** nije moguća zbog prevelikog memorijskog zauzeća, pa moramo raditi sa podacima u sparse formatu, za šta je pogodna biblioteka *tensorflow* (pytorch nema toliku podršku za rad sa sparse tenzorima).

In [7]:
from imblearn.over_sampling import SMOTE

# primjena SMOTE za balansiranje trening skupa
smote = SMOTE(random_state=42, k_neighbors=1)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

In [26]:
# provjera broja karakteristika
n_features = X_train_res.shape[1]
print(f'{n_features}')

47236


Dimenzionalnost podataka (*data*) iznosi 47 236, sa 0.16% ne-nula vrijednosti.

In [18]:
# provjera dimenzionalnosti
print(f'X_train_res: {X_train_res.shape}')
print(f'X_valid: {X_valid.shape}')
print(f'X_test: {X_test.shape}')

X_train_res: (74556, 47236)
X_valid: (1737, 47236)
X_test: (390632, 47236)


In [10]:
# definisanje arhitekture neuronske mreže
model = Sequential([
    Input(shape=(X_train.shape[1],)),  # ulazni sloj
    Dense(2048, activation='relu'),    # 1. skriveni sloj
    Dropout(0.5),
    Dense(1024, activation='relu'),    # 2. skriveni sloj
    Dropout(0.5),
    Dense(512, activation='relu'),     # 3. skriveni sloj
    Dropout(0.5),
    Dense(103, activation='softmax')   # izlazni sloj
])

Ovim smo definisali arhitekturu neuronske mreže sa 47 236 neurona u ulaznom i 108 neurona u izlaznom sloju, te 3 skrivena/unutrašnja sloja. Aktivaciona funkcija ulaznog i svih skrivenih slojeva je *ReLU*, a zbog toga što rješavamo problem višeklasne klasifikacije, aktivaciona funkcija izlaznog sloja je *softmax*.

In [11]:
# definisanje optimizatora, f-je greške i metrike
model.compile(optimizer=Adam(),
              loss=SparseCategoricalCrossentropy(),
              metrics=['accuracy'])

# definisanje checkpoint-a za čuvanje najboljeg modela
checkpoint = ModelCheckpoint('best_model.keras', monitor='val_accuracy', save_best_only=True, mode='max', verbose=1)

Kao funkciju greške koristimo *SparseCategoricalCrossentropy*, jer je pogodna za višeklasnu klasifikaciju. Ova funkcija će izračunavati gubitak između predikcija i stvarnih ciljnih vrijednosti tokom treniranja.

Optimizator će automatski prilagoditi težine modela kako bi minimizovao funkciju greške koja je definisana sa *loss*.

Metrika koju pratimo je tačnost i ona se koristi za praćenje performansi modela tokom treniranja i validacije, ali se ne koristi direktno u procesu optimizacije težina.

In [12]:
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report

# treniranje modela
history = model.fit(X_train_res, y_train_res,
                    epochs=5,
                    batch_size=64,
                    validation_data=(X_valid, y_valid),
                    callbacks=[checkpoint])

# učitavanje informacija o najboljem modelu koje smo radnije sačuvali
best_model = tf.keras.models.load_model('best_model.keras')

# evaluacija performansi najboljeg modela
loss, accuracy = best_model.evaluate(X_test, y_test)
print(f'Test Accuracy: {accuracy}')

# predikcije na testnom skupu
y_pred = best_model.predict(X_test).argmax(axis=1)

# izračunavanje metrika
precision = precision_score(y_test, y_pred, average='weighted', zero_division=1)
recall = recall_score(y_test, y_pred, average='weighted', zero_division=1)
f1 = f1_score(y_test, y_pred, average='weighted', zero_division=1)
print(f'Test Precision: {precision}')
print(f'Test Recall: {recall}')
print(f'Test F1 Score: {f1}')

print(classification_report(y_test, y_pred, zero_division=1))

Epoch 1/5
[1m1165/1165[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 901ms/step - accuracy: 0.7692 - loss: 0.9271
Epoch 1: val_accuracy improved from -inf to 0.78066, saving model to best_model.keras
[1m1165/1165[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1298s[0m 1s/step - accuracy: 0.7694 - loss: 0.9265 - val_accuracy: 0.7807 - val_loss: 1.0625
Epoch 2/5
[1m1165/1165[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 983ms/step - accuracy: 0.9921 - loss: 0.0373
Epoch 2: val_accuracy did not improve from 0.78066
[1m1165/1165[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1237s[0m 990ms/step - accuracy: 0.9921 - loss: 0.0373 - val_accuracy: 0.7761 - val_loss: 1.1865
Epoch 3/5
[1m1165/1165[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 965ms/step - accuracy: 0.9944 - loss: 0.0260
Epoch 3: val_accuracy did not improve from 0.78066
[1m1165/1165[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1178s[0m 1s/step - accuracy: 0.9944 - loss: 0.0260 - val_accuracy: 0

Za drugi dio projektnog zadatka potrebno je nad istim skupom podataka istrenirati ***ligthGBM klasifikator***.
Naredbe za instalaciju biblioteka potrebnih za pokretanje koda drugog dijela projektnog zadatka:
```bash
pip install lightgbm
pip install optuna
```

In [13]:
import lightgbm as lgb
import optuna

Podaci su već uvezeni, podijeljeni na odgovarajuće skupove i balansirani, pa odmah prelazimo na definisanje opsega hiperparametara na kojima ćemo posmatrati ponašanje modela.

In [14]:
from sklearn.metrics import accuracy_score

# f-ja kojom definišemo cilj treninga i parametre koji variraju kod lightGBM klasifikatora
def objective(trial):
    param = {
        "objective": "multiclass",
        "num_class": 103,
        "metric": "multi_logloss",
        "verbosity": -1,
        "boosting_type": "gbdt",
        "lambda_l1": trial.suggest_float("lambda_l1", 1e-8, 10.0, log=True),
        "lambda_l2": trial.suggest_float("lambda_l2", 1e-8, 10.0, log=True),
        "num_leaves": trial.suggest_int("num_leaves", 2, 256),
        "feature_fraction": trial.suggest_float("feature_fraction", 0.4, 1.0),
        "bagging_fraction": trial.suggest_float("bagging_fraction", 0.4, 1.0),
        "bagging_freq": trial.suggest_int("bagging_freq", 1, 7),
        "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
        "learning_rate": trial.suggest_float("learning_rate", 1e-1, 1.0, log=True),
    }

    
    model = lgb.LGBMClassifier(**param)
    model.fit(X_train_res, y_train_res, eval_set=[(X_valid, y_valid)], callbacks=[lgb.early_stopping(stopping_rounds=10)])
    y_pred = model.predict(X_valid)
    accuracy = accuracy_score(y_valid, y_pred)
    return accuracy

Objašnjenje opsega vrijednost koje smo uzeli za hiperparametre i kako oni utiču na stablo odlučivanja:
1. ***lambda_l1* i *lambda_l2* - regularizacija L1 i L2**:
- variraju od 1e-8 do 10.0 - veće vrijednosti povećavaju regularizaciju, što pomaže da se smanji overfitting
2. ***num_leaves* - maksimalan broj listova po stablu**:
- varira od 2 do 256 - veće vrijednosti omogućavaju složenije modele
3. ***feature_fraction* - procenat feature-a koji se koriste u svakoj iteraciji**:
- varira od 0.4 do 1.0 - manje vrijednosti sprečavaju overfitting
4. ***bagging_fraction* - procenat uzoraka koji se koriste u svakoj iteraciji**:
- varira od 0.4 do 1.0 - manje vrijednosti sprečavaju overfitting
5. ***bagging_freq* - učestalost bagging-a**:
- varira od 1 do 7 - veće vrijednosti povećavaju robusnost modela smanjenjem varijanse, ali povećavaju vrijeme treniranja
6. ***min_child_samples* - minimalan br. uzoraka potrebnih da bi se podijelio čvor**:
- varira od 5 do 100 - veće vrijednosti smanjuju mogućnost da dođe do overfitting-a
7. ***learning_rate*** - promjena težina modela kao odgovor na procjenjenu grešku:
- varira od 1e-1 do 1.0 - veće vrijednosti omogućavaju bržu konvergenciju, ali s rizikom da dođe do preskakanja optimalnog rješenja

In [16]:
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=10)

print("Number of finished trials: {}".format(len(study.trials)))

print("Best trial:")
trial = study.best_trial

print("  Value: {}".format(trial.value))

print("  Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))

[I 2024-06-26 17:07:13,387] A new study created in memory with name: no-name-aa739667-f126-45db-a254-6e90d67ca412


Training until validation scores don't improve for 10 rounds
Did not meet early stopping. Best iteration is:
[98]	valid_0's multi_logloss: 0.75683


[I 2024-06-26 17:20:12,488] Trial 0 finished with value: 0.7962003454231433 and parameters: {'lambda_l1': 2.8564849841336635, 'lambda_l2': 1.619939623594285e-08, 'num_leaves': 109, 'feature_fraction': 0.7261803604019853, 'bagging_fraction': 0.8893758865019048, 'bagging_freq': 2, 'min_child_samples': 36}. Best is trial 0 with value: 0.7962003454231433.


Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[42]	valid_0's multi_logloss: 0.810578


[I 2024-06-26 17:44:13,341] Trial 1 finished with value: 0.8059873344847438 and parameters: {'lambda_l1': 2.4710764289326206e-07, 'lambda_l2': 6.627768698260412e-08, 'num_leaves': 187, 'feature_fraction': 0.8724652709931167, 'bagging_fraction': 0.6965413630295012, 'bagging_freq': 3, 'min_child_samples': 82}. Best is trial 1 with value: 0.8059873344847438.


Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[79]	valid_0's multi_logloss: 0.768031


[I 2024-06-26 17:56:27,576] Trial 2 finished with value: 0.7944732297063903 and parameters: {'lambda_l1': 4.092109204906268, 'lambda_l2': 0.04900165020324122, 'num_leaves': 89, 'feature_fraction': 0.9654306130513368, 'bagging_fraction': 0.8755153531480141, 'bagging_freq': 7, 'min_child_samples': 54}. Best is trial 1 with value: 0.8059873344847438.


Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[34]	valid_0's multi_logloss: 0.843977


[I 2024-06-26 18:11:08,540] Trial 3 finished with value: 0.7927461139896373 and parameters: {'lambda_l1': 1.88379041662297e-06, 'lambda_l2': 4.9383887885706253e-05, 'num_leaves': 120, 'feature_fraction': 0.7118441150837593, 'bagging_fraction': 0.9177891462106128, 'bagging_freq': 4, 'min_child_samples': 53}. Best is trial 1 with value: 0.8059873344847438.


Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[49]	valid_0's multi_logloss: 0.80889


[I 2024-06-26 18:37:16,342] Trial 4 finished with value: 0.7938975244674726 and parameters: {'lambda_l1': 1.5379259005882808e-05, 'lambda_l2': 0.0007279909093645176, 'num_leaves': 256, 'feature_fraction': 0.4523642568833109, 'bagging_fraction': 0.6126852074916422, 'bagging_freq': 5, 'min_child_samples': 48}. Best is trial 1 with value: 0.8059873344847438.


Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[37]	valid_0's multi_logloss: 0.861316


[I 2024-06-26 18:55:16,030] Trial 5 finished with value: 0.795048934945308 and parameters: {'lambda_l1': 0.0038351236183668345, 'lambda_l2': 0.00022073687835532778, 'num_leaves': 199, 'feature_fraction': 0.7051781223925677, 'bagging_fraction': 0.7856485271103513, 'bagging_freq': 5, 'min_child_samples': 27}. Best is trial 1 with value: 0.8059873344847438.


Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[45]	valid_0's multi_logloss: 0.765524


[I 2024-06-26 19:10:13,614] Trial 6 finished with value: 0.8094415659182499 and parameters: {'lambda_l1': 0.000138026955057922, 'lambda_l2': 0.3122995055500407, 'num_leaves': 140, 'feature_fraction': 0.7385020544767629, 'bagging_fraction': 0.8870672632019924, 'bagging_freq': 5, 'min_child_samples': 52}. Best is trial 6 with value: 0.8094415659182499.


Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[57]	valid_0's multi_logloss: 0.767324


[I 2024-06-26 19:20:56,790] Trial 7 finished with value: 0.8002302820955671 and parameters: {'lambda_l1': 0.4306881440675227, 'lambda_l2': 3.9124730233903544e-08, 'num_leaves': 215, 'feature_fraction': 0.7275771279681029, 'bagging_fraction': 0.41246764941517994, 'bagging_freq': 1, 'min_child_samples': 26}. Best is trial 6 with value: 0.8094415659182499.


Training until validation scores don't improve for 10 rounds
Did not meet early stopping. Best iteration is:
[98]	valid_0's multi_logloss: 0.722194


[I 2024-06-26 19:36:57,132] Trial 8 finished with value: 0.8094415659182499 and parameters: {'lambda_l1': 0.1164144099985862, 'lambda_l2': 5.744499994951523, 'num_leaves': 217, 'feature_fraction': 0.63587982266162, 'bagging_fraction': 0.41399158853734774, 'bagging_freq': 1, 'min_child_samples': 36}. Best is trial 6 with value: 0.8094415659182499.


Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[40]	valid_0's multi_logloss: 0.755189


[I 2024-06-26 19:46:00,810] Trial 9 finished with value: 0.8077144502014968 and parameters: {'lambda_l1': 0.00555979059001471, 'lambda_l2': 0.011021715110853526, 'num_leaves': 210, 'feature_fraction': 0.6791521309903719, 'bagging_fraction': 0.6429896011538936, 'bagging_freq': 3, 'min_child_samples': 99}. Best is trial 6 with value: 0.8094415659182499.


Number of finished trials: 10
Best trial:
  Value: 0.8094415659182499
  Params: 
    lambda_l1: 0.000138026955057922
    lambda_l2: 0.3122995055500407
    num_leaves: 140
    feature_fraction: 0.7385020544767629
    bagging_fraction: 0.8870672632019924
    bagging_freq: 5
    min_child_samples: 52


Sada je pronađena i sačuvana najbolja kombinacija hiperparametara (*trial.params*), te na tako optimizovanim hiperparametrima vršimo dodatnu evaluaciju modela na validacionom skupu podataka.

In [24]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# treniranje najboljeg modela sa pronađenim parametrima
best_params = trial.params

# uklanjanje redundantnih parametara
conflicting_params = {
    'colsample_bytree': 'feature_fraction',
    'reg_alpha': 'lambda_l1',
    'reg_lambda': 'lambda_l2',
    'subsample': 'bagging_fraction',
    'subsample_freq': 'bagging_freq'
}

# uklanjanje parametara iz conflicting_params koji su u best_params
for redundant_param, primary_param in conflicting_params.items():
    if redundant_param in best_params:
        del best_params[redundant_param]

# dodavanje neophodnih parametara
best_params['objective'] = 'multiclass'
best_params['num_class'] = 103
best_params['metric'] = 'multi_logloss'
best_params['force_col_wise'] = True
best_params['verbosity'] = -1

final_model = lgb.LGBMClassifier(**best_params)
final_model.fit(X_train_res, y_train_res, eval_set=[(X_valid, y_valid)], callbacks=[lgb.early_stopping(stopping_rounds=10)])

# evaluacija na testnom skupu
y_pred = final_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='macro', zero_division=1)
recall = recall_score(y_test, y_pred, average='macro', zero_division=1)
f1 = f1_score(y_test, y_pred, average='macro', zero_division=1)

print(f'Test Accuracy: {accuracy}')
print(f'Test Precision: {precision}')
print(f'Test Recall: {recall}')
print(f'Test F1 Score: {f1}')
print(classification_report(y_test, y_pred, zero_division=1))

Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[45]	valid_0's multi_logloss: 0.765524
Test Accuracy: 0.7608490855843864
Test Precision: 0.5865425844097253
Test Recall: 0.48791600605223717
Test F1 Score: 0.5051139602449142
              precision    recall  f1-score   support

           0       0.44      0.39      0.42     11770
           1       0.69      0.54      0.61      5678
           2       0.47      0.39      0.43     15967
           3       0.54      0.40      0.46      3348
           4       0.88      0.91      0.89     71052
           8       0.36      0.25      0.30       595
           9       0.71      0.68      0.70     15401
          14       0.62      0.75      0.68     16642
          18       0.57      0.48      0.52      9459
          19       0.41      0.25      0.31      1951
          20       0.15      0.01      0.02       641
          21       0.48      0.35      0.41      7747
          22       0.62   

# **Zaključak**:
Nakon treniranja oba modela uočavamo:
1. neuronska mreža: 
- viša vrijednost metrika *precision*, *recall*, i *f1 score*, što sugeriše bolje performanse po pitanju tačnosti i pokrivenosti klasa
- sveobuhvatno bolje performanse po metrikama po klasama, posebno *precision* i *f1 score*.
- niža vrijednost za *accuracy*

2. lightGBM:
- viša vrijednost za *accuracy*, dakle veći procenat tačnih predikcija
- niže vrijednosti metrika *precision*, *recall* i *f1 score*, što podrazumijeva manju pouzdanost kad su u pitanju tačnost i pokrivenost svih klasa
- više zavisi od hiperparametara i podešavanja i zapravo je manje kompleksan model u ovom slučaju (radi bržeg treniranja i evaluacije)

Na osnovu analize rezultata koje smo dobili testiranjem oba modela, možemo zaključiti da **neuronska mreža** (sa arhitekturom kakvu smo definisali) pokazuje sveobuhvatno bolje performanse kad se uzmu u obzir metrike relevantne za klasifikacione zadatke.