**Implementacija predikcije ocene parfema uz koriscenje XGBoost i Random Forest**

Pre svega treba importovati sve potrebne pakete koji ce nam sluziti za analizu.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
import tensorflow as tf
from sklearn.metrics import mean_absolute_error, mean_squared_error
import re
from ast import literal_eval


Funckija *parse_season_ratings* sluzi za za parsiranje polja godisnja doba iz skupa podataka, dok funckija *consolidate_notes* objedinjuje/konsoliduje base, middle i top note u jednu kolekciju

In [2]:
def parse_season_ratings(rating_str):
    pattern = r'([A-Za-z]+):\s*([0-9.]+)%'
    return {season: float(percent) for season, percent in re.findall(pattern, rating_str)}

def consolidate_notes(notes):
    all_notes = []
    for note_type in ['Top Notes', 'Middle Notes', 'Base Notes']:
        if note_type in notes:
            all_notes.extend(notes[note_type])
    return all_notes


Ucitavanje skupa podataka

In [3]:
file_path = "../datasets/mainDataset.csv"  
data = pd.read_csv(file_path, delimiter='|')

Izdvajanje relevantnih polja iz skupa podataka koji ce biti ulaz u algoritme.
Bilo su isprobane razne kombinacije.. (popricati sa vukasinom sta cemo da napisemo)

In [4]:
data['Accords'] = data['Accords'].apply(literal_eval)
data['Notes'] = data['Notes'].apply(literal_eval)
data['Votes'] = data['Rating'].apply(lambda x: literal_eval(x)['votes'])
data['Rating'] = data['Rating'].apply(lambda x: literal_eval(x)['rating'])
data['Season ratings'] = data['Season ratings'].apply(parse_season_ratings)
data['Day ratings'] = data['Day ratings'].apply(parse_season_ratings)
data['Designers'] = data['Designers'].apply(literal_eval)


Stvaranje recorda i formiranje Data Frame koji sadrzi svaki parfem iz skupa podataka u pogodnom obliku i svih nota JEDNOG parfema

In [5]:
records = []
for _, row in data.iterrows():
    records.append({
        "Brand": row["Brand"],
        "Gender": row["Gender"],
        "Longevity": row["Longevity"],
        "Sillage": row["Sillage"],
        "Rating": row["Rating"],
        "Votes" : row["Votes"],
        "Season_Winter": row["Season ratings"].get("Winter", 0),
        "Season_Spring": row["Season ratings"].get("Spring", 0),
        "Season_Summer": row["Season ratings"].get("Summer", 0),
        "Season_Fall": row["Season ratings"].get("Fall", 0),
        "Day": row["Day ratings"].get("Day", 0),
        "Night": row["Day ratings"].get("Night", 0)
    })

structured_df = pd.DataFrame(records)
structured_df['All Notes'] = data['Notes'].apply(consolidate_notes)


Kategorijske kolone Brand i Gender su pretvorene u one-hot enkodirane kolone, dok su sve jedinstvene note iz All Notes pretvorene u binarne indikatorske kolone koje pokazuju prisustvo svake note.

In [6]:
ohe = OneHotEncoder(sparse_output=False)
ohe_features = pd.DataFrame(
    ohe.fit_transform(structured_df[['Brand', 'Gender']]),
    columns=ohe.get_feature_names_out(['Brand', 'Gender'])
)

all_unique_notes = set(note for notes in structured_df['All Notes'] for note in notes)
note_df = pd.DataFrame(
    {f'Note_{note}': structured_df['All Notes'].apply(lambda x: 1 if note in x else 0)
     for note in all_unique_notes}
)

structured_df = pd.concat([structured_df, note_df], axis=1)

X = pd.concat([ohe_features, structured_df.drop(columns=['Brand', 'Gender', 'Rating', 'All Notes'])], axis=1)
y = structured_df['Rating']


In [7]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [8]:

rf_model = RandomForestRegressor(random_state=42)
rf_model.fit(X_train, y_train)
y_pred = rf_model.predict(X_test)

print("Random Forest R^2:", rf_model.score(X_test, y_test))
print("Random Forest MAE:", mean_absolute_error(y_test, y_pred))
print("Random Forest RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))


Random Forest R^2: 0.3775266246447101
Random Forest MAE: 0.15133368146214102
Random Forest RMSE: 0.19743249499089294


In [9]:

xgb_model = XGBRegressor(random_state=42)
xgb_model.fit(X_train, y_train)
y_xgb_pred = xgb_model.predict(X_test)

print("XGBoost R^2:", xgb_model.score(X_test, y_test))
print("XGBoost MAE:", mean_absolute_error(y_test, y_xgb_pred))
print("XGBoost RMSE:", np.sqrt(mean_squared_error(y_test, y_xgb_pred)))


XGBoost R^2: 0.3314560345863199
XGBoost MAE: 0.15685051415047505
XGBoost RMSE: 0.20460829204292488
