# Steam Review Helpfulness Prediction

This notebook builds an end-to-end pipeline for predicting **whether a Steam review is helpful**, using:

1. **Data loading & merging**
   - Kaggle Steam review dataset (CSV)
   - `steam_games` metadata stored as one Python dict per line
2. **Simple baseline** – predict helpfulness from **review length only**
3. **TF–IDF + Logistic Regression** baseline
4. **DistilBERT model with game metadata**

All models use:
- **Balanced train set** (equal helpful / not helpful)
- **Realistic validation set** (original class proportions) for **threshold tuning**
- **Realistic test set** (original class proportions) for final evaluation




In [2]:
import os
import math
import random
import ast
from pathlib import Path

import numpy as np
import pandas as pd

from sklearn.metrics import accuracy_score, f1_score, roc_auc_score

RANDOM_STATE = 114514
np.random.seed(RANDOM_STATE)
random.seed(RANDOM_STATE)

DATA_DIR = Path("./data")  # adjust if needed

def load_python_dict_lines(path: Path):
    """Load a file where each line is a Python dict literal.

    Each line should look like:
    {'publisher': '...', 'genres': [...], 'id': '4570', ...}

    Returns a list of dicts.
    """
    if not path.exists():
        raise FileNotFoundError(f"File not found: {path}")

    objs = []
    with path.open("r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            try:
                obj = ast.literal_eval(line)
                objs.append(obj)
            except Exception:
                # Skip malformed lines silently
                continue
    return objs


## 1. Load & Merge Data

We assume:

- A large Kaggle-style CSV with columns at least:
  - `app_id`, `app_name`, `review_text`, `review_score`, `review_votes`
- A `steam_games` file where **each line is a Python dict literal**, e.g.:
  `{'publisher': 'SEGA', 'genres': [...], 'id': '4570', ...}`

Adjust the filenames / column names below as needed to match your actual files.


In [3]:
# Paths – change filenames to match your actual data
kaggle_csv_path = DATA_DIR / "dataset.csv"   # big CSV
games_path = DATA_DIR / "steam_games.json"              # or .txt; dict-per-line format

usecols = ["app_id", "app_name", "review_text", "review_score", "review_votes"]
print("Loading Kaggle CSV from:", kaggle_csv_path)
df_big = pd.read_csv(kaggle_csv_path, usecols=usecols, low_memory=False)
print("Raw Kaggle shape:", df_big.shape)
print(df_big.head())

df_big = df_big.dropna(subset=["review_text", "review_votes"]).copy()
df_big["review_text"] = df_big["review_text"].astype(str)
df_big["app_id"] = df_big["app_id"].astype(str)

print("\nLoading game metadata from:", games_path)
games_raw = load_python_dict_lines(games_path)
df_games = pd.DataFrame(games_raw)
df_games["id"] = df_games["id"].astype(str)
print("Games shape:", df_games.shape)
print(df_games.head())

df = df_big.merge(df_games, left_on="app_id", right_on="id", how="left", suffixes=("", "_game"))
print("\nMerged shape:", df.shape)


Loading Kaggle CSV from: data/dataset.csv
Raw Kaggle shape: (6417106, 5)
   app_id        app_name                                        review_text  \
0      10  Counter-Strike                                    Ruined my life.   
1      10  Counter-Strike  This will be more of a ''my experience with th...   
2      10  Counter-Strike                      This game saved my virginity.   
3      10  Counter-Strike  • Do you like original games? • Do you like ga...   
4      10  Counter-Strike           Easy to learn, hard to master.             

   review_score  review_votes  
0             1             0  
1             1             1  
2             1             0  
3             1             0  
4             1             1  

Loading game metadata from: data/steam_games.json
Games shape: (32135, 16)
          publisher                                             genres  \
0         Kotoshiro      [Action, Casual, Indie, Simulation, Strategy]   
1  Making Fun, Inc.           

## 2. Create Helpfulness Label

We define a binary helpfulness label from `review_votes`:

- **helpful = 1** if `review_votes >= 1` (has at least one helpful vote)
- **helpful = 0** otherwise


In [4]:
# Ensure review_votes is numeric
df["review_votes"] = pd.to_numeric(df["review_votes"], errors="coerce").fillna(0).astype(int)
df["helpful"] = (df["review_votes"] >= 1).astype(int)
df = df.dropna(subset=["review_text"]).copy()

print("Label distribution (raw):")
print(df["helpful"].value_counts())
print("\nLabel distribution (proportion):")
print(df["helpful"].value_counts(normalize=True))


Label distribution (raw):
helpful
0    5465356
1     944445
Name: count, dtype: int64

Label distribution (proportion):
helpful
0    0.852656
1    0.147344
Name: proportion, dtype: float64


## 3. Build Train / Validation / Test Splits

- **Train**: balanced  (same number of helpful and not helpful)
- **Validation (real)**: class proportions match full dataset
- **Test (real)**: class proportions match full dataset


In [9]:
TRAIN_PER_CLASS = 100000
VAL_TOTAL = 10000
TEST_TOTAL = 20000

df_pos = df[df["helpful"] == 1].copy()
df_neg = df[df["helpful"] == 0].copy()

print("Positives available:", len(df_pos))
print("Negatives available:", len(df_neg))

if len(df_pos) == 0 or len(df_neg) == 0:
    raise ValueError("Dataset has only one helpfulness class after thresholding.")

# Overall class ratios in the full dataset
pos_ratio = len(df_pos) / len(df)
neg_ratio = len(df_neg) / len(df)
print(f"Overall pos_ratio={pos_ratio:.4f}, neg_ratio={neg_ratio:.4f}")

# -------------------------
# 3.1 Balanced TRAIN set
# -------------------------
train_pos_n = min(TRAIN_PER_CLASS, len(df_pos))
train_neg_n = min(TRAIN_PER_CLASS, len(df_neg))
train_pos = df_pos.sample(train_pos_n, random_state=RANDOM_STATE)
train_neg = df_neg.sample(train_neg_n, random_state=RANDOM_STATE)
df_train = pd.concat([train_pos, train_neg]).sample(frac=1.0, random_state=RANDOM_STATE).reset_index(drop=True)

# Remove train rows from pools
df_pos_rest = df_pos.drop(train_pos.index)
df_neg_rest = df_neg.drop(train_neg.index)

# -------------------------
# 3.2 REALISTIC VALIDATION set
# -------------------------
val_pos_target = int(VAL_TOTAL * pos_ratio)
val_neg_target = VAL_TOTAL - val_pos_target
val_pos_n = min(val_pos_target, len(df_pos_rest))
val_neg_n = min(val_neg_target, len(df_neg_rest))

val_pos = df_pos_rest.sample(val_pos_n, random_state=RANDOM_STATE + 1)
val_neg = df_neg_rest.sample(val_neg_n, random_state=RANDOM_STATE + 2)
df_val = pd.concat([val_pos, val_neg]).sample(frac=1.0, random_state=RANDOM_STATE + 3).reset_index(drop=True)

# Remove val rows from pools
df_pos_rest2 = df_pos_rest.drop(val_pos.index)
df_neg_rest2 = df_neg_rest.drop(val_neg.index)

# -------------------------
# 3.3 REALISTIC TEST set
# -------------------------
test_pos_target = int(TEST_TOTAL * pos_ratio)
test_neg_target = TEST_TOTAL - test_pos_target
test_pos_n = min(test_pos_target, len(df_pos_rest2))
test_neg_n = min(test_neg_target, len(df_neg_rest2))

test_pos = df_pos_rest2.sample(test_pos_n, random_state=RANDOM_STATE + 4)
test_neg = df_neg_rest2.sample(test_neg_n, random_state=RANDOM_STATE + 5)
df_test = pd.concat([test_pos, test_neg]).sample(frac=1.0, random_state=RANDOM_STATE + 6).reset_index(drop=True)

print("\nTrain distribution:")
print(df_train["helpful"].value_counts(normalize=True))
print("\nValidation distribution (realistic):")
print(df_val["helpful"].value_counts(normalize=True))
print("\nTest distribution (realistic):")
print(df_test["helpful"].value_counts(normalize=True))

display(df_test.head(6))

Positives available: 944445
Negatives available: 5465356
Overall pos_ratio=0.1473, neg_ratio=0.8527

Train distribution:
helpful
0    0.5
1    0.5
Name: proportion, dtype: float64

Validation distribution (realistic):
helpful
0    0.8527
1    0.1473
Name: proportion, dtype: float64

Test distribution (realistic):
helpful
0    0.8527
1    0.1473
Name: proportion, dtype: float64


Unnamed: 0,app_id,app_name,review_text,review_score,review_votes,publisher,genres,app_name_game,title,url,...,discount_price,reviews_url,specs,price,early_access,id,developer,sentiment,metascore,helpful
0,22330,The Elder Scrolls IV: Oblivion,"Just buy it, you won't regret it!",1,0,Bethesda Softworks,[RPG],The Elder Scrolls IV: Oblivion® Game of the Ye...,The Elder Scrolls IV: Oblivion® Game of the Ye...,http://store.steampowered.com/app/22330/The_El...,...,,http://steamcommunity.com/app/22330/reviews/?b...,"[Single-player, Steam Cloud]",14.99,False,22330,Bethesda Game Studios,Very Positive,94.0,0
1,7940,Call of Duty 4: Modern Warfare,Always been a fan of the Call of Duty franchis...,1,0,"Activision, Aspyr (Mac)",[Action],Call of Duty® 4: Modern Warfare®,Call of Duty® 4: Modern Warfare®,http://store.steampowered.com/app/7940/Call_of...,...,,http://steamcommunity.com/app/7940/reviews/?br...,"[Single-player, Multi-player]",19.99,False,7940,"Infinity Ward,Aspyr (Mac)",Very Positive,92.0,0
2,282070,This War of Mine,"Daytime gameplay is a little basic, but overal...",1,0,11 bit studios,"[Adventure, Indie, Simulation]",This War of Mine,This War of Mine,http://store.steampowered.com/app/282070/This_...,...,,http://steamcommunity.com/app/282070/reviews/?...,"[Single-player, Steam Achievements, Full contr...",19.99,False,282070,11 bit studios,Overwhelmingly Positive,83.0,0
3,226620,Desktop Dungeons,Puzzle/strategy rogue-lite. Great to fire up f...,1,0,QCF Design,"[Adventure, Casual, Indie, RPG, Strategy]",Desktop Dungeons,Desktop Dungeons,http://store.steampowered.com/app/226620/Deskt...,...,,http://steamcommunity.com/app/226620/reviews/?...,"[Single-player, Steam Achievements, Steam Trad...",14.99,False,226620,QCF Design,Very Positive,82.0,0
4,286340,FarSky,"Great game. Loads of fun on it;however, the de...",-1,0,Farsky Interactive,"[Adventure, Indie]",FarSky,FarSky,http://store.steampowered.com/app/286340/FarSky/,...,,http://steamcommunity.com/app/286340/reviews/?...,[Single-player],4.99,False,286340,Farsky Interactive,Mostly Positive,,0
5,225260,Brütal Legend,After Psychonauts (which I personally loved) D...,1,0,Double Fine Productions,"[Action, Adventure, Strategy]",Brutal Legend,Brutal Legend,http://store.steampowered.com/app/225260/Bruta...,...,,http://steamcommunity.com/app/225260/reviews/?...,"[Single-player, Multi-player, Steam Achievemen...",14.99,False,225260,Double Fine Productions,Very Positive,80.0,0


## 4. Prepare Game Metadata (for BERT)

We will later train BERT using both review text and game metadata (genres, tags, price, metascore, sentiment).
The metadata processing is applied to the **combined train+val+test** DataFrame, then we split back.


In [11]:
def list_to_str(x):
    if isinstance(x, list):
        return " ".join(map(str, x))
    if isinstance(x, str):
        return x
    return ""

# Work on a combined frame so transforms are consistent
df_all = pd.concat([
    df_train.assign(_split="train"),
    df_val.assign(_split="val"),
    df_test.assign(_split="test"),
]).reset_index(drop=True)

df_all["genres_str"] = df_all.get("genres", "").apply(list_to_str)
df_all["tags_str"] = df_all.get("tags", "").apply(list_to_str)

def clean_price(x):
    if isinstance(x, str):
        if "free" in x.lower():
            return 0.0
        x = x.replace("$", "")
        try:
            return float(x)
        except Exception:
            return np.nan
    return x

if "price" in df_all.columns:
    df_all["price_num"] = df_all["price"].apply(clean_price)
else:
    df_all["price_num"] = 0.0

df_all["price_num"] = pd.to_numeric(df_all["price_num"], errors="coerce").fillna(0.0)

def price_bucket(p):
    if p == 0:
        return "FREE"
    if p < 10:
        return "CHEAP"
    if p < 30:
        return "MIDPRICE"
    return "EXPENSIVE"

df_all["price_bucket"] = df_all["price_num"].apply(price_bucket)

if "metascore" in df_all.columns:
    df_all["metascore_num"] = pd.to_numeric(df_all["metascore"], errors="coerce").fillna(-1)
else:
    df_all["metascore_num"] = -1

def metascore_bucket(m):
    if m < 0:
        return "META_UNKNOWN"
    if m < 60:
        return "META_LOW"
    if m < 80:
        return "META_MEDIUM"
    return "META_HIGH"

df_all["metascore_bucket"] = df_all["metascore_num"].apply(metascore_bucket)

if "sentiment" in df_all.columns:
    df_all["sentiment_str"] = df_all["sentiment"].fillna("")
else:
    df_all["sentiment_str"] = ""

if "review_score" in df_all.columns:
    df_all["review_score_num"] = pd.to_numeric(
        df_all["review_score"], errors="coerce"
    ).fillna(0)
else:
    df_all["review_score_num"] = 0

def review_score_bucket(s):
    if s == 1:
        return "RS_RECOMMEND"
    if s == -1:
        return "RS_NOT_RECOMMEND"
    return "RS_UNKNOWN"

df_all["review_score_bucket"] = df_all["review_score_num"].apply(review_score_bucket)

if "app_name" not in df_all.columns:
    df_all["app_name"] = df_all.get("title", "")

# Split back into train/val/test with enriched metadata
df_train = df_all[df_all["_split"] == "train"].drop(columns=["_split"]).reset_index(drop=True)
df_val = df_all[df_all["_split"] == "val"].drop(columns=["_split"]).reset_index(drop=True)
df_test = df_all[df_all["_split"] == "test"].drop(columns=["_split"]).reset_index(drop=True)


## 5. Simple Baseline – Review Length + Threshold Tuning

Steps:
- Compute review length (tokens) on train/val/test.
- Use **train lengths** to propose candidate thresholds (quantiles).
- For each threshold, evaluate macro-F1 on the **validation set**.
- Pick the best threshold.
- Evaluate on the **realistic test set**.


In [7]:
def text_len(s):
    return len(str(s).split())

train_lengths = df_train["review_text"].apply(text_len)
val_lengths = df_val["review_text"].apply(text_len)
test_lengths = df_test["review_text"].apply(text_len)

y_train_len = df_train["helpful"]
y_val_len = df_val["helpful"]
y_test_len = df_test["helpful"]

candidate_quantiles = np.linspace(0.1, 0.9, 17)
candidate_thresholds = [np.quantile(train_lengths, q) for q in candidate_quantiles]

best_thr = None
best_f1_macro = -1

for thr in candidate_thresholds:
    y_pred_val = (val_lengths >= thr).astype(int)
    f1_m = f1_score(y_val_len, y_pred_val, average="macro")
    if f1_m > best_f1_macro:
        best_f1_macro = f1_m
        best_thr = thr

print(f"Best length threshold (tokens): {best_thr:.2f}, val macro-F1 = {best_f1_macro:.4f}")

# Evaluate on test
y_pred_test = (test_lengths >= best_thr).astype(int)

acc_len = accuracy_score(y_test_len, y_pred_test)
f1_pos_len = f1_score(y_test_len, y_pred_test)
f1_macro_len = f1_score(y_test_len, y_pred_test, average="macro")

print("\nSimple length-based baseline (REALISTIC test):")
print("  Accuracy   :", acc_len)
print("  F1 (pos)   :", f1_pos_len)
print("  F1 macro   :", f1_macro_len)


Best length threshold (tokens): 106.00, val macro-F1 = 0.5180

Simple length-based baseline (REALISTIC test):
  Accuracy   : 0.7627
  F1 (pos)   : 0.1685353889278206
  F1 macro   : 0.515067881096361


## 6. TF–IDF + Logistic Regression + Threshold Tuning


In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

X_train_text = df_train["review_text"]
y_train = df_train["helpful"]
X_val_text = df_val["review_text"]
y_val = df_val["helpful"]
X_test_text = df_test["review_text"]
y_test = df_test["helpful"]

tfidf = TfidfVectorizer(
    max_features=50000,
    ngram_range=(1, 2),
    min_df=5,
)

print("Fitting TF-IDF on training data...")
X_train_tfidf = tfidf.fit_transform(X_train_text)
X_val_tfidf = tfidf.transform(X_val_text)
X_test_tfidf = tfidf.transform(X_test_text)

clf_lr = LogisticRegression(
    max_iter=2000,
    n_jobs=-1,
    class_weight="balanced",
)

print("Training Logistic Regression...")
clf_lr.fit(X_train_tfidf, y_train)

y_prob_val = clf_lr.predict_proba(X_val_tfidf)[:, 1]

# Threshold tuning on validation set
thresholds = np.linspace(0.01, 0.99, 99)
best_thr_lr = 0.5
best_f1_macro_lr = -1

for thr in thresholds:
    y_pred_val = (y_prob_val >= thr).astype(int)
    f1_m = f1_score(y_val, y_pred_val, average="macro")
    if f1_m > best_f1_macro_lr:
        best_f1_macro_lr = f1_m
        best_thr_lr = thr

print(f"Best LR threshold on val: {best_thr_lr:.2f}, val macro-F1 = {best_f1_macro_lr:.4f}")

# Evaluate on test with tuned threshold
y_prob_test = clf_lr.predict_proba(X_test_tfidf)[:, 1]
y_pred_test_lr = (y_prob_test >= best_thr_lr).astype(int)

acc_lr = accuracy_score(y_test, y_pred_test_lr)
f1_pos_lr = f1_score(y_test, y_pred_test_lr)
f1_macro_lr = f1_score(y_test, y_pred_test_lr, average="macro")
roc_lr = roc_auc_score(y_test, y_prob_test)

print("\nTF-IDF + Logistic Regression (REALISTIC test, tuned threshold):")
print("  Accuracy   :", acc_lr)
print("  F1 (pos)   :", f1_pos_lr)
print("  F1 macro   :", f1_macro_lr)
print("  ROC-AUC    :", roc_lr)


Fitting TF-IDF on training data...
Training Logistic Regression...
Best LR threshold on val: 0.61, val macro-F1 = 0.5781

TF-IDF + Logistic Regression (REALISTIC test, tuned threshold):
  Accuracy   : 0.77675
  F1 (pos)   : 0.29048148736691565
  F1 macro   : 0.5790082103817698
  ROC-AUC    : 0.6677282480608898


## 7. DistilBERT with Game Metadata + Threshold Tuning


In [12]:
from datasets import Dataset
from transformers import (
    DistilBertTokenizerFast,
    DistilBertForSequenceClassification,
    TrainingArguments,
    Trainer,
)

def build_bert_text(row):
    parts = []
    name = row.get("app_name", "")
    if isinstance(name, float):
        name = ""
    if name:
        parts.append(f"[GAME] {name}")
    parts.append(
        f"[PRICE] {row['price_bucket']} "
        f"[META] {row['metascore_bucket']} "
        f"[STORE_SENT] {row['sentiment_str']} "
        f"[USER_SCORE] {row['review_score_bucket']}"
    )
    if row.get("genres_str", ""):
        parts.append(f"[GENRES] {row['genres_str']}")
    if row.get("tags_str", ""):
        parts.append(f"[TAGS] {row['tags_str']}")
    parts.append(f"[REVIEW] {row['review_text']}")
    return " ".join(parts)

df_train["bert_text"] = df_train.apply(build_bert_text, axis=1)
df_val["bert_text"] = df_val.apply(build_bert_text, axis=1)
df_test["bert_text"] = df_test.apply(build_bert_text, axis=1)

X_train_b = df_train["bert_text"]
y_train_b = df_train["helpful"].astype(int)
X_val_b = df_val["bert_text"]
y_val_b = df_val["helpful"].astype(int)
X_test_b = df_test["bert_text"]
y_test_b = df_test["helpful"].astype(int)

train_ds = Dataset.from_dict({"text": X_train_b.tolist(), "label": y_train_b.tolist()})
val_ds = Dataset.from_dict({"text": X_val_b.tolist(), "label": y_val_b.tolist()})
test_ds = Dataset.from_dict({"text": X_test_b.tolist(), "label": y_test_b.tolist()})

tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

def tokenize_batch(batch):
    return tokenizer(
        batch["text"],
        truncation=True,
        padding="max_length",
        max_length=256,
    )

train_ds = train_ds.map(tokenize_batch, batched=True)
val_ds = val_ds.map(tokenize_batch, batched=True)
test_ds = test_ds.map(tokenize_batch, batched=True)

train_ds = train_ds.remove_columns(["text"])
val_ds = val_ds.remove_columns(["text"])
test_ds = test_ds.remove_columns(["text"])
train_ds.set_format("torch")
val_ds.set_format("torch")
test_ds.set_format("torch")

model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2,
)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(axis=-1)
    acc = accuracy_score(labels, preds)
    f1_pos = f1_score(labels, preds)
    f1_macro = f1_score(labels, preds, average="macro")
    return {
        "accuracy": acc,
        "f1_pos": f1_pos,
        "f1_macro": f1_macro,
    }

training_args = TrainingArguments(
    output_dir="./bert_steam",
    do_train=True,
    do_eval=True,
    num_train_epochs=4,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    learning_rate=1e-5,
    weight_decay=0.01,
    logging_steps=1000,
    save_steps=1000,   # how often to save/eval in older versions
    gradient_checkpointing=True,
)

trainer = Trainer(
    model=model,
    args=training_args, 
    train_dataset=train_ds,
    eval_dataset=test_ds,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,

)



print("Starting BERT training...")
trainer.train()

# Get validation probabilities for threshold tuning
val_pred = trainer.predict(val_ds)
val_logits = val_pred.predictions

import torch

val_probs = torch.softmax(torch.tensor(val_logits), dim=1).numpy()[:, 1]

thresholds = np.linspace(0.01, 0.99, 99)
best_thr_b = 0.5
best_f1_macro_b = -1

for thr in thresholds:
    y_pred_val_b = (val_probs >= thr).astype(int)
    f1_m = f1_score(y_val_b, y_pred_val_b, average="macro")
    if f1_m > best_f1_macro_b:
        best_f1_macro_b = f1_m
        best_thr_b = thr

print(f"Best BERT threshold on val: {best_thr_b:.2f}, val macro-F1 = {best_f1_macro_b:.4f}")

# Evaluate on test
test_pred = trainer.predict(test_ds)
test_logits = test_pred.predictions
test_probs = torch.softmax(torch.tensor(test_logits), dim=1).numpy()[:, 1]
y_pred_test_b = (test_probs >= best_thr_b).astype(int)

acc_b = accuracy_score(y_test_b, y_pred_test_b)
f1_pos_b = f1_score(y_test_b, y_pred_test_b)
f1_macro_b = f1_score(y_test_b, y_pred_test_b, average="macro")
roc_b = roc_auc_score(y_test_b, test_probs)

print("\nDistilBERT + game metadata (REALISTIC test, tuned threshold) results:")
print("  Accuracy   :", acc_b)
print("  F1 (pos)   :", f1_pos_b)
print("  F1 macro   :", f1_macro_b)
print("  ROC-AUC    :", roc_b)


Map: 100%|██████████| 200000/200000 [00:10<00:00, 19467.64 examples/s]
Map: 100%|██████████| 10000/10000 [00:00<00:00, 20473.97 examples/s]
Map: 100%|██████████| 20000/20000 [00:01<00:00, 18205.11 examples/s]
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Starting BERT training...


Step,Training Loss
1000,0.6559
2000,0.6426
3000,0.6352
4000,0.63
5000,0.6287
6000,0.6233
7000,0.6255
8000,0.6186
9000,0.6173
10000,0.6155


Best BERT threshold on val: 0.74, val macro-F1 = 0.6216



DistilBERT + game metadata (REALISTIC test, tuned threshold) results:
  Accuracy   : 0.79385
  F1 (pos)   : 0.3673469387755102
  F1 macro   : 0.6221049122094855
  ROC-AUC    : 0.7321295157564673
