# Australian Steam Recommendation – Baselines and Stronger Model

This notebook assumes you have already created `australian_steam_merged.csv` from the data-prep notebook.

We will:
1. Load the merged dataset
2. Build **text-only baselines**:
   - Majority class baseline
   - TF-IDF + Naive Bayes
   - TF-IDF + Logistic Regression
3. Build a **stronger model** that combines:
   - TF-IDF text features
   - Numeric user–item and game features (e.g., playtime, items_count, price, metascore)
4. Compare models using Accuracy, F1, Macro-F1, and ROC-AUC.


In [None]:
import numpy as np
import pandas as pd
from pathlib import Path

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from scipy.sparse import hstack

pd.set_option('display.max_columns', None)
pd.set_option('display.width', 180)

print('Libraries imported.')

## 1. Load Merged Dataset

We start from the preprocessed file `australian_steam_merged.csv`, where each row is a review with:
- `review_text`
- `label` (0/1 for not recommended / recommended)
- user–item features (`playtime_forever`, `playtime_2weeks`, `items_count`)
- game features (`price`, `metascore`, etc., depending on availability).

In [None]:
data_path = Path('australian_steam_merged.csv')
if not data_path.exists():
    raise FileNotFoundError(f'Cannot find {data_path}. Please run the data prep notebook first.')

df = pd.read_csv(data_path)
print('Loaded merged dataset with shape:', df.shape)
display(df.head())

## 2. Basic Cleaning and Train/Test Split

We keep rows with non-null `review_text` and `label`, then create an 80/20 train/test split.
We stratify by the label due to the strong class imbalance (most reviews are positive).

In [None]:
# Drop rows with missing text or label
df = df.dropna(subset=['review_text', 'label']).copy()
df['review_text'] = df['review_text'].astype(str)
df['label'] = df['label'].astype(int)

print('After cleaning:', df.shape)
print('Label distribution:')
print(df['label'].value_counts(normalize=True))

X_text = df['review_text']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(
    X_text, y,
    test_size=0.2,
    random_state=42,
    stratify=y,
)

print('Train size:', len(X_train))
print('Test size:', len(X_test))

## 3. TF-IDF Text Representation

We represent `review_text` using a TF-IDF bag-of-words model with unigrams and bigrams.
This will be reused across all text-based models.

In [None]:
tfidf = TfidfVectorizer(
    max_features=50_000,
    ngram_range=(1, 2),
    min_df=3,
)

X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

print('TF-IDF train shape:', X_train_tfidf.shape)
print('TF-IDF test shape:', X_test_tfidf.shape)

## 4. Baseline 1 – Majority Class Predictor

We first consider a trivial baseline that always predicts the most frequent class in the training data.

In [None]:
import numpy as np

majority_class = y_train.mode()[0]
y_pred_majority = np.full_like(y_test, fill_value=majority_class)

acc_majority = accuracy_score(y_test, y_pred_majority)
f1_majority = f1_score(y_test, y_pred_majority)
f1_macro_majority = f1_score(y_test, y_pred_majority, average='macro')

print('Majority baseline:')
print('  Accuracy   :', acc_majority)
print('  F1 (pos)   :', f1_majority)
print('  F1 macro   :', f1_macro_majority)

## 5. Baseline 2 – TF-IDF + Naive Bayes

Next, we train a Multinomial Naive Bayes classifier on the TF-IDF features. This is a classic baseline for text classification tasks.

In [None]:
nb = MultinomialNB()
nb.fit(X_train_tfidf, y_train)

y_pred_nb = nb.predict(X_test_tfidf)
y_prob_nb = nb.predict_proba(X_test_tfidf)[:, 1]

acc_nb = accuracy_score(y_test, y_pred_nb)
f1_nb = f1_score(y_test, y_pred_nb)
f1_macro_nb = f1_score(y_test, y_pred_nb, average='macro')
roc_nb = roc_auc_score(y_test, y_prob_nb)

print('Naive Bayes:')
print('  Accuracy   :', acc_nb)
print('  F1 (pos)   :', f1_nb)
print('  F1 macro   :', f1_macro_nb)
print('  ROC-AUC    :', roc_nb)

## 6. Baseline 3 – TF-IDF + Logistic Regression (Text Only)

We now train a logistic regression classifier on the TF-IDF features only. This is typically a strong baseline for text classification and provides a good reference point for more complex models.

In [None]:
clf_lr_text = LogisticRegression(max_iter=2000, n_jobs=-1)
clf_lr_text.fit(X_train_tfidf, y_train)

y_pred_lr = clf_lr_text.predict(X_test_tfidf)
y_prob_lr = clf_lr_text.predict_proba(X_test_tfidf)[:, 1]

acc_lr = accuracy_score(y_test, y_pred_lr)
f1_lr = f1_score(y_test, y_pred_lr)
f1_macro_lr = f1_score(y_test, y_pred_lr, average='macro')
roc_lr = roc_auc_score(y_test, y_prob_lr)

print('Logistic Regression (text only):')
print('  Accuracy   :', acc_lr)
print('  F1 (pos)   :', f1_lr)
print('  F1 macro   :', f1_macro_lr)
print('  ROC-AUC    :', roc_lr)

## 7. Stronger Model – TF-IDF + Numeric Features (User–Item + Game)

To build a stronger model, we augment the TF-IDF text representation with numeric features such as:
- `playtime_forever`, `playtime_2weeks`
- `items_count`
- `price`, `metascore` (if available)

We then train a single logistic regression model on the concatenated feature space.

In [None]:
# Identify numeric feature columns that exist in the dataset
candidate_num_cols = ['playtime_forever', 'playtime_2weeks', 'items_count', 'price', 'metascore']
num_cols = [c for c in candidate_num_cols if c in df.columns]

print('Base numeric feature columns:', num_cols)

# Fill missing values with 0
df[num_cols] = df[num_cols].fillna(0)

# Optional: add log-transformed versions for skewed features
for col in ['playtime_forever', 'playtime_2weeks']:
    if col in df.columns:
        log_col = 'log_' + col
        df[log_col] = np.log1p(df[col])
        num_cols.append(log_col)

print('Final numeric feature columns:', num_cols)

# Align numeric features with train/test splits using the same indices as X_train/X_test
X_train_num = df.loc[X_train.index, num_cols].values
X_test_num = df.loc[X_test.index, num_cols].values

# Standardize numeric features
scaler = StandardScaler()
X_train_num_scaled = scaler.fit_transform(X_train_num)
X_test_num_scaled = scaler.transform(X_test_num)

print('Numeric train shape:', X_train_num_scaled.shape)
print('Numeric test shape:', X_test_num_scaled.shape)

In [None]:
# Combine TF-IDF (sparse) and numeric (dense) features
X_train_full = hstack([X_train_tfidf, X_train_num_scaled])
X_test_full = hstack([X_test_tfidf, X_test_num_scaled])

print('Combined train shape:', X_train_full.shape)
print('Combined test shape:', X_test_full.shape)

In [None]:
# Train logistic regression on combined features
clf_full = LogisticRegression(max_iter=2000, n_jobs=-1)
clf_full.fit(X_train_full, y_train)

y_pred_full = clf_full.predict(X_test_full)
y_prob_full = clf_full.predict_proba(X_test_full)[:, 1]

acc_full = accuracy_score(y_test, y_pred_full)
f1_full = f1_score(y_test, y_pred_full)
f1_macro_full = f1_score(y_test, y_pred_full, average='macro')
roc_full = roc_auc_score(y_test, y_prob_full)

print('Stronger model (TF-IDF + numeric features):')
print('  Accuracy   :', acc_full)
print('  F1 (pos)   :', f1_full)
print('  F1 macro   :', f1_macro_full)
print('  ROC-AUC    :', roc_full)

## 8. Model Comparison Table

Finally, we summarize the performance of all models in a single table. This is suitable for inclusion in the report and presentation.

In [None]:
results = [
    ['Majority', acc_majority, f1_majority, f1_macro_majority, None],
    ['TF-IDF + Naive Bayes', acc_nb, f1_nb, f1_macro_nb, roc_nb],
    ['TF-IDF + Logistic Regression (text only)', acc_lr, f1_lr, f1_macro_lr, roc_lr],
    ['TF-IDF + Logistic Regression (text + numeric)', acc_full, f1_full, f1_macro_full, roc_full],
]

results_df = pd.DataFrame(results, columns=['Model', 'Accuracy', 'F1 (pos)', 'F1 macro', 'ROC-AUC'])
display(results_df)