### Interactions Dataset Overview

| Column | Type    | Description                                                                                                                |
|--------|---------|----------------------------------------------------------------------------------------------------------------------------|
| `u`    | int64   | **User ID** – identifies the user who performed the interaction                                                            |
| `i`    | int64   | **Item ID** – identifies the item (e.g., book) the user interacted with                                                    |
| `t`    | float64 | **Timestamp** – the time when the interaction occurred, stored as a Unix timestamp (seconds since 1970-01-01 00:00:00 UTC) |

---

| Column       | Type   | Description                                                                                                  |
|--------------|--------|--------------------------------------------------------------------------------------------------------------|
| `Title`      | object | **Book title** – the title of the book or item                                                               |
| `Author`     | object | **Author(s)** – the author(s) of the book; may be missing (NaN)                                              |
| `ISBN Valid` | object | **ISBN number(s)** – unique identifier(s) for the book; multiple ISBNs are separated by semicolons           |
| `Publisher`  | object | **Publisher** – the publishing company of the book                                                           |
| `Subjects`   | object | **Subjects or topics** – the thematic categories of the book; multiple subjects are separated by semicolons  |
| `i`          | int64  | **Item ID** – a unique numeric identifier for each item, corresponds to the `i` column in `interactions.csv` |


In [23]:
import pandas as pd
import scipy as sp
import numpy as np
from sklearn.preprocessing import MinMaxScaler

import src.data_loader as data_loader

In [24]:
# Load data
interactions = data_loader.load_interactions()  # u, i, t
items = data_loader.load_items()  # Title, Author, ISBN, Publisher, Subjects, i

## Step 1 – Basic User & Book Features

**User Features**
- **Activity level**: total number of interactions per user
- **Preference statistics**: favorite subjects, authors, or genres
- **Interaction counts**: how many books each user has borrowed or rated

In [25]:
# User activity (number of interactions)
user_activity = interactions.groupby('u').size().reset_index(name='user_activity')

# User's category preference (count of interactions per category)
interactions_with_cat = interactions.merge(items[['i','Subjects']], on='i', how='left')
user_category_pref = interactions_with_cat.groupby(['u','Subjects']).size().unstack(fill_value=0)

**Book Features**
- **Author**: the creator of the book
- **Subjects / Categories**: main topics or genres
- **Publisher**: publishing house
- **Popularity**: number of interactions or borrows

In [26]:
# Item popularity
item_popularity = interactions['i'].value_counts().reset_index()
item_popularity.columns = ['i','popularity']

# Merge book info
items_features = items.merge(item_popularity, on='i', how='left').fillna(0)

## Step 2 – Encoding & Embeddings (for advanced models)

- **Discrete features** (Author, Publisher, Category) can be transformed using:
  - **One-hot encoding**: simple binary representation
  - **Embeddings**: learn dense vector representations for high-capacity models

## Step 3 – Build User-Book Interaction Matrix

- Construct the **user-item matrix** representing interactions between users and books
- This matrix is used for:
  - **Collaborative Filtering (CF)**
  - **Matrix Factorization (MF)**
  - **Embedding-based models**

## Step 4 – Handle Cold Start

- **Identify cold-start users and items** (e.g., only 1 interaction)
- For **cold users**:
  - Recommend **Top-Popular books**
  - Use **Content-based recommendations** based on item metadata

In [27]:
# Cold users (less than 2 interactions)
cold_users = user_activity[user_activity['user_activity']<2]['u'].tolist()
# Cold books (less than 2 interactions)
cold_books = item_popularity[item_popularity['popularity']<2]['i'].tolist()

# Add flag
user_activity['cold_user'] = user_activity['u'].apply(lambda x: 1 if x in cold_users else 0)
items_features['cold_item'] = items_features['i'].apply(lambda x: 1 if x in cold_books else 0)

## Step 5 – Feature Normalization / Standardization

- Apply to **numerical features**, such as:
  - User activity level
  - Book popularity
- Ensures features are on a **comparable scale** for models (e.g., MF, embeddings, neural networks)

In [28]:
scaler = MinMaxScaler()
user_activity['user_activity_scaled'] = scaler.fit_transform(user_activity[['user_activity']])
items_features['popularity_scaled'] = scaler.fit_transform(items_features[['popularity']])

## Step 6 – Build Training / Validation Sets

- **Split interactions** by:
  - Time-based split (train on past, validate on future)
  - Random split (if no temporal order needed)
- **Cold-start users/items** should be **tested separately** to evaluate handling of unseen cases

In [29]:
# Sort by timestamp
interactions = interactions.sort_values('t')
# 80% train, 10% val, 10% test
train_idx = int(0.8*len(interactions))
val_idx = int(0.9*len(interactions))

train_data = interactions.iloc[:train_idx]
val_data = interactions.iloc[train_idx:val_idx]
test_data = interactions.iloc[val_idx:]