# Restricted Boltzmann Machine (RBM)

This notebook demonstrates how to implement a Restricted Boltzmann Machine (RBM) using PyTorch to build a simple movie recommendation system. We use the **MovieLens 100K** dataset to model user preferences and recommend unseen movies based on learned interactions.

## Importing Libraries

In [None]:
# Data handling 
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# PyTorch for building the RBM
import torch
import torch.nn as nn
import torch.optim as optim

# Additional Utilities
import random
from sklearn.preprocessing import LabelEncoder

## Data Loading and Exploration

In [None]:
# Load the dataset
# u.data from MovieLens 100K (tab-separated: user_id, item_id, rating, timestamp)
df = pd.read_csv('data/u.data', sep='\t', names=['user_id', 'item_id', 'rating', 'timestamp'])
df

In [None]:
# Basic info about datatypes and nulls
print("Dataset Info:")
df.info()

In [None]:
# Unique users and movies
n_users = df['user_id'].nunique()
n_items = df['item_id'].nunique()
print(f"Number of unique users: {n_users}")
print(f"Number of unique items (movies): {n_items}")

In [None]:
# Set style and color palette
sns.set(style='whitegrid', palette='pastel', font_scale=1.1)
plt.rcParams['figure.figsize'] = (8, 5)

In [None]:
# Distribution of ratings
sns.countplot(data=df, x='rating', order=sorted(df['rating'].unique()), edgecolor='black')
plt.title('Distribution of Ratings')
plt.xlabel('Rating')
plt.ylabel('Number of Ratings')
plt.show()

We plotted how ratings are distributed across the dataset.

- Ratings **4 and 3 are the most frequent**, followed by 5.
- Very few ratings fall below 3, which suggests users are more likely to give positive feedback.

**This informed our decision to binarize ratings**, treating 4 and 5 as liked (1), and the rest as not liked (0). This simplifies the modeling and focuses on positive recommendations.

In [None]:
# User Activity
user_activity = df.groupby('user_id')['item_id'].count()
sns.histplot(user_activity, bins=30, edgecolor='black')
plt.title('User Rating Counts')
plt.xlabel('Number of Movies Rated')
plt.ylabel('Number of Users')
plt.show()

This plot shows how many movies each user has rated.

- Most users rated **very few movies** (fewer than 50).
- A small number of users are highly active, rating hundreds of movies.
- This creates a **long tail distribution** where many users contribute little data.
 
Sparse user profiles can make it harder for the model to learn preferences accurately. However, we will proceed without filtering for now and observe model performance. Later, we may consider removing extremely inactive users to improve learning.

In [None]:
# Most Rated Movies
item_popularity = df.groupby('item_id')['user_id'].count()
sns.histplot(item_popularity, bins=30, edgecolor='black')
plt.title('Movie Rating Counts')
plt.xlabel('Number of Users per Movie')
plt.ylabel('Number of Movies')
plt.show()

This plot shows how often each movie was rated across users.

- A large number of movies are rated by **fewer than 20 users**.
- Only a few movies are rated by hundreds of users.
- The dataset is **movie-sparse**, making cold-start issues likely for lesser-known movies.
- The model will tend to favor popular movies due to richer training data.

We will continue with all movies included for now and evaluate the model's effectiveness. Later, we can test whether filtering out movies with too few ratings improves recommendations.

## Preprocessing

In [None]:
# Binarize ratings: 1 for 'liked' (4 or 5), 0 for others
df['binarized_rating'] = df['rating'].apply(lambda x: 1 if x >= 4 else 0)

# Create user-item matrix with binarized ratings
user_item_matrix = df.pivot(index='user_id', columns='item_id', values='binarized_rating')

# Fill missing values with -1 (indicates unrated)
user_item_matrix.fillna(-1, inplace=True)

# Store mappings for decoding
user_id_mapping = {idx: i for i, idx in enumerate(user_item_matrix.index)}
movie_id_mapping = {idx: i for i, idx in enumerate(user_item_matrix.columns)}

# Convert to PyTorch tensor
training_data = torch.FloatTensor(user_item_matrix.values)

# Clone training and test data to simulate hiding ratings for evaluation
train_data = training_data.clone()
test_data = training_data.clone()

# Mask a few ratings per user in train_data, retain them in test_data for evaluation
for user in range(training_data.shape[0]):
    rated_items = torch.where(training_data[user] != -1)[0]
    
    if len(rated_items) >= 10:
        test_items = random.sample(rated_items.tolist(), k=2)
        for item in test_items:
            train_data[user][item] = -1  # hide from training
    else:
        test_data[user] = -1  # not enough data for reliable testing