# Preprocessing the MovieLens-1M Dataset

This notebook outlines the preprocessing steps used to create a synthetic [MovieLens-1M dataset (2003)](https://grouplens.org/datasets/movielens/1m/) with three gender classes (M/F/NB) as the sensitive attribute. An iterative k-core filter is applied to the interaction data to remove sparse users/items and keep a dense, well-supported subset.

## Imports and Configuration

The configuration is set so that the non-binary class replaces 10% of the original user population. In addition, a 10-core filtering is used by default.

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import os

DATA_DIR = Path(os.getenv('PROJECT_ROOT', Path.cwd()))

NON_BINARY_FRAC = 0.1
RANDOM_SEED = 42
K_CORE_FILTER = 10

np.random.seed(RANDOM_SEED)

## Load User Data

User data is loaded and reduced to user IDs and gender labels, which serve as the sensitive attribute; all other columns are discarded. IDs are shifted to 0-based indexing to match the preprocessing convention.

In [2]:
users_df = pd.read_csv(
    filepath_or_buffer='users.dat',
    sep='::',
    engine='python',
    header=None,
    usecols=[0, 1],
    names=['user_id', 'gender']
)

users_df['user_id'] = users_df['user_id'] - 1

## Create Non-Binary Gender Class

A synthetic non-binary attribute is created by randomly sampling 10% of users from the existing male and female populations while preserving their original ratio. Gender labels are displayed before and after the transformation.

In [3]:
male_count = users_df[users_df['gender'] == 'M'].shape[0]
female_count = users_df[users_df['gender'] == 'F'].shape[0]
total_users = male_count + female_count

print("=" * 60)
print("INITIAL GENDER DISTRIBUTION")
print("=" * 60)
print(f"{'Male:':<7} {male_count:<3} ({male_count/total_users*100:>4.1f}%)")
print(f"{'Female:':<7} {female_count:<3} ({female_count/total_users*100:>4.1f}%) \n")

num_non_binary = int(total_users * NON_BINARY_FRAC)

# Sample users to become non-binary (respecting existing gender ratio)
gender_counts = users_df['gender'].value_counts()
ratio_m_f = gender_counts['M'] / gender_counts['F']
num_nb_from_female = int(num_non_binary / (1 + ratio_m_f))
num_nb_from_male = num_non_binary - num_nb_from_female

print("=" * 60)
print("ASSIGNING NON-BINARY GENDERS")
print("=" * 60)
print(f"Sampling {NON_BINARY_FRAC*100:.0f}% of users to be non-binary.")
print(f"Sampling respects the existing M/F ratio ({ratio_m_f:.3f}):")
print(f"  - {num_nb_from_male} from male users")
print(f"  - {num_nb_from_female} from female users \n")

male_indices = users_df[users_df['gender'] == 'M'].sample(
    n=num_nb_from_male, random_state=RANDOM_SEED
).index
female_indices = users_df[users_df['gender'] == 'F'].sample(
    n=num_nb_from_female, random_state=RANDOM_SEED
).index

# Combine and assign non-binary
nb_indices = male_indices.union(female_indices)
users_df.loc[nb_indices, 'gender'] = 'NB'

male_count = users_df[users_df['gender'] == 'M'].shape[0]
female_count = users_df[users_df['gender'] == 'F'].shape[0]
nb_count = users_df[users_df['gender'] == 'NB'].shape[0]

assert total_users == male_count + female_count + nb_count, (
    f"Population mismatch after assigning non-binary genders. "
    f"Before: {total_users}, "
    f"After: {male_count + female_count + nb_count}"
)

print("=" * 60)
print("RESULTING GENDER DISTRIBUTION")
print("=" * 60)
print(f"{'Men:':<11} {male_count:<3} ({male_count/total_users*100:>4.1f}%)")
print(f"{'Women:':<11} {female_count:<3} ({female_count/total_users*100:>4.1f}%)")
print(f"{'Non-binary:':<12} {nb_count:<3} ({nb_count/total_users*100:>4.1f}%) \n")

INITIAL GENDER DISTRIBUTION
Male:   4331 (71.7%)
Female: 1709 (28.3%) 

ASSIGNING NON-BINARY GENDERS
Sampling 10% of users to be non-binary.
Sampling respects the existing M/F ratio (2.534):
  - 434 from male users
  - 170 from female users 

RESULTING GENDER DISTRIBUTION
Men:        3897 (64.5%)
Women:      1539 (25.5%)
Non-binary:  604 (10.0%) 



# Load Interaction Data

User–item interaction data is loaded and IDs are shifted to 0-based indexing. Ratings on the 1–5 scale are binarized (rating > 4 as positive), after which the original rating values are discarded.

In [4]:
items_df = pd.read_csv(
    filepath_or_buffer='ratings.dat',
    sep='::',
    engine='python',
    header=None,
    usecols=[0, 1, 2],  # Skip timestamp
    names=['user_id', 'item_id', 'rating']
)

items_df['user_id'] = items_df['user_id'] - 1
items_df['item_id'] = items_df['item_id'] - 1

items_df['label'] = (items_df['rating'] > 4).astype(int)
items_df = items_df.drop(columns=['rating'])

## K-Core Filtering

An iterative k-core filter removes users and items with fewer than k=10 interactions. The process repeats until no more entities fall below the threshold, ensuring a dense dataset.

In [5]:
print("=" * 60)
print("BEFORE ITERATIVE K-CORE FILTERING")
print("=" * 60)
print(f"Total interactions: {len(items_df):,}")
print(f"Min interactions per user: {items_df['user_id'].value_counts().min()}")
print(f"Min interactions per item: {items_df['item_id'].value_counts().min()} \n")

users_before = set(items_df["user_id"].unique())

def iterative_filter(df, k_user=10, k_item=10):
    prev_shape = None
    current_df = df.copy()

    while prev_shape != current_df.shape:
        prev_shape = current_df.shape

        # Filter users
        user_counts = current_df['user_id'].value_counts()
        current_df = current_df[current_df['user_id'].map(user_counts) >= k_user]

        # Filter items
        item_counts = current_df['item_id'].value_counts()
        current_df = current_df[current_df['item_id'].map(item_counts) >= k_item]
    
    return current_df

items_df = iterative_filter(items_df, k_user=K_CORE_FILTER, k_item=K_CORE_FILTER)

users_after = set(items_df["user_id"].unique())

print("=" * 60)
print("AFTER ITERATIVE K-CORE FILTERING")
print("=" * 60)
print(f"Total interactions: {len(items_df):,}")
print(f"Min interactions per user: {items_df['user_id'].value_counts().min()}")
print(f"Min interactions per item: {items_df['item_id'].value_counts().min()} \n")

BEFORE ITERATIVE K-CORE FILTERING
Total interactions: 1,000,209
Min interactions per user: 20
Min interactions per item: 1 

AFTER ITERATIVE K-CORE FILTERING
Total interactions: 998,539
Min interactions per user: 17
Min interactions per item: 10 



## Update Users DataFrame

User data is synchronized with the filtered interaction data by removing users that were eliminated during k-core filtering.

In [6]:
removed_users = users_before.difference(users_after)

if len(removed_users) > 0:
    print(f"{len(removed_users)} users removed during filtering. Updating users dataframe... \n")
    valid_user_ids = items_df['user_id'].unique()
    users_df = users_df[users_df['user_id'].isin(valid_user_ids)].reset_index(drop=True)
elif len(removed_users) == 0:
    print("No users removed during filtering. Proceeding without updating users dataframe. \n")
else:
    raise ValueError("Unexpected condition: more users after filtering than before. \n")

No users removed during filtering. Proceeding without updating users dataframe. 



## Train/Val/Test Split

Interactions are split per user (80% train, 10% validation, 10% test) so that users are represented across splits. This reduces distribution shifts between splits and supports a more realistic evaluation setting.

In [7]:
def split_per_user(df, train_frac=0.8, val_frac=0.1, random_state=42):
    """
    Splitting items per user by using groupby and apply.

    Args:
        df: DataFrame with user_id column
        train_frac: Fraction for training
        val_frac: Fraction for validation
        random_state: Random seed

    Returns:
        train_df, val_df, test_df
    """
    def split_user_data(user_items):
        """
        Split a single user's interactions into train/val/test sets.
        """
        user_id = user_items.name
        user_items = user_items.copy()
        user_items['user_id'] = user_id

        user_items = user_items.sample(frac=1, random_state=random_state) # Shuffle items
        num_items = len(user_items)

        num_train = int(train_frac * num_items)
        num_val = int(val_frac * num_items)

        split_labels = ['train'] * num_train + ['val'] * num_val + ['test'] * (num_items - num_train - num_val)
        user_items['split'] = split_labels

        return user_items

    # Apply splitting to each user's items
    df_with_splits = df.groupby('user_id', group_keys=False).apply(split_user_data)
    df_with_splits = df_with_splits[['user_id', 'item_id', 'label', 'split']] # Reorder columns

    train_df = df_with_splits[df_with_splits['split'] == 'train'].drop(columns=['split'])
    valid_df = df_with_splits[df_with_splits['split'] == 'val'].drop(columns=['split'])
    test_df = df_with_splits[df_with_splits['split'] == 'test'].drop(columns=['split'])

    return (
        train_df.reset_index(drop=True),
        valid_df.reset_index(drop=True),
        test_df.reset_index(drop=True)
    )

train_df, valid_df, test_df = split_per_user(items_df, random_state=RANDOM_SEED)

print("=" * 60)
print("DATASET SPLITS")
print("=" * 60)
print(f"{'Train size:':<11} {len(train_df):>5,}")
print(f"{'Valid size:':<12} {len(valid_df):>5,}")
print(f"{'Test size:':<11} {len(test_df):>5,}")
print(f"{'Total size:':<11} {len(train_df) + len(valid_df) + len(test_df):>5,} \n")

DATASET SPLITS
Train size: 796,389
Valid size:  97,199
Test size:  104,951
Total size: 998,539 



## Summary and Save Output Files

Final dataset statistics are displayed, gender labels are mapped to integers (M=0, F=1, NB=2), and all processed files are saved to CSV format, including both ordered and randomized versions of the sensitive attribute data.

In [8]:
print("=" * 60)
print("FINAL DATASET SUMMARY")
print("=" * 60)
print(f"Total users: {len(users_df)}")
print(f"Total items: {items_df['item_id'].nunique()}")
print(f"Total interactions: {len(train_df) + len(valid_df) + len(test_df):,}")

male_final = (users_df['gender'] == 'M').sum()
female_final = (users_df['gender'] == 'F').sum()
nb_final = (users_df['gender'] == 'NB').sum()

print(f"\nFinal gender distribution:")
print(f" - {'Male:':<11} {male_final:>4} ({male_final/len(users_df)*100:.1f}%)")
print(f" - {'Female:':<11} {female_final:>4} ({female_final/len(users_df)*100:.1f}%)")
print(f" - {'Non-binary:':<11} {nb_final:>4} ({nb_final/len(users_df)*100:.1f}%) \n")

print("Mapping gender labels to integers...")
gender_mapping = {'M': 0, 'F': 1, 'NB': 2}
users_df['gender'] = users_df['gender'].map(gender_mapping)

# Randomized sensitive attribute dataset
users_random = users_df.sample(frac=1, random_state=RANDOM_SEED).reset_index(drop=True)

print(f"\nSaving processed files to: {DATA_DIR.parent.parent.name}/{DATA_DIR.parent.name}/{DATA_DIR.name} \n")

users_df.to_csv('sensitive_attribute.csv', index=False)
users_random.to_csv('sensitive_attribute_random.csv', index=False)
train_df.to_csv('train.csv', index=False)
valid_df.to_csv('valid.csv', index=False)
test_df.to_csv('test.csv', index=False)

print("✓ All files saved successfully!")

FINAL DATASET SUMMARY
Total users: 6040
Total items: 3260
Total interactions: 998,539

Final gender distribution:
 - Male:       3897 (64.5%)
 - Female:     1539 (25.5%)
 - Non-binary:  604 (10.0%) 

Mapping gender labels to integers...

Saving processed files to: Three-Class-MPR/datasets/ml-1m-synthetic 

✓ All files saved successfully!
