# Dataset Construction - MIND

This notebook aims to model any recommender system datasets to the unified sequential format of pure text string, serving as the input of the user simulator LLM.

A recommender system data usually consists of 3 critical components: user, item and behavior. The user and item are usually represented by their IDs, while the behavior is usually represented by their interactions.

The goal of this notebook is to transform the original data into a unified format, with two types of JSON format:

- Item Feature JSON: This JSON contains a list of all items in the dataset, each element is following the format:
```json
{
    "item_id": "item_id (str), a unique identifier for the item",
    "item_description": "item_description (dict[str, str]), a dictionary of item attributes and their values. All values should be processed to be pure text string that can be understood by human. All features that contain unreadable information (e.g. image, video, audio, url, ID string ,etc.) should be converted (when possible) or removed (when not possible) to be pure text string.",
    "item_features": "item_features (dict[str, Any]), a dictionary of item features (except for those in item_description) and their values. "
This JSON is stored as `<DATASET_NAME>_item_feature.jsonl`.

- User Feature JSON: This JSON contains a list of all items in the dataset, each element is following the format:
```json
{
    "user_id": "user_id (str), a unique identifier for the user",
    "user_description": "user_description (dict[str, str]), a dictionary of user attributes and their values. All values should be processed to be pure text string that can be understood by human. All features that contain unreadable information (e.g. image, video, audio, url, ID string ,etc.) should be converted (when possible) or removed (when not possible) to be pure text string.",
    "user_features": "user_features (dict[str, Any]), a dictionary of user features (except for those in user_description) and their values. "
}
```
This JSON is stored as `<DATASET_NAME>_user_feature.jsonl`.

- Interaction JSON: This JSON contains a list of all user behaviors in the dataset, each element is following the format:
```json
{
    "user_id": "user_id (str), a unique identifier for the user",
    "item_id": "item_id (str), the ID of the item that the user has interacted with",
    "timestamp": "timestamp (str), the timestamp of the interaction in the format of YYYY-MM-DD HH:MM:SS. When the timestamp is not available, it should be set to the random timestamp.",
    "behavior_features": "behavior_features (dict[str, Any]), a dictionary of behavior features and their values. All values should be processed to be pure text string that can be understood by human. All features that contain unreadable information (e.g. image, video, audio, url, ID string ,etc.) should be converted (when possible) or removed (when not possible) to be pure text string."
}
```
This JSON is stored as `<DATASET_NAME>_interaction.jsonl`.


## Download the raw data

You can download the raw data from the following links:
- [MIND](https://msnews.github.io/)
You should download the `Training Set`,`Validation Set` and `Test Set`.

Or you can download the raw data from the following links:
- [MIND](https://recodatasets.z20.web.core.windows.net/newsrec/MINDlarge_train.zip)
- [MIND](https://recodatasets.z20.web.core.windows.net/newsrec/MINDlarge_dev.zip)
- [MIND](https://recodatasets.z20.web.core.windows.net/newsrec/MINDlarge_test.zip)
After downloading the raw data, you can unzip the file and move the files ``news.tsv`` and ``behaviors.tsv`` to the same directory.

```bash
unzip MINDlarge_train.zip -d <DATASET_PATH>
unzip MINDlarge_dev.zip -d <DATASET_PATH>_dev
unzip MINDlarge_test.zip -d <DATASET_PATH>_test
mv <DATASET_PATH>_dev/behaviors.tsv <DATASET_PATH>/behaviors_valid.tsv
mv <DATASET_PATH>_dev/news.tsv <DATASET_PATH>/news_valid.tsv
mv <DATASET_PATH>_test/behaviors.tsv <DATASET_PATH>/behaviors_test.tsv
mv <DATASET_PATH>_test/news.tsv <DATASET_PATH>/news_test.tsv
```




In [1]:
# Basic Setting of the Filtered Dataset
DATASET_NAME = "MIND"
DATASET_PATH = "<SOURCE_PATH>"
OUTPUT_PATH = "<PROJECT_PATH>/raws/"
MIN_INTERACTION_CNT = 5     # The minimum number of interactions for a user to be included in the dataset.
MAX_INTERACTION_CNT = 20  # The maximum number of interactions for a user to be included in the dataset.


In [None]:
# Step 1: Identify the file structure of the dataset

import os
# List all files in the dataset directory
print("Files in dataset directory:")
for file in os.listdir(DATASET_PATH):
    print(f"- {file}")


In [3]:
# Step 1.2 Preprocess
import pandas as pd

impression = pd.read_csv(os.path.join(DATASET_PATH, "behaviors_valid.tsv"), sep="\t", header=None, names=["impression_id", "user_id", "time", "history", "impressions"])
impression.to_csv(os.path.join(DATASET_PATH, "behaviors.tsv"), index=False, sep="\t", header=None, mode="a")

df = pd.read_csv(os.path.join(DATASET_PATH, "news.tsv"), sep="\t", header=None, names=["news_id", "category", "sub_category", "title", "abstract", "url", "title_entities", "abstract_entities"])
df_2 = pd.read_csv(os.path.join(DATASET_PATH, "news_valid.tsv"), sep="\t", header=None, names=["news_id", "category", "sub_category", "title", "abstract", "url", "title_entities", "abstract_entities"])
df_3 = pd.read_csv(os.path.join(DATASET_PATH, "news_test.tsv"), sep="\t", header=None, names=["news_id", "category", "sub_category", "title", "abstract", "url", "title_entities", "abstract_entities"])

news = pd.concat([df, df_2, df_3], ignore_index=True, axis=0)

# Display the first 5 rows
news.drop_duplicates("news_id", inplace=True)

news.to_csv(os.path.join(DATASET_PATH, "news_all.tsv"), index=False, sep="\t", header=None)


In [9]:
# Step 2: Load the dataset

from datetime import datetime
from tqdm import tqdm
tqdm.pandas()

def behavior_preprocess(path, min_history_length=5, min_behavior_counts=5, max_impressions_length=10):

    behavior_chunks = pd.read_csv(os.path.join(path, "behaviors.tsv"), sep="\t", header=None, names=["impression_id", "user_id", "time", "history", "impressions"], chunksize=100000)
    new_behaviors = []
    for behaviors in behavior_chunks:
      behaviors['history'] = behaviors['history'].progress_apply(lambda x: x.split() if pd.notna(x) else [])
      behaviors = behaviors[behaviors['history'].apply(len) >= min_history_length]

      behaviors['impressions'] = behaviors['impressions'].apply(lambda x: x.split())

      behaviors = behaviors[behaviors['impressions'].apply(len) <= max_impressions_length]

      behaviors = behaviors[behaviors['impressions'].apply(lambda x: sum([imp.endswith("1") for imp in x]) == 1)]
      
      new_behaviors.append(behaviors)
    return pd.concat(new_behaviors, axis=0)


def load_news(path: str, filename: str = "news_all.tsv") -> pd.DataFrame:
    news = pd.read_csv(
        os.path.join(path, filename),
        sep="\t", header=None,
        names=[
            "news_id", "category", "sub_category",
            "title", "abstract", "url",
            "title_entities", "abstract_entities"
        ]
    )
    return news

def process_impressions(impressions):
    item_ids = []
    positions = []
    
    for idx, imp in enumerate(impressions):
        item_id, value = imp.split('-')
        item_ids.append(item_id)
        if value == '1':
            positions.append(idx)
    return item_ids, positions

def process_user_impressions(input_df):
    processed_impressions = input_df['impressions'].apply(process_impressions)
    
    input_df['item_list'] = processed_impressions.apply(lambda x: x[0])
    input_df['label'] = processed_impressions.apply(lambda x: x[1])
    
    return input_df

behaviors = behavior_preprocess(DATASET_PATH)

impression = process_user_impressions(behaviors)

news = load_news(DATASET_PATH)

In [None]:
times1 = impression.loc[:, ['time', 'item_list']].explode('item_list').groupby('item_list')['time'].agg('min')
times2 = impression.loc[:, ['time', 'history']].explode('history').groupby('history')['time'].agg('min')
item_list = pd.concat([times1, times2]).groupby(level=0).min().to_frame()
news = news.join(item_list, on='news_id')

In [None]:
# Step 4: Convert the dataset to the unified format

from tqdm import tqdm
from datetime import datetime

tqdm.pandas()

item_df = news.progress_apply(
        lambda row: {
            "item_id": row["news_id"],
            "item_description": {
                "title": row["title"],
                "abstract": row["abstract"] if row["abstract"] else "NA",
                "category": row["category"],
                "subcategory": row["sub_category"]
            },
            "item_features": {
                "title_entities": row["title_entities"] if pd.notna(row["title_entities"]) else "",
                "abstract_entities": row["abstract_entities"] if pd.notna(row["abstract_entities"]) else "",
                "time": datetime.strptime(row["time"], '%m/%d/%Y %H:%M:%S %p').timestamp() if pd.notna(row["time"]) else 0
            }
        },
        axis=1
    ).tolist()

user_df = impression.groupby('user_id').agg({
    'history': 'first'  # Take the first history for each user
}).progress_apply(
    lambda row: {
        "user_id": row.name,  # row.name contains the user_id in grouped data
        "user_description": {},  # Empty since we don't have user descriptions
        "user_features": {
            "history": row["history"]
        }
    },
    axis=1
).tolist()

interaction_df = impression.progress_apply(
    lambda row: {
        "user_id": row["user_id"],
        "item_id": row["item_list"][row["label"][0]],
        "timestamp": row["time"],
        "behavior_features": {
            "impression_list": row["item_list"],
            "item_pos": row["label"][0]
        }
    },
    axis=1
).tolist()


In [None]:
# Step 5: Save the dataset

item_df = pd.DataFrame(item_df)
user_df = pd.DataFrame(user_df)
interaction_df = pd.DataFrame(interaction_df)


print(f"\nFinal shapes:")
print(f"Items: {len(item_df)}")
print(f"Users: {len(user_df)}")
print(f"Interactions: {len(interaction_df)}")


print("\nSaving files...")
# Create output directory if it doesn't exist
os.makedirs(OUTPUT_PATH, exist_ok=True)
interaction_df.to_json(os.path.join(OUTPUT_PATH, f"{DATASET_NAME}_interaction.jsonl"), lines=True, orient="records")
user_df.to_json(os.path.join(OUTPUT_PATH, f"{DATASET_NAME}_user_feature.jsonl"), lines=True, orient="records")
item_df.to_json(os.path.join(OUTPUT_PATH, f"{DATASET_NAME}_item_feature.jsonl"), lines=True, orient="records")

print("\nProcessing complete.")