# What is this notebook

This preprocessing pipeline prepares the dataset for the hybrid fake review detection framework, combining:

- SGDCTH’s group-based spammer detection
- SL-GAD’s self-supervised graph representation learning (GNN encoder)

It converts raw Amazon review data into a heterogeneous information sub-network (HISN) with rich attribute features and temporal, textual, and behavioral signals.
The resulting dataset is ready for use in Stage 2: self-supervised GNN training and Stage 3: DBSCAN-based candidate group detection.

# Library

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import torch
from collections import Counter
import emoji
import re
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, balanced_accuracy_score, f1_score, precision_score, recall_score
from collections import Counter
from wordcloud import WordCloud
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

# Data

In [2]:
df = pd.read_json("Amazon_Fashion.jsonl", lines=True)
df.head()

Unnamed: 0,rating,title,text,images,asin,parent_asin,user_id,timestamp,helpful_vote,verified_purchase
0,5,Pretty locket,I think this locket is really pretty. The insi...,[],B00LOPVX74,B00LOPVX74,AGBFYI2DDIKXC5Y4FARTYDTQBMFQ,2020-01-09 00:06:34.489,3,True
1,5,A,Great,[],B07B4JXK8D,B07B4JXK8D,AFQLNQNQYFWQZPJQZS6V3NZU4QBQ,2020-12-20 01:04:06.701,0,True
2,2,Two Stars,One of the stones fell out within the first 2 ...,[],B007ZSEQ4Q,B007ZSEQ4Q,AHITBJSS7KYUBVZPX7M2WJCOIVKQ,2015-05-23 01:33:48.000,3,True
3,1,Won’t buy again,Crappy socks. Money wasted. Bought to wear wit...,[],B07F2BTFS9,B07F2BTFS9,AFVNEEPDEIH5SPUN5BWC6NKL3WNQ,2018-12-31 20:57:27.095,2,True
4,5,I LOVE these glasses,I LOVE these glasses! They fit perfectly over...,[],B00PKRFU4O,B00XESJTDE,AHSPLDNW5OOUK2PLH7GXLACFBZNQ,2015-08-13 14:29:26.000,0,True


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500939 entries, 0 to 2500938
Data columns (total 10 columns):
 #   Column             Dtype         
---  ------             -----         
 0   rating             int64         
 1   title              object        
 2   text               object        
 3   images             object        
 4   asin               object        
 5   parent_asin        object        
 6   user_id            object        
 7   timestamp          datetime64[ns]
 8   helpful_vote       int64         
 9   verified_purchase  bool          
dtypes: bool(1), datetime64[ns](1), int64(2), object(6)
memory usage: 174.1+ MB


# Preprocessing

## K-Core Filtering

K-Core filtering is needed to done in review dataset with some purpose:
- To reduce sparsity of the data, because there are much data that only have 1 reviews or reviewer that only have reviewed 1 product
- Fairer evaluation, especially if using GNN. History or neighbor is needed to make our model learn
- Stability and reproducibility for the data
- Computationally more efficient since we have less data

As for this data, we will perform a K-Core filtering with 3-2-Core rule where 
- Each item need to have 3 reviews each, and
- Each user need to have 2 reviews each

This is done incrementally until the requirements met

In [4]:
def k_core_filtering(df, k_items = 3, k_users = 2):
    iteration = 0
    
    while True:
        iteration += 1
        print(f"Iteration {iteration}: {len(df)} rows")
        
        before_len = len(df)
        
        # First we filter by the item count that atleast there is 3 review
        item_counts = df['asin'].value_counts()
        active_items = item_counts[item_counts >= k_items].index
        df = df[df['asin'].isin(active_items)]
        
        print(f"After item filtering: {len(df)} rows")
        
        # Then we filter by the user count that atleast there is 3 review
        user_counts = df['user_id'].value_counts()
        active_users = user_counts[user_counts >= k_users].index
        df = df[df['user_id'].isin(active_users)]
        
        print(f"After user filtering: {len(df)} rows")
        
        after_len = len(df)
        
        print(f"Filtered {before_len - after_len} rows in this iteration.\n")
        
        if before_len == after_len:
            break
        
    return df

data = k_core_filtering(df)

Iteration 1: 2500939 rows
After item filtering: 1679162 rows
After user filtering: 398409 rows
Filtered 2102530 rows in this iteration.

Iteration 2: 398409 rows
After item filtering: 266837 rows
After user filtering: 211865 rows
Filtered 186544 rows in this iteration.

Iteration 3: 211865 rows
After item filtering: 192192 rows
After user filtering: 180792 rows
Filtered 31073 rows in this iteration.

Iteration 4: 180792 rows
After item filtering: 175586 rows
After user filtering: 172310 rows
Filtered 8482 rows in this iteration.

Iteration 5: 172310 rows
After item filtering: 170819 rows
After user filtering: 169845 rows
Filtered 2465 rows in this iteration.

Iteration 6: 169845 rows
After item filtering: 169384 rows
After user filtering: 169091 rows
Filtered 754 rows in this iteration.

Iteration 7: 169091 rows
After item filtering: 168945 rows
After user filtering: 168848 rows
Filtered 243 rows in this iteration.

Iteration 8: 168848 rows
After item filtering: 168802 rows
After user 

In [5]:
num_users = data['user_id'].nunique()
num_items = data['asin'].nunique()
num_reviews = len(data)

print(f"Number of unique users in the core filtering: {num_users}")
print(f"Number of unique items in the core filtering: {num_items}")
print(f"Total reviews (edges) in the core filtering: {num_reviews}")

if num_users > 0:
    print(f"Average reviews per user: {num_reviews / num_users:.2f}")
if num_items > 0:
    print(f"Average reviews per item: {num_reviews / num_items:.2f}")

Number of unique users in the core filtering: 69587
Number of unique items in the core filtering: 29282
Total reviews (edges) in the core filtering: 168728
Average reviews per user: 2.42
Average reviews per item: 5.76


After the K-Core filtering, we end up with 168.728 data which means we are left with less than 10% of the data. This shows in the raw data there are too many noise and sparsity that we dont need to train our model and create a strong fake review detections.

In [6]:
data.to_csv("data_kcore_3_2.csv", index= False)

## Basic data cleaning

To make the next processess to be done smoothly, some of the data especially the texts data where it contains title and review, need to be cleaned and normalized.

In [7]:
# Change the timestanp to be datetime type data
data["timestamp"] = pd.to_datetime(data["timestamp"], errors="coerce")

# Combine title + text
data["full_text"] = (data["title"] + " " + data["text"]).str.strip()

In [8]:
def clean_review(text):
    if not isinstance(text, str):
        return ""

    # 1. Lowercase
    text = text.lower()

    # 2. Remove URLs
    text = re.sub(r"http\S+|www\S+|https\S+", " <URL> ", text)

    # 3. Remove emails
    text = re.sub(r"\S+@\S+", " <EMAIL> ", text)

    # 4. Normalize repeated characters (e.g. coooool -> coool)
    text = re.sub(r"(.)\1{2,}", r"\1\1", text)

    # 5. Remove weird symbols (keep normal punctuation)
    text = re.sub(r"[^a-z0-9\s\.,!?']", " ", text)

    # 6. Normalize whitespace
    text = re.sub(r"\s+", " ", text).strip()

    return text

data["full_text"] = data["full_text"].apply(clean_review)

In [9]:
data.head()

Unnamed: 0,rating,title,text,images,asin,parent_asin,user_id,timestamp,helpful_vote,verified_purchase,full_text
28,5,Great socks for the gym or for your house,Nice socks. They say they are for yoga/pilates...,[],B0856TH4LK,B0856TH4LK,AFSKPY37N3C43SOI5IEXEK5JSIYA,2020-04-18 10:24:18.621,0,False,great socks for the gym or for your house nice...
29,5,"Nice, warm merino wool socks!",I absolutely love these socks. Super soft and ...,[],B07ZWZ2595,B07ZWZ2595,AFSKPY37N3C43SOI5IEXEK5JSIYA,2020-01-19 11:23:39.201,0,False,"nice, warm merino wool socks! i absolutely lov..."
30,4,"Nice jacket, runs a little small","Not a bad rain jacket but, at the time I am wr...",[],B07NY72H7W,B07NY72H7W,AFSKPY37N3C43SOI5IEXEK5JSIYA,2019-08-12 19:43:59.659,0,False,"nice jacket, runs a little small not a bad rai..."
31,5,Great biking gloves for newbies or expert riders,"Great cycling gloves. Well made, form fitting ...",[],B07NX5RHZ2,B07NX5RHZ2,AFSKPY37N3C43SOI5IEXEK5JSIYA,2019-06-24 01:00:44.117,0,False,great biking gloves for newbies or expert ride...
32,3,Quality made but runs small,"Nicely made, lightweight windbreaker. Easily f...",[],B07H92ZCQ9,B07H92ZCQ9,AFSKPY37N3C43SOI5IEXEK5JSIYA,2019-02-09 11:46:44.435,0,False,"quality made but runs small nicely made, light..."


In [10]:
data.drop(columns="parent_asin", inplace=True)

In [11]:
data.head()

Unnamed: 0,rating,title,text,images,asin,user_id,timestamp,helpful_vote,verified_purchase,full_text
28,5,Great socks for the gym or for your house,Nice socks. They say they are for yoga/pilates...,[],B0856TH4LK,AFSKPY37N3C43SOI5IEXEK5JSIYA,2020-04-18 10:24:18.621,0,False,great socks for the gym or for your house nice...
29,5,"Nice, warm merino wool socks!",I absolutely love these socks. Super soft and ...,[],B07ZWZ2595,AFSKPY37N3C43SOI5IEXEK5JSIYA,2020-01-19 11:23:39.201,0,False,"nice, warm merino wool socks! i absolutely lov..."
30,4,"Nice jacket, runs a little small","Not a bad rain jacket but, at the time I am wr...",[],B07NY72H7W,AFSKPY37N3C43SOI5IEXEK5JSIYA,2019-08-12 19:43:59.659,0,False,"nice jacket, runs a little small not a bad rai..."
31,5,Great biking gloves for newbies or expert riders,"Great cycling gloves. Well made, form fitting ...",[],B07NX5RHZ2,AFSKPY37N3C43SOI5IEXEK5JSIYA,2019-06-24 01:00:44.117,0,False,great biking gloves for newbies or expert ride...
32,3,Quality made but runs small,"Nicely made, lightweight windbreaker. Easily f...",[],B07H92ZCQ9,AFSKPY37N3C43SOI5IEXEK5JSIYA,2019-02-09 11:46:44.435,0,False,"quality made but runs small nicely made, light..."


## Text Embedding

> Purpose:

Generate 768-dimensional dense text embeddings representing the semantic meaning of each review.

> Why this model:

Multilingual support (Amazon data spans multiple locales)

Strong semantic representation suitable for similarity and anomaly analysis.

> Output:

text_embs.shape = (num_reviews, 768)
→ used as review attribute features in the HISN graph.

In [12]:
from sentence_transformers import SentenceTransformer

In [13]:
MODEL = "sentence-transformers/paraphrase-xlm-r-multilingual-v1"
embedding_model = SentenceTransformer(MODEL)

data = data.reset_index(drop=True)
text_embs = embedding_model.encode(data["full_text"].tolist(), batch_size=512, show_progress_bar=True)
text_embs = np.array(text_embs)

Batches:   0%|          | 0/330 [00:00<?, ?it/s]

## Feature Engineering

### Basic numerical
- review length : number of words in review text --> detect if there is unnatural review patterns
- dup count = number of identical texts --> Measure if there is duplication in the review
- is_duplicate = Binary flag for duplicate reviews --> detect text reuse or copy-paste
- helpful_vote -> log_helpful = log transformed helpful votes --> smooth heavy tailed distribution
- verified purchsse -> integered --> normalization

In [14]:
data["review_length"] = data["text"].apply(lambda x: len(x.split()))
data["dup_count"] = data["text"].map(data["text"].value_counts())   # how many times this text appears
data["is_duplicate"] = (data["dup_count"] > 1).astype(int)
data["log_helpful"] = np.log1p(data["helpful_vote"])
data["verified_int"] = data["verified_purchase"].astype(int)

### Temporal Cyclic Features

In this block we add time of day and day of year as continous cyclic features, preservice periodic patterns. 
- Reviews may exhibit temporal bursts typical of spam campaigns
- Cyclic encoding captures daily/seasonal rhythms better than raw timestamps

In [15]:
data["day"] = data["timestamp"].dt.dayofyear
data["hour"] = data["timestamp"].dt.hour
data["sin_day"] = np.sin(2*np.pi*data["day"]/365)
data["cos_day"] = np.cos(2*np.pi*data["day"]/365)
data["sin_hour"] = np.sin(2*np.pi*data["hour"]/24)
data["cos_hour"] = np.cos(2*np.pi*data["hour"]/24)

## Making User and Items Stats

In [16]:
user_stats = data.groupby("user_id").agg(
    n_reviews_user=("rating","count"),
    avg_rating_user=("rating","mean"),
    std_rating_user=("rating","std"),
    frac_verified_user=("verified_int","mean"),
    avg_len_user=("review_length","mean"),
    dup_ratio_user=("is_duplicate","mean")
).reset_index()

In [17]:
item_stats = data.groupby("asin").agg(
    n_reviews_item=("rating","count"),
    avg_rating_item=("rating","mean"),
    std_rating_item=("rating","std"),
    dup_ratio_item=("is_duplicate","mean")
).reset_index()

### User features engineering

This one is to detect the burst of reviews in a day, like hoew big is it and how consistent it happened

In [18]:
# -------------------------------------------------
# 1. Reviews per day statistics
# -------------------------------------------------

# Reviews per user per day
data['date'] = data['timestamp'].dt.date
user_daily = data.groupby(['user_id', 'date']).size().reset_index(name='reviews_per_day')

# Max reviews in a single day per user
user_burst = user_daily.groupby('user_id')['reviews_per_day'].max().reset_index()
user_burst.rename(columns={'reviews_per_day': 'max_reviews_per_day'}, inplace=True)

# Entropy of review days per user (spread vs bursty)
def entropy(arr):
    probs = arr / arr.sum()
    return -np.sum(probs * np.log(probs + 1e-10))

user_entropy = user_daily.groupby('user_id')['reviews_per_day'].apply(entropy).reset_index()
user_entropy.rename(columns={'reviews_per_day': 'day_entropy'}, inplace=True)

This one is to detect how similasr is a review of users to their other reviews to detect if there are any copy-pasted reviews in different products.

This is critical because In SL-GAD, the contrastive module learns consistency between views of the same node.<br>
If a user already has low diversity (high text similarity), their embeddings will appear artificially stable —<br>
→ the model must learn to disentangle “stable because spammy” vs. “stable because natural”.
These similarity features help the GNN capture that nuance.

In [19]:
# -------------------------------------------------
# 2. Similarity Features (review text duplicates/similarity)
# -------------------------------------------------
similarities = []
for uid, group in data.groupby('user_id'):
    if len(group) > 1:
        embs = text_embs[group.index]
        sim_matrix = cosine_similarity(embs)
        # Upper triangle mean (excluding self)
        triu = sim_matrix[np.triu_indices_from(sim_matrix, k=1)]
        avg_sim = triu.mean() if len(triu) > 0 else 0
    else:
        avg_sim = 0
    similarities.append((uid, avg_sim))
    
user_sim = pd.DataFrame(similarities, columns=['user_id','avg_text_sim'])

Its to detect the network of the users and the items based on degree centrality
1. Users (n_unique_items)
- Meaning : Number of distinct items a user reviewed 
- Intuition : Diversity of reviewed items<br>
Low : Focused reviewer (could be genuine niche interest)<br>
High : possible spammer reviewing too many unrelated products

2. Items (n_unique_users)
- Meaning : Number of distinct users reviewing an item
- Intuition : Item popularity<br>
Very high → popular item (normal).<br>
Very low → suspicious item possibly targeted by fake reviews.

In [20]:
# -------------------------------------------------
# 3. Network-based Features
# -------------------------------------------------
# Degree centrality
user_degree = data.groupby('user_id')['asin'].nunique().reset_index()
user_degree.rename(columns={'asin': 'n_unique_items'}, inplace=True)

item_degree = data.groupby('asin')['user_id'].nunique().reset_index()
item_degree.rename(columns={'user_id': 'n_unique_users'}, inplace=True)

# Density in user-item neighborhood: reviews / (user_degree * item_degree)
data['user_degree'] = data['user_id'].map(user_degree.set_index('user_id')['n_unique_items'])
data['item_degree'] = data['asin'].map(item_degree.set_index('asin')['n_unique_users'])
data['local_density'] = data['dup_count'] / (data['user_degree'] * data['item_degree'] + 1e-5)

### Stats and features merging

In [21]:
# Userstats
user_stats = user_stats.merge(user_burst, on='user_id', how='left')
user_stats = user_stats.merge(user_entropy, on='user_id', how='left')
user_stats = user_stats.merge(user_sim, on='user_id', how='left')
user_stats = user_stats.merge(user_degree, on='user_id', how='left')

user_stats['avg_text_sim'].fillna(0, inplace=True)
user_stats.fillna(0, inplace=True)

# Itemstats
item_stats = item_stats.merge(item_degree, on='asin', how='left')

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  user_stats['avg_text_sim'].fillna(0, inplace=True)


## Scaling

In [22]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
columns_to_be_scaled = ["rating","review_length","log_helpful","verified_int","dup_count"]
data_scaled = data[columns_to_be_scaled].copy()
data_scaled = pd.DataFrame(scaler.fit_transform(data_scaled), columns=data_scaled.columns)

In [23]:
for col in data_scaled.columns:
    data[col+"_scaled"] = data_scaled[col]

data.drop(columns=columns_to_be_scaled, inplace=True)
data.fillna(0, inplace=True)

In [24]:
data.isnull().sum()

title                   0
text                    0
images                  0
asin                    0
user_id                 0
timestamp               0
helpful_vote            0
verified_purchase       0
full_text               0
is_duplicate            0
day                     0
hour                    0
sin_day                 0
cos_day                 0
sin_hour                0
cos_hour                0
date                    0
user_degree             0
item_degree             0
local_density           0
rating_scaled           0
review_length_scaled    0
log_helpful_scaled      0
verified_int_scaled     0
dup_count_scaled        0
dtype: int64

In [25]:
review_features = np.hstack([
    text_embs,
    data[["rating_scaled","review_length_scaled","log_helpful_scaled",
        "verified_int_scaled","sin_day","cos_day","sin_hour","cos_hour","dup_count_scaled"]].values
])

In [26]:
print("Review feature matrix shape:", review_features.shape)
print("User stats shape:", user_stats.shape)
print("Item stats shape:", item_stats.shape)

Review feature matrix shape: (168728, 777)
User stats shape: (69587, 11)
Item stats shape: (29282, 6)


In [27]:
cols_to_keep = [
    "user_id", "asin", "timestamp",
    "rating_scaled", "review_length_scaled", "log_helpful_scaled",
    "verified_int_scaled", "dup_count_scaled",
    "sin_day", "cos_day", "sin_hour", "cos_hour",
    "user_degree", "item_degree", "local_density"
]

data = data[cols_to_keep]
print("✅ Final modeling-ready dataset shape:", data.shape)


✅ Final modeling-ready dataset shape: (168728, 15)


In [28]:
import pickle
np.save("nb2/review_features.npy", review_features)

# Save user and item stats
user_stats.to_csv("nb2/user_stats.csv", index=False)
item_stats.to_csv("nb2/item_stats.csv", index=False)

# Save processed DataFrame (with scaled columns)
data.to_parquet("nb2/reviews_processed.parquet", index=False)

# Save ID mappings (for graph construction later)
user2id = {u: i for i, u in enumerate(user_stats["user_id"].unique())}
item2id = {a: i for i, a in enumerate(item_stats["asin"].unique())}
review2id = {r: i for i, r in enumerate(data.index)}

with open("nb2/id_mappings.pkl", "wb") as f:
    pickle.dump({"user2id": user2id, "item2id": item2id, "review2id": review2id}, f)

print("✅ All processed data saved.")

✅ All processed data saved.


# iseng nfs

In [29]:
data = pd.read_parquet("nb2/reviews_processed.parquet")

In [31]:
# 1️⃣ NRP
nrp = data.groupby('asin')['user_id'].nunique().reset_index()
nrp.columns = ['asin', 'NRP']

# 2️⃣ NDP (using dup_count_scaled)
ndp = data.groupby('asin')['dup_count_scaled'].mean().reset_index()
ndp.columns = ['asin', 'NDP']

# 3️⃣ NTP
data['date'] = data['timestamp'].dt.date
burst_window = data.groupby(['asin', 'date']).size().reset_index(name='reviews_per_day')
ntp = burst_window.groupby('asin')['reviews_per_day'].max().reset_index()
ntp.columns = ['asin', 'NTP']

# 4️⃣ Combine
nfs = nrp.merge(ndp, on='asin', how='left').merge(ntp, on='asin', how='left')
nfs.fillna(0, inplace=True)

# Weighted combination
alpha, beta, gamma = 0.4, 0.3, 0.3
nfs['NFS_pre'] = alpha * nfs['NRP'] + beta * nfs['NDP'] + gamma * nfs['NTP']

print("✅ NFS metrics calculated successfully!")
print(nfs.sort_values('NFS_pre', ascending=False).head(10))


✅ NFS metrics calculated successfully!
            asin  NRP       NDP  NTP     NFS_pre
63    B000GAWSDG  367 -0.089483    4  147.973155
57    B000FIS5U4  322 -0.093024    4  129.972093
8810  B017U1FDM6  239  0.148553    3   96.544566
6622  B00ZIK4NH8  195  0.102977    3   78.930893
3500  B00KA3TUNA  180  0.532277    3   73.059683
3507  B00KA3VX62  178  0.315525    3   72.194657
5624  B00UDF11O6  161  0.204466    2   65.061340
3504  B00KA3VEG6  157  0.154181    3   63.746254
3499  B00KA3SRVG  156  0.120379    3   63.336114
5308  B00SH9BD0W  147  0.089508    4   60.026852


In [33]:
top_n = int(0.05 * len(data))    # or top 5%, depending on dataset
target_asins = nfs.sort_values('NFS_pre', ascending=False).head(top_n)['asin']

In [34]:
target_data = data[data['asin'].isin(target_asins)].copy()
print("HISN base dataset:", target_data.shape)

HISN base dataset: (94043, 16)


In [35]:
data.shape

(168728, 16)

In [36]:
coverage = len(target_data) / len(data)
print(f"{coverage*100:.2f}% of reviews belong to target products")
print(f"{len(target_asins)} target products out of {data['asin'].nunique()} total")


55.74% of reviews belong to target products
8436 target products out of 29282 total


In [37]:
print("Users:", target_data['user_id'].nunique())
print("Items:", target_data['asin'].nunique())
print("Reviews (edges):", len(target_data))

avg_reviews_per_user = len(target_data) / target_data['user_id'].nunique()
avg_reviews_per_item = len(target_data) / target_data['asin'].nunique()

print(f"Avg reviews per user: {avg_reviews_per_user:.2f}")
print(f"Avg reviews per item: {avg_reviews_per_item:.2f}")


Users: 53418
Items: 8436
Reviews (edges): 94043
Avg reviews per user: 1.76
Avg reviews per item: 11.15


In [38]:
# ============================================================
# 1️⃣  Start from your current target_data (after NFS filtering)
# ============================================================
print("Before filtering:")
print("Users:", target_data['user_id'].nunique())
print("Items:", target_data['asin'].nunique())
print("Reviews:", len(target_data))

# ============================================================
# 2️⃣  Filter out one-shot users (those with only 1 review)
# ============================================================
user_counts = target_data['user_id'].value_counts()
active_users = user_counts[user_counts >= 2].index

filtered_data = target_data[target_data['user_id'].isin(active_users)].copy()

print("\nAfter removing 1-review users:")
print("Users:", filtered_data['user_id'].nunique())
print("Items:", filtered_data['asin'].nunique())
print("Reviews:", len(filtered_data))

# ============================================================
# 3️⃣  Expand HISN neighborhood — include ALL reviews
#      from these remaining active users (even for non-target items)
# ============================================================
target_users = filtered_data['user_id'].unique()
expanded_data = data[data['user_id'].isin(target_users)].copy()

print("\nAfter expanding to all reviews of active users:")
print("Users:", expanded_data['user_id'].nunique())
print("Items:", expanded_data['asin'].nunique())
print("Reviews:", len(expanded_data))

# ============================================================
# 4️⃣  Optional — verify average connectivity
# ============================================================
avg_reviews_per_user = len(expanded_data) / expanded_data['user_id'].nunique()
avg_reviews_per_item = len(expanded_data) / expanded_data['asin'].nunique()

print(f"\nAvg reviews per user: {avg_reviews_per_user:.2f}")
print(f"Avg reviews per item: {avg_reviews_per_item:.2f}")

# ============================================================
# 5️⃣  Save this final HISN dataset for Stage 2 (SL-GAD training)
# ============================================================
expanded_data.to_parquet("nb2/hisn_final.parquet", index=False)
print("\n✅ Final HISN saved as 'hisn_final.parquet'")


Before filtering:
Users: 53418
Items: 8436
Reviews: 94043

After removing 1-review users:
Users: 29725
Items: 8430
Reviews: 70350

After expanding to all reviews of active users:
Users: 29725
Items: 15881
Reviews: 81922

Avg reviews per user: 2.76
Avg reviews per item: 5.16

✅ Final HISN saved as 'hisn_final.parquet'
