# Feature Engineering for Fraudulent Review Detection  
This notebook performs end-to-end feature engineering for the BT4012 fraud detection project.  
We engineer features from three major data sources:

1. **Reviewer (User) Data**  
   - Historical behaviour and engagement patterns  
   - Account age and activity frequency  
   - Normalised ratios such as useful_per_review and reviews_per_month  
   - These features help identify suspicious user behaviour (e.g., very new accounts posting frequent reviews)

2. **Business (Restaurant) Data**  
   - Restaurant attributes, operational characteristics, and completeness of business profiles  
   - Category signals and filtered-review statistics  
   - Top-5 categorical encoding is used to prevent sparsity from exploding one-hot encodings

3. **Reviewer–Business Relationship Features**  
   - Geographic match (same city/state)  
   - Exact location matching  
   - These features detect unusual patterns, such as users reviewing restaurants far outside their typical region

4. **Graph-Based Network Features**  
   - Reviewer graph constructed using shared restaurants  
   - Edge weights represent co-review frequency  
   - Extracted features include `max_weight`, `avg_weight`, and `clustering_coeff`  
   - This captures fraud rings, where multiple fake accounts target the same businesses

---

## Objective  
The goal of this notebook is to produce a **single, unified, feature-rich dataset** that captures behavioural, business, relational, and network-level characteristics for each review instance.

This dataset will be used to train downstream machine learning models for detecting fraudulent reviews.


In [109]:
# Imports
import pandas as pd
import numpy as np
import networkx as nx
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split


In [110]:
path_reviewers = "../data/raw/Reviewers (Users) CSV.csv"
path_reviews = "../data/raw/Smaller_Reviews.csv"
path_restaurants = "../data/raw/Resturants CSV.csv"

# Load data
reviewers = pd.read_csv(path_reviewers)
reviews = pd.read_csv(path_reviews)
restaurants = pd.read_csv(path_restaurants)

# Columns that should remain un-prefixed
exclude = ['reviewID', 'reviewerID', 'restaurantID']

# Apply prefixes for column origin traceability
reviews.columns = [c if c in exclude else f"review_{c}" for c in reviews.columns]
reviewers.columns = [c if c in exclude else f"reviewer_{c}" for c in reviewers.columns]
restaurants.columns = [c if c in exclude else f"restaurant_{c}" for c in restaurants.columns]


In [111]:
# Convert dates
reviews['review_date'] = pd.to_datetime(reviews['review_date'], errors='coerce')
reviewers['reviewer_yelpJoinDate'] = pd.to_datetime(
    reviewers['reviewer_yelpJoinDate'], format="%B %Y", errors='coerce'
)

# Split reviewer location into city/state
loc_split = reviewers['reviewer_location'].str.split(",", n=1, expand=True)
reviewers['reviewer_city'] = loc_split[0].str.strip()
reviewers['reviewer_state'] = loc_split[1].str.strip()


In [112]:
#Merge 
df = (
    reviews
    .merge(reviewers, on="reviewerID", how="left")
    .merge(restaurants, on="restaurantID", how="left")
)


## Feature Engineering for Reviewers

In [113]:
# Account age
df['account_age_days'] = (df['review_date'] - df['reviewer_yelpJoinDate']).dt.days
df['account_age_months'] = (df['account_age_days'] / 30).clip(lower=0)

# Safe denominator
safe_review_count = df['reviewer_reviewCount'].replace(0, np.nan)

# Reviewer activity
df['reviews_per_month'] = df['reviewer_reviewCount'] / df['account_age_months']

# Per-review ratios
ratio_cols = [
    "usefulCount", "coolCount", "funnyCount",
    "complimentCount", "tipCount", "fanCount", "firstCount"
]

for metric in ratio_cols:
    colname = f"reviewer_{metric}"
    if colname in df.columns:
        df[f"{metric}_per_review"] = df[colname] / safe_review_count

# Engagement ratio
df['engagement_ratio'] = (
    df['reviewer_usefulCount'] +
    df['reviewer_coolCount'] +
    df['reviewer_funnyCount']
) / safe_review_count


## Feature Engineering for Businesses

In [114]:
# Clean restaurant location → extract city/state
parts = df['restaurant_location'].fillna("").str.split("-")

def get_city_state(x):
    last = x[-1].strip()
    if ", " in last:
        return last.split(", ")
    return (None, None)

df['restaurant_city'], df['restaurant_state'] = zip(*parts.apply(get_city_state))

def get_neighbourhood(x):
    return x[-2].strip() if len(x) >= 2 else None

df['restaurant_neighbourhood'] = parts.apply(get_neighbourhood)

# Missing fields count
business_cols = [c for c in df.columns if c.startswith("restaurant_")]
df["restaurant_num_missing_fields"] = df[business_cols].isna().sum(axis=1)

# Feature availability count
restaurant_feature_cols = [
    "restaurant_GoodforKids", "restaurant_AcceptsCreditCards", "restaurant_Parking",
    "restaurant_Attire", "restaurant_GoodforGroups", "restaurant_PriceRange",
    "restaurant_TakesReservations", "restaurant_Delivery", "restaurant_Takeout",
    "restaurant_WaiterService", "restaurant_OutdoorSeating", "restaurant_WiFi",
    "restaurant_Alcohol", "restaurant_NoiseLevel", "restaurant_Ambience",
    "restaurant_HasTV", "restaurant_Caters", "restaurant_WheelchairAccessible"
]

df['restaurant_feature_count'] = df[restaurant_feature_cols].notna().sum(axis=1)

# Fill numeric missing values
num_cols = ['restaurant_reviewCount', 'restaurant_rating', 'restaurant_filReviewCount']
df[num_cols] = df[num_cols].fillna(0)

# Filtered reviews
df['restaurant_filtered_ratio'] = df['restaurant_filReviewCount'] / (df['restaurant_reviewCount'] + 1)
df['restaurant_filtered_diff'] = df['restaurant_reviewCount'] - df['restaurant_filReviewCount']


In [None]:
# Removing Unneccesary Business Columns 
cols_to_remove_before_any_processing = [
"restaurant_name",
"restaurant_address",
"restaurant_phoneNumber",
"restaurant_location",
"restaurant_Hours",
"restaurant_categories",
"restaurant_webSite",

]

df = df.drop(columns=cols_to_remove_before_any_processing, errors="ignore")

## Reviewer–Restaurant Geographic Consistency Features


In [116]:
df['same_state'] = (df['reviewer_state'] == df['restaurant_state']).astype(int)
df['same_city'] = (df['reviewer_city'] == df['restaurant_city']).astype(int)
df['location_exact_match'] = (df['same_city'] & df['same_state']).astype(int)

#Geographical signals are a powerful tool in detecting suspicious reviewers.  
#Legitimate users typically review restaurants located within reasonable proximity to
#where they live, while fraudulent accounts often exhibit patterns that contradict 
#normal geographic behaviour.



### Perform One Hot Encoding for model training

We do OHE for location based features because fraud reviews could be more rampant in certain restaurant locations.

In [117]:
restaurant_object_cols = df[[c for c in df.columns if c.startswith("restaurant_")]].select_dtypes(include="object").columns.tolist()

for col in restaurant_object_cols:
    top5 = df[col].value_counts().nlargest(5).index
    df[col + "_top5"] = df[col].where(df[col].isin(top5), "Others")

df = pd.get_dummies(df, columns=[c + "_top5" for c in restaurant_object_cols])

df.drop(columns=restaurant_object_cols, inplace=True)


In [118]:
df.to_csv("../data/processed/reviewer_business_features.csv", index=False)


## Graph-Based Reviewer Network Features

To capture coordinated behaviour among suspicious reviewers, we construct a
reviewer–reviewer interaction graph based on shared restaurant activity.

Fake-review campaigns often involve multiple accounts operated by the same entity,
leaving behind a structural footprint in the form of tightly connected reviewer clusters.
This section transforms raw reviewer–restaurant interactions into graph-theoretic
features that quantify these relationships.


In [119]:
# Copy to avoid modifying the earlier df
merged = df.copy()

graph = nx.Graph()
graph.add_nodes_from(merged['reviewerID'].unique())

# Build edges based on shared restaurants
restaurant_groups = merged.groupby('restaurantID')['reviewerID'].apply(list)

for reviewers_list in restaurant_groups:
    for i in range(len(reviewers_list)):
        for j in range(i+1, len(reviewers_list)):
            r1, r2 = reviewers_list[i], reviewers_list[j]
            if graph.has_edge(r1, r2):
                graph[r1][r2]['weight'] += 1
            else:
                graph.add_edge(r1, r2, weight=1)

# Graph metrics
max_w = {}
avg_w = {}
for node in graph.nodes():
    weights = [graph[node][nbr]['weight'] for nbr in graph.neighbors(node)]
    max_w[node] = max(weights) if weights else 0
    avg_w[node] = np.mean(weights) if weights else 0

clustering = nx.clustering(graph, weight='weight')

graph_features = pd.DataFrame({
    "reviewerID": list(graph.nodes()),
    "max_weight": [max_w[n] for n in graph.nodes()],
    "avg_weight": [avg_w[n] for n in graph.nodes()],
    "clustering_coeff": [clustering[n] for n in graph.nodes()],
})

# Merge with base
merged = merged.merge(graph_features, on="reviewerID", how="left").fillna(0)


KeyboardInterrupt: 

Train test split

In [None]:
X = merged.drop(columns=['review_flagged'])
y = merged['review_flagged']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

train_df = X_train.copy()
train_df["review_flagged"] = y_train

test_df = X_test.copy()
test_df["review_flagged"] = y_test

train_df.to_csv("../data/processed/train.csv", index=False)
test_df.to_csv("../data/processed/test.csv", index=False)


In [121]:
count = 0
for col in df.columns:
    print(col)
    count += 1
print(count)

review_Unnamed: 0
review_date
reviewID
reviewerID
review_reviewContent
review_rating
review_usefulCount
review_coolCount
review_funnyCount
review_flagged
restaurantID
reviewer_name
reviewer_location
reviewer_yelpJoinDate
reviewer_friendCount
reviewer_reviewCount
reviewer_firstCount
reviewer_usefulCount
reviewer_coolCount
reviewer_funnyCount
reviewer_complimentCount
reviewer_tipCount
reviewer_fanCount
reviewer_city
reviewer_state
restaurant_reviewCount
restaurant_rating
restaurant_filReviewCount
account_age_days
account_age_months
reviews_per_month
usefulCount_per_review
coolCount_per_review
funnyCount_per_review
complimentCount_per_review
tipCount_per_review
fanCount_per_review
firstCount_per_review
engagement_ratio
restaurant_num_missing_fields
restaurant_feature_count
restaurant_filtered_ratio
restaurant_filtered_diff
same_state
same_city
location_exact_match
restaurant_GoodforKids_top5_No
restaurant_GoodforKids_top5_Others
restaurant_GoodforKids_top5_Yes
restaurant_AcceptsCreditCard

In [None]:
#removed:
#"review_Unnamed: 0","review_date","reviewID","reviewerID","review_reviewContent","review_rating",
#"review_flagged","restaurantID","reviewer_name","reviewer_location","reviewer_yelpJoinDate",
#"reviewer_city","reviewer_state","account_age_days",
[
"review_rating",
"review_usefulCount",
"review_coolCount",
"review_funnyCount",
"reviewer_friendCount",
"reviewer_reviewCount",
"reviewer_firstCount",
"reviewer_usefulCount",
"reviewer_coolCount",
"reviewer_funnyCount",
"reviewer_complimentCount",
"reviewer_tipCount",
"reviewer_fanCount",
"reviews_per_month",
"usefulCount_per_review",
"coolCount_per_review",
"funnyCount_per_review",
"complimentCount_per_review",
"tipCount_per_review",
"fanCount_per_review",
"firstCount_per_review",
"engagement_ratio",
"account_age_months",

"restaurant_reviewCount",
"restaurant_rating",
"restaurant_filReviewCount",
"restaurant_num_missing_fields",
"restaurant_feature_count",
"restaurant_filtered_ratio",
"restaurant_filtered_diff",

"same_state",
"same_city",
"location_exact_match",

"restaurant_GoodforKids_top5_No",
"restaurant_GoodforKids_top5_Others",
"restaurant_GoodforKids_top5_Yes",
"restaurant_AcceptsCreditCards_top5_No",
"restaurant_AcceptsCreditCards_top5_Others",
"restaurant_AcceptsCreditCards_top5_Yes",
"restaurant_Parking_top5_Others",
"restaurant_Parking_top5_Private Lot",
"restaurant_Parking_top5_Street",
"restaurant_Parking_top5_Street, Private Lot",
"restaurant_Parking_top5_Street, Valet",
"restaurant_Parking_top5_Valet",
"restaurant_Attire_top5_Casual",
"restaurant_Attire_top5_Dressy",
"restaurant_Attire_top5_Formal (Jacket Required)",
"restaurant_Attire_top5_Others",
"restaurant_GoodforGroups_top5_No",
"restaurant_GoodforGroups_top5_Others",
"restaurant_GoodforGroups_top5_Yes",
"restaurant_PriceRange_top5_$",
"restaurant_PriceRange_top5_$$",
"restaurant_PriceRange_top5_$$$",
"restaurant_PriceRange_top5_$$$$",
"restaurant_PriceRange_top5_Others",
"restaurant_PriceRange_top5_££",
"restaurant_TakesReservations_top5_No",
"restaurant_TakesReservations_top5_Others",
"restaurant_TakesReservations_top5_Yes",
"restaurant_Delivery_top5_No",
"restaurant_Delivery_top5_Others",
"restaurant_Delivery_top5_Yes",
"restaurant_Takeout_top5_No",
"restaurant_Takeout_top5_Others",
"restaurant_Takeout_top5_Yes",
"restaurant_WaiterService_top5_No",
"restaurant_WaiterService_top5_Others",
"restaurant_WaiterService_top5_Yes",
"restaurant_OutdoorSeating_top5_No",
"restaurant_OutdoorSeating_top5_Others",
"restaurant_OutdoorSeating_top5_Yes",
"restaurant_WiFi_top5_Free",
"restaurant_WiFi_top5_No",
"restaurant_WiFi_top5_Others",
"restaurant_WiFi_top5_Paid",
"restaurant_GoodFor_top5_Breakfast",
"restaurant_GoodFor_top5_Dinner",
"restaurant_GoodFor_top5_Late Night, Dinner",
"restaurant_GoodFor_top5_Lunch",
"restaurant_GoodFor_top5_Lunch, Dinner",
"restaurant_GoodFor_top5_Others",
"restaurant_Alcohol_top5_Beer & Wine Only",
"restaurant_Alcohol_top5_Full Bar",
"restaurant_Alcohol_top5_No",
"restaurant_Alcohol_top5_Others",
"restaurant_NoiseLevel_top5_Average",
"restaurant_NoiseLevel_top5_Loud",
"restaurant_NoiseLevel_top5_Others",
"restaurant_NoiseLevel_top5_Quiet",
"restaurant_NoiseLevel_top5_Very Loud",
"restaurant_Ambience_top5_Casual",
"restaurant_Ambience_top5_Classy",
"restaurant_Ambience_top5_Hipster, Casual",
"restaurant_Ambience_top5_Others",
"restaurant_Ambience_top5_Trendy",
"restaurant_Ambience_top5_Trendy, Casual",
"restaurant_HasTV_top5_No",
"restaurant_HasTV_top5_Others",
"restaurant_HasTV_top5_Yes",
"restaurant_Caters_top5_No",
"restaurant_Caters_top5_Others",
"restaurant_Caters_top5_Yes",
"restaurant_WheelchairAccessible_top5_No",
"restaurant_WheelchairAccessible_top5_Others",
"restaurant_WheelchairAccessible_top5_Yes",
"restaurant_city_top5_Chicago",
"restaurant_city_top5_Las Vegas",
"restaurant_city_top5_Los Angeles",
"restaurant_city_top5_New York",
"restaurant_city_top5_Others",
"restaurant_city_top5_San Francisco",
"restaurant_state_top5_CA",
"restaurant_state_top5_IL",
"restaurant_state_top5_MA",
"restaurant_state_top5_NY",
"restaurant_state_top5_Others",
"restaurant_state_top5_TX",
"restaurant_neighbourhood_top5_Lakeview",
"restaurant_neighbourhood_top5_Lincoln Park",
"restaurant_neighbourhood_top5_Near North Side",
"restaurant_neighbourhood_top5_Others",
"restaurant_neighbourhood_top5_The Loop",
"restaurant_neighbourhood_top5_Wicker Park",

"max_weight",
"avg_weight",
"clustering_coeff"
]


['review_usefulCount',
 'review_coolCount',
 'review_funnyCount',
 'reviewer_name',
 'reviewer_friendCount',
 'reviewer_reviewCount',
 'reviewer_firstCount',
 'reviewer_usefulCount',
 'reviewer_coolCount',
 'reviewer_funnyCount',
 'reviewer_complimentCount',
 'reviewer_tipCount',
 'reviewer_fanCount',
 'restaurant_reviewCount',
 'restaurant_rating',
 'restaurant_filReviewCount',
 'account_age_days',
 'account_age_months',
 'reviews_per_month',
 'usefulCount_per_review',
 'coolCount_per_review',
 'funnyCount_per_review',
 'complimentCount_per_review',
 'tipCount_per_review',
 'fanCount_per_review',
 'firstCount_per_review',
 'engagement_ratio',
 'restaurant_num_missing_fields',
 'restaurant_feature_count',
 'restaurant_filtered_ratio',
 'restaurant_filtered_diff',
 'same_state',
 'same_city',
 'location_exact_match',
 'restaurant_name_top5_Fogo de Chao',
 "restaurant_name_top5_Hot Doug's",
 'restaurant_name_top5_Others',
 'restaurant_name_top5_Piece Brewery and Pizzeria',
 'restaurant_n