FRAUD DETECTION FEATURE ENGINEERING


Author: Zeeshan
Date: January 2, 2026

Purpose: Create fraud indicator features from preprocessed data

My Feature Categories:
1. Competition Red Flags (bidder analysis)
2. Price Anomalies (amount analysis)
3. Timing Suspicions (date patterns)
4. Department Patterns (entity behavior)

Total features to create: 15-20

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

# Load preprocessed data
df = pd.read_csv('data/processed/preprocessed_data.csv')
df['pub_date'] = pd.to_datetime(df['pub_date'])

print(f"üì• Loaded{len(df):,} preprocessed records")
print("üî® Starting feature engineering...")

üì• Loaded26,205 preprocessed records
üî® Starting feature engineering...


CATEGORY 1: COMPETITION RED FLAGS

My hypothesis: Fraud often involves reducing competition

Features I'm creating:
- single_bidder_flag: Only 1 company bid (major red flag)
- weak_competition: Less than 3 bidders (concerning)
- no_competition_score: Inverse of bidder count (0-100 scale)

In [2]:
print("\n1Ô∏è‚É£ Creating competition features...")

# Single bidder flag (1 = yes, 0 = no)
df['single_bidder_flag'] = (df['bidder_count'] == 1).astype(int)
single_count = df['single_bidder_flag'].sum()
single_pct = single_count / len(df) * 100
print(f"   ‚úì single_bidder_flag:{single_count:,} tenders ({single_pct:.1f}%)")

# Weak competition (less than 3 bidders)
df['weak_competition'] = (df['bidder_count'] < 3).astype(int)
weak_count = df['weak_competition'].sum()
print(f"   ‚úì weak_competition:{weak_count:,} tenders ({weak_count/len(df)*100:.1f}%)")

# Competition score (lower = more suspicious)
# Convert bidder count to 0-100 scale (capped at 10 bidders = 100)
df['competition_score'] = (df['bidder_count'].clip(upper=10) / 10 * 100)
avg_comp_score = df['competition_score'].mean()
print(f"   ‚úì competition_score: Average ={avg_comp_score:.1f}/100")


1Ô∏è‚É£ Creating competition features...
   ‚úì single_bidder_flag:3,364 tenders (12.8%)
   ‚úì weak_competition:14,939 tenders (57.0%)
   ‚úì competition_score: Average =27.3/100


CATEGORY 2: PRICE ANOMALY FEATURES

My approach: Compare each tender to department averages

Features:
- price_vs_dept_avg: How much higher/lower than department average
- extreme_high_price: Significantly above normal (potential inflation)
- round_amount_flag: Exact lakhs/crores (often inflated estimates)
- threshold_game: Just under ‚Çπ10L or ‚Çπ25L limits (avoidance tactic)


In [3]:
print("\n2Ô∏è‚É£ Creating price anomaly features...")

# Calculate department-wise average prices
dept_avg_prices = df.groupby('dept_name')['contract_amount'].transform('mean')
df['dept_avg_amount'] = dept_avg_prices

# Price deviation from department average (percentage)
df['price_vs_dept_avg'] = ((df['contract_amount'] - df['dept_avg_amount']) /
                            df['dept_avg_amount'] * 100)

# Extreme high price (more than 100% above department average)
df['extreme_high_price'] = (df['price_vs_dept_avg'] > 100).astype(int)
extreme_count = df['extreme_high_price'].sum()
print(f"   ‚úì extreme_high_price:{extreme_count:,} tenders (>{100}% above avg)")

# Round amount detection (exact multiple of 1 lakh)
df['round_amount_flag'] = (df['contract_amount'] % 100000 == 0).astype(int)
round_count = df['round_amount_flag'].sum()
print(f"   ‚úì round_amount_flag:{round_count:,} tenders ({round_count/len(df)*100:.1f}%)")

# Threshold gaming (just under approval limits)
# In India: ‚Çπ10L and ‚Çπ25L are common approval thresholds
df['threshold_game'] = (
    ((df['contract_amount'] >= 950000) & (df['contract_amount'] < 1000000)) |
    ((df['contract_amount'] >= 2400000) & (df['contract_amount'] < 2500000))
).astype(int)
threshold_count = df['threshold_game'].sum()
print(f"   ‚úì threshold_game:{threshold_count:,} tenders near limits")


2Ô∏è‚É£ Creating price anomaly features...
   ‚úì extreme_high_price:2,308 tenders (>100% above avg)
   ‚úì round_amount_flag:4,819 tenders (18.4%)
   ‚úì threshold_game:294 tenders near limits


CATEGORY 3: TIMING SUSPICION FEATURES

My observation from EDA: Timing patterns matter in fraud

Features:
- dec_rush: December awards (year-end budget exhaustion)
- march_rush: March awards (fiscal year-end in India)
- weekend_award: Saturday/Sunday (unusual for govt)
- quarter_end: Last month of quarter (budget pressure)

In [4]:
print("\n3Ô∏è‚É£ Creating timing suspicion features...")

# December rush (year-end budget exhaustion)
df['dec_rush'] = (df['tender_month'] == 12).astype(int)
dec_count = df['dec_rush'].sum()
print(f"   ‚úì dec_rush:{dec_count:,} December tenders ({dec_count/len(df)*100:.1f}%)")

# March rush (Indian fiscal year ends March 31)
df['march_rush'] = (df['tender_month'] == 3).astype(int)
march_count = df['march_rush'].sum()
print(f"   ‚úì march_rush:{march_count:,} March tenders ({march_count/len(df)*100:.1f}%)")

# Weekend award (suspicious timing)
df['weekend_award'] = (df['day_of_week'] >= 5).astype(int)
weekend_count = df['weekend_award'].sum()
print(f"   ‚úì weekend_award:{weekend_count:,} weekend tenders ({weekend_count/len(df)*100:.1f}%)")

# Quarter-end pressure
df['quarter_end'] = df['tender_month'].isin([3, 6, 9, 12]).astype(int)
qtr_count = df['quarter_end'].sum()
print(f"   ‚úì quarter_end:{qtr_count:,} quarter-end tenders ({qtr_count/len(df)*100:.1f}%)")


3Ô∏è‚É£ Creating timing suspicion features...
   ‚úì dec_rush:0 December tenders (0.0%)
   ‚úì march_rush:0 March tenders (0.0%)
   ‚úì weekend_award:0 weekend tenders (0.0%)
   ‚úì quarter_end:26,205 quarter-end tenders (100.0%)


CATEGORY 4: DEPARTMENT PATTERN FEATURES

My insight: Some departments may have systematic issues

Features:
- dept_tender_volume: How active is this department
- dept_single_bid_rate: Department's history of single bidders

In [5]:
print("\n4Ô∏è‚É£ Creating department behavior features...")

# Department tender volume
dept_counts = df.groupby('dept_name').size()
df['dept_tender_volume'] = df['dept_name'].map(dept_counts)
print(f"   ‚úì dept_tender_volume: Range{df['dept_tender_volume'].min()}-{df['dept_tender_volume'].max()}")

# Department single-bidder rate
dept_single_bids = df[df['single_bidder_flag'] == 1].groupby('dept_name').size()
dept_single_rate = (dept_single_bids / dept_counts * 100).fillna(0)
df['dept_single_bid_rate'] = df['dept_name'].map(dept_single_rate)
print(f"   ‚úì dept_single_bid_rate: Average{df['dept_single_bid_rate'].mean():.1f}%")


4Ô∏è‚É£ Creating department behavior features...
   ‚úì dept_tender_volume: Range1-9580
   ‚úì dept_single_bid_rate: Average12.8%


COMPOSITE RISK SCORE (My Weighted Formula)

My scoring logic:
- Single bidder: 30 points (highest weight - major red flag)
- Extreme price: 25 points (second highest - inflation indicator)
- Year/FY end: 20 points (timing pressure)
- Round amount: 15 points (lazy estimation)
- Threshold game: 10 points (deliberate avoidance)

Total: 0-100 risk score

In [6]:
print("\n5Ô∏è‚É£ Creating composite risk score...")

df['fraud_risk_score'] = (
    df['single_bidder_flag'] * 30 +
    df['extreme_high_price'] * 25 +
    (df['dec_rush'] | df['march_rush']) * 20 +
    df['round_amount_flag'] * 15 +
    df['threshold_game'] * 10
).clip(0, 100)

# Categorize risk levels
df['risk_category'] = pd.cut(
    df['fraud_risk_score'],
    bins=[0, 30, 60, 100],
    labels=['Low', 'Medium', 'High']
)

print(f"\nüìä Risk Distribution:")
print(df['risk_category'].value_counts())
print(f"\n   Average risk score:{df['fraud_risk_score'].mean():.1f}/100")
print(f"   High risk tenders:{(df['risk_category'] == 'High').sum():,}")


5Ô∏è‚É£ Creating composite risk score...

üìä Risk Distribution:
risk_category
Low       7605
Medium    1494
High        61
Name: count, dtype: int64

   Average risk score:8.9/100
   High risk tenders:61


In [7]:
# Save complete dataset with all features
df.to_csv('data/processed/data_with_features.csv', index=False)

print(f"\n‚úÖ FEATURE ENGINEERING COMPLETE")
print(f"   Original columns: 11")
print(f"   New features created:{len(df.columns) - 11}")
print(f"   Total columns now:{len(df.columns)}")
print(f"\nüíæ Saved to: data/processed/data_with_features.csv")

# List all my created features
my_features = [
    'single_bidder_flag', 'weak_competition', 'competition_score',
    'price_vs_dept_avg', 'extreme_high_price', 'round_amount_flag', 'threshold_game',
    'dec_rush', 'march_rush', 'weekend_award', 'quarter_end',
    'dept_tender_volume', 'dept_single_bid_rate',
    'fraud_risk_score', 'risk_category'
]

print(f"\nüìù Features I engineered:")
for i, feat in enumerate(my_features, 1):
    print(f"{i}.{feat}")


‚úÖ FEATURE ENGINEERING COMPLETE
   Original columns: 11
   New features created:17
   Total columns now:28

üíæ Saved to: data/processed/data_with_features.csv

üìù Features I engineered:
1.single_bidder_flag
2.weak_competition
3.competition_score
4.price_vs_dept_avg
5.extreme_high_price
6.round_amount_flag
7.threshold_game
8.dec_rush
9.march_rush
10.weekend_award
11.quarter_end
12.dept_tender_volume
13.dept_single_bid_rate
14.fraud_risk_score
15.risk_category
