# Ethereum Fraud Detection - Feature Analysis and Mapping

We will:
1. Look at each dataset separately
2. Map Ethereum features to standard fraud detection features
3. Create a standardized dataset with renamed features
4. Explain the relationship between standard and Ethereum features

In [1]:
import pandas as pd
import numpy as np

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Define file paths
TRANSACTION_FILE = '/Users/chery/Desktop/y2s2/IS4303/project code/Ethereum-Fraud-Detection/data/raw/transaction_dataset.csv'
FEATURES_FILE = '/Users/chery/Desktop/y2s2/IS4303/project code/Ethereum-Fraud-Detection/data/raw/eth_illicit_features.csv' 

## 1. Transaction Dataset Exploration

Let's look at the Ethereum transaction dataset:

In [2]:
# Load transaction dataset
trans_df = pd.read_csv(TRANSACTION_FILE)

print("=== Transaction Dataset ===")
print(f"Shape: {trans_df.shape} (rows, columns)")
print("Columns and their types:")
for col in sorted(trans_df.columns):
    print(f"- {col}: {trans_df[col].dtype}")

print("Sample of data (first 5 rows):")
print(trans_df.head())

print("Missing values:")
missing = trans_df.isnull().sum()
print(missing[missing > 0])  # Only show columns with missing values

FileNotFoundError: [Errno 2] No such file or directory: '/Users/chery/Desktop/y2s2/IS4303/project code/Ethereum-Fraud-Detection/data/raw/transaction_dataset.csv'

## 2. Features Dataset Exploration

Let's look at the Ethereum features dataset:

In [3]:
# Load features dataset
features_df = pd.read_csv(FEATURES_FILE)

print("=== Features Dataset ===")
print(f"Shape: {features_df.shape} (rows, columns)")
print("Columns and their types:")
for col in sorted(features_df.columns):
    print(f"- {col}: {features_df[col].dtype}")

print("Sample of data (first 5 rows):")
print(features_df.head())

print("Missing values:")
missing = features_df.isnull().sum()
print(missing[missing > 0])  # Only show columns with missing values

=== Features Dataset ===
Shape: (12146, 34) (rows, columns)
Columns and their types:
- activityDays: int64
- address: object
- avgTimeBetweenRecTnx: float64
- avgTimeBetweenSentTnx: float64
- avgValReceived: float64
- avgValSent: float64
- createdContracts: int64
- dailyMax: int64
- flag: int64
- giniRec: float64
- giniSent: float64
- lifetime: int64
- maxTimeBetweenRecTnx: float64
- maxTimeBetweenSentTnx: float64
- maxValReceived: float64
- maxValSent: float64
- minTimeBetweenRecTnx: float64
- minTimeBetweenSentTnx: float64
- minValReceived: float64
- minValSent: float64
- numUniqRecAddress: int64
- numUniqSentAddress: int64
- ratioRecSent: float64
- ratioRecTotal: float64
- ratioSentTotal: float64
- receivedTransactions: int64
- sentTransactions: int64
- stdBalanceEth: float64
- totalEtherBalance: float64
- totalEtherReceived: float64
- totalEtherSent: float64
- totalEtherSentContracts: float64
- totalTransactions: int64
- txFreq: float64
Sample of data (first 5 rows):
              

## 3. Feature Mapping

### Standard Fraud Detection Features vs Ethereum Features

1. **fraud_bool (Binary)**
   - Maps to: FLAG/flag
   - Purpose: Indicates fraudulent/non-fraudulent transactions
   - Value type: Binary

2. **income (Numeric)**
   - Maps to: total_ether_balance, total_ether_sent, total_ether_received
   - Purpose: Represents financial capacity/activity
   - Value type: Numeric

3. **name_email_similarity**
   - Not available in Ethereum data (transactions are pseudonymous)
   - Alternative: We can use address behavior patterns

4. **address_history**
   - Maps to: avg_time_between_sent_tx, avg_time_between_received_tx
   - Purpose: Captures account age and activity patterns
   - Value type: Numeric

5. **customer_age**
   - Not directly available in Ethereum data
   - Alternative: We can use account age and transaction patterns

6. **days_since_request**
   - Maps to: Time-based features (avgTimeBetweenSentTnx, avgTimeBetweenRecTnx)
   - Purpose: Captures temporal transaction patterns
   - Value type: Numeric

7. **transaction_amounts**
   - Maps to: min/max/avg_value_sent, min/max/avg_value_received
   - Purpose: Captures transaction value patterns
   - Value type: Numeric

8. **payment_type**
   - Maps to: Contract interactions vs normal transactions
   - Purpose: Distinguishes transaction types
   - Value type: Categorical

9. **address_activity**
   - Maps to: unique_sent_addresses, unique_received_addresses
   - Purpose: Captures interaction patterns
   - Value type: Numeric

10. **velocity**
    - Maps to: sent_transactions, received_transactions, total_transactions
    - Purpose: Captures transaction frequency
    - Value type: Numeric

In [4]:
# Define feature mappings
feature_mappings = {
    # Core fraud detection features
    'fraud_label': ['FLAG', 'flag'],
    'transaction_count': ['total transactions', 'totalTransactions'],
    'account_balance': ['total ether balance', 'totalEtherBalance'],
    
    # Transaction patterns
    'transaction_frequency_sent': ['Avg min between sent tnx', 'avgTimeBetweenSentTnx'],
    'transaction_frequency_received': ['Avg min between received tnx', 'avgTimeBetweenRecTnx'],
    'unique_contacts_sent': ['Unique Sent To Addresses', 'numUniqSentAddress'],
    'unique_contacts_received': ['Unique Received From Addresses', 'numUniqRecAddress'],
    
    # Transaction values
    'min_transaction_sent': ['min val sent', 'minValSent'],
    'max_transaction_sent': ['max val sent', 'maxValSent'],
    'avg_transaction_sent': ['avg val sent', 'avgValSent'],
    'min_transaction_received': ['min value received', 'minValReceived'],
    'max_transaction_received': ['max value received', 'maxValReceived'],
    'avg_transaction_received': ['avg val received', 'avgValReceived'],
    
    # Activity metrics
    'total_sent': ['total Ether sent', 'totalEtherSent'],
    'total_received': ['total ether received', 'totalEtherReceived'],
    'contract_interaction': ['total ether sent contracts', 'totalEtherSentContracts'],
    'contract_creation': ['Number of Created Contracts', 'createdContracts']
}

# Create standardized dataframes
trans_standardized = pd.DataFrame()
feat_standardized = pd.DataFrame()

# Map features with new standardized names
for new_col, [trans_col, feat_col] in feature_mappings.items():
    if trans_col in trans_df.columns and feat_col in features_df.columns:
        trans_standardized[new_col] = trans_df[trans_col]
        feat_standardized[new_col] = features_df[feat_col]

print("Standardized features available in both datasets:")
for col in sorted(trans_standardized.columns):
    print(f"- {col}")

Standardized features available in both datasets:
- account_balance
- avg_transaction_received
- avg_transaction_sent
- contract_creation
- contract_interaction
- fraud_label
- max_transaction_sent
- min_transaction_received
- min_transaction_sent
- total_received
- total_sent
- transaction_frequency_received
- transaction_frequency_sent
- unique_contacts_received
- unique_contacts_sent


## 4. Creating Final Dataset

Now we'll merge the datasets with standardized feature names that align with typical fraud detection features.

In [5]:
print("Dataset shapes before merging:")
print(f"Transaction dataset: {trans_standardized.shape}")
print(f"Features dataset: {feat_standardized.shape}")

# Merge datasets
merged_df = pd.concat([trans_standardized, feat_standardized], axis=0, ignore_index=True)

print("Final dataset shape:", merged_df.shape)
print("Final feature set:")
for col in sorted(merged_df.columns):
    print(f"- {col}")
    # Show basic statistics for each feature
    if merged_df[col].dtype in ['int64', 'float64']:
        stats = merged_df[col].describe()
        print(f"  * Type: {merged_df[col].dtype}")
        print(f"  * Range: {stats['min']:.2f} to {stats['max']:.2f}")
        print(f"  * Mean: {stats['mean']:.2f}")
    else:
        print(f"  * Type: {merged_df[col].dtype}")
        print(f"  * Unique values: {merged_df[col].nunique()}")
    print()

# Check for duplicates
duplicates = merged_df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

# Save the final dataset
merged_df.to_csv('/Users/chery/Desktop/y2s2/IS4303/project code/Ethereum-Fraud-Detection/data/processed/merged_dataset.csv', index=False)
print("Saved standardized dataset to 'data/processed/merged_dataset.csv'")

Dataset shapes before merging:
Transaction dataset: (9841, 15)
Features dataset: (12146, 15)
Final dataset shape: (21987, 15)
Final feature set:
- account_balance
  * Type: float64
  * Range: -15605352.04 to 14288636.26
  * Mean: 585.71

- avg_transaction_received
  * Type: float64
  * Range: 0.00 to 283618.83
  * Mean: 73.83

- avg_transaction_sent
  * Type: float64
  * Range: 0.00 to 25533.61
  * Mean: 26.74

- contract_creation
  * Type: int64
  * Range: 0.00 to 9995.00
  * Mean: 2.64

- contract_interaction
  * Type: float64
  * Range: 0.00 to 39.00
  * Mean: 0.00

- fraud_label
  * Type: int64
  * Range: 0.00 to 1.00
  * Mean: 0.33

- max_transaction_sent
  * Type: float64
  * Range: 0.00 to 611102.01
  * Mean: 233.02

- min_transaction_received
  * Type: float64
  * Range: 0.00 to 25533.61
  * Mean: 27.20

- min_transaction_sent
  * Type: float64
  * Range: 0.00 to 25533.61
  * Mean: 5.60

- total_received
  * Type: float64
  * Range: 0.00 to 28581590.07
  * Mean: 6824.45

- tota

OSError: Cannot save file into a non-existent directory: 'Ethereum-Fraud-Detection/data/processed'