# DIFrauD Multilingual Data Quality Assessment
## Analyzing Language Diversity Impact on Classification Performance

**Course:** COSC 4371 Security Analytics - Fall 2025  
**Team Members:** Joseph Mascardo, Niket Gupta  

---

### Project Objective
Investigate whether all samples in the DIFrauD dataset are in English, analyze language distribution by class and domain, and study the impact on classification performance.

### Research Questions
1. What is the language distribution across classes and domains in DIFrauD?
2. How does removing non-English samples affect classifier performance?
3. Do transformer-based models handle multilingual content better than traditional ML?

---

## External Sources and References

### Dataset
- **DIFrauD Dataset**: https://huggingface.co/datasets/difraud/difraud
- **Citation**: Boumber, D., et al. (2024). "Domain-Agnostic Adapter Architecture for Deception Detection." LREC-COLING 2024.

### Libraries Used
- **langdetect**: https://pypi.org/project/langdetect/ - Language detection (port of Google's language-detection)
- **datasets**: https://huggingface.co/docs/datasets/ - HuggingFace datasets library
- **transformers**: https://huggingface.co/docs/transformers/ - HuggingFace transformers for DistilBERT
- **scikit-learn**: https://scikit-learn.org/ - Traditional ML classifiers
- **pandas/numpy**: Data processing
- **matplotlib/seaborn**: Visualizations

### Key References
- Conneau, A., et al. (2020). "Unsupervised cross-lingual representation learning at scale." ACL 2020.
- Devlin, J., et al. (2019). "BERT: Pre-training of deep bidirectional transformers." NAACL 2019.
- Verma, R. M., et al. (2019). "Data quality for security challenges." ACM CCS 2019.

---
## 1. Environment Setup and Imports

**Steps taken:**
1. Install required packages
2. Import necessary libraries
3. Set random seeds for reproducibility

In [None]:
# Install required packages (run once)
# Source: Standard pip installation
!pip install -q datasets langdetect transformers torch scikit-learn pandas numpy matplotlib seaborn tqdm
!pip install -q spacy
!python -m spacy download en_core_web_sm -q

In [None]:
# Import libraries
import warnings
warnings.filterwarnings('ignore')

# Data handling
import pandas as pd
import numpy as np
from collections import Counter, defaultdict
import time  # For training time tracking

# Dataset loading - Source: https://huggingface.co/docs/datasets/
from datasets import load_dataset, concatenate_datasets

# Language detection - Source: https://pypi.org/project/langdetect/
from langdetect import detect, detect_langs, LangDetectException

# spaCy for language detection validation - Source: https://spacy.io/
import spacy

# ML libraries - Source: https://scikit-learn.org/
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.metrics import (
    classification_report, confusion_matrix, f1_score, 
    precision_score, recall_score, accuracy_score,
    balanced_accuracy_score
)
from sklearn.preprocessing import LabelEncoder

# Deep Learning - Source: https://huggingface.co/docs/transformers/
import torch
from transformers import (
    DistilBertTokenizer, DistilBertForSequenceClassification,
    Trainer, TrainingArguments, EarlyStoppingCallback
)

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Progress tracking
from tqdm import tqdm

# Statistical testing
from scipy import stats

# Set random seeds for reproducibility
SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print("All libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

---
## 2. Load DIFrauD Dataset

**Steps taken:**
1. Load all 7 domains from HuggingFace
2. Combine train, validation, and test splits
3. Create unified DataFrame with domain labels

**Dataset Source:** https://huggingface.co/datasets/difraud/difraud

In [None]:
# Define all domains in DIFrauD dataset
DOMAINS = [
    'fake_news',
    'job_scams', 
    'phishing',
    'political_statements',
    'product_reviews',
    'sms',
    'twitter_rumours'
]

def load_difraud_dataset():
    """
    Load all domains from DIFrauD dataset by directly reading JSONL files.
    Source: HuggingFace datasets library
    Dataset: https://huggingface.co/datasets/difraud/difraud
    
    Note: The dataset uses legacy loading scripts no longer supported by datasets library,
    so we load directly from the JSONL files using data_files parameter.
    """
    all_data = []
    
    for domain in tqdm(DOMAINS, desc="Loading domains"):
        try:
            # Load directly from JSONL files using data_files parameter
            # This bypasses the deprecated loading script
            base_url = f"https://huggingface.co/datasets/difraud/difraud/resolve/main/{domain}"
            
            dataset = load_dataset(
                'json',
                data_files={
                    'train': f"{base_url}/train.jsonl",
                    'validation': f"{base_url}/validation.jsonl",
                    'test': f"{base_url}/test.jsonl"
                }
            )
            
            # Combine all splits
            for split in ['train', 'validation', 'test']:
                if split in dataset:
                    df_split = dataset[split].to_pandas()
                    df_split['domain'] = domain
                    df_split['split'] = split
                    all_data.append(df_split)
                    
        except Exception as e:
            print(f"Error loading {domain}: {e}")
    
    # Combine all data
    if len(all_data) == 0:
        raise ValueError("No data was loaded. Check dataset availability and internet connection.")
    
    df = pd.concat(all_data, ignore_index=True)
    return df

# Load the dataset
print("Loading DIFrauD dataset from HuggingFace...")
print("(Downloading JSONL files directly - this may take a few minutes)\n")
df = load_difraud_dataset()

print(f"\nDataset loaded successfully!")
print(f"Total samples: {len(df):,}")
print(f"Columns: {df.columns.tolist()}")

In [None]:
# Dataset overview
print("="*60)
print("DATASET OVERVIEW")
print("="*60)

print("\n--- Samples by Domain ---")
domain_counts = df.groupby('domain').agg({
    'text': 'count',
    'label': ['sum', 'mean']
}).round(3)
domain_counts.columns = ['Total', 'Deceptive', 'Deceptive_Ratio']
domain_counts['Non-Deceptive'] = domain_counts['Total'] - domain_counts['Deceptive']
print(domain_counts)

print("\n--- Overall Class Distribution ---")
print(f"Deceptive (label=1): {df['label'].sum():,} ({df['label'].mean()*100:.2f}%)")
print(f"Non-Deceptive (label=0): {(df['label']==0).sum():,} ({(1-df['label'].mean())*100:.2f}%)")

print("\n--- Sample Text Lengths ---")
df['text_length'] = df['text'].str.len()
print(df.groupby('domain')['text_length'].describe().round(1))

---
## 3. Language Detection Pipeline

**Steps taken:**
1. Implement language detection using `langdetect` library
2. Handle edge cases (short texts, detection errors)
3. Apply to all samples and record detected languages

**Source:** langdetect library - https://pypi.org/project/langdetect/  
**Note:** langdetect is a port of Google's language-detection library

In [None]:
def detect_language_safe(text, min_length=20):
    """
    Safely detect language of text with error handling.
    
    Source: langdetect library (https://pypi.org/project/langdetect/)
    
    Parameters:
    - text: Input text string
    - min_length: Minimum text length for reliable detection
    
    Returns:
    - Tuple of (detected_language_code, confidence_score)
    """
    if not isinstance(text, str) or len(text.strip()) < min_length:
        return ('unknown', 0.0)
    
    try:
        # Get language probabilities
        langs = detect_langs(text)
        # Return top language and its probability
        top_lang = langs[0]
        return (top_lang.lang, top_lang.prob)
    except LangDetectException:
        return ('unknown', 0.0)
    except Exception as e:
        return ('error', 0.0)

# Test the function
test_texts = [
    "This is a test message in English.",
    "Ceci est un message de test en français.",
    "Pathaya enketa maraikara pa",  # From SMS dataset (Tamil)
    "短文本"  # Short Chinese text
]

print("Language Detection Test:")
for text in test_texts:
    lang, conf = detect_language_safe(text)
    print(f"  '{text[:40]}...' -> {lang} (conf: {conf:.2f})")

In [None]:
# Apply language detection to entire dataset
print("Detecting languages for all samples...")
print("(This may take several minutes)\n")

# Apply with progress bar
tqdm.pandas(desc="Detecting languages")
language_results = df['text'].progress_apply(detect_language_safe)

# Extract language codes and confidence scores
df['detected_language'] = language_results.apply(lambda x: x[0])
df['language_confidence'] = language_results.apply(lambda x: x[1])

print("\nLanguage detection completed!")
print(f"Unique languages detected: {df['detected_language'].nunique()}")

### 3.1 spaCy Language Detection and Cross-Validation

**Steps taken:**
1. Load spaCy English model (en_core_web_sm) for English language detection
2. Use spaCy's language detection capabilities to validate langdetect results
3. Cross-validate between langdetect and spaCy detections
4. Calculate agreement metrics between the two methods

**Source:** spaCy - https://spacy.io/

**Note:** spaCy's en_core_web_sm model is trained on English text, so we use it to validate
whether text is likely English based on model processing confidence and token recognition.

In [None]:
# Load spaCy English model for language validation
# Source: https://spacy.io/models/en#en_core_web_sm

print("Loading spaCy English model (en_core_web_sm)...")
try:
    nlp = spacy.load("en_core_web_sm")
    print("spaCy model loaded successfully!")
except OSError:
    print("Downloading en_core_web_sm model...")
    import subprocess
    subprocess.run(["python", "-m", "spacy", "download", "en_core_web_sm"])
    nlp = spacy.load("en_core_web_sm")

def detect_english_spacy(text, min_length=20, sample_size=500):
    """
    Detect if text is English using spaCy's en_core_web_sm model.
    
    The approach:
    1. Process text with English model
    2. Check ratio of recognized tokens (tokens that are in English vocabulary)
    3. High ratio of recognized tokens = likely English
    
    Source: spaCy documentation - https://spacy.io/
    
    Parameters:
    - text: Input text string
    - min_length: Minimum text length for reliable detection
    - sample_size: Max characters to process (for efficiency)
    
    Returns:
    - Tuple of (is_english: bool, confidence: float, details: dict)
    """
    if not isinstance(text, str) or len(text.strip()) < min_length:
        return (False, 0.0, {'reason': 'text_too_short'})
    
    try:
        # Sample text for efficiency on long documents
        text_sample = text[:sample_size] if len(text) > sample_size else text
        
        # Process with spaCy
        doc = nlp(text_sample)
        
        # Count tokens and analyze
        total_tokens = len([t for t in doc if not t.is_space and not t.is_punct])
        
        if total_tokens == 0:
            return (False, 0.0, {'reason': 'no_tokens'})
        
        # Count tokens that are recognized (in vocabulary or have vectors)
        recognized_tokens = sum(1 for t in doc if not t.is_space and not t.is_punct and 
                               (t.is_alpha and (not t.is_oov or t.has_vector)))
        
        # Count tokens with English POS tags (meaningful parsing)
        parsed_tokens = sum(1 for t in doc if not t.is_space and t.pos_ != '')
        
        # Count stopwords (English stopwords indicate English text)
        stopword_count = sum(1 for t in doc if t.is_stop)
        
        # Calculate metrics
        recognition_ratio = recognized_tokens / total_tokens if total_tokens > 0 else 0
        parse_ratio = parsed_tokens / total_tokens if total_tokens > 0 else 0
        stopword_ratio = stopword_count / total_tokens if total_tokens > 0 else 0
        
        # Composite confidence score
        # High recognition + parsing + stopwords = likely English
        confidence = (recognition_ratio * 0.4 + parse_ratio * 0.3 + min(stopword_ratio * 2, 0.3))
        
        # Threshold for English detection
        is_english = confidence > 0.5
        
        details = {
            'total_tokens': total_tokens,
            'recognized_tokens': recognized_tokens,
            'recognition_ratio': round(recognition_ratio, 3),
            'stopword_ratio': round(stopword_ratio, 3),
            'parse_ratio': round(parse_ratio, 3)
        }
        
        return (is_english, round(confidence, 3), details)
        
    except Exception as e:
        return (False, 0.0, {'reason': f'error: {str(e)}'})

# Test spaCy detection
print("\nspaCy English Detection Test:")
for text in test_texts:
    is_eng, conf, details = detect_english_spacy(text)
    print(f"  '{text[:40]}...' -> English={is_eng} (conf: {conf:.2f})")

In [None]:
# Apply spaCy English detection to the dataset and cross-validate with langdetect
# Note: Processing entire dataset with spaCy is slow, so we use sampling for validation

print("="*60)
print("CROSS-VALIDATION: langdetect vs spaCy")
print("="*60)

# Sample for cross-validation (processing full dataset with spaCy is computationally expensive)
CROSS_VAL_SAMPLE_SIZE = 2000
np.random.seed(SEED)
sample_indices = np.random.choice(len(df), min(CROSS_VAL_SAMPLE_SIZE, len(df)), replace=False)
df_sample = df.iloc[sample_indices].copy()

print(f"\nCross-validating on {len(df_sample)} samples...")

# Apply spaCy detection to sample
tqdm.pandas(desc="spaCy detection")
spacy_results = df_sample['text'].progress_apply(detect_english_spacy)

df_sample['spacy_is_english'] = spacy_results.apply(lambda x: x[0])
df_sample['spacy_confidence'] = spacy_results.apply(lambda x: x[1])

# Compare results
# langdetect says English (is_english == True) vs spaCy says English (spacy_is_english == True)
agreement = (df_sample['is_english'] == df_sample['spacy_is_english']).mean()
print(f"\n--- Agreement Metrics ---")
print(f"Overall agreement rate: {agreement*100:.2f}%")

# Confusion matrix between the two methods
cross_tab = pd.crosstab(
    df_sample['is_english'].map({True: 'langdetect: English', False: 'langdetect: Non-English'}),
    df_sample['spacy_is_english'].map({True: 'spaCy: English', False: 'spaCy: Non-English'})
)
print("\nCross-tabulation (langdetect vs spaCy):")
print(cross_tab)

# Calculate Cohen's Kappa for inter-rater agreement
from sklearn.metrics import cohen_kappa_score
kappa = cohen_kappa_score(df_sample['is_english'], df_sample['spacy_is_english'])
print(f"\nCohen's Kappa (inter-method agreement): {kappa:.4f}")
print(f"  Interpretation: ", end="")
if kappa < 0.20:
    print("Poor agreement")
elif kappa < 0.40:
    print("Fair agreement")
elif kappa < 0.60:
    print("Moderate agreement")
elif kappa < 0.80:
    print("Substantial agreement")
else:
    print("Almost perfect agreement")

# Analyze disagreements
disagreements = df_sample[df_sample['is_english'] != df_sample['spacy_is_english']]
print(f"\n--- Disagreement Analysis ---")
print(f"Total disagreements: {len(disagreements)} ({len(disagreements)/len(df_sample)*100:.2f}%)")

if len(disagreements) > 0:
    # langdetect says English, spaCy says Non-English
    ld_eng_sp_non = disagreements[(disagreements['is_english'] == True) & (disagreements['spacy_is_english'] == False)]
    print(f"  langdetect=English, spaCy=Non-English: {len(ld_eng_sp_non)}")
    
    # langdetect says Non-English, spaCy says English
    ld_non_sp_eng = disagreements[(disagreements['is_english'] == False) & (disagreements['spacy_is_english'] == True)]
    print(f"  langdetect=Non-English, spaCy=English: {len(ld_non_sp_eng)}")
    
    # Show examples of disagreements
    print("\nExamples of disagreements:")
    for i, (idx, row) in enumerate(disagreements.head(3).iterrows()):
        print(f"\n  Example {i+1} (Domain: {row['domain']}):")
        print(f"    Text: '{row['text'][:100]}...'")
        print(f"    langdetect: {row['detected_language']} (conf: {row['language_confidence']:.2f})")
        print(f"    spaCy English: {row['spacy_is_english']} (conf: {row['spacy_confidence']:.2f})")

### 3.2 Manual Validation of Language Detection

**Purpose:** Document the manual validation process for language detection results.

This section provides a framework for manually validating a sample of language detection results 
to assess the accuracy of automated detection methods.

**Methodology:**
1. Randomly sample 100-500 instances from the dataset
2. Manually review each sample's detected language
3. Record agreement/disagreement with automated detection
4. Calculate validation metrics (accuracy, error types)

In [None]:
# Manual Validation Framework for Language Detection
# This cell generates a sample for manual validation and documents the process

print("="*80)
print("MANUAL VALIDATION SAMPLE")
print("="*80)

# Generate stratified sample for manual validation
MANUAL_VAL_SIZE = 200  # Sample size (100-500 recommended)
np.random.seed(SEED)

# Stratified sampling: ensure representation from each domain and both language classes
manual_val_samples = []

for domain in DOMAINS:
    domain_data = df[df['domain'] == domain]
    
    # Sample from English and Non-English
    english_samples = domain_data[domain_data['is_english'] == True]
    non_english_samples = domain_data[domain_data['is_english'] == False]
    
    # Take proportional samples
    n_per_domain = MANUAL_VAL_SIZE // len(DOMAINS)
    n_english = min(len(english_samples), n_per_domain // 2)
    n_non_english = min(len(non_english_samples), n_per_domain // 2)
    
    if n_english > 0:
        manual_val_samples.append(english_samples.sample(n=n_english, random_state=SEED))
    if n_non_english > 0:
        manual_val_samples.append(non_english_samples.sample(n=n_non_english, random_state=SEED))

manual_val_df = pd.concat(manual_val_samples, ignore_index=True)
print(f"Manual validation sample size: {len(manual_val_df)}")
print(f"Domains represented: {manual_val_df['domain'].nunique()}")

# Display sample distribution
print("\n--- Sample Distribution ---")
print(manual_val_df.groupby(['domain', 'is_english']).size().unstack(fill_value=0))

# Generate validation template
print("\n--- Manual Validation Template ---")
print("For each sample below, verify if the detected language is correct.")
print("Record your assessment as: CORRECT, INCORRECT, or UNCERTAIN\n")

# Show sample entries for manual review
print("Sample entries for manual validation:")
print("-" * 80)

for i, (idx, row) in enumerate(manual_val_df.head(10).iterrows()):
    print(f"\n[Sample {i+1}] Domain: {row['domain']}")
    print(f"  Text: '{row['text'][:150]}...'")
    print(f"  Detected: {row['detected_language']} (confidence: {row['language_confidence']:.2f})")
    print(f"  Auto-classified as English: {row['is_english']}")
    print(f"  Manual Assessment: _________ (CORRECT / INCORRECT / UNCERTAIN)")

# Save validation sample to CSV for offline manual review
manual_val_df[['text', 'label', 'domain', 'detected_language', 'language_confidence', 'is_english']].to_csv(
    'manual_validation_sample.csv', index=False
)
print("\n" + "-" * 80)
print(f"\nFull validation sample saved to 'manual_validation_sample.csv'")
print("Use this file for systematic manual validation of language detection results.")

# Document validation process
print("\n" + "="*80)
print("MANUAL VALIDATION DOCUMENTATION")
print("="*80)
print("""
VALIDATION PROCESS:
1. Sample Selection: Stratified sample of {size} instances across all domains
2. Review Criteria:
   - CORRECT: Detected language matches actual language of text
   - INCORRECT: Detected language does not match actual language
   - UNCERTAIN: Text is ambiguous, code-mixed, or too short to determine

EXPECTED OUTCOMES:
- Calculate agreement rate between automated and manual detection
- Identify systematic errors (e.g., specific languages misclassified)
- Estimate precision and recall of English detection

VALIDATION METRICS TO CALCULATE:
- Manual agreement rate = (CORRECT assessments) / (total assessments)
- Error analysis: distribution of INCORRECT by domain and detected language
- Confidence calibration: compare detection confidence vs accuracy

NOTE: After completing manual validation, update this section with results.
""".format(size=MANUAL_VAL_SIZE))

---
## 4. Language Distribution Analysis

**Steps taken:**
1. Calculate language distribution overall
2. Analyze by class (deceptive vs non-deceptive)
3. Analyze by domain
4. Perform chi-square tests for significance

In [None]:
# Overall language distribution
print("="*60)
print("OVERALL LANGUAGE DISTRIBUTION")
print("="*60)

lang_counts = df['detected_language'].value_counts()
lang_percentages = df['detected_language'].value_counts(normalize=True) * 100

lang_summary = pd.DataFrame({
    'Count': lang_counts,
    'Percentage': lang_percentages.round(2)
})
print(lang_summary.head(15))

# English vs Non-English
df['is_english'] = df['detected_language'] == 'en'
print(f"\n--- English vs Non-English ---")
print(f"English samples: {df['is_english'].sum():,} ({df['is_english'].mean()*100:.2f}%)")
print(f"Non-English samples: {(~df['is_english']).sum():,} ({(~df['is_english']).mean()*100:.2f}%)")

In [None]:
# Language distribution by CLASS (deceptive vs non-deceptive)
print("="*60)
print("LANGUAGE DISTRIBUTION BY CLASS")
print("="*60)

class_lang_dist = pd.crosstab(
    df['label'].map({0: 'Non-Deceptive', 1: 'Deceptive'}),
    df['is_english'].map({True: 'English', False: 'Non-English'}),
    margins=True
)
print("\nCounts:")
print(class_lang_dist)

# Percentages within each class
class_lang_pct = pd.crosstab(
    df['label'].map({0: 'Non-Deceptive', 1: 'Deceptive'}),
    df['is_english'].map({True: 'English', False: 'Non-English'}),
    normalize='index'
) * 100
print("\nPercentages (within each class):")
print(class_lang_pct.round(2))

# Chi-square test for class vs language
contingency = pd.crosstab(df['label'], df['is_english'])
chi2, p_value, dof, expected = stats.chi2_contingency(contingency)
print(f"\nChi-square test (Class vs Language):")
print(f"  Chi-square statistic: {chi2:.4f}")
print(f"  p-value: {p_value:.4e}")
print(f"  Significant (p < 0.05): {'Yes' if p_value < 0.05 else 'No'}")

In [None]:
# Language distribution by DOMAIN
print("="*60)
print("LANGUAGE DISTRIBUTION BY DOMAIN")
print("="*60)

domain_lang_analysis = []

for domain in DOMAINS:
    domain_df = df[df['domain'] == domain]
    
    total = len(domain_df)
    english = domain_df['is_english'].sum()
    non_english = total - english
    
    # Top non-English languages
    non_eng_langs = domain_df[~domain_df['is_english']]['detected_language'].value_counts().head(3)
    top_non_eng = ', '.join([f"{lang}({cnt})" for lang, cnt in non_eng_langs.items()])
    
    domain_lang_analysis.append({
        'Domain': domain,
        'Total': total,
        'English': english,
        'Non-English': non_english,
        'English %': (english/total*100),
        'Non-English %': (non_english/total*100),
        'Top Non-English Languages': top_non_eng
    })

domain_lang_df = pd.DataFrame(domain_lang_analysis)
print(domain_lang_df.to_string(index=False))

In [None]:
# Detailed breakdown: Language distribution by Domain AND Class
print("="*60)
print("LANGUAGE DISTRIBUTION BY DOMAIN AND CLASS")
print("="*60)

detailed_analysis = []

for domain in DOMAINS:
    for label in [0, 1]:
        subset = df[(df['domain'] == domain) & (df['label'] == label)]
        
        if len(subset) == 0:
            continue
            
        total = len(subset)
        english = subset['is_english'].sum()
        
        # Get top 5 detected languages
        lang_dist = subset['detected_language'].value_counts().head(5).to_dict()
        
        detailed_analysis.append({
            'Domain': domain,
            'Class': 'Deceptive' if label == 1 else 'Non-Deceptive',
            'Total': total,
            'English': english,
            'English %': round(english/total*100, 2),
            'Non-English': total - english,
            'Non-English %': round((total-english)/total*100, 2),
            'Languages': lang_dist
        })

detailed_df = pd.DataFrame(detailed_analysis)
print(detailed_df[['Domain', 'Class', 'Total', 'English', 'English %', 'Non-English', 'Non-English %']].to_string(index=False))

In [None]:
# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: Overall language distribution (top 10)
ax1 = axes[0, 0]
top_langs = df['detected_language'].value_counts().head(10)
colors = ['green' if lang == 'en' else 'coral' for lang in top_langs.index]
top_langs.plot(kind='bar', ax=ax1, color=colors)
ax1.set_title('Top 10 Detected Languages', fontsize=12)
ax1.set_xlabel('Language Code')
ax1.set_ylabel('Count')
ax1.tick_params(axis='x', rotation=45)

# Plot 2: English vs Non-English by domain
ax2 = axes[0, 1]
domain_lang_pivot = df.groupby('domain')['is_english'].agg(['sum', 'count'])
domain_lang_pivot['non_english'] = domain_lang_pivot['count'] - domain_lang_pivot['sum']
domain_lang_pivot[['sum', 'non_english']].plot(kind='bar', stacked=True, ax=ax2, 
                                                color=['green', 'coral'])
ax2.set_title('English vs Non-English by Domain', fontsize=12)
ax2.set_xlabel('Domain')
ax2.set_ylabel('Count')
ax2.legend(['English', 'Non-English'])
ax2.tick_params(axis='x', rotation=45)

# Plot 3: Non-English percentage by domain
ax3 = axes[1, 0]
non_eng_pct = domain_lang_df.set_index('Domain')['Non-English %']
non_eng_pct.plot(kind='bar', ax=ax3, color='coral')
ax3.set_title('Non-English Percentage by Domain', fontsize=12)
ax3.set_xlabel('Domain')
ax3.set_ylabel('Non-English %')
ax3.tick_params(axis='x', rotation=45)
ax3.axhline(y=non_eng_pct.mean(), color='red', linestyle='--', label=f'Mean: {non_eng_pct.mean():.1f}%')
ax3.legend()

# Plot 4: Language distribution by class
ax4 = axes[1, 1]
class_lang_pct.plot(kind='bar', ax=ax4, color=['green', 'coral'])
ax4.set_title('Language Distribution by Class', fontsize=12)
ax4.set_xlabel('Class')
ax4.set_ylabel('Percentage')
ax4.legend(['English', 'Non-English'])
ax4.tick_params(axis='x', rotation=0)

plt.tight_layout()
plt.savefig('language_distribution_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nVisualization saved as 'language_distribution_analysis.png'")

---
## 5. Create Dataset Splits (English-only vs Full)

**Steps taken:**
1. Create filtered English-only dataset
2. Create full multilingual dataset
3. Ensure consistent train/test splits for fair comparison

In [None]:
# Create English-only and Full datasets
print("Creating dataset versions...\n")

# Full dataset (all languages)
df_full = df.copy()

# English-only dataset
df_english = df[df['is_english'] == True].copy()

print(f"Full dataset: {len(df_full):,} samples")
print(f"English-only dataset: {len(df_english):,} samples")
print(f"Samples removed: {len(df_full) - len(df_english):,} ({(1 - len(df_english)/len(df_full))*100:.2f}%)")

# Compare class distribution
print("\n--- Class Distribution Comparison ---")
print(f"Full - Deceptive: {df_full['label'].mean()*100:.2f}%")
print(f"English-only - Deceptive: {df_english['label'].mean()*100:.2f}%")

In [None]:
def prepare_train_test_data(df, test_size=0.2, random_state=42):
    """
    Prepare stratified train/test splits.
    Uses stratification to handle class imbalance.
    
    Source: scikit-learn train_test_split
    https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
    """
    X = df['text'].values
    y = df['label'].values
    domains = df['domain'].values
    
    X_train, X_test, y_train, y_test, domains_train, domains_test = train_test_split(
        X, y, domains,
        test_size=test_size,
        random_state=random_state,
        stratify=y
    )
    
    return X_train, X_test, y_train, y_test, domains_train, domains_test

# Prepare data for both versions
print("Preparing train/test splits...\n")

# Full dataset
X_train_full, X_test_full, y_train_full, y_test_full, domains_train_full, domains_test_full = \
    prepare_train_test_data(df_full)

# English-only dataset
X_train_eng, X_test_eng, y_train_eng, y_test_eng, domains_train_eng, domains_test_eng = \
    prepare_train_test_data(df_english)

print("Full Dataset:")
print(f"  Train: {len(X_train_full):,} | Test: {len(X_test_full):,}")
print(f"  Train class dist: {np.mean(y_train_full)*100:.2f}% deceptive")

print("\nEnglish-only Dataset:")
print(f"  Train: {len(X_train_eng):,} | Test: {len(X_test_eng):,}")
print(f"  Train class dist: {np.mean(y_train_eng)*100:.2f}% deceptive")

---
## 6. Traditional ML Classifiers (Random Forest & SVM)

**Steps taken:**
1. Create TF-IDF features
2. Train Random Forest and SVM classifiers
3. Evaluate on both dataset versions
4. Use F1-score as primary metric (suitable for imbalanced data)

**Source:** scikit-learn - https://scikit-learn.org/

In [None]:
def create_tfidf_features(X_train, X_test, max_features=10000):
    """
    Create TF-IDF features from text data.
    
    Source: scikit-learn TfidfVectorizer
    https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
    """
    vectorizer = TfidfVectorizer(
        max_features=max_features,
        ngram_range=(1, 2),  # Unigrams and bigrams
        min_df=2,           # Minimum document frequency
        max_df=0.95,        # Maximum document frequency
        sublinear_tf=True   # Apply sublinear tf scaling
    )
    
    X_train_tfidf = vectorizer.fit_transform(X_train)
    X_test_tfidf = vectorizer.transform(X_test)
    
    return X_train_tfidf, X_test_tfidf, vectorizer

print("Creating TF-IDF features...\n")

# Full dataset features
X_train_full_tfidf, X_test_full_tfidf, vectorizer_full = \
    create_tfidf_features(X_train_full, X_test_full)
print(f"Full dataset - TF-IDF shape: {X_train_full_tfidf.shape}")

# English-only features
X_train_eng_tfidf, X_test_eng_tfidf, vectorizer_eng = \
    create_tfidf_features(X_train_eng, X_test_eng)
print(f"English-only - TF-IDF shape: {X_train_eng_tfidf.shape}")

In [None]:
def train_and_evaluate_classifier(clf, X_train, X_test, y_train, y_test, clf_name, dataset_name):
    """
    Train classifier and return evaluation metrics.
    
    Uses metrics suitable for imbalanced datasets:
    - F1-Score (weighted and macro)
    - Balanced Accuracy
    - Precision and Recall
    
    Source: scikit-learn metrics
    """
    print(f"\nTraining {clf_name} on {dataset_name}...")
    clf.fit(X_train, y_train)
    
    # Predictions
    y_pred = clf.predict(X_test)
    
    # Calculate metrics
    metrics = {
        'Classifier': clf_name,
        'Dataset': dataset_name,
        'Accuracy': accuracy_score(y_test, y_pred),
        'Balanced_Accuracy': balanced_accuracy_score(y_test, y_pred),
        'F1_Weighted': f1_score(y_test, y_pred, average='weighted'),
        'F1_Macro': f1_score(y_test, y_pred, average='macro'),
        'Precision_Weighted': precision_score(y_test, y_pred, average='weighted'),
        'Recall_Weighted': recall_score(y_test, y_pred, average='weighted')
    }
    
    print(f"  F1 (weighted): {metrics['F1_Weighted']:.4f}")
    print(f"  F1 (macro): {metrics['F1_Macro']:.4f}")
    print(f"  Balanced Accuracy: {metrics['Balanced_Accuracy']:.4f}")
    
    return metrics, y_pred, clf

In [None]:
# Train Random Forest with timing
# Source: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

print("="*60)
print("RANDOM FOREST CLASSIFIER")
print("="*60)

rf_results = []
training_times = {}  # Track training times for all models

# Random Forest on Full Dataset
rf_full = RandomForestClassifier(
    n_estimators=100,
    max_depth=None,
    min_samples_split=2,
    class_weight='balanced',  # Handle class imbalance
    random_state=SEED,
    n_jobs=-1
)

start_time = time.time()
metrics_rf_full, pred_rf_full, _ = train_and_evaluate_classifier(
    rf_full, X_train_full_tfidf, X_test_full_tfidf,
    y_train_full, y_test_full,
    'Random Forest', 'Full (Multilingual)'
)
training_times['RF_Full'] = time.time() - start_time
print(f"  Training time: {training_times['RF_Full']:.2f} seconds")
rf_results.append(metrics_rf_full)

# Random Forest on English-only Dataset
rf_eng = RandomForestClassifier(
    n_estimators=100,
    max_depth=None,
    min_samples_split=2,
    class_weight='balanced',
    random_state=SEED,
    n_jobs=-1
)

start_time = time.time()
metrics_rf_eng, pred_rf_eng, _ = train_and_evaluate_classifier(
    rf_eng, X_train_eng_tfidf, X_test_eng_tfidf,
    y_train_eng, y_test_eng,
    'Random Forest', 'English-only'
)
training_times['RF_English'] = time.time() - start_time
print(f"  Training time: {training_times['RF_English']:.2f} seconds")
rf_results.append(metrics_rf_eng)

In [None]:
# Train SVM (Support Vector Machine) using LinearSVC with timing
# Source: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html
# Note: LinearSVC is faster than SVC with kernel='linear' for large sparse datasets like TF-IDF

print("="*60)
print("SVM CLASSIFIER (LinearSVC)")
print("="*60)

svm_results = []

# LinearSVC on Full Dataset
svm_full = LinearSVC(
    C=1.0,
    class_weight='balanced',
    max_iter=10000,
    random_state=SEED
)

start_time = time.time()
metrics_svm_full, pred_svm_full, _ = train_and_evaluate_classifier(
    svm_full, X_train_full_tfidf, X_test_full_tfidf,
    y_train_full, y_test_full,
    'SVM (LinearSVC)', 'Full (Multilingual)'
)
training_times['SVM_Full'] = time.time() - start_time
print(f"  Training time: {training_times['SVM_Full']:.2f} seconds")
svm_results.append(metrics_svm_full)

# LinearSVC on English-only Dataset
svm_eng = LinearSVC(
    C=1.0,
    class_weight='balanced',
    max_iter=10000,
    random_state=SEED
)

start_time = time.time()
metrics_svm_eng, pred_svm_eng, _ = train_and_evaluate_classifier(
    svm_eng, X_train_eng_tfidf, X_test_eng_tfidf,
    y_train_eng, y_test_eng,
    'SVM (LinearSVC)', 'English-only'
)
training_times['SVM_English'] = time.time() - start_time
print(f"  Training time: {training_times['SVM_English']:.2f} seconds")
svm_results.append(metrics_svm_eng)

---
## 7. Transformer-Based Classifier (DistilBERT)

**Steps taken:**
1. Load pretrained DistilBERT model and tokenizer
2. Fine-tune on both dataset versions
3. Evaluate performance

**Source:** HuggingFace Transformers - https://huggingface.co/docs/transformers/  
**Model:** distilbert-base-uncased - https://huggingface.co/distilbert-base-uncased

In [None]:
# DistilBERT Dataset Class
class FraudDataset(torch.utils.data.Dataset):
    """
    Custom PyTorch Dataset for fraud detection.
    Source: PyTorch Dataset API
    """
    def __init__(self, texts, labels, tokenizer, max_length=256):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = self.labels[idx]
        
        encoding = self.tokenizer(
            text,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )
        
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

def compute_metrics(eval_pred):
    """
    Compute metrics for HuggingFace Trainer.
    Uses metrics suitable for imbalanced data.
    """
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    
    return {
        'accuracy': accuracy_score(labels, predictions),
        'balanced_accuracy': balanced_accuracy_score(labels, predictions),
        'f1_weighted': f1_score(labels, predictions, average='weighted'),
        'f1_macro': f1_score(labels, predictions, average='macro'),
        'precision': precision_score(labels, predictions, average='weighted'),
        'recall': recall_score(labels, predictions, average='weighted')
    }

In [None]:
def train_distilbert(X_train, X_test, y_train, y_test, dataset_name, epochs=3, batch_size=16):
    """
    Train DistilBERT classifier with training time tracking.
    
    Source: HuggingFace Transformers
    Model: distilbert-base-uncased
    https://huggingface.co/distilbert-base-uncased
    
    Returns:
    - metrics: dict of evaluation metrics
    - y_pred: predictions on test set
    - model: trained model
    - train_time: training time in seconds
    """
    print(f"\n{'='*60}")
    print(f"Training DistilBERT on {dataset_name}")
    print(f"{'='*60}")
    
    # Start timing
    start_time = time.time()
    
    # Load tokenizer and model
    model_name = 'distilbert-base-uncased'
    tokenizer = DistilBertTokenizer.from_pretrained(model_name)
    model = DistilBertForSequenceClassification.from_pretrained(
        model_name,
        num_labels=2
    )
    
    # Create datasets
    train_dataset = FraudDataset(X_train, y_train, tokenizer)
    test_dataset = FraudDataset(X_test, y_test, tokenizer)
    
    # Training arguments
    training_args = TrainingArguments(
        output_dir=f'./results_{dataset_name.replace(" ", "_")}',
        num_train_epochs=epochs,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        warmup_steps=500,
        weight_decay=0.01,
        logging_dir='./logs',
        logging_steps=100,
        eval_strategy='epoch',
        save_strategy='epoch',
        load_best_model_at_end=True,
        metric_for_best_model='f1_weighted',
        seed=SEED
    )
    
    # Initialize trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
    )
    
    # Train
    trainer.train()
    
    # Calculate training time
    train_time = time.time() - start_time
    
    # Evaluate
    eval_results = trainer.evaluate()
    
    # Get predictions for detailed metrics
    predictions = trainer.predict(test_dataset)
    y_pred = np.argmax(predictions.predictions, axis=1)
    
    metrics = {
        'Classifier': 'DistilBERT',
        'Dataset': dataset_name,
        'Accuracy': eval_results['eval_accuracy'],
        'Balanced_Accuracy': eval_results['eval_balanced_accuracy'],
        'F1_Weighted': eval_results['eval_f1_weighted'],
        'F1_Macro': eval_results['eval_f1_macro'],
        'Precision_Weighted': eval_results['eval_precision'],
        'Recall_Weighted': eval_results['eval_recall']
    }
    
    print(f"\nResults for {dataset_name}:")
    print(f"  F1 (weighted): {metrics['F1_Weighted']:.4f}")
    print(f"  F1 (macro): {metrics['F1_Macro']:.4f}")
    print(f"  Balanced Accuracy: {metrics['Balanced_Accuracy']:.4f}")
    print(f"  Training time: {train_time:.2f} seconds ({train_time/60:.2f} minutes)")
    
    return metrics, y_pred, model, train_time

In [None]:
# Train DistilBERT on both datasets
# Note: This may take significant time depending on GPU availability

distilbert_results = []

# Check if GPU is available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

# Sample size for faster training (optional - remove for full training)
# Comment out these lines for full dataset training
SAMPLE_SIZE = 5000  # Use smaller sample for demonstration
print(f"\nNote: Using sample of {SAMPLE_SIZE} for demonstration.")
print("Remove SAMPLE_SIZE limit for full training.\n")

# Sample data
np.random.seed(SEED)
sample_idx_full = np.random.choice(len(X_train_full), min(SAMPLE_SIZE, len(X_train_full)), replace=False)
sample_idx_eng = np.random.choice(len(X_train_eng), min(SAMPLE_SIZE, len(X_train_eng)), replace=False)

X_train_full_sample = X_train_full[sample_idx_full]
y_train_full_sample = y_train_full[sample_idx_full]

X_train_eng_sample = X_train_eng[sample_idx_eng]
y_train_eng_sample = y_train_eng[sample_idx_eng]

In [None]:
# Train on Full Dataset
metrics_bert_full, pred_bert_full, model_full, bert_time_full = train_distilbert(
    X_train_full_sample, X_test_full[:1000],  # Smaller test set for speed
    y_train_full_sample, y_test_full[:1000],
    'Full (Multilingual)',
    epochs=2,
    batch_size=16
)
training_times['DistilBERT_Full'] = bert_time_full
distilbert_results.append(metrics_bert_full)

In [None]:
# Train on English-only Dataset
metrics_bert_eng, pred_bert_eng, model_eng, bert_time_eng = train_distilbert(
    X_train_eng_sample, X_test_eng[:1000],
    y_train_eng_sample, y_test_eng[:1000],
    'English-only',
    epochs=2,
    batch_size=16
)
training_times['DistilBERT_English'] = bert_time_eng
distilbert_results.append(metrics_bert_eng)

# Print training time summary
print("\n" + "="*60)
print("TRAINING TIME SUMMARY")
print("="*60)
for model_name, train_time in training_times.items():
    print(f"  {model_name}: {train_time:.2f}s ({train_time/60:.2f}m)")

---
## 8. Results Comparison and Analysis

**Steps taken:**
1. Compile all results
2. Calculate domain-wise performance
3. Compute aggregate metrics (mean and weighted)
4. Statistical significance testing

In [None]:
# Compile all results
all_results = rf_results + svm_results + distilbert_results
results_df = pd.DataFrame(all_results)

print("="*80)
print("OVERALL RESULTS COMPARISON")
print("="*80)
print(results_df.to_string(index=False))

In [None]:
# Calculate performance difference
print("\n" + "="*60)
print("PERFORMANCE DIFFERENCE (English-only vs Full)")
print("="*60)

for classifier in ['Random Forest', 'SVM', 'DistilBERT']:
    clf_results = results_df[results_df['Classifier'] == classifier]
    
    if len(clf_results) < 2:
        continue
        
    full_f1 = clf_results[clf_results['Dataset'].str.contains('Full')]['F1_Weighted'].values[0]
    eng_f1 = clf_results[clf_results['Dataset'].str.contains('English')]['F1_Weighted'].values[0]
    
    diff = eng_f1 - full_f1
    pct_change = (diff / full_f1) * 100
    
    print(f"\n{classifier}:")
    print(f"  Full dataset F1: {full_f1:.4f}")
    print(f"  English-only F1: {eng_f1:.4f}")
    print(f"  Difference: {diff:+.4f} ({pct_change:+.2f}%)")
    print(f"  Impact: {'Improved' if diff > 0 else 'Decreased'} with English-only data")

### 8.1 Statistical Significance Testing

**Steps taken:**
1. Calculate Cohen's d effect size for all model comparisons
2. Perform paired t-tests comparing English-only vs Full dataset performance
3. Use cross-validation to obtain multiple performance measurements for statistical testing

**Source:** 
- Cohen's d: Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences
- scipy.stats.ttest_rel: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html

In [None]:
# Cohen's d Effect Size and Paired t-tests
# Source: Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences

def cohens_d(group1, group2):
    """
    Calculate Cohen's d effect size for two groups.
    
    Cohen's d = (mean1 - mean2) / pooled_std
    
    Interpretation (Cohen, 1988):
    - |d| < 0.2: negligible effect
    - 0.2 <= |d| < 0.5: small effect
    - 0.5 <= |d| < 0.8: medium effect
    - |d| >= 0.8: large effect
    
    Source: Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences
    """
    n1, n2 = len(group1), len(group2)
    var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)
    
    # Pooled standard deviation
    pooled_std = np.sqrt(((n1 - 1) * var1 + (n2 - 1) * var2) / (n1 + n2 - 2))
    
    if pooled_std == 0:
        return 0.0
    
    d = (np.mean(group1) - np.mean(group2)) / pooled_std
    return d

def interpret_cohens_d(d):
    """Interpret Cohen's d effect size."""
    abs_d = abs(d)
    if abs_d < 0.2:
        return "negligible"
    elif abs_d < 0.5:
        return "small"
    elif abs_d < 0.8:
        return "medium"
    else:
        return "large"

print("="*80)
print("COHEN'S D EFFECT SIZE ANALYSIS")
print("="*80)
print("\nComparing English-only vs Full dataset performance across classifiers")
print("(Positive d = English-only performs better; Negative d = Full performs better)\n")

# We need cross-validation scores for proper paired t-tests
# Let's run stratified k-fold CV to get multiple measurements

print("Running 5-fold cross-validation for statistical testing...")
print("(This provides multiple performance measurements for paired t-tests)\n")

from sklearn.model_selection import StratifiedKFold

N_FOLDS = 5
cv = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

# Store CV results for statistical analysis
cv_results = {
    'RF_Full': [], 'RF_English': [],
    'SVM_Full': [], 'SVM_English': []
}

# Cross-validation for Random Forest
print("Cross-validating Random Forest...")
for fold, (train_idx, val_idx) in enumerate(cv.split(X_train_full_tfidf, y_train_full)):
    # Full dataset
    rf_cv = RandomForestClassifier(n_estimators=100, class_weight='balanced', 
                                    random_state=SEED, n_jobs=-1)
    rf_cv.fit(X_train_full_tfidf[train_idx], y_train_full[train_idx])
    y_pred_cv = rf_cv.predict(X_train_full_tfidf[val_idx])
    cv_results['RF_Full'].append(f1_score(y_train_full[val_idx], y_pred_cv, average='weighted'))

for fold, (train_idx, val_idx) in enumerate(cv.split(X_train_eng_tfidf, y_train_eng)):
    # English-only dataset
    rf_cv = RandomForestClassifier(n_estimators=100, class_weight='balanced', 
                                    random_state=SEED, n_jobs=-1)
    rf_cv.fit(X_train_eng_tfidf[train_idx], y_train_eng[train_idx])
    y_pred_cv = rf_cv.predict(X_train_eng_tfidf[val_idx])
    cv_results['RF_English'].append(f1_score(y_train_eng[val_idx], y_pred_cv, average='weighted'))

# Cross-validation for SVM
print("Cross-validating SVM...")
for fold, (train_idx, val_idx) in enumerate(cv.split(X_train_full_tfidf, y_train_full)):
    svm_cv = LinearSVC(C=1.0, class_weight='balanced', max_iter=10000, random_state=SEED)
    svm_cv.fit(X_train_full_tfidf[train_idx], y_train_full[train_idx])
    y_pred_cv = svm_cv.predict(X_train_full_tfidf[val_idx])
    cv_results['SVM_Full'].append(f1_score(y_train_full[val_idx], y_pred_cv, average='weighted'))

for fold, (train_idx, val_idx) in enumerate(cv.split(X_train_eng_tfidf, y_train_eng)):
    svm_cv = LinearSVC(C=1.0, class_weight='balanced', max_iter=10000, random_state=SEED)
    svm_cv.fit(X_train_eng_tfidf[train_idx], y_train_eng[train_idx])
    y_pred_cv = svm_cv.predict(X_train_eng_tfidf[val_idx])
    cv_results['SVM_English'].append(f1_score(y_train_eng[val_idx], y_pred_cv, average='weighted'))

print("\nCross-validation complete!\n")

# Calculate Cohen's d for each classifier
effect_sizes = {}

print("--- Cohen's d Effect Sizes ---")
for clf_name in ['RF', 'SVM']:
    full_scores = np.array(cv_results[f'{clf_name}_Full'])
    eng_scores = np.array(cv_results[f'{clf_name}_English'])
    
    d = cohens_d(eng_scores, full_scores)
    effect_sizes[clf_name] = d
    interpretation = interpret_cohens_d(d)
    
    print(f"\n{clf_name}:")
    print(f"  Full dataset CV F1: {full_scores.mean():.4f} (+/- {full_scores.std():.4f})")
    print(f"  English-only CV F1: {eng_scores.mean():.4f} (+/- {eng_scores.std():.4f})")
    print(f"  Cohen's d: {d:.4f} ({interpretation} effect)")
    print(f"  Direction: {'English-only better' if d > 0 else 'Full better'}")

# Paired t-tests
print("\n" + "="*80)
print("PAIRED T-TESTS (English-only vs Full)")
print("="*80)
print("\nUsing scipy.stats.ttest_rel for paired comparisons")
print("H0: No difference in performance between English-only and Full datasets")
print("H1: There is a significant difference in performance\n")

ttest_results = {}

for clf_name in ['RF', 'SVM']:
    full_scores = np.array(cv_results[f'{clf_name}_Full'])
    eng_scores = np.array(cv_results[f'{clf_name}_English'])
    
    # Paired t-test
    # Source: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html
    t_stat, p_value = stats.ttest_rel(eng_scores, full_scores)
    
    ttest_results[clf_name] = {'t_stat': t_stat, 'p_value': p_value}
    
    print(f"{clf_name}:")
    print(f"  t-statistic: {t_stat:.4f}")
    print(f"  p-value: {p_value:.4f}")
    print(f"  Significant at alpha=0.05: {'Yes' if p_value < 0.05 else 'No'}")
    print(f"  Significant at alpha=0.01: {'Yes' if p_value < 0.01 else 'No'}")
    print()

# Summary table
print("\n--- Statistical Analysis Summary ---")
summary_data = []
for clf_name in ['RF', 'SVM']:
    full_scores = np.array(cv_results[f'{clf_name}_Full'])
    eng_scores = np.array(cv_results[f'{clf_name}_English'])
    d = effect_sizes[clf_name]
    t_res = ttest_results[clf_name]
    
    summary_data.append({
        'Classifier': 'Random Forest' if clf_name == 'RF' else 'SVM (LinearSVC)',
        'Full_F1_Mean': f"{full_scores.mean():.4f}",
        'English_F1_Mean': f"{eng_scores.mean():.4f}",
        'Cohens_d': f"{d:.4f}",
        'Effect_Size': interpret_cohens_d(d),
        't_stat': f"{t_res['t_stat']:.4f}",
        'p_value': f"{t_res['p_value']:.4f}",
        'Significant': 'Yes' if t_res['p_value'] < 0.05 else 'No'
    })

summary_df = pd.DataFrame(summary_data)
print(summary_df.to_string(index=False))

In [None]:
# Domain-wise performance analysis
# Note: This requires per-domain evaluation which we'll compute here

def evaluate_by_domain(y_true, y_pred, domains):
    """
    Calculate performance metrics for each domain.
    """
    domain_metrics = []
    
    for domain in DOMAINS:
        mask = domains == domain
        if mask.sum() == 0:
            continue
            
        y_true_domain = y_true[mask]
        y_pred_domain = y_pred[mask]
        
        domain_metrics.append({
            'Domain': domain,
            'Samples': mask.sum(),
            'Accuracy': accuracy_score(y_true_domain, y_pred_domain),
            'F1_Weighted': f1_score(y_true_domain, y_pred_domain, average='weighted', zero_division=0),
            'F1_Macro': f1_score(y_true_domain, y_pred_domain, average='macro', zero_division=0)
        })
    
    return pd.DataFrame(domain_metrics)

print("="*60)
print("DOMAIN-WISE PERFORMANCE (Random Forest - Full Dataset)")
print("="*60)
domain_perf_full = evaluate_by_domain(y_test_full, pred_rf_full, domains_test_full)
print(domain_perf_full.to_string(index=False))

print("\n" + "="*60)
print("DOMAIN-WISE PERFORMANCE (Random Forest - English-only)")
print("="*60)
domain_perf_eng = evaluate_by_domain(y_test_eng, pred_rf_eng, domains_test_eng)
print(domain_perf_eng.to_string(index=False))

In [None]:
# Aggregate metrics (Mean and Weighted)
print("="*60)
print("AGGREGATE PERFORMANCE METRICS")
print("="*60)

# Mean performance across domains
print("\n--- Mean Performance (unweighted average across domains) ---")
print(f"Full Dataset - Mean F1: {domain_perf_full['F1_Weighted'].mean():.4f}")
print(f"English-only - Mean F1: {domain_perf_eng['F1_Weighted'].mean():.4f}")

# Weighted performance (weighted by number of samples)
print("\n--- Weighted Performance (weighted by domain size) ---")
weighted_f1_full = np.average(
    domain_perf_full['F1_Weighted'], 
    weights=domain_perf_full['Samples']
)
weighted_f1_eng = np.average(
    domain_perf_eng['F1_Weighted'], 
    weights=domain_perf_eng['Samples']
)
print(f"Full Dataset - Weighted F1: {weighted_f1_full:.4f}")
print(f"English-only - Weighted F1: {weighted_f1_eng:.4f}")

In [None]:
# Visualization of results
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: F1 Score comparison by classifier
ax1 = axes[0, 0]
classifiers = results_df['Classifier'].unique()
x = np.arange(len(classifiers))
width = 0.35

full_f1 = [results_df[(results_df['Classifier']==c) & (results_df['Dataset'].str.contains('Full'))]['F1_Weighted'].values[0] 
           if len(results_df[(results_df['Classifier']==c) & (results_df['Dataset'].str.contains('Full'))]) > 0 else 0
           for c in classifiers]
eng_f1 = [results_df[(results_df['Classifier']==c) & (results_df['Dataset'].str.contains('English'))]['F1_Weighted'].values[0]
          if len(results_df[(results_df['Classifier']==c) & (results_df['Dataset'].str.contains('English'))]) > 0 else 0
          for c in classifiers]

bars1 = ax1.bar(x - width/2, full_f1, width, label='Full (Multilingual)', color='coral')
bars2 = ax1.bar(x + width/2, eng_f1, width, label='English-only', color='green')
ax1.set_ylabel('F1 Score (Weighted)')
ax1.set_title('F1 Score by Classifier and Dataset')
ax1.set_xticks(x)
ax1.set_xticklabels(classifiers)
ax1.legend()
ax1.set_ylim(0, 1)

# Plot 2: Domain-wise F1 comparison
ax2 = axes[0, 1]
x = np.arange(len(DOMAINS))
ax2.bar(x - width/2, domain_perf_full['F1_Weighted'], width, label='Full', color='coral')
ax2.bar(x + width/2, domain_perf_eng['F1_Weighted'], width, label='English-only', color='green')
ax2.set_ylabel('F1 Score (Weighted)')
ax2.set_title('Domain-wise F1 Score (Random Forest)')
ax2.set_xticks(x)
ax2.set_xticklabels([d.replace('_', '\n') for d in DOMAINS], fontsize=8)
ax2.legend()

# Plot 3: Confusion Matrix (Full Dataset)
ax3 = axes[1, 0]
cm_full = confusion_matrix(y_test_full, pred_rf_full)
sns.heatmap(cm_full, annot=True, fmt='d', cmap='Blues', ax=ax3)
ax3.set_title('Confusion Matrix - Full Dataset (RF)')
ax3.set_xlabel('Predicted')
ax3.set_ylabel('Actual')

# Plot 4: Confusion Matrix (English-only)
ax4 = axes[1, 1]
cm_eng = confusion_matrix(y_test_eng, pred_rf_eng)
sns.heatmap(cm_eng, annot=True, fmt='d', cmap='Greens', ax=ax4)
ax4.set_title('Confusion Matrix - English-only (RF)')
ax4.set_xlabel('Predicted')
ax4.set_ylabel('Actual')

plt.tight_layout()
plt.savefig('classification_results.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nVisualization saved as 'classification_results.png'")

In [None]:
# Comprehensive Model Comparison Visualization
# Shows all models side-by-side with multiple metrics

print("="*80)
print("COMPREHENSIVE MODEL COMPARISON")
print("="*80)

fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# Prepare data for visualization
metrics_to_plot = ['F1_Weighted', 'F1_Macro', 'Balanced_Accuracy', 'Precision_Weighted', 'Recall_Weighted']
classifiers = results_df['Classifier'].unique()
datasets = ['Full (Multilingual)', 'English-only']

# Color palette
colors_full = '#E74C3C'  # Coral/Red for Full
colors_eng = '#27AE60'   # Green for English-only

# Plot 1: All metrics by classifier (grouped bar)
ax1 = axes[0, 0]
x = np.arange(len(classifiers))
width = 0.35

f1_full = [results_df[(results_df['Classifier']==c) & (results_df['Dataset'].str.contains('Full'))]['F1_Weighted'].values[0] 
           if len(results_df[(results_df['Classifier']==c) & (results_df['Dataset'].str.contains('Full'))]) > 0 else 0
           for c in classifiers]
f1_eng = [results_df[(results_df['Classifier']==c) & (results_df['Dataset'].str.contains('English'))]['F1_Weighted'].values[0]
          if len(results_df[(results_df['Classifier']==c) & (results_df['Dataset'].str.contains('English'))]) > 0 else 0
          for c in classifiers]

bars1 = ax1.bar(x - width/2, f1_full, width, label='Full Dataset', color=colors_full, alpha=0.8)
bars2 = ax1.bar(x + width/2, f1_eng, width, label='English-only', color=colors_eng, alpha=0.8)

# Add value labels on bars
for bar, val in zip(bars1, f1_full):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, f'{val:.3f}', 
             ha='center', va='bottom', fontsize=9)
for bar, val in zip(bars2, f1_eng):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, f'{val:.3f}', 
             ha='center', va='bottom', fontsize=9)

ax1.set_ylabel('F1 Score (Weighted)', fontsize=11)
ax1.set_title('F1 Score Comparison Across All Models', fontsize=12, fontweight='bold')
ax1.set_xticks(x)
ax1.set_xticklabels(classifiers, fontsize=10)
ax1.legend(loc='lower right')
ax1.set_ylim(0, 1.1)
ax1.grid(axis='y', alpha=0.3)

# Plot 2: Balanced Accuracy comparison
ax2 = axes[0, 1]
ba_full = [results_df[(results_df['Classifier']==c) & (results_df['Dataset'].str.contains('Full'))]['Balanced_Accuracy'].values[0] 
           if len(results_df[(results_df['Classifier']==c) & (results_df['Dataset'].str.contains('Full'))]) > 0 else 0
           for c in classifiers]
ba_eng = [results_df[(results_df['Classifier']==c) & (results_df['Dataset'].str.contains('English'))]['Balanced_Accuracy'].values[0]
          if len(results_df[(results_df['Classifier']==c) & (results_df['Dataset'].str.contains('English'))]) > 0 else 0
          for c in classifiers]

ax2.bar(x - width/2, ba_full, width, label='Full Dataset', color=colors_full, alpha=0.8)
ax2.bar(x + width/2, ba_eng, width, label='English-only', color=colors_eng, alpha=0.8)
ax2.set_ylabel('Balanced Accuracy', fontsize=11)
ax2.set_title('Balanced Accuracy Comparison', fontsize=12, fontweight='bold')
ax2.set_xticks(x)
ax2.set_xticklabels(classifiers, fontsize=10)
ax2.legend(loc='lower right')
ax2.set_ylim(0, 1.1)
ax2.grid(axis='y', alpha=0.3)

# Plot 3: Performance difference (English - Full)
ax3 = axes[0, 2]
f1_diff = [eng - full for eng, full in zip(f1_eng, f1_full)]
colors_diff = [colors_eng if d >= 0 else colors_full for d in f1_diff]

bars = ax3.bar(classifiers, f1_diff, color=colors_diff, alpha=0.8)
ax3.axhline(y=0, color='black', linestyle='-', linewidth=0.5)
ax3.set_ylabel('F1 Difference (English - Full)', fontsize=11)
ax3.set_title('Performance Difference\n(Positive = English Better)', fontsize=12, fontweight='bold')
ax3.grid(axis='y', alpha=0.3)

# Add value labels
for bar, val in zip(bars, f1_diff):
    ypos = bar.get_height() + 0.005 if val >= 0 else bar.get_height() - 0.015
    ax3.text(bar.get_x() + bar.get_width()/2, ypos, f'{val:+.4f}', 
             ha='center', va='bottom' if val >= 0 else 'top', fontsize=10)

# Plot 4: Radar/Spider chart data preparation (multiple metrics comparison)
ax4 = axes[1, 0]
metrics_labels = ['F1 Weighted', 'F1 Macro', 'Bal. Acc', 'Precision', 'Recall']
metrics_keys = ['F1_Weighted', 'F1_Macro', 'Balanced_Accuracy', 'Precision_Weighted', 'Recall_Weighted']

# Get best model performance on full vs english
best_full_idx = results_df[results_df['Dataset'].str.contains('Full')]['F1_Weighted'].idxmax()
best_eng_idx = results_df[results_df['Dataset'].str.contains('English')]['F1_Weighted'].idxmax()
best_full = results_df.loc[best_full_idx]
best_eng = results_df.loc[best_eng_idx]

full_vals = [best_full[k] for k in metrics_keys]
eng_vals = [best_eng[k] for k in metrics_keys]

x_radar = np.arange(len(metrics_labels))
ax4.bar(x_radar - 0.2, full_vals, 0.4, label=f'Best Full ({best_full["Classifier"]})', color=colors_full, alpha=0.8)
ax4.bar(x_radar + 0.2, eng_vals, 0.4, label=f'Best Eng ({best_eng["Classifier"]})', color=colors_eng, alpha=0.8)
ax4.set_xticks(x_radar)
ax4.set_xticklabels(metrics_labels, fontsize=9)
ax4.set_ylabel('Score', fontsize=11)
ax4.set_title('Best Model Performance (All Metrics)', fontsize=12, fontweight='bold')
ax4.legend(loc='lower right')
ax4.set_ylim(0, 1.1)
ax4.grid(axis='y', alpha=0.3)

# Plot 5: Training time comparison
ax5 = axes[1, 1]
time_data = []
time_labels = []
time_colors = []

for model in ['RF', 'SVM', 'DistilBERT']:
    for dataset in ['Full', 'English']:
        key = f'{model}_{dataset}'
        if key in training_times:
            time_data.append(training_times[key])
            time_labels.append(f'{model}\n({dataset[:3]})')
            time_colors.append(colors_full if dataset == 'Full' else colors_eng)

ax5.bar(range(len(time_data)), time_data, color=time_colors, alpha=0.8)
ax5.set_xticks(range(len(time_data)))
ax5.set_xticklabels(time_labels, fontsize=9)
ax5.set_ylabel('Training Time (seconds)', fontsize=11)
ax5.set_title('Training Time by Model and Dataset', fontsize=12, fontweight='bold')
ax5.grid(axis='y', alpha=0.3)

# Plot 6: Summary heatmap of all results
ax6 = axes[1, 2]
heatmap_data = results_df.pivot(index='Classifier', columns='Dataset', values='F1_Weighted')
sns.heatmap(heatmap_data, annot=True, fmt='.4f', cmap='RdYlGn', ax=ax6, 
            cbar_kws={'label': 'F1 Score'}, vmin=0.5, vmax=1.0)
ax6.set_title('F1 Score Heatmap', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.savefig('comprehensive_model_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nComprehensive visualization saved as 'comprehensive_model_comparison.png'")

---
## 9. Summary and Conclusions

### Key Findings

In [None]:
# Generate summary report
print("="*80)
print("FINAL SUMMARY REPORT")
print("="*80)

print("\n### Dataset Analysis ###")
print(f"Total samples analyzed: {len(df):,}")
print(f"English samples: {df['is_english'].sum():,} ({df['is_english'].mean()*100:.2f}%)")
print(f"Non-English samples: {(~df['is_english']).sum():,} ({(~df['is_english']).mean()*100:.2f}%)")
print(f"Unique languages detected: {df['detected_language'].nunique()}")

print("\n### Language Distribution by Domain ###")
print(domain_lang_df[['Domain', 'Total', 'Non-English', 'Non-English %']].to_string(index=False))

print("\n### Classification Performance Summary ###")
print(results_df[['Classifier', 'Dataset', 'F1_Weighted', 'Balanced_Accuracy']].to_string(index=False))

print("\n### Hypothesis Testing Results ###")
print("H1 (Data Composition): ", end="")
non_eng_pct = (~df['is_english']).mean() * 100
if non_eng_pct > 1:
    print(f"SUPPORTED - {non_eng_pct:.2f}% non-English content found")
else:
    print(f"NOT SUPPORTED - Only {non_eng_pct:.2f}% non-English content")

print("H2 (Performance Impact): ", end="")
# Compare best F1 scores
if len(results_df) > 0:
    full_best = results_df[results_df['Dataset'].str.contains('Full')]['F1_Weighted'].max()
    eng_best = results_df[results_df['Dataset'].str.contains('English')]['F1_Weighted'].max()
    if eng_best > full_best:
        print(f"SUPPORTED - English-only shows higher F1 ({eng_best:.4f} vs {full_best:.4f})")
    else:
        print(f"NOT SUPPORTED - Full dataset shows comparable/better F1 ({full_best:.4f} vs {eng_best:.4f})")

In [None]:
# Save results to CSV
results_df.to_csv('classification_results.csv', index=False)
domain_lang_df.to_csv('language_distribution_by_domain.csv', index=False)

# Save detailed language analysis
df[['text', 'label', 'domain', 'detected_language', 'language_confidence', 'is_english']].to_csv(
    'difraud_language_analysis.csv', index=False
)

print("\nResults saved to:")
print("  - classification_results.csv")
print("  - language_distribution_by_domain.csv")
print("  - difraud_language_analysis.csv")

---
## References and Sources

### Dataset
- **DIFrauD Dataset**: Boumber, D., et al. (2024). "Domain-Agnostic Adapter Architecture for Deception Detection." LREC-COLING 2024. Available at: https://huggingface.co/datasets/difraud/difraud

### Libraries and Code Sources
- **langdetect**: Language detection library (port of Google's language-detection). https://pypi.org/project/langdetect/
- **HuggingFace datasets**: Dataset loading library. https://huggingface.co/docs/datasets/
- **HuggingFace transformers**: Transformer models (DistilBERT). https://huggingface.co/docs/transformers/
- **scikit-learn**: ML classifiers (Random Forest, SVM) and metrics. https://scikit-learn.org/
- **DistilBERT model**: distilbert-base-uncased. https://huggingface.co/distilbert-base-uncased

### Academic References
- Conneau, A., et al. (2020). "Unsupervised cross-lingual representation learning at scale." ACL 2020.
- Devlin, J., et al. (2019). "BERT: Pre-training of deep bidirectional transformers." NAACL 2019.
- Verma, R. M., et al. (2019). "Data quality for security challenges." ACM CCS 2019.

### Metrics Choice Justification
- **F1-Score (Weighted)**: Used as primary metric due to class imbalance in DIFrauD dataset. Weighted F1 accounts for class distribution.
- **F1-Score (Macro)**: Unweighted average across classes, useful for evaluating performance on minority class.
- **Balanced Accuracy**: Accounts for class imbalance by averaging recall across classes.

### Code Notes
- All code in this notebook is original unless otherwise noted
- API usage follows official documentation from respective libraries
- Random seed (42) used for reproducibility