# FakeScope Development.ipynb - Complete Guide

## Overview
This notebook implements a comprehensive fake news detection pipeline combining traditional ML (TF-IDF + baselines) with transformer models (DistilBERT) and ensemble methods.

## Chronology & Architecture

### Phase 1: Data Loading & Preprocessing (Cells 1-17)
- **Load datasets**: Two CSV files merged into unified dataset
- **Label normalization**: Maps various labels to binary (0=Fake, 1=True)
- **Text cleaning**: Removes punctuation, digits, stopwords (NLTK + sklearn + custom)
- **Deduplication**: MD5 content hashes to detect exact duplicates
- **EDA**: WordClouds, class distribution, top word/bigram frequencies

**Key Configuration**:
```python
TOKEN_PATTERN = r'(?u)\b\w\w+\b'  # Skip 1-letter tokens
MIN_DF = 5                           # Drop very rare tokens
MAX_DF = 0.90                        # Drop extremely common tokens
NGRAM_RANGE = (1, 2)                 # Unigrams and bigrams
```

### Phase 2: Train/Test Split (Cells 18-19)
- **GroupShuffleSplit**: Prevents data leakage by grouping duplicate articles
- **25% test split** with random_state=42 for reproducibility
- **Content hash grouping**: Ensures same article variants stay in same split

### Phase 3: Feature Extraction (Cell 20)
- **TF-IDF vectorization**: max_features=5000, custom stopwords
- **Train-only fitting**: Vectorizer fit on X_train, transform on X_test
- **Output**: Sparse matrices (train: ~75% samples, test: ~25%)

### Phase 4: Baseline Models (Cells 21-34)
- **Logistic Regression**: Linear baseline with L2 regularization
- **Decision Tree**: Non-linear baseline with max_depth tuning
- **Random Forest**: Ensemble baseline with GridSearchCV hyperparameter tuning
- **Evaluation**: Accuracy, F1, ROC/AUC, confusion matrices, feature importance
- **Best model selection**: Choose by AUC, save to best_baseline_model.joblib

### Phase 5: Transformer Models (Cells 35-56)
- **DistilBERT fine-tuning**: HuggingFace Transformers with custom tokenization
- **2-stage training** (optional):
  1. **Stage 1**: Masked Language Modeling (MLM) on unlabeled corpus for domain adaptation
  2. **Stage 2**: Fine-tune adapted model on labeled fake news classification
- **Training config**: 3 epochs, batch_size=16, learning_rate=2e-5, MPS (Apple Silicon)
- **Cross-validation**: 5-fold StratifiedKFold for robust evaluation
- **Model persistence**: Save to ./distilbert_fakenews/

### Phase 6: Ensemble & Error Analysis (Cells 39-42)
- **Soft voting ensemble**:
  ```python
  ensemble_proba = 0.6 * bert_proba + 0.4 * rf_proba
  ensemble_pred = (ensemble_proba > 0.5).astype(int)
  ```
- **Weight rationale**: 60% transformer (semantic context) + 40% RF (robust features)
- **Error analysis**: Identify misclassified examples, show prediction details
- **Attention visualization**: Use bertviz to understand model focus

### Phase 7: Fact-Checking Integration (Cells 58-70)
- **Google Fact Check API**: Query external fact-checkers for claims
- **Claim extraction**: Use spaCy to extract sentences >10 words
- **Credibility scoring**: Combine model predictions with fact-check verdicts
- **Comment generation**: Explanations based on score

# Practical Usage Guide

## Environment Setup
```bash
# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate  # On Mac/Linux

# Install dependencies
pip install -r requirements.txt

# Download spaCy model
python -m spacy download en_core_web_sm
```

## Running the Notebook
1. Open Development.ipynb in VS Code
2. Select Python interpreter: **Python: Select Interpreter** → Choose .venv
3. Run cells sequentially from top to bottom
4. Monitor console for warnings about data leakage or errors

## Expected Runtimes (M4 Mac)
- Data loading & preprocessing: ~2 min
- Baseline training: ~5 min
- DistilBERT fine-tuning: ~45 min (standard) or ~2 hours (2-stage with MLM)
- Ensemble & evaluation: ~1 min

## Key Variables & Their Purpose
- `df_news`: Main dataset after merging and cleaning
- `X_train`, `X_test`: Text data for train/test splits
- `y_train`, `y_test`: Binary labels (0=Fake, 1=True)
- `vectorizer`: TF-IDF transformer fitted on training data
- `X_train_tfidf`, `X_test_tfidf`: Sparse TF-IDF feature matrices
- `modellr`, `modeldt`, `rf`: Baseline models (LogReg, DecisionTree, RandomForest)
- `model`, `tokenizer`, `trainer`: DistilBERT components
- `ensemble_proba`, `ensemble_pred`: Combined predictions from transformer + RF

# Code Examples

## Loading Saved Models
```python
from joblib import load
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load baseline model
baseline = load('best_baseline_model.joblib')
vectorizer = load('tfidf_vectorizer.joblib')

# Load transformer
model = AutoModelForSequenceClassification.from_pretrained('./distilbert_fakenews')
tokenizer = AutoTokenizer.from_pretrained('./distilbert_fakenews')
```

## Making Predictions on New Text
```python
new_text = "Breaking: Scientists discover new planet in solar system"

# Clean text (use clean_text function from notebook)
cleaned = clean_text(new_text)

# Baseline prediction
text_tfidf = vectorizer.transform([cleaned])
baseline_pred = baseline.predict(text_tfidf)[0]

# Transformer prediction
inputs = tokenizer(new_text, return_tensors='pt', truncation=True, max_length=512)
outputs = model(**inputs)
transformer_pred = torch.argmax(outputs.logits, dim=1).item()

print(f"Baseline: {'Fake' if baseline_pred == 0 else 'True'}")
print(f"Transformer: {'Fake' if transformer_pred == 0 else 'True'}")
```

# Common Issues & Solutions

## Issue: "ModuleNotFoundError: No module named 'google-api-python-client'"
**Solution**: Install correct package name:
```bash
pip install google-api-python-client
```

## Issue: "RuntimeError: MPS backend out of memory"
**Solution**: Reduce batch size in TrainingArguments:
```python
training_args = TrainingArguments(
    per_device_train_batch_size=8,  # Reduce from 16
    per_device_eval_batch_size=8,
    ...
)
```

## Issue: "Perfect test accuracy (100%)"
**Solution**: Likely data leakage! Check:
1. Duplicates in train/test: `train_hashes & test_hashes`
2. Publisher names in features: Add to custom_stopwords
3. Temporal leakage: Ensure chronological split if data has timestamps

# Performance Metrics Interpretation

## Baseline Models (Expected Ranges)
- **Logistic Regression**: Accuracy ~92-95%, F1 ~0.92-0.95, AUC ~0.95-0.98
- **Decision Tree**: Accuracy ~88-92%, F1 ~0.88-0.92, AUC ~0.90-0.94
- **Random Forest**: Accuracy ~93-96%, F1 ~0.93-0.96, AUC ~0.96-0.99

## Transformer Models (Expected Ranges)
- **DistilBERT (standard)**: Accuracy ~97-99%, F1 ~0.97-0.99, AUC ~0.99+
- **DistilBERT (2-stage)**: Accuracy ~98-99.5%, F1 ~0.98-0.995, AUC ~0.995+

## Ensemble (Expected Improvement)
- Typically +0.5-1% accuracy over best individual model
- More robust to adversarial examples and edge cases
- Better calibrated probabilities for uncertainty estimation

## Red Flags
- **Accuracy > 99.5%**: Possible data leakage
- **Train accuracy >> Test accuracy**: Overfitting (reduce max_features, increase regularization)
- **F1 << Accuracy**: Class imbalance (use balanced class weights)
- **AUC < Accuracy**: Probability calibration issues (use CalibratedClassifierCV)

# Project Structure

```
FakeScope/
├── Development.ipynb              # Main training notebook
├── LLM_Pipeline.ipynb            # Production inference pipeline
├── guide.ipynb                   # This guide
├── requirements.txt              # Python dependencies
├── .gitignore                    # Git exclusions
├── datasets/
│   └── input/
│       ├── alt/News.csv
│       └── alt 2/New Task.csv
├── best_baseline_model.joblib    # Saved baseline (created after training)
├── tfidf_vectorizer.joblib       # Saved vectorizer
├── distilbert_fakenews/          # Saved transformer model
│   ├── config.json
│   ├── model.safetensors
│   ├── tokenizer_config.json
│   ├── tokenizer.json
│   └── vocab.txt
├── distilbert_news_adapted/      # Domain-adapted model (2-stage)
├── results/                      # Training checkpoints
├── mlm_results/                  # MLM training outputs
└── factcheck_cache.json          # Fact-check API cache
```