    {
      "cell_type": "markdown",
      "metadata": {"language": "markdown"},
      "source": [
        "# FakeScope Development.ipynb - Complete Guide",
        "",
        "## Overview",
        "This notebook implements a comprehensive fake news detection pipeline combining traditional ML (TF-IDF + baselines) with transformer models (DistilBERT) and ensemble methods.",
        "",
        "## Chronology & Architecture",
        "",
        "### Phase 1: Data Loading & Preprocessing (Cells 1-17)",
        "- **Load datasets**: Two CSV files (Fake.csv, True.csv) merged into unified dataset",
        "- **Label normalization**: Maps various labels to binary (0=Fake, 1=True)",
        "- **Text cleaning**: Removes punctuation, digits, stopwords (NLTK + sklearn + custom)",
        "- **Deduplication**: MD5 content hashes to detect exact duplicates",
        "- **EDA**: WordClouds, class distribution, top word/bigram frequencies",
        "",
        "**Key Configuration**:",
        "```python",
        "TOKEN_PATTERN = r'(?u)\\b\\w\\w+\\b'  # Skip 1-letter tokens",
        "MIN_DF = 5                           # Drop very rare tokens",
        "MAX_DF = 0.90                        # Drop extremely common tokens",
        "NGRAM_RANGE = (1, 2)                 # Unigrams and bigrams",
        "```",
        "",
        "### Phase 2: Train/Test Split (Cells 18-19)",
        "- **GroupShuffleSplit**: Prevents data leakage by grouping duplicate articles",
        "- **25% test split** with random_state=42 for reproducibility",
        "- **Content hash grouping**: Ensures same article variants stay in same split",
        "",
        "### Phase 3: Feature Extraction (Cell 20)",
        "- **TF-IDF vectorization**: max_features=5000, custom stopwords",
        "- **Train-only fitting**: Vectorizer fit on X_train, transform on X_test",
        "- **Output**: Sparse matrices (train: ~75% samples, test: ~25%)",
        "",
        "### Phase 4: Baseline Models (Cells 21-34)",
        "- **Logistic Regression**: Linear baseline with L2 regularization",
        "- **Decision Tree**: Non-linear baseline with max_depth tuning",
        "- **Random Forest**: Ensemble baseline with GridSearchCV hyperparameter tuning",
        "- **Evaluation**: Accuracy, F1, ROC/AUC, confusion matrices, feature importance",
        "- **Best model selection**: Choose by AUC, save to best_baseline_model.joblib",
        "",
        "### Phase 5: Transformer Models (Cells 35-56)",
        "- **DistilBERT fine-tuning**: HuggingFace Transformers with custom tokenization",
        "- **2-stage training** (optional):",
        "  1. **Stage 1**: Masked Language Modeling (MLM) on unlabeled corpus for domain adaptation",
        "  2. **Stage 2**: Fine-tune adapted model on labeled fake news classification",
        "- **Training config**: 3 epochs, batch_size=16, learning_rate=2e-5, MPS (Apple Silicon)",
        "- **Cross-validation**: 5-fold StratifiedKFold for robust evaluation",
        "- **Model persistence**: Save to ./distilbert_fakenews/",
        "",
        "### Phase 6: Ensemble & Error Analysis (Cells 39-42)",
        "- **Soft voting ensemble**:",
        "  ```python",
        "  ensemble_proba = 0.6 * bert_proba + 0.4 * rf_proba",
        "  ensemble_pred = (ensemble_proba > 0.5).astype(int)",
        "  ```",
        "- **Weight rationale**: 60% transformer (semantic context) + 40% RF (robust features)",
        "- **Error analysis**: Identify misclassified examples, show prediction details",
        "- **Attention visualization**: Use bertviz to understand model focus",
        "",
        "### Phase 7: Fact-Checking Integration (Cells 58-70)",
        "- **Google Fact Check API**: Query external fact-checkers for claims",
        "- **Claim extraction**: Use spaCy to extract sentences >10 words",
        "- **Credibility scoring**: Combine model predictions with fact-check verdicts",
        "- **Comment generation**: T5/FLAN-style explanations based on score",
        "",
        "## Key Features & Best Practices",
        "",
        "### Data Leakage Prevention",
        "1. **Content hash grouping**: Prevents train/test contamination from duplicates",
        "2. **Train-only vectorizer fitting**: TF-IDF vocabulary built only from training data",
        "3. **No publisher names in features**: Custom stopwords remove 'reuters', 'ap', etc.",
        "",
        "### Model Reproducibility",
        "- Fixed random_state=42 across all splits and models",
        "- Saved models: best_baseline_model.joblib, tfidf_vectorizer.joblib, distilbert_fakenews/",
        "- Training logs: trainer_state.json tracks loss/accuracy per epoch",
        "",
        "### Performance Optimization",
        "- MPS device support for Apple Silicon GPU acceleration",
        "- Batch processing for transformer inference",
        "- Sparse matrix TF-IDF for memory efficiency",
        "",
        "## Configuration Flags",
        "",
        "### REBUILD_SPLIT_DEDUP (not yet implemented)",
        "Set to True to automatically deduplicate and regenerate train/test splits.",
        "",
        "### REFIT_VECTORIZE_TRAIN_ONLY (not yet implemented)",
        "Set to True to enforce strict train-only vectorizer fitting with validation checks."
      ]
    },

# Practical Usage Guide for Development.ipynb

## Quick Start
1. **Install dependencies**: See requirements.txt for all needed packages.
2. **Select Python interpreter**: Use a virtual environment and select it in VS Code.
3. **Run cells sequentially**: Start from the top, executing each cell in order.
4. **Configure flags**: Set REBUILD_SPLIT_DEDUP and REFIT_VECTORIZE_TRAIN_ONLY as needed for your workflow.
5. **Monitor outputs**: Check printed warnings for data leakage, model performance, and errors.
6. **Save models**: Use provided cells to save/load trained models for reproducibility.
7. **Troubleshooting**: If you encounter errors, use the safe_run wrapper and review error messages. Restart the kernel if needed.

## Example: Data Deduplication
```python
REBUILD_SPLIT_DEDUP = True  # Enable deduplication and split rebuild
# Run the deduplication cell to remove duplicate articles before splitting
```
## Example: Vectorizer Integrity
```python
REFIT_VECTORIZE_TRAIN_ONLY = True  # Fit TF-IDF only on training data
# Run the vectorizer integrity cell to check for test-only tokens
```
## Example: Ensemble Prediction
```python
ensemble_proba = 0.6 * bert_proba + 0.4 * rf_proba
ensemble_pred = (ensemble_proba > 0.5).astype(int)
# Use ensemble_pred for final predictions
```
## Example: Error Recovery
```python
result = safe_run(trainer.train, error_msg="Training failed", fallback=None)
```
## Example: Model Saving/Loading
```python
# Save model
trainer.save_model('./distilbert_fakenews')
tokenizer.save_pretrained('./distilbert_fakenews')
# Load model
from transformers import AutoTokenizer, AutoModelForSequenceClassification
loaded_model = AutoModelForSequenceClassification.from_pretrained('./distilbert_fakenews')
loaded_tokenizer = AutoTokenizer.from_pretrained('./distilbert_fakenews')
```
## Troubleshooting Tips
- If you see 'ModuleNotFoundError', check your Python environment and install missing packages with pip.
- If you get memory errors, reduce batch size or max_features in TF-IDF.
- For API errors, check your keys and network connection.
- Restart the kernel if you encounter persistent issues.

# Updating requirements.txt

Recommended requirements (pin versions for reproducibility):

```

# Updating .gitignore

Recommended entries for Python/ML projects:

```