# 📚 Text Authenticity Detection - Training Notebook

## Welcome to Your First NLP Machine Learning Project!

In this notebook, we'll learn how to build a machine learning model to detect fake text. This is a practical introduction to:
- **Natural Language Processing (NLP)**
- **Feature Engineering**
- **Machine Learning Classification**
- **Model Evaluation**

---

## 🎯 The Problem

**Goal**: Given two text articles, determine which one is real and which one is AI-generated (fake).

**Why is this important?**
- Combating misinformation
- Protecting against AI-generated spam
- Understanding how AI-generated text differs from human writing

**What makes this challenging?**
- AI text is getting very sophisticated
- Subtle differences in writing style
- Need to extract meaningful features from text

---

## 📖 Learning Path

We'll build our solution in these steps:
1. **Explore the Data** - Understanding what we're working with
2. **Extract Simple Features** - Convert text to numbers
3. **Build a Simple Model** - Start with basic machine learning
4. **Evaluate Performance** - How good is our model?
5. **Improve Features** - Add more sophisticated text analysis
6. **Try Better Models** - Ensemble methods
7. **Advanced Techniques** - For those ready for more!

Let's begin! 🚀

## Step 1: Import Libraries and Setup

First, let's import the tools we'll need. Don't worry if you don't recognize all of these - we'll explain them as we go!

In [None]:
# Basic data manipulation
import pandas as pd  # For working with data tables
import numpy as np   # For numerical operations
import os           # For file operations

# Text processing
import re           # For text pattern matching
from collections import Counter  # For counting things

# Visualization - helps us understand our data
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('default')

# Text analysis tools
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from textstat import flesch_reading_ease  # Measures how easy text is to read

# Machine Learning tools
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler

# Download required NLTK data
try:
    nltk.data.find('vader_lexicon')
except LookupError:
    nltk.download('vader_lexicon')

print("✅ All libraries loaded successfully!")
print("Let's start learning about text authenticity detection!")

## Step 2: Understanding Our Data

Before building any model, we need to understand what data we have. This is called **Exploratory Data Analysis (EDA)**.

### 📁 Data Structure
Our data is organized like this:
```
data/
├── train.csv          # Tells us which text is real for each article
├── train/             # Training text files
│   ├── article_0000/
│   │   ├── file_1.txt # One of these is real...
│   │   └── file_2.txt # ...and one is AI-generated
│   └── article_0001/
│       ├── file_1.txt
│       └── file_2.txt
└── test/              # Test files (we predict these)
```

In [None]:
# Set up our file paths
# 🔧 MODIFY THESE PATHS for your data location
BASE_PATH = '/kaggle/input/fake-or-real-the-impostor-hunt/data'
TRAIN_CSV = f'{BASE_PATH}/train.csv'
TRAIN_DIR = f'{BASE_PATH}/train'
TEST_DIR = f'{BASE_PATH}/test'

# For reproducible results
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print(f"📁 Looking for data in: {BASE_PATH}")

# Let's see what we have
if os.path.exists(TRAIN_CSV):
    print("✅ Found training labels file")
else:
    print("❌ Training labels file not found - check your paths!")
    
if os.path.exists(TRAIN_DIR):
    print("✅ Found training text files")
else:
    print("❌ Training directory not found - check your paths!")

### 🔍 Let's Look at the Labels

The CSV file tells us which text (file_1.txt or file_2.txt) is the real one for each article.

In [None]:
# Load the labels
if os.path.exists(TRAIN_CSV):
    labels_df = pd.read_csv(TRAIN_CSV)
    
    print(f"📊 We have labels for {len(labels_df)} articles")
    print("\nFirst few rows:")
    print(labels_df.head())
    
    print("\n🎯 What the columns mean:")
    print("• 'id': Article number")
    print("• 'real_text_id': Which file is real (1 = file_1.txt, 2 = file_2.txt)")
    
    # Let's see the distribution
    real_text_counts = labels_df['real_text_id'].value_counts()
    print(f"\n📈 Distribution of real texts:")
    print(f"file_1.txt is real: {real_text_counts.get(1, 0)} times")
    print(f"file_2.txt is real: {real_text_counts.get(2, 0)} times")
    
    # Is our dataset balanced?
    balance_ratio = min(real_text_counts) / max(real_text_counts)
    if balance_ratio > 0.8:
        print("✅ Dataset is well balanced!")
    else:
        print("⚠️ Dataset is imbalanced - we might need to handle this")
else:
    print("❌ Cannot analyze labels - file not found")

### 📚 Reading the Actual Text Files

Now let's write a function to load the actual text content. This is a common pattern in data science - write reusable functions!

In [None]:
def load_text_data(train_dir, train_csv_path):
    """
    Load text data and combine with labels.
    
    This function demonstrates:
    - File handling
    - Data organization
    - Error handling (try/except)
    """
    print(f"📖 Loading text data...")
    
    # Load the labels first
    if not os.path.exists(train_csv_path):
        print(f"❌ Labels file not found: {train_csv_path}")
        return None
    
    train_df = pd.read_csv(train_csv_path)
    print(f"📊 Found labels for {len(train_df)} articles")
    
    # Now load the text files
    texts_data = []
    successful_loads = 0
    
    for idx, row in train_df.iterrows():
        article_id = row['id']
        real_text_id = row['real_text_id']
        
        # Build file paths
        article_dir = os.path.join(train_dir, f"article_{article_id:04d}")
        text_1_path = os.path.join(article_dir, "file_1.txt")
        text_2_path = os.path.join(article_dir, "file_2.txt")
        
        # Try to read both files
        try:
            with open(text_1_path, 'r', encoding='utf-8') as f:
                text_1 = f.read().strip()
            with open(text_2_path, 'r', encoding='utf-8') as f:
                text_2 = f.read().strip()
            
            # Organize the data
            texts_data.append({
                'article_id': article_id,
                'text_1': text_1,
                'text_2': text_2,
                'real_text_id': real_text_id,
                'real_text': text_1 if real_text_id == 1 else text_2,
                'fake_text': text_2 if real_text_id == 1 else text_1
            })
            successful_loads += 1
            
        except Exception as e:
            if successful_loads == 0:  # Only show first few errors
                print(f"⚠️ Couldn't load article {article_id}: {e}")
    
    if successful_loads == 0:
        print("❌ No text files loaded successfully")
        return None
        
    df = pd.DataFrame(texts_data)
    print(f"✅ Successfully loaded {len(df)} text pairs")
    
    if len(df) < len(train_df):
        print(f"⚠️ Only loaded {len(df)}/{len(train_df)} articles")
    
    return df

# Load our data
text_df = load_text_data(TRAIN_DIR, TRAIN_CSV)

if text_df is not None:
    print(f"\n📊 Final dataset shape: {text_df.shape}")
    print(f"Columns: {list(text_df.columns)}")
else:
    print("❌ Failed to load data - check your file paths!")

### 👀 Let's Look at Some Examples

The best way to understand text data is to read some examples!

In [None]:
if text_df is not None and len(text_df) > 0:
    # Pick a random example
    sample_idx = 0  # Let's start with the first one
    sample = text_df.iloc[sample_idx]
    
    print(f"🔍 Example Article {sample['article_id']}")
    print(f"Real text is: file_{sample['real_text_id']}.txt")
    print("=" * 50)
    
    print("📄 REAL TEXT (first 300 characters):")
    print(sample['real_text'][:300] + "...\n")
    
    print("🤖 FAKE TEXT (first 300 characters):")
    print(sample['fake_text'][:300] + "...\n")
    
    print("🤔 DISCUSSION QUESTIONS:")
    print("• Can you spot any differences in writing style?")
    print("• Which text seems more natural to you?")
    print("• What patterns might help a computer tell them apart?")
    
    # Basic statistics
    print(f"\n📏 Basic Stats:")
    print(f"Real text length: {len(sample['real_text'])} characters")
    print(f"Fake text length: {len(sample['fake_text'])} characters")
    print(f"Real text words: {len(sample['real_text'].split())} words")
    print(f"Fake text words: {len(sample['fake_text'].split())} words")
    
else:
    print("❌ No data to show examples from")

### 📊 Data Overview

Let's get a high-level view of our dataset. This helps us understand what we're working with.

In [None]:
if text_df is not None and len(text_df) > 0:
    # Calculate basic statistics
    real_lengths = [len(text) for text in text_df['real_text']]
    fake_lengths = [len(text) for text in text_df['fake_text']]
    
    real_word_counts = [len(text.split()) for text in text_df['real_text']]
    fake_word_counts = [len(text.split()) for text in text_df['fake_text']]
    
    print("📊 DATASET OVERVIEW")
    print("=" * 30)
    print(f"Total articles: {len(text_df)}")
    print(f"Real text ID distribution: {text_df['real_text_id'].value_counts().to_dict()}")
    
    print("\n📏 TEXT LENGTHS (characters):")
    print(f"Real text - Average: {np.mean(real_lengths):.0f}, Range: {min(real_lengths)}-{max(real_lengths)}")
    print(f"Fake text - Average: {np.mean(fake_lengths):.0f}, Range: {min(fake_lengths)}-{max(fake_lengths)}")
    
    print("\n📝 WORD COUNTS:")
    print(f"Real text - Average: {np.mean(real_word_counts):.0f}, Range: {min(real_word_counts)}-{max(real_word_counts)}")
    print(f"Fake text - Average: {np.mean(fake_word_counts):.0f}, Range: {min(fake_word_counts)}-{max(fake_word_counts)}")
    
    # Are there obvious differences?
    length_diff = abs(np.mean(real_lengths) - np.mean(fake_lengths))
    word_diff = abs(np.mean(real_word_counts) - np.mean(fake_word_counts))
    
    print("\n🔍 FIRST OBSERVATIONS:")
    if length_diff > 100:
        print(f"• Text lengths differ significantly ({length_diff:.0f} characters on average)")
    else:
        print("• Text lengths are similar - length alone won't distinguish them")
        
    if word_diff > 20:
        print(f"• Word counts differ significantly ({word_diff:.0f} words on average)")
    else:
        print("• Word counts are similar - we need more sophisticated features")
        
    print("\n💡 This tells us we need to look beyond simple length metrics!")
    
else:
    print("❌ No data available for analysis")

---

## 🎓 Checkpoint: What We've Learned So Far

**Key Concepts:**
1. **Exploratory Data Analysis (EDA)** - Understanding your data before modeling
2. **File handling** - Reading data from multiple sources
3. **Data organization** - Structuring data for analysis

**Questions to Consider:**
- What patterns did you notice between real and fake text?
- What features might help distinguish them?
- How would you approach this problem?

---

## 🎉 Congratulations!

You've successfully built a machine learning system for text authenticity detection! 

### 🏆 **What You've Learned:**
- **Data Science Fundamentals**: Loading, exploring, and understanding text data
- **Feature Engineering**: Converting text into meaningful numerical features  
- **Machine Learning**: Training and evaluating classification models
- **Model Interpretation**: Understanding what your model learned
- **Scientific Thinking**: Comparing approaches and drawing conclusions

### 🌟 **Next Steps:**
- **Practice**: Try this approach on other text classification problems
- **Learn More**: Explore advanced NLP techniques (transformers, embeddings)
- **Build Portfolio**: Document this project for future employers
- **Share Knowledge**: Teach others what you've learned!

### 📚 **Resources for Further Learning:**
- **Books**: "Hands-On Machine Learning" by Aurélien Géron
- **Courses**: Fast.ai, Coursera Machine Learning courses
- **Libraries**: Explore spaCy, transformers, scikit-learn documentation
- **Practice**: Kaggle competitions, personal projects

**Remember**: The best way to learn data science is by doing projects like this one. Keep experimenting, keep learning, and most importantly - have fun with it! 🚀

---

*Happy learning, future data scientist!* 📊✨

### 🚀 **Intermediate Challenges**

1. **New Feature Creation**:
   - Design and implement a new feature (e.g., exclamation mark count, average paragraph length)
   - Add it to the feature extraction and test if it improves model performance

2. **Model Improvement**:
   - Try different Random Forest parameters (n_estimators, max_depth)
   - Implement cross-validation to get more reliable performance estimates

3. **Error Analysis**:
   - Find examples where your model makes mistakes
   - Analyze what makes these cases difficult to classify

### 🎓 **Advanced Projects**

1. **Ensemble Methods**:
   - Combine multiple models (e.g., average their predictions)
   - Research and implement a voting classifier

2. **Deep Learning**:
   - Research transformer models (BERT, RoBERTa) for text classification
   - Compare traditional features vs. deep learning embeddings

3. **Real-World Application**:
   - Test your model on completely new text (from news websites, social media)
   - Analyze where it succeeds and fails in the wild

In [None]:
# EXERCISE WORKSPACE
# Use this cell to experiment and answer the exercises above!

print("💡 EXERCISE WORKSPACE")
print("=" * 25)
print("Use this space to:")
print("• Experiment with different features")
print("• Test your understanding") 
print("• Try the challenges below")
print()

# Example: Let's explore different articles
if text_df is not None and len(text_df) > 0:
    print("🔍 EXERCISE 1: Explore different articles")
    print("Try changing this number to explore different examples:")
    
    # TRY CHANGING THIS NUMBER (0 to len(text_df)-1)
    explore_idx = 0
    
    if explore_idx < len(text_df):
        sample = text_df.iloc[explore_idx]
        print(f"\\nArticle {sample['article_id']}:")
        print(f"Real text length: {len(sample['real_text'])} chars")
        print(f"Fake text length: {len(sample['fake_text'])} chars")
        print(f"Real text preview: {sample['real_text'][:150]}...")
        print(f"Fake text preview: {sample['fake_text'][:150]}...")
        
        # Extract features for comparison
        real_features = {**extract_basic_features(sample['real_text']), **extract_style_features(sample['real_text'])}
        fake_features = {**extract_basic_features(sample['fake_text']), **extract_style_features(sample['fake_text'])}
        
        print("\\n📊 Feature comparison for this pair:")
        for feature in ['char_count', 'readability_score', 'sentiment_score']:
            real_val = real_features.get(feature, 0)
            fake_val = fake_features.get(feature, 0)
            print(f"  {feature}: Real={real_val:.3f}, Fake={fake_val:.3f}")

print("\\n🎯 YOUR TURN:")
print("1. Change 'explore_idx' to look at different articles")
print("2. Notice patterns between real and fake text")
print("3. Think about what features might help distinguish them")

## 🎯 Hands-On Exercises

### 🔰 **Beginner Exercises**

1. **Feature Analysis**: 
   - Look at the feature importance from your model
   - Pick the top 3 features and explain in plain English what they measure
   - Do these features make intuitive sense for detecting AI text?

2. **Data Exploration**:
   - Try changing the `sample_idx` in the example viewing code
   - Look at 3-4 different article pairs
   - Can you spot patterns that might help distinguish real from fake text?

3. **Model Comparison**:
   - Compare the performance of Logistic Regression vs Random Forest
   - Which model would you choose and why?

In [None]:
def extract_advanced_features(text):
    """
    ADVANCED: Extract more sophisticated text features.
    
    This function demonstrates advanced NLP concepts:
    - N-gram analysis (word patterns)
    - Part-of-speech analysis
    - Named entity recognition
    """
    if not text or len(text.strip()) == 0:
        return {
            'avg_word_freq': 0,
            'complex_words_ratio': 0,
            'sentence_variety': 0
        }
    
    features = {}
    words = text.lower().split()
    
    # 1. WORD FREQUENCY ANALYSIS
    # Measure how "common" the vocabulary is
    word_counts = Counter(words)
    if words:
        # Average frequency of words (higher = more repetitive)
        features['avg_word_freq'] = sum(word_counts.values()) / len(word_counts)
    else:
        features['avg_word_freq'] = 0
    
    # 2. COMPLEXITY ANALYSIS
    # Count "complex" words (more than 2 syllables - simplified)
    complex_word_count = 0
    for word in words:
        # Simple syllable counting: count vowel groups
        vowels = 'aeiouAEIOU'
        syllables = len([char for i, char in enumerate(word) 
                        if char in vowels and (i == 0 or word[i-1] not in vowels)])
        if syllables > 2:
            complex_word_count += 1
    
    features['complex_words_ratio'] = complex_word_count / len(words) if words else 0
    
    # 3. SENTENCE VARIETY
    # Measure diversity in sentence lengths
    sentences = [s.strip() for s in re.split(r'[.!?]+', text) if s.strip()]
    if len(sentences) > 1:
        sentence_lengths = [len(s.split()) for s in sentences]
        features['sentence_variety'] = np.std(sentence_lengths) / np.mean(sentence_lengths) if np.mean(sentence_lengths) > 0 else 0
    else:
        features['sentence_variety'] = 0
    
    return features

# Demonstrate advanced features (optional - only run if you want to explore)
print("🔬 ADVANCED FEATURES DEMO")
print("=" * 30)
print("These are more sophisticated features you could implement:")
print("\\n1. N-gram analysis - Look at word patterns (e.g., 'the quick brown')")
print("2. Part-of-speech ratios - Analyze grammar patterns")  
print("3. Named entity analysis - Count people, places, organizations")
print("4. Sentiment progression - How sentiment changes through the text")
print("5. Coherence measures - How well sentences connect to each other")
print("\\n💡 Challenge: Try implementing one of these features!")

## 🚀 Advanced Section (Optional)

**Ready for more?** This section introduces advanced concepts for students who want to dive deeper into NLP and machine learning.

### 💡 Advanced Feature Ideas

Here are some sophisticated features you could add to improve performance:

---

## 🎓 Learning Checkpoint: What We've Accomplished

**Congratulations!** You've just built your first text classification system! Let's review what we learned:

### ✅ **Key Concepts Mastered:**
1. **Exploratory Data Analysis** - Understanding your data before modeling
2. **Feature Engineering** - Converting text into numerical features
3. **Train/Test Split** - Properly evaluating model performance
4. **Model Training** - Using machine learning algorithms
5. **Model Comparison** - Comparing different approaches
6. **Model Interpretation** - Understanding what the model learned

### 🧠 **Data Science Skills Developed:**
- File handling and data loading
- Text preprocessing and analysis
- Statistical comparison of groups
- Machine learning pipeline creation
- Performance evaluation and interpretation

### 🤔 **Discussion Questions:**
1. Which features were most important for distinguishing real from fake text?
2. Why might Random Forest perform differently than Logistic Regression?
3. What other features could we add to improve performance?
4. How could we test our model on completely new types of text?

---

In [None]:
if 'X_train' in locals() and X_train is not None:
    print("🌳 Training Random Forest model...")
    
    # Random Forest doesn't need scaled features (unlike Logistic Regression)
    rf_model = RandomForestClassifier(
        n_estimators=100,        # Number of trees in the forest
        random_state=RANDOM_STATE,
        max_depth=10,           # Prevent overfitting
        min_samples_split=5     # Prevent overfitting
    )
    
    # Train the model
    rf_model.fit(X_train, y_train)
    
    # Make predictions
    rf_train_pred = rf_model.predict(X_train)
    rf_test_pred = rf_model.predict(X_test)
    
    # Calculate accuracies
    rf_train_acc = accuracy_score(y_train, rf_train_pred)
    rf_test_acc = accuracy_score(y_test, rf_test_pred)
    
    print("✅ Random Forest training completed!")
    
    print(f"\\n📊 MODEL COMPARISON:")
    print("=" * 40)
    print(f"{'Model':<20} {'Train Acc':<10} {'Test Acc':<10}")
    print("-" * 40)
    if 'train_accuracy' in locals():
        print(f"{'Logistic Regression':<20} {train_accuracy:<10.3f} {test_accuracy:<10.3f}")
    print(f"{'Random Forest':<20} {rf_train_acc:<10.3f} {rf_test_acc:<10.3f}")
    
    # Determine which model is better
    print(f"\\n🏆 RESULTS:")
    if 'test_accuracy' in locals():
        if rf_test_acc > test_accuracy:
            improvement = (rf_test_acc - test_accuracy) * 100
            print(f"🎉 Random Forest wins! {improvement:.1f} percentage points better!")
        elif rf_test_acc < test_accuracy:
            difference = (test_accuracy - rf_test_acc) * 100
            print(f"🤔 Logistic Regression was better by {difference:.1f} percentage points")
        else:
            print("🤝 Both models perform similarly!")
    else:
        print(f"Random Forest test accuracy: {rf_test_acc:.3f} ({rf_test_acc*100:.1f}%)")
        
    # Check for overfitting
    rf_overfit = rf_train_acc - rf_test_acc
    if rf_overfit > 0.1:
        print(f"⚠️ Random Forest may be overfitting (gap: {rf_overfit:.3f})")
    else:
        print("✅ Random Forest shows good generalization")
        
else:
    print("❌ Cannot train Random Forest - no training data available")

### 🌳 Let's Try a More Powerful Model: Random Forest

Random Forest is an **ensemble method** - it combines many decision trees to make better predictions. Let's see if we can improve our results!

In [None]:
if 'model' in locals() and model is not None:
    # Get feature importance from model coefficients
    feature_importance = pd.DataFrame({
        'feature': X_train.columns,
        'coefficient': model.coef_[0],
        'abs_coefficient': np.abs(model.coef_[0])
    }).sort_values('abs_coefficient', ascending=False)
    
    print("🧠 What our model learned (Feature Importance):")
    print("=" * 60)
    print(f"{'Feature':<25} {'Coefficient':<12} {'Interpretation'}")
    print("-" * 60)
    
    for _, row in feature_importance.head(8).iterrows():
        feature = row['feature']
        coef = row['coefficient']
        
        if coef > 0:
            interpretation = "Higher → text_1 more likely real"
        else:
            interpretation = "Higher → text_2 more likely real"
            
        print(f"{feature:<25} {coef:>+10.3f}  {interpretation}")
    
    print("\\n📖 How to read this:")
    print("• Larger absolute coefficient = more important feature")
    print("• Positive coefficient = when this difference is positive, text_1 is more likely real")
    print("• Negative coefficient = when this difference is positive, text_2 is more likely real")
    
    # Show the most important features
    top_features = feature_importance.head(3)['feature'].tolist()
    print(f"\\n🎯 Top 3 most important features: {', '.join(top_features)}")
    
    print("\\n🤔 DISCUSSION QUESTIONS:")
    print("• Do these important features make intuitive sense?")
    print("• What do they tell us about the differences between real and AI text?")
    print("• How might we improve our features based on these insights?")
    
else:
    print("❌ No trained model available for analysis")

### 🔍 Understanding What Our Model Learned

One of the great things about Logistic Regression is that we can see which features it thinks are most important!

In [None]:
if 'X_train' in locals() and X_train is not None:
    print("🎓 Training our first model...")
    
    # Step 1: Scale the features (important for logistic regression)
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    print("📏 Features scaled (normalized to similar ranges)")
    
    # Step 2: Create and train the model
    model = LogisticRegression(random_state=RANDOM_STATE, max_iter=1000)
    model.fit(X_train_scaled, y_train)
    
    print("✅ Model training completed!")
    
    # Step 3: Make predictions
    train_predictions = model.predict(X_train_scaled)
    test_predictions = model.predict(X_test_scaled)
    
    # Step 4: Calculate accuracy
    train_accuracy = accuracy_score(y_train, train_predictions)
    test_accuracy = accuracy_score(y_test, test_predictions)
    
    print(f"\\n📊 MODEL PERFORMANCE:")
    print(f"Training accuracy: {train_accuracy:.3f} ({train_accuracy*100:.1f}%)")
    print(f"Test accuracy: {test_accuracy:.3f} ({test_accuracy*100:.1f}%)")
    
    # Step 5: Interpret the results
    print(f"\\n🤔 INTERPRETATION:")
    if test_accuracy > 0.8:
        print("🎉 Excellent! Our model is performing very well!")
    elif test_accuracy > 0.7:
        print("😊 Good performance! There's room for improvement.")
    elif test_accuracy > 0.6:
        print("😐 Moderate performance. We need better features or models.")
    else:
        print("😔 Poor performance. Back to the drawing board!")
    
    # Check for overfitting
    if train_accuracy - test_accuracy > 0.1:
        print("⚠️ Warning: Model might be overfitting (much better on training than test)")
    else:
        print("✅ Good: Similar performance on training and test data")
        
    print(f"\\n🎯 Random guessing would give ~50% accuracy. Our model: {test_accuracy*100:.1f}%")
    
else:
    print("❌ Cannot train model - no training data available")

### 🤖 Training Our First Model

Let's start with a simple but effective algorithm: **Logistic Regression**

**Why Logistic Regression?**
- Easy to understand and interpret
- Fast to train
- Good baseline for text classification
- Shows us which features are most important

In [None]:
if X is not None and y is not None:
    # Split data: 80% for training, 20% for testing
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, 
        test_size=0.2,           # 20% for testing
        random_state=RANDOM_STATE,  # For reproducible results
        stratify=y               # Keep same proportion of classes in both splits
    )
    
    print("📊 Data Split Results:")
    print(f"Training set: {len(X_train)} samples")
    print(f"Test set: {len(X_test)} samples")
    print(f"Training labels: {np.bincount(y_train)}")
    print(f"Test labels: {np.bincount(y_test)}")
    
    print("\\n💡 Why this split matters:")
    print("• Training data: Model learns patterns from this")
    print("• Test data: We evaluate performance on this (model never sees it during training)")
    print("• Stratify: Ensures both sets have similar class distributions")
    
else:
    print("❌ No data available for splitting")

### 🔄 Splitting Our Data

**Why do we split data?** We need to test our model on data it hasn't seen during training. This tells us how well it will perform on new, unseen articles.

In [None]:
def prepare_ml_data(text_df):
    """
    Prepare data for machine learning.
    
    For each article pair, we need to create features and a label:
    - Features: Differences between text_1 and text_2 
    - Label: Which text is real (1 = text_1 is real, 0 = text_2 is real)
    """
    print("🛠️ Preparing data for machine learning...")
    
    if text_df is None or len(text_df) == 0:
        print("❌ No data to prepare")
        return None, None
    
    X = []  # Features (input to our model)
    y = []  # Labels (what we want to predict)
    
    for idx, row in text_df.iterrows():
        # Extract features for both texts
        basic_feat_1 = extract_basic_features(row['text_1'])
        style_feat_1 = extract_style_features(row['text_1'])
        
        basic_feat_2 = extract_basic_features(row['text_2']) 
        style_feat_2 = extract_style_features(row['text_2'])\n        
        # Combine features
        feat_1 = {**basic_feat_1, **style_feat_1}
        feat_2 = {**basic_feat_2, **style_feat_2}
        
        # Create difference features (this is key!)\n        # Instead of absolute values, we look at differences between the texts
        feature_vector = {}
        for key in feat_1.keys():
            feature_vector[f'{key}_diff'] = feat_1[key] - feat_2[key]
            
        X.append(feature_vector)
        
        # Label: 1 if text_1 is real, 0 if text_2 is real
        label = 1 if row['real_text_id'] == 1 else 0
        y.append(label)
        
        # Show progress
        if (idx + 1) % 10 == 0:
            print(f"  Processed {idx + 1}/{len(text_df)} articles...")
    
    # Convert to arrays for scikit-learn
    X_df = pd.DataFrame(X)
    y_array = np.array(y)
    
    print(f"✅ Prepared {len(X_df)} samples with {len(X_df.columns)} features")
    print(f"📊 Class distribution: {np.bincount(y_array)} (0=text_2 real, 1=text_1 real)")
    
    return X_df, y_array

# Prepare our machine learning data
if text_df is not None:
    X, y = prepare_ml_data(text_df)
    
    if X is not None:
        print(f"\\n🔍 Feature names: {list(X.columns)}")
        print(f"\\n📈 Example feature vector (first sample):")
        for feature, value in X.iloc[0].items():
            print(f"  {feature}: {value:.3f}")
            
        print(f"\\n🎯 Corresponding label: {y[0]} ({'text_1 is real' if y[0] == 1 else 'text_2 is real'})")
else:
    print("❌ No text data available")

---

## Step 4: Building Our First Machine Learning Model

Now comes the exciting part - let's use machine learning to automatically classify text as real or fake!

### 🎯 The Machine Learning Process

1. **Prepare the data** - Combine features from both texts in each pair
2. **Split the data** - Keep some data for testing our model
3. **Train the model** - Let the algorithm learn patterns
4. **Evaluate performance** - How accurate is our model?

### 📊 Preparing Our Training Data

In [None]:
def extract_style_features(text):
    """
    Extract features that capture writing style and quality.
    These go beyond simple counting to analyze HOW the text is written.
    """
    if not text or len(text.strip()) == 0:
        return {
            'readability_score': 0,
            'sentiment_score': 0,
            'punctuation_ratio': 0,
            'capital_ratio': 0,
            'unique_word_ratio': 0
        }
    
    features = {}
    
    # 1. READABILITY - How easy is the text to read?
    try:
        features['readability_score'] = flesch_reading_ease(text)
    except:
        features['readability_score'] = 0
    
    # 2. SENTIMENT - What's the emotional tone?
    sia = SentimentIntensityAnalyzer()
    sentiment = sia.polarity_scores(text)
    features['sentiment_score'] = sentiment['compound']  # Overall sentiment (-1 to +1)
    
    # 3. PUNCTUATION - How much punctuation is used?
    punctuation_count = sum(1 for char in text if char in '.,!?;:-')
    features['punctuation_ratio'] = punctuation_count / len(text) if len(text) > 0 else 0
    
    # 4. CAPITALIZATION - How much text is capitalized?
    capital_count = sum(1 for char in text if char.isupper())
    features['capital_ratio'] = capital_count / len(text) if len(text) > 0 else 0
    
    # 5. VOCABULARY DIVERSITY - How varied is the word choice?
    words = text.lower().split()
    if words:
        unique_words = len(set(words))
        features['unique_word_ratio'] = unique_words / len(words)
    else:
        features['unique_word_ratio'] = 0
    
    return features

# Test the new features
if text_df is not None and len(text_df) > 0:
    sample_text = text_df.iloc[0]['real_text']
    style_features = extract_style_features(sample_text)
    
    print("🎨 Testing style features:")
    print("Text preview:", sample_text[:100] + "...")
    print("\nStyle features:")
    for feature_name, value in style_features.items():
        print(f"• {feature_name}: {value:.3f}")
        
    print("\n📖 What these features mean:")
    print("• readability_score: Higher = easier to read (0-100 scale)")
    print("• sentiment_score: Emotional tone (-1 = negative, +1 = positive)")
    print("• punctuation_ratio: Fraction of text that's punctuation")
    print("• capital_ratio: Fraction of text that's uppercase")
    print("• unique_word_ratio: Vocabulary diversity (1 = all words unique)")

### 📈 Adding More Sophisticated Features

Basic length features are a good start, but we can do better! Let's add features that capture writing style and quality:

In [None]:
if text_df is not None and len(text_df) > 0:
    # Extract features for all texts
    print("🔄 Extracting features for all texts...")
    
    real_features = []
    fake_features = []
    
    for idx, row in text_df.iterrows():
        real_feat = extract_basic_features(row['real_text'])
        fake_feat = extract_basic_features(row['fake_text'])
        
        real_features.append(real_feat)
        fake_features.append(fake_feat)
    
    # Convert to DataFrames for easy analysis
    real_df = pd.DataFrame(real_features)
    fake_df = pd.DataFrame(fake_features)
    
    print(f"✅ Extracted features for {len(real_df)} real and {len(fake_df)} fake texts")
    
    # Compare averages
    print("\n📊 FEATURE COMPARISON (Averages)")
    print("=" * 50)
    print(f"{'Feature':<20} {'Real Text':<12} {'Fake Text':<12} {'Difference'}")
    print("-" * 50)
    
    for feature in real_df.columns:
        real_avg = real_df[feature].mean()
        fake_avg = fake_df[feature].mean()
        diff = real_avg - fake_avg
        diff_pct = (diff / fake_avg * 100) if fake_avg != 0 else 0
        
        print(f"{feature:<20} {real_avg:<12.1f} {fake_avg:<12.1f} {diff:+6.1f} ({diff_pct:+.1f}%)")
    
    print("\n🤔 INTERPRETATION:")
    print("• Positive difference = Real texts have higher values")
    print("• Negative difference = Fake texts have higher values") 
    print("• Look for large percentage differences!")
    
    # Find the most discriminative features
    differences = []
    for feature in real_df.columns:
        real_avg = real_df[feature].mean()
        fake_avg = fake_df[feature].mean()
        if fake_avg != 0:
            diff_pct = abs((real_avg - fake_avg) / fake_avg * 100)
            differences.append((feature, diff_pct))
    
    differences.sort(key=lambda x: x[1], reverse=True)
    
    print(f"\n🎯 Most discriminative features:")
    for feature, diff_pct in differences[:3]:
        print(f"• {feature}: {diff_pct:.1f}% difference")
        
else:
    print("❌ No data available for feature extraction")

### 📊 Let's Compare Real vs Fake Features

Now let's see if these basic features can help us distinguish real from fake text:

In [None]:
def extract_basic_features(text):
    """
    Extract simple, interpretable features from text.
    
    Returns a dictionary of feature_name: value pairs.
    """
    if not text or len(text.strip()) == 0:
        # Handle empty text gracefully
        return {
            'char_count': 0,
            'word_count': 0, 
            'sentence_count': 0,
            'avg_word_length': 0,
            'avg_sentence_length': 0
        }
    
    features = {}
    
    # Basic counts
    features['char_count'] = len(text)
    features['word_count'] = len(text.split())
    
    # Count sentences (simple approach - count periods, exclamations, questions)
    sentence_endings = text.count('.') + text.count('!') + text.count('?')
    features['sentence_count'] = max(1, sentence_endings)  # At least 1 sentence
    
    # Average lengths
    words = text.split()
    if words:
        features['avg_word_length'] = sum(len(word) for word in words) / len(words)
    else:
        features['avg_word_length'] = 0
        
    features['avg_sentence_length'] = features['word_count'] / features['sentence_count']
    
    return features

# Test our function on one example
if text_df is not None and len(text_df) > 0:
    sample_text = text_df.iloc[0]['real_text']
    sample_features = extract_basic_features(sample_text)
    
    print("🧪 Testing our feature extraction:")
    print("Text preview:", sample_text[:100] + "...")
    print("\nExtracted features:")
    for feature_name, value in sample_features.items():
        print(f"• {feature_name}: {value:.1f}")
        
    print("\n💡 These numbers represent measurable properties of the text!")

### 📏 Simple Features: Length and Structure

Let's start with the most obvious features - how long are the texts and how are they structured?

## Step 3: Feature Engineering - Converting Text to Numbers

**The Challenge**: Machine learning algorithms need numbers, but we have text. How do we convert text into meaningful features?

**Feature Engineering** is the process of extracting measurable properties from raw data. For text, we might look at:
- Length and structure
- Readability and complexity  
- Sentiment and tone
- Writing patterns

Let's start simple and build up our understanding!