# Chapter 6: Data Validation
## Never Trust User Input

### Course: TCPRG4005 - Secure Programming

---

## Learning Objectives

1. **Understand** why input validation is critical
2. **Implement** proper input sanitization
3. **Prevent** injection attacks (SQL, command, etc.)
4. **Apply** buffer overflow protections
5. **Design** secure data processing pipelines

---

## ⚠️ Golden Rule

> **NEVER TRUST INPUT FROM THE USER**

Nearly every security vulnerability stems from trusting user input implicitly or explicitly.

In [1]:
# Setup for data validation examples
import re
import html
import subprocess
import sqlite3
from pathlib import Path

print("🛡️ Chapter 6: Data Validation")
print("Securing applications against malicious input")
print("=" * 50)

🛡️ Chapter 6: Data Validation
Securing applications against malicious input


# Example 1: SQL Injection Prevention

SQL injection remains one of the most common and dangerous vulnerabilities.

In [2]:
# SQL injection demonstration and prevention
def setup_demo_database():
    """Create a demo database for testing"""
    conn = sqlite3.connect(':memory:')
    cursor = conn.cursor()
    
    # Create users table
    cursor.execute('''
        CREATE TABLE users (
            id INTEGER PRIMARY KEY,
            username TEXT NOT NULL,
            email TEXT NOT NULL,
            is_admin BOOLEAN DEFAULT 0
        )
    ''')
    
    # Insert sample data
    users = [
        ('alice', 'alice@company.com', 0),
        ('bob', 'bob@company.com', 0),
        ('admin', 'admin@company.com', 1)
    ]
    
    cursor.executemany('INSERT INTO users (username, email, is_admin) VALUES (?, ?, ?)', users)
    conn.commit()
    return conn

# Vulnerable function (DON'T DO THIS!)
def vulnerable_login(username, password, conn):
    """VULNERABLE: String concatenation opens SQL injection"""
    cursor = conn.cursor()
    
    # BAD: Direct string interpolation
    query = f"SELECT * FROM users WHERE username = '{username}' AND password = '{password}'"
    print(f"🔴 Vulnerable query: {query}")
    
    try:
        cursor.execute(query)
        result = cursor.fetchone()
        return result is not None
    except sqlite3.Error as e:
        print(f"SQL Error: {e}")
        return False

# Secure function (DO THIS!)
def secure_login(username, password, conn):
    """SECURE: Parameterized queries prevent injection"""
    cursor = conn.cursor()
    
    # GOOD: Parameterized query
    query = "SELECT * FROM users WHERE username = ? AND password = ?"
    print(f"✅ Secure query: {query}")
    print(f"   Parameters: {username}, {password}")
    
    try:
        cursor.execute(query, (username, password))
        result = cursor.fetchone()
        return result is not None
    except sqlite3.Error as e:
        print(f"SQL Error: {e}")
        return False

# Demonstrate SQL injection
conn = setup_demo_database()

print("SQL Injection Demonstration")
print("-" * 30)

# Normal login attempt
print("Normal login attempt:")
normal_result = vulnerable_login("alice", "secret123", conn)
print(f"Result: {normal_result}\n")

# SQL injection attempt
print("SQL injection attack:")
malicious_input = "admin' OR '1'='1' --"
injection_result = vulnerable_login(malicious_input, "anything", conn)
print(f"Result: {injection_result}")
print("🚨 Attack succeeded! Logged in without password!\n")

# Same attack on secure function
print("Same attack on secure function:")
secure_result = secure_login(malicious_input, "anything", conn)
print(f"Result: {secure_result}")
print("✅ Attack failed! Parameterized query prevented injection")

conn.close()

SQL Injection Demonstration
------------------------------
Normal login attempt:
🔴 Vulnerable query: SELECT * FROM users WHERE username = 'alice' AND password = 'secret123'
SQL Error: no such column: password
Result: False

SQL injection attack:
🔴 Vulnerable query: SELECT * FROM users WHERE username = 'admin' OR '1'='1' --' AND password = 'anything'
Result: True
🚨 Attack succeeded! Logged in without password!

Same attack on secure function:
✅ Secure query: SELECT * FROM users WHERE username = ? AND password = ?
   Parameters: admin' OR '1'='1' --, anything
SQL Error: no such column: password
Result: False
✅ Attack failed! Parameterized query prevented injection


# Example 2: Command Injection Prevention

Executing system commands with user input is extremely dangerous.

In [None]:
# Command injection demonstration
def vulnerable_ping(hostname):
    """VULNERABLE: Direct command execution"""
    command = f"ping -c 1 {hostname}"
    print(f"🔴 Executing: {command}")
    
    try:
        # DON'T DO THIS - vulnerable to command injection
        result = subprocess.run(command, shell=True, capture_output=True, text=True, timeout=5)
        return result.stdout
    except subprocess.TimeoutExpired:
        return "Command timed out"
    except Exception as e:
        return f"Error: {e}"

def secure_ping(hostname):
    """SECURE: Input validation and safe execution"""
    # Validate hostname format
    hostname_pattern = r'^[a-zA-Z0-9.-]+$'
    if not re.match(hostname_pattern, hostname):
        return "Invalid hostname format"
    
    # Limit hostname length
    if len(hostname) > 253:  # Max DNS hostname length
        return "Hostname too long"
    
    # Use list form (no shell=True) and sanitized input
    command = ["ping", "-c", "1", hostname]
    print(f"✅ Executing: {command}")
    
    try:
        result = subprocess.run(command, capture_output=True, text=True, timeout=5)
        return result.stdout if result.returncode == 0 else "Ping failed"
    except subprocess.TimeoutExpired:
        return "Command timed out"
    except Exception as e:
        return f"Error: {e}"

print("\nCommand Injection Demonstration")
print("-" * 35)

# Normal usage
print("Normal ping:")
normal_result = secure_ping("8.8.8.8")
print("✅ Normal operation succeeded\n")

# Malicious input
print("Command injection attempt:")
malicious_command = "8.8.8.8; rm -rf /"  # Tries to delete files
print(f"Malicious input: {malicious_command}")

print("\nVulnerable function:")
# Note: This is safe in our demo since rm -rf doesn't exist on most systems
# In real scenarios, this could be catastrophic
vulnerable_result = vulnerable_ping(malicious_command)
print("🚨 Vulnerable function executed malicious command!")

print("\nSecure function:")
secure_result = secure_ping(malicious_command)
print(f"Result: {secure_result}")
print("✅ Secure function blocked malicious input")

# Example 3: Cross-Site Scripting (XSS) Prevention

Web applications must sanitize output to prevent script injection.

In [None]:
# XSS prevention demonstration
def vulnerable_comment_display(user_comment):
    """VULNERABLE: Direct output without escaping"""
    html_output = f"""
    <div class="comment">
        <p>User says: {user_comment}</p>
    </div>
    """
    return html_output

def secure_comment_display(user_comment):
    """SECURE: HTML escaping prevents XSS"""
    # Escape HTML characters
    escaped_comment = html.escape(user_comment)
    
    html_output = f"""
    <div class="comment">
        <p>User says: {escaped_comment}</p>
    </div>
    """
    return html_output

def advanced_sanitization(user_input):
    """Advanced input sanitization with whitelist approach"""
    # Remove script tags
    script_pattern = r'<script[^>]*>.*?</script>'
    cleaned = re.sub(script_pattern, '', user_input, flags=re.IGNORECASE | re.DOTALL)
    
    # Remove javascript: URLs
    js_pattern = r'javascript:[^"\'\s]*'
    cleaned = re.sub(js_pattern, '', cleaned, flags=re.IGNORECASE)
    
    # Remove event handlers (onclick, onload, etc.)
    event_pattern = r'on\w+\s*=\s*["\'][^"\']*["\']'
    cleaned = re.sub(event_pattern, '', cleaned, flags=re.IGNORECASE)
    
    # Escape remaining HTML
    escaped = html.escape(cleaned)
    
    return escaped

print("\nXSS Prevention Demonstration")
print("-" * 30)

# Normal comment
normal_comment = "I love this website!"
print("Normal comment:")
print(f"Input: {normal_comment}")
print(f"Vulnerable: {vulnerable_comment_display(normal_comment)}")
print(f"Secure: {secure_comment_display(normal_comment)}\n")

# Malicious XSS attempt
xss_payload = "<script>alert('XSS Attack!');</script>"
print("XSS attack attempt:")
print(f"Input: {xss_payload}")
print("Vulnerable output:")
print(vulnerable_comment_display(xss_payload))
print("🚨 Script would execute in browser!\n")

print("Secure output:")
print(secure_comment_display(xss_payload))
print("✅ Script tags escaped, attack prevented\n")

# Advanced XSS attempt
advanced_xss = '<img src="x" onerror="alert(\'Advanced XSS\')">'
print("Advanced XSS attempt:")
print(f"Input: {advanced_xss}")
print(f"Advanced sanitization: {advanced_sanitization(advanced_xss)}")
print("✅ Event handlers removed and HTML escaped")

# Example 4: Input Validation Framework

Build a comprehensive input validation system for different data types.

In [None]:
# Comprehensive input validation framework
class ValidationError(Exception):
    pass

class InputValidator:
    @staticmethod
    def validate_email(email):
        """Validate email format"""
        if not isinstance(email, str):
            raise ValidationError("Email must be a string")
        
        # Basic email regex (simplified)
        email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
        if not re.match(email_pattern, email):
            raise ValidationError("Invalid email format")
        
        # Length check
        if len(email) > 254:  # RFC 5321 limit
            raise ValidationError("Email too long")
        
        return email.lower().strip()
    
    @staticmethod
    def validate_password(password):
        """Validate password strength"""
        if not isinstance(password, str):
            raise ValidationError("Password must be a string")
        
        if len(password) < 8:
            raise ValidationError("Password must be at least 8 characters")
        
        if len(password) > 128:
            raise ValidationError("Password too long")
        
        # Check for variety of characters
        has_upper = any(c.isupper() for c in password)
        has_lower = any(c.islower() for c in password)
        has_digit = any(c.isdigit() for c in password)
        has_special = any(c in "!@#$%^&*()_+-=[]{}|;:,.<>?" for c in password)
        
        if not (has_upper and has_lower and has_digit and has_special):
            raise ValidationError("Password must contain uppercase, lowercase, digit, and special character")
        
        return password
    
    @staticmethod
    def validate_integer(value, min_val=None, max_val=None):
        """Validate integer with optional bounds"""
        try:
            int_val = int(value)
        except (ValueError, TypeError):
            raise ValidationError("Must be a valid integer")
        
        if min_val is not None and int_val < min_val:
            raise ValidationError(f"Value must be at least {min_val}")
        
        if max_val is not None and int_val > max_val:
            raise ValidationError(f"Value must be no more than {max_val}")
        
        return int_val
    
    @staticmethod
    def validate_filename(filename):
        """Validate filename for security"""
        if not isinstance(filename, str):
            raise ValidationError("Filename must be a string")
        
        # Check for path traversal
        if '..' in filename or filename.startswith('/') or '\\' in filename:
            raise ValidationError("Invalid filename: contains path traversal")
        
        # Check for null bytes
        if '\x00' in filename:
            raise ValidationError("Invalid filename: contains null byte")
        
        # Length check
        if len(filename) > 255:
            raise ValidationError("Filename too long")
        
        # Character whitelist
        allowed_pattern = r'^[a-zA-Z0-9._-]+$'
        if not re.match(allowed_pattern, filename):
            raise ValidationError("Filename contains invalid characters")
        
        return filename

# Demonstrate validation framework
validator = InputValidator()

print("\nInput Validation Framework")
print("-" * 30)

# Test cases
test_cases = [
    ("email", "user@example.com", "Valid email"),
    ("email", "invalid-email", "Invalid email"),
    ("password", "Str0ng!Pass", "Strong password"),
    ("password", "weak", "Weak password"),
    ("integer", "42", "Valid integer"),
    ("integer", "not_a_number", "Invalid integer"),
    ("filename", "document.pdf", "Safe filename"),
    ("filename", "../../../etc/passwd", "Malicious filename")
]

for test_type, test_value, description in test_cases:
    try:
        if test_type == "email":
            result = validator.validate_email(test_value)
        elif test_type == "password":
            result = validator.validate_password(test_value)
        elif test_type == "integer":
            result = validator.validate_integer(test_value, 1, 100)
        elif test_type == "filename":
            result = validator.validate_filename(test_value)
        
        print(f"✅ {description}: {result}")
    except ValidationError as e:
        print(f"❌ {description}: {e}")

# Example 5: Buffer Overflow Prevention

While less common in modern languages, understanding buffer overflows is important.

In [None]:
# Buffer overflow concepts in Python context
def demonstrate_buffer_concepts():
    """Demonstrate buffer overflow concepts"""
    print("Buffer Overflow Prevention Concepts")
    print("-" * 40)
    
    print("🐍 Python automatically handles memory management,")
    print("   but understanding buffer overflows is important")
    print("   when interfacing with C libraries or other languages.\n")
    
    # Simulate unsafe vs safe string operations
    print("Unsafe patterns (common in C/C++):")
    unsafe_examples = [
        "strcpy(buffer, user_input);  // No bounds checking",
        "sprintf(buffer, \"%s\", data);  // Can overflow",
        "gets(buffer);  // Never use - no size limit",
        "strcat(dest, src);  // Can exceed dest size"
    ]
    
    for example in unsafe_examples:
        print(f"  ❌ {example}")
    
    print("\nSafe alternatives:")
    safe_examples = [
        "strncpy(buffer, user_input, sizeof(buffer)-1);",
        "snprintf(buffer, sizeof(buffer), \"%s\", data);",
        "fgets(buffer, sizeof(buffer), stdin);",
        "strncat(dest, src, sizeof(dest) - strlen(dest) - 1);"
    ]
    
    for example in safe_examples:
        print(f"  ✅ {example}")

def python_length_validation():
    """Show Python's built-in protections and how to add validation"""
    print("\nPython String Length Validation")
    print("-" * 35)
    
    # Python automatically manages memory, but we should still validate lengths
    def safe_string_handler(user_input, max_length=100):
        """Safely handle string input with length limits"""
        if not isinstance(user_input, str):
            raise ValueError("Input must be a string")
        
        if len(user_input) > max_length:
            raise ValueError(f"Input too long: {len(user_input)} > {max_length}")
        
        # Additional validation
        if '\x00' in user_input:  # Null byte
            raise ValueError("Input contains null byte")
        
        return user_input
    
    # Test with normal input
    try:
        result = safe_string_handler("Normal input")
        print(f"✅ Normal input accepted: '{result}'")
    except ValueError as e:
        print(f"❌ Error: {e}")
    
    # Test with too-long input
    try:
        long_input = "A" * 150  # Exceeds limit
        result = safe_string_handler(long_input)
        print(f"✅ Long input accepted: {len(result)} chars")
    except ValueError as e:
        print(f"❌ Long input rejected: {e}")
    
    # Test with null byte
    try:
        null_input = "Bad\x00Input"
        result = safe_string_handler(null_input)
        print(f"✅ Null byte input accepted")
    except ValueError as e:
        print(f"❌ Null byte input rejected: {e}")

demonstrate_buffer_concepts()
python_length_validation()

# Chapter 6 Summary

## Key Principles:

✅ **Never Trust User Input**:
- All input is potentially malicious
- Validate on the server side, not just client side
- Use whitelist validation when possible

✅ **SQL Injection Prevention**:
- Always use parameterized queries
- Never concatenate user input into SQL strings
- Use ORM frameworks that handle escaping

✅ **Command Injection Prevention**:
- Avoid shell=True in subprocess calls
- Use command lists instead of strings
- Validate and sanitize all parameters

✅ **XSS Prevention**:
- Escape HTML output
- Use Content Security Policy (CSP)
- Sanitize input with whitelist approach

✅ **Input Validation Framework**:
- Validate data type, format, length, and range
- Provide clear error messages
- Fail securely (reject invalid input)

✅ **Buffer Overflow Prevention**:
- Use memory-safe languages when possible
- Always check bounds in C/C++
- Use safe string functions

## Best Practices:

- **Validate early** and reject invalid input immediately
- **Use established libraries** for common validations
- **Log validation failures** for security monitoring
- **Keep validation rules** up to date with threats
- **Test with malicious input** during development

## Common Mistakes to Avoid:

- Trusting client-side validation only
- Blacklist filtering (use whitelists instead)
- Insufficient input length checking
- Mixing data and commands in the same channel
- Not escaping output properly

> **Remember**: Defense in depth - validate input AND escape output!