# 🕷️ Web Scraping Concepts - Greek Derby RAG Chatbot

## Learning Objectives
By the end of this lesson, you will understand:
- Web scraping fundamentals and best practices
- HTML parsing and CSS selector strategies
- Anti-bot detection and evasion techniques
- Data cleaning and preprocessing methods
- Rate limiting and ethical scraping
- Error handling and resilience patterns
- Integration with RAG systems
- Legal and ethical considerations

---

## Q1: What is web scraping and how do we use it in our Greek Derby chatbot?

**Answer:**

Web scraping is the automated extraction of data from websites. Our Greek Derby RAG chatbot uses web scraping to gather real-time information about Olympiakos and Panathinaikos from Gazzetta.gr, creating a dynamic knowledge base that stays current with the latest news and updates.

### What is Web Scraping?

**Web Scraping Definition:**
Web scraping (also called web harvesting or web data extraction) is the process of automatically extracting data from websites using software tools. It involves:

- **HTTP Requests**: Sending requests to web servers
- **HTML Parsing**: Extracting structured data from HTML content
- **Data Processing**: Cleaning and transforming raw data
- **Storage**: Saving data for further use

### Why Web Scraping for Our RAG Chatbot?

**Traditional Knowledge Base Limitations:**
- **Static Content**: Pre-written content becomes outdated
- **Limited Scope**: Cannot cover all possible questions
- **Manual Updates**: Requires constant manual maintenance
- **No Real-time Data**: Cannot provide current information

**Web Scraping Benefits:**
- **Real-time Updates**: Always current information
- **Comprehensive Coverage**: Access to vast amounts of data
- **Automated Updates**: No manual intervention needed
- **Dynamic Content**: Fresh news and statistics

### Our Web Scraping Architecture:

```
┌─────────────────────────────────────────────────────────────┐
│                    Web Scraping Pipeline                    │
│                                                             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐     │
│  │   Gazzetta  │    │   HTML      │    │   Content   │     │
│  │   .gr URLs  │───▶│   Parser    │───▶│   Processor │     │
│  └─────────────┘    └─────────────┘    └─────────────┘     │
│                                                             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐     │
│  │   Text      │    │   Vector    │    │   RAG      │     │
│  │   Splitter  │───▶│   Database  │───▶│   System   │     │
│  └─────────────┘    └─────────────┘    └─────────────┘     │
└─────────────────────────────────────────────────────────────┘
```

### Our Scraping Implementation:

#### 1. **Target Websites**
```python
# backend/standalone-service/greek_derby_chatbot.py
greek_derby_urls = [
    "https://www.gazzetta.gr/football/superleague/olympiakos",
    "https://www.gazzetta.gr/football/superleague/panathinaikos", 
    "https://www.gazzetta.gr/football/superleague",
    "https://www.gazzetta.gr",
]
```

**Why These URLs:**
- **Official Source**: Gazzetta.gr is a major Greek sports news site
- **Comprehensive Coverage**: Covers both teams and general football news
- **Regular Updates**: Fresh content published daily
- **Greek Language**: Native Greek content for better understanding

#### 2. **Scraping Libraries Used**
```python
# Web scraping imports
from langchain_community.document_loaders import WebBaseLoader
import bs4
import requests
```

**Library Choices:**
- **LangChain WebBaseLoader**: High-level abstraction for web scraping
- **BeautifulSoup (bs4)**: Powerful HTML parsing library
- **Requests**: Low-level HTTP library for custom requests

#### 3. **Multi-Strategy Scraping Approach**
```python
# backend/scheduler/update_vector_db.py
def load_fresh_content(self) -> List[Document]:
    """Load fresh content from Gazzetta.gr"""
    
    for url in self.greek_derby_urls:
        try:
            # Strategy 1: CSS Selector-based scraping
            loader = WebBaseLoader(
                web_paths=(url,),
                bs_kwargs=dict(
                    parse_only=bs4.SoupStrainer(
                        class_=("article-content", "article-title", "article-body", 
                               "content", "post-content", "entry-content", 
                               "post-body", "article-text", "main-content", 
                               "story-content", "article", "post", 
                               "content-area", "main", "body")
                    )
                ),
            )
            docs = loader.load()
            
            # Strategy 2: Fallback without selectors
            if not docs or all(len(doc.page_content.strip()) < 100 for doc in docs):
                loader_fallback = WebBaseLoader(web_paths=(url,))
                docs = loader_fallback.load()
            
            # Strategy 3: Content validation and filtering
            valid_docs = [doc for doc in docs if len(doc.page_content.strip()) > 50]
            
        except Exception as e:
            self.logger.error(f"Failed to load {url}: {str(e)}")
            continue
```

### Scraping Strategy Breakdown:

#### 1. **Primary Strategy: CSS Selector Targeting**
```python
# Target specific HTML elements
bs4.SoupStrainer(
    class_=("article-content", "article-title", "article-body", 
           "content", "post-content", "entry-content", 
           "post-body", "article-text", "main-content", 
           "story-content", "article", "post", 
           "content-area", "main", "body")
)
```

**Benefits:**
- **Precision**: Extract only relevant content
- **Efficiency**: Skip navigation, ads, and irrelevant elements
- **Clean Data**: Better quality extracted content
- **Performance**: Faster processing with less data

#### 2. **Fallback Strategy: Full Page Scraping**
```python
# If selectors fail, scrape entire page
loader_fallback = WebBaseLoader(web_paths=(url,))
docs = loader_fallback.load()
```

**When to Use:**
- **Selector Failure**: When CSS selectors don't match
- **Site Changes**: When website structure changes
- **New Content Types**: For different article layouts
- **Backup Method**: Ensures we don't miss content

#### 3. **Content Validation and Filtering**
```python
# Filter out low-quality content
valid_docs = [doc for doc in docs if len(doc.page_content.strip()) > 50]

# Additional filtering
if valid_docs:
    for doc in valid_docs:
        # Add metadata
        doc.metadata.update({
            'source': url,
            'scraped_at': datetime.now().isoformat(),
            'content_type': 'greek_football_news'
        })
```

**Quality Filters:**
- **Minimum Length**: Remove very short content (< 50 characters)
- **Content Type**: Focus on article content, not navigation
- **Language**: Ensure Greek language content
- **Relevance**: Filter for football-related content

### Anti-Bot Detection and Evasion:

#### 1. **User Agent Rotation**
```python
# Set realistic user agent
os.environ['USER_AGENT'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
```

**Why User Agent Matters:**
- **Bot Detection**: Websites block requests with default user agents
- **Realistic Simulation**: Mimic real browser behavior
- **Rate Limiting**: Some sites have different limits for different user agents

#### 2. **Request Headers**
```python
# Add realistic headers
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
}
```

#### 3. **Rate Limiting and Delays**
```python
import time
import random

# Add delays between requests
time.sleep(random.uniform(1, 3))  # Random delay between 1-3 seconds

# Respect robots.txt
# Check robots.txt before scraping
```

### Error Handling and Resilience:

#### 1. **Try-Catch Blocks**
```python
for url in self.greek_derby_urls:
    try:
        # Scraping logic
        docs = loader.load()
        # Process documents
    except requests.exceptions.RequestException as e:
        self.logger.error(f"Network error for {url}: {str(e)}")
        continue
    except bs4.exceptions.ParserError as e:
        self.logger.error(f"Parsing error for {url}: {str(e)}")
        continue
    except Exception as e:
        self.logger.error(f"Unexpected error for {url}: {str(e)}")
        continue
```

#### 2. **Retry Logic**
```python
import time
from functools import wraps

def retry_on_failure(max_retries=3, delay=1):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if attempt == max_retries - 1:
                        raise e
                    time.sleep(delay * (2 ** attempt))  # Exponential backoff
            return None
        return wrapper
    return decorator

@retry_on_failure(max_retries=3, delay=2)
def scrape_url(url):
    # Scraping logic here
    pass
```

### Integration with RAG System:

#### 1. **Document Processing**
```python
# Process scraped content for RAG
def process_scraped_content(self, docs):
    """Process scraped documents for RAG system"""
    
    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=100,
        separators=["\n\n", "\n", ". ", "! ", "? ", " ", ""]
    )
    
    # Split documents
    split_docs = text_splitter.split_documents(docs)
    
    # Add metadata
    for doc in split_docs:
        doc.metadata.update({
            'source_type': 'web_scraped',
            'language': 'greek',
            'domain': 'gazzetta.gr'
        })
    
    return split_docs
```

#### 2. **Vector Database Integration**
```python
# Store processed content in vector database
def store_in_vector_db(self, processed_docs):
    """Store processed documents in Pinecone"""
    
    # Add to vector store
    self.vector_store.add_documents(processed_docs)
    
    # Log success
    self.logger.info(f"Stored {len(processed_docs)} documents in vector database")
```

### Benefits for Our RAG Chatbot:

1. **Real-time Information**: Always current news and updates
2. **Comprehensive Knowledge**: Access to vast amounts of Greek football content
3. **Automated Updates**: No manual maintenance required
4. **Dynamic Responses**: Can answer questions about recent events
5. **Language Accuracy**: Native Greek content for better understanding
6. **Scalable Data**: Can easily add more sources
7. **Cost Effective**: Free content from public websites
8. **Reliable Source**: Gazzetta.gr is a trusted Greek sports news source


## Q2: How do we implement HTML parsing and CSS selector strategies?

**Answer:**

HTML parsing and CSS selector strategies are crucial for extracting relevant content from web pages. Our Greek Derby chatbot uses sophisticated parsing techniques to accurately extract Greek football content while filtering out irrelevant elements like navigation, ads, and scripts.

### HTML Parsing Fundamentals:

#### 1. **BeautifulSoup Integration with LangChain**

```python
# backend/scheduler/update_vector_db.py
from langchain_community.document_loaders import WebBaseLoader
import bs4

# Configure BeautifulSoup parsing
loader = WebBaseLoader(
    web_paths=(url,),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("article-content", "article-title", "article-body", 
                   "content", "post-content", "entry-content", 
                   "post-body", "article-text", "main-content", 
                   "story-content", "article", "post", 
                   "content-area", "main", "body")
        )
    ),
)
```

**Why BeautifulSoup:**
- **Flexible Parsing**: Handles malformed HTML gracefully
- **CSS Selectors**: Powerful element selection capabilities
- **Integration**: Works seamlessly with LangChain
- **Performance**: Efficient parsing of large documents

#### 2. **CSS Selector Strategy**

```python
# Multi-level CSS selector approach
css_selectors = {
    'primary': [
        'article-content',      # Main article content
        'article-title',        # Article headlines
        'article-body',         # Article body text
        'content',              # Generic content area
        'post-content',         # Blog post content
        'entry-content',        # WordPress entry content
        'post-body',            # Post body text
        'article-text',         # Article text content
        'main-content',         # Main content area
        'story-content',        # News story content
    ],
    'secondary': [
        'article',              # Article elements
        'post',                 # Post elements
        'content-area',         # Content area
        'main',                 # Main section
        'body',                 # Body content
    ]
}
```

**Selector Hierarchy:**
1. **Primary Selectors**: Target specific content containers
2. **Secondary Selectors**: Fallback to broader content areas
3. **Tertiary Selectors**: Last resort for general content

### Advanced Parsing Techniques:

#### 1. **Content Type Detection**

```python
def detect_content_type(soup):
    """Detect the type of content being parsed"""
    
    # Check for article indicators
    if soup.find('article'):
        return 'article'
    
    # Check for news indicators
    if soup.find(class_=['news', 'story', 'post']):
        return 'news'
    
    # Check for list indicators
    if soup.find(['ul', 'ol']):
        return 'list'
    
    # Default to general content
    return 'general'

# Usage in parsing
content_type = detect_content_type(soup)
if content_type == 'article':
    # Use article-specific selectors
    selectors = ['article-content', 'article-body', 'entry-content']
else:
    # Use general selectors
    selectors = ['content', 'main', 'body']
```

#### 2. **Dynamic Selector Adaptation**

```python
def find_best_selectors(soup, url):
    """Dynamically find the best CSS selectors for a page"""
    
    # Common content patterns
    content_patterns = [
        {'tag': 'div', 'class': 'article-content'},
        {'tag': 'article', 'class': 'post'},
        {'tag': 'div', 'class': 'content'},
        {'tag': 'main', 'class': 'main-content'},
        {'tag': 'section', 'class': 'story'},
    ]
    
    best_selectors = []
    
    for pattern in content_patterns:
        elements = soup.find_all(pattern['tag'], class_=pattern['class'])
        if elements:
            # Check content quality
            total_text = sum(len(elem.get_text().strip()) for elem in elements)
            if total_text > 100:  # Minimum content threshold
                best_selectors.append(pattern)
    
    return best_selectors
```

#### 3. **Content Quality Assessment**

```python
def assess_content_quality(element):
    """Assess the quality of extracted content"""
    
    text = element.get_text().strip()
    
    # Quality metrics
    metrics = {
        'length': len(text),
        'word_count': len(text.split()),
        'paragraph_count': len(element.find_all('p')),
        'link_density': len(element.find_all('a')) / max(len(text.split()), 1),
        'has_headings': bool(element.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])),
        'has_images': bool(element.find_all('img')),
    }
    
    # Calculate quality score
    quality_score = 0
    
    # Length bonus
    if metrics['length'] > 200:
        quality_score += 2
    elif metrics['length'] > 100:
        quality_score += 1
    
    # Word count bonus
    if metrics['word_count'] > 50:
        quality_score += 2
    elif metrics['word_count'] > 20:
        quality_score += 1
    
    # Structure bonus
    if metrics['has_headings']:
        quality_score += 1
    
    # Link density penalty
    if metrics['link_density'] > 0.3:
        quality_score -= 1
    
    return quality_score, metrics
```

### Greek Language Specific Parsing:

#### 1. **Greek Text Detection**

```python
import re

def is_greek_text(text):
    """Check if text contains Greek characters"""
    greek_pattern = re.compile(r'[\u0370-\u03FF\u1F00-\u1FFF]')
    return bool(greek_pattern.search(text))

def filter_greek_content(docs):
    """Filter documents to keep only Greek content"""
    greek_docs = []
    
    for doc in docs:
        if is_greek_text(doc.page_content):
            greek_docs.append(doc)
        else:
            # Check if it's mixed content
            greek_ratio = len(re.findall(r'[\u0370-\u03FF\u1F00-\u1FFF]', doc.page_content))
            total_chars = len(doc.page_content)
            
            if greek_ratio / max(total_chars, 1) > 0.3:  # 30% Greek content
                greek_docs.append(doc)
    
    return greek_docs
```

#### 2. **Greek-Specific Content Selectors**

```python
# Greek news site specific selectors
greek_content_selectors = [
    'article-content',
    'article-title', 
    'article-body',
    'news-content',
    'story-content',
    'post-content',
    'entry-content',
    'content',
    'main-content',
    'article',
    'post',
    'news-item',
    'story',
]

# Greek sports specific selectors
greek_sports_selectors = [
    'match-report',
    'team-news',
    'player-news',
    'transfer-news',
    'match-preview',
    'post-match',
    'analysis',
    'commentary',
]
```

### Error Handling in Parsing:

#### 1. **Parser Error Recovery**

```python
def robust_html_parsing(html_content, url):
    """Robust HTML parsing with error recovery"""
    
    try:
        # Try with BeautifulSoup
        soup = bs4.BeautifulSoup(html_content, 'html.parser')
        return soup
    except bs4.exceptions.ParserError as e:
        print(f"Parser error for {url}: {e}")
        
        try:
            # Try with lxml parser
            soup = bs4.BeautifulSoup(html_content, 'lxml')
            return soup
        except Exception as e2:
            print(f"LXML parser error for {url}: {e2}")
            
            try:
                # Try with html5lib parser
                soup = bs4.BeautifulSoup(html_content, 'html5lib')
                return soup
            except Exception as e3:
                print(f"HTML5lib parser error for {url}: {e3}")
                return None
```

#### 2. **Content Extraction Fallbacks**

```python
def extract_content_with_fallbacks(soup, url):
    """Extract content with multiple fallback strategies"""
    
    # Strategy 1: CSS selectors
    content = extract_with_selectors(soup)
    if content and len(content.strip()) > 100:
        return content
    
    # Strategy 2: Tag-based extraction
    content = extract_with_tags(soup)
    if content and len(content.strip()) > 100:
        return content
    
    # Strategy 3: Text-based extraction
    content = extract_text_only(soup)
    if content and len(content.strip()) > 100:
        return content
    
    # Strategy 4: Full page text
    content = soup.get_text()
    return content

def extract_with_selectors(soup):
    """Extract content using CSS selectors"""
    selectors = ['article-content', 'content', 'main', 'article']
    
    for selector in selectors:
        elements = soup.find_all(class_=selector)
        if elements:
            text = ' '.join(elem.get_text() for elem in elements)
            if len(text.strip()) > 100:
                return text
    
    return None

def extract_with_tags(soup):
    """Extract content using HTML tags"""
    # Remove script and style elements
    for script in soup(["script", "style"]):
        script.decompose()
    
    # Get text from main content tags
    content_tags = ['article', 'main', 'section', 'div']
    text_parts = []
    
    for tag in content_tags:
        elements = soup.find_all(tag)
        for element in elements:
            text = element.get_text().strip()
            if len(text) > 50:  # Minimum length
                text_parts.append(text)
    
    return ' '.join(text_parts)
```

### Performance Optimization:

#### 1. **Selective Parsing**

```python
def selective_parsing(html_content, target_selectors):
    """Parse only specific parts of HTML for better performance"""
    
    # Create a strainer for specific elements
    strainer = bs4.SoupStrainer(
        class_=target_selectors,
        attrs={'id': re.compile(r'content|article|main|post')}
    )
    
    # Parse only the strained content
    soup = bs4.BeautifulSoup(html_content, 'html.parser', parse_only=strainer)
    
    return soup
```

#### 2. **Caching Parsed Content**

```python
import hashlib
from functools import lru_cache

@lru_cache(maxsize=100)
def cached_parse(url, content_hash):
    """Cache parsed content to avoid re-parsing"""
    # This would be implemented with actual caching logic
    pass

def parse_with_caching(url, html_content):
    """Parse HTML with caching"""
    content_hash = hashlib.md5(html_content.encode()).hexdigest()
    
    # Check cache first
    cached_result = cached_parse(url, content_hash)
    if cached_result:
        return cached_result
    
    # Parse and cache
    result = robust_html_parsing(html_content, url)
    return result
```

### Best Practices for CSS Selectors:

#### 1. **Selector Specificity**

```python
# Good: Specific selectors
selectors = [
    'article.news-item .content',
    'div.article-body p',
    'section.main-content article',
]

# Bad: Too generic
selectors = [
    'div',
    'p',
    'span',
]
```

#### 2. **Fallback Hierarchy**

```python
def get_content_with_hierarchy(soup):
    """Get content using a hierarchy of selectors"""
    
    hierarchy = [
        # Most specific
        'article.news-item .content',
        'div.article-body',
        'section.main-content',
        
        # Less specific
        'article',
        'div.content',
        'main',
        
        # Fallback
        'body'
    ]
    
    for selector in hierarchy:
        elements = soup.select(selector)
        if elements:
            content = ' '.join(elem.get_text() for elem in elements)
            if len(content.strip()) > 100:
                return content
    
    return None
```

### Benefits for Our RAG Chatbot:

1. **Accurate Content Extraction**: Precise targeting of relevant content
2. **Noise Reduction**: Filter out ads, navigation, and irrelevant elements
3. **Language Focus**: Prioritize Greek language content
4. **Quality Assurance**: Ensure high-quality content for the knowledge base
5. **Performance**: Efficient parsing with minimal resource usage
6. **Reliability**: Robust error handling and fallback strategies
7. **Maintainability**: Easy to update selectors when sites change
8. **Scalability**: Can handle multiple content types and sources


## Q3: How do we handle anti-bot detection and implement ethical scraping?

**Answer:**

Anti-bot detection and ethical scraping are crucial aspects of web scraping. Our Greek Derby chatbot implements sophisticated techniques to avoid detection while respecting website resources and terms of service, ensuring sustainable and responsible data collection.

### Understanding Anti-Bot Detection:

#### 1. **Common Bot Detection Methods**

```python
# Websites use various methods to detect bots
detection_methods = {
    'user_agent_check': 'Analyze User-Agent headers',
    'request_frequency': 'Monitor request rate and patterns',
    'javascript_challenges': 'Require JavaScript execution',
    'captcha_systems': 'Human verification challenges',
    'ip_blacklisting': 'Block suspicious IP addresses',
    'behavioral_analysis': 'Analyze browsing patterns',
    'header_validation': 'Check for missing or suspicious headers',
    'cookie_handling': 'Require proper cookie management',
    'session_management': 'Track session continuity',
    'device_fingerprinting': 'Identify unique device characteristics'
}
```

#### 2. **Our Anti-Detection Strategy**

```python
# backend/standalone-service/greek_derby_chatbot.py
def setup_anti_detection():
    """Configure anti-detection measures"""
    
    # Set realistic user agent
    os.environ['USER_AGENT'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    
    # Configure headers
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
        'Cache-Control': 'max-age=0',
        'DNT': '1',
        'Sec-Fetch-Dest': 'document',
        'Sec-Fetch-Mode': 'navigate',
        'Sec-Fetch-Site': 'none',
        'Sec-Fetch-User': '?1',
    }
    
    return headers
```

### Advanced Anti-Detection Techniques:

#### 1. **User Agent Rotation**

```python
import random

class UserAgentRotator:
    """Rotate user agents to avoid detection"""
    
    def __init__(self):
        self.user_agents = [
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15',
        ]
    
    def get_random_user_agent(self):
        """Get a random user agent"""
        return random.choice(self.user_agents)
    
    def get_headers(self):
        """Get headers with random user agent"""
        return {
            'User-Agent': self.get_random_user_agent(),
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
            'Accept-Encoding': 'gzip, deflate',
            'Connection': 'keep-alive',
            'Upgrade-Insecure-Requests': '1',
        }
```

#### 2. **Request Timing and Rate Limiting**

```python
import time
import random
from datetime import datetime, timedelta

class RateLimiter:
    """Implement intelligent rate limiting"""
    
    def __init__(self, min_delay=1, max_delay=3, burst_limit=5):
        self.min_delay = min_delay
        self.max_delay = max_delay
        self.burst_limit = burst_limit
        self.request_times = []
        self.last_request_time = 0
    
    def wait_if_needed(self):
        """Wait if rate limit would be exceeded"""
        now = time.time()
        
        # Remove old request times (older than 1 minute)
        self.request_times = [t for t in self.request_times if now - t < 60]
        
        # Check burst limit
        if len(self.request_times) >= self.burst_limit:
            sleep_time = 60 - (now - self.request_times[0])
            if sleep_time > 0:
                time.sleep(sleep_time)
                self.request_times = []
        
        # Random delay between requests
        delay = random.uniform(self.min_delay, self.max_delay)
        time.sleep(delay)
        
        # Record this request
        self.request_times.append(now)
        self.last_request_time = now

# Usage in scraping
rate_limiter = RateLimiter(min_delay=2, max_delay=5, burst_limit=3)

for url in urls:
    rate_limiter.wait_if_needed()
    # Make request
    response = requests.get(url, headers=headers)
```

#### 3. **Session Management**

```python
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

class SessionManager:
    """Manage HTTP sessions with proper configuration"""
    
    def __init__(self):
        self.session = requests.Session()
        self.setup_session()
    
    def setup_session(self):
        """Configure session with retry strategy"""
        
        # Retry strategy
        retry_strategy = Retry(
            total=3,
            backoff_factor=1,
            status_forcelist=[429, 500, 502, 503, 504],
        )
        
        # Mount adapter
        adapter = HTTPAdapter(max_retries=retry_strategy)
        self.session.mount("http://", adapter)
        self.session.mount("https://", adapter)
        
        # Set default headers
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
            'Accept-Encoding': 'gzip, deflate',
            'Connection': 'keep-alive',
        })
    
    def get(self, url, **kwargs):
        """Make GET request with session"""
        return self.session.get(url, **kwargs)
    
    def close(self):
        """Close session"""
        self.session.close()
```

### Ethical Scraping Practices:

#### 1. **Robots.txt Compliance**

```python
import urllib.robotparser
from urllib.parse import urljoin, urlparse

class RobotsTxtChecker:
    """Check robots.txt before scraping"""
    
    def __init__(self):
        self.robots_cache = {}
    
    def can_scrape(self, url, user_agent='*'):
        """Check if URL can be scraped according to robots.txt"""
        
        parsed_url = urlparse(url)
        robots_url = f"{parsed_url.scheme}://{parsed_url.netloc}/robots.txt"
        
        # Check cache first
        if robots_url in self.robots_cache:
            rp = self.robots_cache[robots_url]
        else:
            rp = urllib.robotparser.RobotFileParser()
            rp.set_url(robots_url)
            rp.read()
            self.robots_cache[robots_url] = rp
        
        return rp.can_fetch(user_agent, url)
    
    def get_crawl_delay(self, url, user_agent='*'):
        """Get recommended crawl delay from robots.txt"""
        parsed_url = urlparse(url)
        robots_url = f"{parsed_url.scheme}://{parsed_url.netloc}/robots.txt"
        
        if robots_url in self.robots_cache:
            rp = self.robots_cache[robots_url]
            return rp.crawl_delay(user_agent)
        
        return None

# Usage
robots_checker = RobotsTxtChecker()

for url in urls:
    if robots_checker.can_scrape(url):
        delay = robots_checker.get_crawl_delay(url)
        if delay:
            time.sleep(delay)
        # Proceed with scraping
    else:
        print(f"Scraping not allowed for {url}")
```

#### 2. **Respectful Request Patterns**

```python
class EthicalScraper:
    """Implement ethical scraping practices"""
    
    def __init__(self):
        self.robots_checker = RobotsTxtChecker()
        self.rate_limiter = RateLimiter()
        self.session_manager = SessionManager()
    
    def scrape_respectfully(self, urls):
        """Scrape URLs with ethical considerations"""
        
        for url in urls:
            try:
                # Check robots.txt
                if not self.robots_checker.can_scrape(url):
                    print(f"Skipping {url} - not allowed by robots.txt")
                    continue
                
                # Apply rate limiting
                self.rate_limiter.wait_if_needed()
                
                # Make request
                response = self.session_manager.get(url, timeout=30)
                
                # Check response status
                if response.status_code == 200:
                    yield response
                elif response.status_code == 429:  # Too Many Requests
                    print(f"Rate limited for {url}, waiting longer...")
                    time.sleep(60)  # Wait 1 minute
                    continue
                else:
                    print(f"Error {response.status_code} for {url}")
                    
            except Exception as e:
                print(f"Error scraping {url}: {e}")
                continue
```

#### 3. **Data Usage Guidelines**

```python
class DataUsagePolicy:
    """Define data usage policies and restrictions"""
    
    def __init__(self):
        self.policies = {
            'max_requests_per_hour': 100,
            'max_requests_per_day': 1000,
            'respect_noindex': True,
            'respect_nofollow': True,
            'cache_duration': 3600,  # 1 hour
            'user_agent_identification': True,
            'contact_info_required': True,
        }
    
    def validate_request(self, url, request_count):
        """Validate if request is within policy limits"""
        
        # Check hourly limit
        if request_count['hourly'] >= self.policies['max_requests_per_hour']:
            return False, "Hourly limit exceeded"
        
        # Check daily limit
        if request_count['daily'] >= self.policies['max_requests_per_day']:
            return False, "Daily limit exceeded"
        
        return True, "Request allowed"
    
    def get_cache_duration(self):
        """Get recommended cache duration"""
        return self.policies['cache_duration']
```

### Advanced Anti-Detection Techniques:

#### 1. **Proxy Rotation**

```python
import random

class ProxyRotator:
    """Rotate proxies to avoid IP blocking"""
    
    def __init__(self, proxy_list):
        self.proxy_list = proxy_list
        self.current_proxy = None
    
    def get_random_proxy(self):
        """Get a random proxy from the list"""
        return random.choice(self.proxy_list)
    
    def get_proxy_dict(self):
        """Get proxy configuration for requests"""
        proxy = self.get_random_proxy()
        return {
            'http': proxy,
            'https': proxy
        }
    
    def rotate_proxy(self):
        """Rotate to a new proxy"""
        self.current_proxy = self.get_random_proxy()

# Usage with proxies
proxy_rotator = ProxyRotator([
    'http://proxy1:port',
    'http://proxy2:port',
    'http://proxy3:port',
])

# Make request with proxy
response = requests.get(url, proxies=proxy_rotator.get_proxy_dict())
```

#### 2. **JavaScript Rendering (Selenium)**

```python
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

class JavaScriptScraper:
    """Handle JavaScript-heavy websites"""
    
    def __init__(self, headless=True):
        self.driver = self.setup_driver(headless)
    
    def setup_driver(self, headless):
        """Setup Chrome driver with options"""
        options = Options()
        
        if headless:
            options.add_argument('--headless')
        
        # Anti-detection options
        options.add_argument('--no-sandbox')
        options.add_argument('--disable-dev-shm-usage')
        options.add_argument('--disable-blink-features=AutomationControlled')
        options.add_experimental_option("excludeSwitches", ["enable-automation"])
        options.add_experimental_option('useAutomationExtension', False)
        
        driver = webdriver.Chrome(options=options)
        
        # Execute script to remove webdriver property
        driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
        
        return driver
    
    def scrape_with_js(self, url):
        """Scrape URL that requires JavaScript"""
        try:
            self.driver.get(url)
            
            # Wait for content to load
            WebDriverWait(self.driver, 10).until(
                EC.presence_of_element_located((By.TAG_NAME, "body"))
            )
            
            # Get page source
            page_source = self.driver.page_source
            
            return page_source
            
        except Exception as e:
            print(f"Error scraping {url}: {e}")
            return None
    
    def close(self):
        """Close the driver"""
        self.driver.quit()
```

#### 3. **Request Fingerprinting Avoidance**

```python
class FingerprintAvoidance:
    """Avoid request fingerprinting"""
    
    @staticmethod
    def randomize_headers():
        """Randomize headers to avoid fingerprinting"""
        base_headers = {
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
            'Accept-Encoding': 'gzip, deflate',
            'Connection': 'keep-alive',
            'Upgrade-Insecure-Requests': '1',
        }
        
        # Randomly add optional headers
        optional_headers = {
            'DNT': '1',
            'Cache-Control': 'max-age=0',
            'Sec-Fetch-Dest': 'document',
            'Sec-Fetch-Mode': 'navigate',
            'Sec-Fetch-Site': 'none',
            'Sec-Fetch-User': '?1',
        }
        
        # Randomly include some optional headers
        for header, value in optional_headers.items():
            if random.random() > 0.5:
                base_headers[header] = value
        
        return base_headers
    
    @staticmethod
    def randomize_request_timing():
        """Randomize request timing patterns"""
        # Random delay between 1-5 seconds
        delay = random.uniform(1, 5)
        
        # Sometimes add longer delays
        if random.random() < 0.1:  # 10% chance
            delay += random.uniform(10, 30)
        
        return delay
```

### Legal and Ethical Considerations:

#### 1. **Terms of Service Compliance**

```python
class TermsOfServiceChecker:
    """Check and comply with terms of service"""
    
    def __init__(self):
        self.known_restrictions = {
            'gazzetta.gr': {
                'allowed': True,
                'rate_limit': 60,  # requests per hour
                'requires_attribution': True,
                'commercial_use': False,
            }
        }
    
    def check_tos(self, domain):
        """Check terms of service for domain"""
        return self.known_restrictions.get(domain, {
            'allowed': True,
            'rate_limit': 30,
            'requires_attribution': False,
            'commercial_use': False,
        })
    
    def is_compliant(self, domain, request_rate):
        """Check if current usage is compliant"""
        tos = self.check_tos(domain)
        
        if not tos['allowed']:
            return False, "Scraping not allowed"
        
        if request_rate > tos['rate_limit']:
            return False, f"Rate limit exceeded: {tos['rate_limit']} requests/hour"
        
        return True, "Compliant"
```

#### 2. **Data Privacy Compliance**

```python
class PrivacyCompliance:
    """Ensure privacy compliance in scraping"""
    
    def __init__(self):
        self.gdpr_requirements = {
            'data_minimization': True,
            'purpose_limitation': True,
            'storage_limitation': True,
            'accuracy': True,
            'security': True,
        }
    
    def anonymize_data(self, data):
        """Anonymize personal data in scraped content"""
        # Remove email addresses
        import re
        data = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '[EMAIL]', data)
        
        # Remove phone numbers
        data = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', data)
        
        # Remove IP addresses
        data = re.sub(r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b', '[IP]', data)
        
        return data
    
    def validate_data_usage(self, data, purpose):
        """Validate data usage against privacy requirements"""
        # Check if data is necessary for purpose
        if not self.is_data_necessary(data, purpose):
            return False, "Data not necessary for stated purpose"
        
        # Check if data is minimized
        if not self.is_data_minimized(data, purpose):
            return False, "Data not minimized"
        
        return True, "Privacy compliant"
```

### Benefits for Our RAG Chatbot:

1. **Sustainable Scraping**: Avoid getting blocked or banned
2. **Ethical Compliance**: Respect website resources and terms
3. **Reliable Data**: Consistent access to content
4. **Legal Safety**: Comply with applicable laws and regulations
5. **Resource Efficiency**: Optimize request patterns
6. **Long-term Viability**: Maintain access over time
7. **Reputation Management**: Build trust with content providers
8. **Quality Assurance**: Ensure high-quality, reliable data


## Q4: How do we implement data cleaning and preprocessing for scraped content?

**Answer:**

Data cleaning and preprocessing are essential steps in web scraping to ensure high-quality data for our RAG system. Our Greek Derby chatbot implements comprehensive data cleaning techniques to transform raw scraped content into clean, structured, and useful information.

### Why Data Cleaning Matters:

#### 1. **Common Data Quality Issues**

```python
# Typical issues in scraped data
data_quality_issues = {
    'html_entities': '&amp; &lt; &gt; &quot; &#39;',
    'extra_whitespace': 'Multiple   spaces   and\n\n\nnewlines',
    'html_tags': '<p>Text</p><div>More text</div>',
    'javascript_code': 'var x = 1; function() { return x; }',
    'navigation_text': 'Home | About | Contact | Login',
    'advertisement_text': 'Sponsored Content | Click Here',
    'duplicate_content': 'Same text repeated multiple times',
    'encoding_issues': 'Special characters not properly decoded',
    'irrelevant_content': 'Comments, footers, headers',
    'broken_text': 'Incomplete sentences or fragments'
}
```

#### 2. **Impact on RAG System**

**Poor Data Quality Effects:**
- **Noise in Embeddings**: Irrelevant content affects vector similarity
- **Reduced Accuracy**: Low-quality chunks lead to poor answers
- **Storage Waste**: Unnecessary data consumes vector database space
- **Performance Issues**: Processing irrelevant content slows down retrieval
- **User Experience**: Poor quality responses from the chatbot

### Our Data Cleaning Pipeline:

#### 1. **HTML Content Extraction**

```python
import re
import html
from bs4 import BeautifulSoup

class HTMLContentExtractor:
    """Extract clean text from HTML content"""
    
    def __init__(self):
        self.unwanted_tags = [
            'script', 'style', 'nav', 'header', 'footer', 
            'aside', 'advertisement', 'ad', 'sidebar',
            'menu', 'navigation', 'social', 'share',
            'comment', 'comments', 'related', 'recommended'
        ]
        
        self.unwanted_classes = [
            'ad', 'advertisement', 'sidebar', 'navigation',
            'menu', 'footer', 'header', 'social', 'share',
            'comment', 'related', 'recommended', 'sponsored'
        ]
    
    def extract_clean_text(self, html_content):
        """Extract clean text from HTML"""
        
        # Parse HTML
        soup = BeautifulSoup(html_content, 'html.parser')
        
        # Remove unwanted tags
        for tag in self.unwanted_tags:
            for element in soup.find_all(tag):
                element.decompose()
        
        # Remove elements with unwanted classes
        for class_name in self.unwanted_classes:
            for element in soup.find_all(class_=class_name):
                element.decompose()
        
        # Get text content
        text = soup.get_text()
        
        # Clean up text
        text = self.clean_text(text)
        
        return text
    
    def clean_text(self, text):
        """Clean extracted text"""
        
        # Decode HTML entities
        text = html.unescape(text)
        
        # Remove extra whitespace
        text = re.sub(r'\s+', ' ', text)
        
        # Remove leading/trailing whitespace
        text = text.strip()
        
        # Remove common noise patterns
        text = self.remove_noise_patterns(text)
        
        return text
    
    def remove_noise_patterns(self, text):
        """Remove common noise patterns"""
        
        # Remove navigation patterns
        text = re.sub(r'Home\s*\|\s*About\s*\|\s*Contact', '', text)
        text = re.sub(r'Login\s*\|\s*Register\s*\|\s*Sign\s*Up', '', text)
        
        # Remove advertisement patterns
        text = re.sub(r'Sponsored\s*Content', '', text)
        text = re.sub(r'Click\s*Here\s*Now', '', text)
        text = re.sub(r'Ad\s*Advertisement', '', text)
        
        # Remove social media patterns
        text = re.sub(r'Follow\s*us\s*on\s*Facebook|Twitter|Instagram', '', text)
        text = re.sub(r'Share\s*on\s*Social\s*Media', '', text)
        
        # Remove cookie notices
        text = re.sub(r'This\s*website\s*uses\s*cookies', '', text)
        text = re.sub(r'Accept\s*Cookies\s*Privacy\s*Policy', '', text)
        
        return text
```

#### 2. **Greek Language Specific Cleaning**

```python
class GreekTextCleaner:
    """Clean Greek text content specifically"""
    
    def __init__(self):
        # Greek-specific patterns
        self.greek_patterns = {
            'diacritics': r'[άέήίόύώΆΈΉΊΌΎΏ]',
            'greek_chars': r'[α-ωΑ-Ω]',
            'greek_punctuation': r'[·΄]',
        }
        
        # Common Greek noise patterns
        self.greek_noise = [
            r'Σελίδα\s*\d+\s*από\s*\d+',  # Page X of Y
            r'Διαβάστε\s*περισσότερα',    # Read more
            r'Σχολιάστε\s*αυτό\s*το\s*άρθρο',  # Comment on this article
            r'Μοιραστείτε\s*στο\s*Facebook',   # Share on Facebook
            r'Ακολουθήστε\s*μας\s*στο\s*Twitter',  # Follow us on Twitter
        ]
    
    def clean_greek_text(self, text):
        """Clean Greek text content"""
        
        # Remove Greek noise patterns
        for pattern in self.greek_noise:
            text = re.sub(pattern, '', text, flags=re.IGNORECASE)
        
        # Normalize Greek diacritics
        text = self.normalize_greek_diacritics(text)
        
        # Remove extra punctuation
        text = re.sub(r'[·΄]+', '', text)
        
        # Clean up spacing
        text = re.sub(r'\s+', ' ', text)
        text = text.strip()
        
        return text
    
    def normalize_greek_diacritics(self, text):
        """Normalize Greek diacritics for consistency"""
        
        # Common diacritic mappings
        diacritic_map = {
            'ά': 'α', 'έ': 'ε', 'ή': 'η', 'ί': 'ι',
            'ό': 'ο', 'ύ': 'υ', 'ώ': 'ω',
            'Ά': 'Α', 'Έ': 'Ε', 'Ή': 'Η', 'Ί': 'Ι',
            'Ό': 'Ο', 'Ύ': 'Υ', 'Ώ': 'Ω'
        }
        
        for diacritic, base in diacritic_map.items():
            text = text.replace(diacritic, base)
        
        return text
    
    def is_greek_content(self, text):
        """Check if text is primarily Greek"""
        
        greek_chars = len(re.findall(self.greek_patterns['greek_chars'], text))
        total_chars = len(re.sub(r'\s', '', text))
        
        if total_chars == 0:
            return False
        
        greek_ratio = greek_chars / total_chars
        return greek_ratio > 0.3  # 30% Greek content threshold
```

#### 3. **Content Quality Assessment**

```python
class ContentQualityAssessor:
    """Assess and filter content quality"""
    
    def __init__(self):
        self.min_length = 50
        self.max_length = 10000
        self.min_word_count = 10
        self.max_link_density = 0.3
        self.min_sentence_count = 2
    
    def assess_quality(self, text, metadata=None):
        """Assess content quality and return score"""
        
        if not text or not text.strip():
            return 0, "Empty content"
        
        # Basic metrics
        length = len(text.strip())
        word_count = len(text.split())
        sentence_count = len(re.findall(r'[.!?]+', text))
        
        # Link density
        link_count = len(re.findall(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text))
        link_density = link_count / max(word_count, 1)
        
        # Quality checks
        checks = {
            'length_ok': self.min_length <= length <= self.max_length,
            'word_count_ok': word_count >= self.min_word_count,
            'sentence_count_ok': sentence_count >= self.min_sentence_count,
            'link_density_ok': link_density <= self.max_link_density,
            'not_navigation': not self.is_navigation_text(text),
            'not_advertisement': not self.is_advertisement_text(text),
            'has_meaningful_content': self.has_meaningful_content(text),
        }
        
        # Calculate quality score
        quality_score = sum(checks.values()) / len(checks)
        
        # Determine quality level
        if quality_score >= 0.8:
            quality_level = "High"
        elif quality_score >= 0.6:
            quality_level = "Medium"
        else:
            quality_level = "Low"
        
        return quality_score, quality_level, checks
    
    def is_navigation_text(self, text):
        """Check if text is navigation content"""
        nav_patterns = [
            r'Home\s*\|\s*About\s*\|\s*Contact',
            r'Login\s*\|\s*Register',
            r'Menu\s*\|\s*Navigation',
            r'Σελίδα\s*\d+\s*από\s*\d+',  # Greek: Page X of Y
        ]
        
        for pattern in nav_patterns:
            if re.search(pattern, text, re.IGNORECASE):
                return True
        
        return False
    
    def is_advertisement_text(self, text):
        """Check if text is advertisement content"""
        ad_patterns = [
            r'Sponsored\s*Content',
            r'Advertisement',
            r'Click\s*Here\s*Now',
            r'Buy\s*Now',
            r'Limited\s*Time\s*Offer',
            r'Διαφήμιση',  # Greek: Advertisement
            r'Προσφορά',  # Greek: Offer
        ]
        
        for pattern in ad_patterns:
            if re.search(pattern, text, re.IGNORECASE):
                return True
        
        return False
    
    def has_meaningful_content(self, text):
        """Check if text has meaningful content"""
        
        # Check for meaningful words (not just common words)
        meaningful_words = [
            'match', 'game', 'team', 'player', 'coach', 'goal', 'score',
            'άγων', 'ομάδα', 'παίκτης', 'προπονητής', 'γκολ', 'σκορ',  # Greek
            'football', 'soccer', 'championship', 'league', 'derby',
            'ποδόσφαιρο', 'πρωτάθλημα', 'λίγκα', 'ντέρμπι',  # Greek
        ]
        
        text_lower = text.lower()
        meaningful_count = sum(1 for word in meaningful_words if word in text_lower)
        
        return meaningful_count >= 2
```

### Advanced Data Processing:

#### 1. **Content Deduplication**

```python
import hashlib
from collections import defaultdict

class ContentDeduplicator:
    """Remove duplicate content from scraped data"""
    
    def __init__(self):
        self.content_hashes = set()
        self.similarity_threshold = 0.8
    
    def deduplicate_documents(self, documents):
        """Remove duplicate documents"""
        
        unique_docs = []
        seen_hashes = set()
        
        for doc in documents:
            # Create content hash
            content_hash = self.create_content_hash(doc.page_content)
            
            if content_hash not in seen_hashes:
                seen_hashes.add(content_hash)
                unique_docs.append(doc)
            else:
                print(f"Duplicate document found: {doc.metadata.get('source', 'Unknown')}")
        
        return unique_docs
    
    def create_content_hash(self, content):
        """Create hash for content deduplication"""
        
        # Normalize content
        normalized = re.sub(r'\s+', ' ', content.strip().lower())
        
        # Create hash
        return hashlib.md5(normalized.encode()).hexdigest()
    
    def find_similar_content(self, documents):
        """Find similar content using text similarity"""
        
        from difflib import SequenceMatcher
        
        similar_groups = defaultdict(list)
        
        for i, doc1 in enumerate(documents):
            for j, doc2 in enumerate(documents[i+1:], i+1):
                similarity = SequenceMatcher(None, doc1.page_content, doc2.page_content).ratio()
                
                if similarity >= self.similarity_threshold:
                    similar_groups[i].append((j, similarity))
        
        return similar_groups
```

#### 2. **Content Enrichment**

```python
class ContentEnricher:
    """Enrich scraped content with metadata and context"""
    
    def __init__(self):
        self.greek_football_terms = {
            'teams': ['Ολυμπιακός', 'Παναθηναϊκός', 'ΑΕΚ', 'ΠΑΟΚ'],
            'competitions': ['Σούπερ Λίγκα', 'Champions League', 'Europa League'],
            'positions': ['επιθετικός', 'μέσος', 'αμυντικός', 'τερματοφύλακας'],
        }
    
    def enrich_document(self, doc, source_url):
        """Enrich document with metadata and context"""
        
        # Extract metadata from URL
        metadata = self.extract_url_metadata(source_url)
        
        # Detect content type
        content_type = self.detect_content_type(doc.page_content)
        metadata['content_type'] = content_type
        
        # Extract entities
        entities = self.extract_entities(doc.page_content)
        metadata['entities'] = entities
        
        # Detect language
        language = self.detect_language(doc.page_content)
        metadata['language'] = language
        
        # Add timestamp
        metadata['processed_at'] = datetime.now().isoformat()
        
        # Update document metadata
        doc.metadata.update(metadata)
        
        return doc
    
    def extract_url_metadata(self, url):
        """Extract metadata from URL"""
        
        from urllib.parse import urlparse
        
        parsed = urlparse(url)
        
        metadata = {
            'domain': parsed.netloc,
            'path': parsed.path,
            'source': url,
        }
        
        # Extract section from path
        path_parts = parsed.path.strip('/').split('/')
        if path_parts:
            metadata['section'] = path_parts[0]
        
        return metadata
    
    def detect_content_type(self, content):
        """Detect type of content"""
        
        content_lower = content.lower()
        
        if any(word in content_lower for word in ['match', 'game', 'άγων', 'αγώνας']):
            return 'match_report'
        elif any(word in content_lower for word in ['transfer', 'μεταγραφή', 'μεταγραφές']):
            return 'transfer_news'
        elif any(word in content_lower for word in ['injury', 'τραυματισμός', 'τραυματισμοί']):
            return 'injury_news'
        elif any(word in content_lower for word in ['preview', 'προεπισκόπηση', 'προβλέψεις']):
            return 'match_preview'
        else:
            return 'general_news'
    
    def extract_entities(self, content):
        """Extract relevant entities from content"""
        
        entities = {
            'teams': [],
            'players': [],
            'competitions': [],
            'dates': [],
        }
        
        # Extract teams
        for team in self.greek_football_terms['teams']:
            if team in content:
                entities['teams'].append(team)
        
        # Extract competitions
        for comp in self.greek_football_terms['competitions']:
            if comp in content:
                entities['competitions'].append(comp)
        
        # Extract dates (simple pattern)
        date_pattern = r'\d{1,2}/\d{1,2}/\d{4}|\d{4}-\d{2}-\d{2}'
        entities['dates'] = re.findall(date_pattern, content)
        
        return entities
    
    def detect_language(self, content):
        """Detect primary language of content"""
        
        greek_chars = len(re.findall(r'[α-ωΑ-Ω]', content))
        english_chars = len(re.findall(r'[a-zA-Z]', content))
        
        total_chars = greek_chars + english_chars
        
        if total_chars == 0:
            return 'unknown'
        
        greek_ratio = greek_chars / total_chars
        
        if greek_ratio > 0.5:
            return 'greek'
        elif greek_ratio > 0.1:
            return 'mixed'
        else:
            return 'english'
```

### Integration with RAG System:

#### 1. **Preprocessing for Vector Database**

```python
class RAGPreprocessor:
    """Preprocess content for RAG system"""
    
    def __init__(self):
        self.html_extractor = HTMLContentExtractor()
        self.greek_cleaner = GreekTextCleaner()
        self.quality_assessor = ContentQualityAssessor()
        self.deduplicator = ContentDeduplicator()
        self.enricher = ContentEnricher()
    
    def preprocess_documents(self, raw_documents):
        """Preprocess documents for RAG system"""
        
        processed_docs = []
        
        for doc in raw_documents:
            try:
                # Extract clean text
                clean_text = self.html_extractor.extract_clean_text(doc.page_content)
                
                # Clean Greek text
                clean_text = self.greek_cleaner.clean_greek_text(clean_text)
                
                # Assess quality
                quality_score, quality_level, checks = self.quality_assessor.assess_quality(clean_text)
                
                # Skip low-quality content
                if quality_score < 0.6:
                    print(f"Skipping low-quality content: {quality_level}")
                    continue
                
                # Update document content
                doc.page_content = clean_text
                
                # Enrich with metadata
                doc = self.enricher.enrich_document(doc, doc.metadata.get('source', ''))
                
                # Add quality metadata
                doc.metadata.update({
                    'quality_score': quality_score,
                    'quality_level': quality_level,
                    'quality_checks': checks
                })
                
                processed_docs.append(doc)
                
            except Exception as e:
                print(f"Error processing document: {e}")
                continue
        
        # Remove duplicates
        processed_docs = self.deduplicator.deduplicate_documents(processed_docs)
        
        return processed_docs
```

### Benefits for Our RAG Chatbot:

1. **High-Quality Data**: Clean, relevant content for better embeddings
2. **Language Accuracy**: Proper Greek text processing and normalization
3. **Noise Reduction**: Remove irrelevant content and advertisements
4. **Deduplication**: Avoid duplicate content in the knowledge base
5. **Metadata Enrichment**: Rich context for better retrieval
6. **Quality Assurance**: Ensure only high-quality content is stored
7. **Performance**: Optimized content for faster processing
8. **Scalability**: Efficient preprocessing for large amounts of data


## Q5: How do we implement error handling and resilience in web scraping?

**Answer:**

Error handling and resilience are critical for maintaining a robust web scraping system. Our Greek Derby chatbot implements comprehensive error handling strategies to ensure continuous operation even when facing network issues, website changes, or anti-bot measures.

### Common Scraping Challenges:

#### 1. **Network and Connection Issues**

```python
# Common network errors in web scraping
network_errors = {
    'ConnectionError': 'Cannot connect to the server',
    'TimeoutError': 'Request timed out',
    'DNSResolutionError': 'Cannot resolve domain name',
    'SSLHandshakeError': 'SSL/TLS handshake failed',
    'ProxyError': 'Proxy server connection failed',
    'TooManyRedirects': 'Too many redirects',
    'ChunkedEncodingError': 'Chunked encoding error',
    'ReadTimeoutError': 'Read operation timed out',
}
```

#### 2. **HTTP Status Code Errors**

```python
# HTTP status codes and their meanings
http_status_codes = {
    200: 'OK - Request successful',
    301: 'Moved Permanently - Redirect',
    302: 'Found - Temporary redirect',
    403: 'Forbidden - Access denied',
    404: 'Not Found - Page does not exist',
    429: 'Too Many Requests - Rate limited',
    500: 'Internal Server Error - Server error',
    502: 'Bad Gateway - Proxy error',
    503: 'Service Unavailable - Server overloaded',
    504: 'Gateway Timeout - Proxy timeout',
}
```

### Our Error Handling Strategy:

#### 1. **Comprehensive Exception Handling**

```python
import requests
from requests.exceptions import (
    RequestException, ConnectionError, Timeout, 
    TooManyRedirects, HTTPError, ProxyError
)
import time
import random
from functools import wraps

class ScrapingErrorHandler:
    """Handle errors in web scraping operations"""
    
    def __init__(self, max_retries=3, base_delay=1, max_delay=60):
        self.max_retries = max_retries
        self.base_delay = base_delay
        self.max_delay = max_delay
        self.error_counts = {}
    
    def handle_request_errors(self, func):
        """Decorator to handle request errors with retry logic"""
        
        @wraps(func)
        def wrapper(*args, **kwargs):
            last_exception = None
            
            for attempt in range(self.max_retries + 1):
                try:
                    return func(*args, **kwargs)
                    
                except ConnectionError as e:
                    last_exception = e
                    self.log_error('ConnectionError', str(e), attempt)
                    if attempt < self.max_retries:
                        self.wait_with_backoff(attempt)
                        continue
                    
                except Timeout as e:
                    last_exception = e
                    self.log_error('Timeout', str(e), attempt)
                    if attempt < self.max_retries:
                        self.wait_with_backoff(attempt)
                        continue
                    
                except HTTPError as e:
                    last_exception = e
                    status_code = e.response.status_code
                    self.log_error('HTTPError', f"Status {status_code}: {str(e)}", attempt)
                    
                    # Don't retry certain status codes
                    if status_code in [403, 404, 401]:
                        break
                    
                    if attempt < self.max_retries:
                        self.wait_with_backoff(attempt)
                        continue
                    
                except TooManyRedirects as e:
                    last_exception = e
                    self.log_error('TooManyRedirects', str(e), attempt)
                    if attempt < self.max_retries:
                        self.wait_with_backoff(attempt)
                        continue
                    
                except ProxyError as e:
                    last_exception = e
                    self.log_error('ProxyError', str(e), attempt)
                    if attempt < self.max_retries:
                        self.wait_with_backoff(attempt)
                        continue
                    
                except RequestException as e:
                    last_exception = e
                    self.log_error('RequestException', str(e), attempt)
                    if attempt < self.max_retries:
                        self.wait_with_backoff(attempt)
                        continue
                    
                except Exception as e:
                    last_exception = e
                    self.log_error('UnexpectedError', str(e), attempt)
                    break
            
            # If all retries failed, raise the last exception
            raise last_exception
        
        return wrapper
    
    def wait_with_backoff(self, attempt):
        """Wait with exponential backoff and jitter"""
        
        # Exponential backoff with jitter
        delay = min(
            self.base_delay * (2 ** attempt) + random.uniform(0, 1),
            self.max_delay
        )
        
        print(f"Waiting {delay:.2f} seconds before retry {attempt + 1}")
        time.sleep(delay)
    
    def log_error(self, error_type, message, attempt):
        """Log error information"""
        
        # Track error counts
        if error_type not in self.error_counts:
            self.error_counts[error_type] = 0
        self.error_counts[error_type] += 1
        
        print(f"[{error_type}] Attempt {attempt + 1}: {message}")
    
    def get_error_stats(self):
        """Get error statistics"""
        return self.error_counts.copy()
```

#### 2. **Robust Request Implementation**

```python
class RobustScraper:
    """Implement robust scraping with comprehensive error handling"""
    
    def __init__(self):
        self.error_handler = ScrapingErrorHandler()
        self.session = self.setup_session()
    
    def setup_session(self):
        """Setup requests session with proper configuration"""
        
        session = requests.Session()
        
        # Configure retry strategy
        from requests.adapters import HTTPAdapter
        from urllib3.util.retry import Retry
        
        retry_strategy = Retry(
            total=3,
            backoff_factor=1,
            status_forcelist=[429, 500, 502, 503, 504],
            allowed_methods=["HEAD", "GET", "OPTIONS"]
        )
        
        adapter = HTTPAdapter(max_retries=retry_strategy)
        session.mount("http://", adapter)
        session.mount("https://", adapter)
        
        # Set default headers
        session.headers.update({
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
            'Accept-Encoding': 'gzip, deflate',
            'Connection': 'keep-alive',
        })
        
        return session
    
    @error_handler.handle_request_errors
    def make_request(self, url, **kwargs):
        """Make HTTP request with error handling"""
        
        # Set default timeout
        if 'timeout' not in kwargs:
            kwargs['timeout'] = 30
        
        # Make request
        response = self.session.get(url, **kwargs)
        
        # Check for HTTP errors
        response.raise_for_status()
        
        return response
    
    def scrape_url_safely(self, url):
        """Scrape URL with comprehensive error handling"""
        
        try:
            response = self.make_request(url)
            
            # Check content type
            content_type = response.headers.get('content-type', '')
            if 'text/html' not in content_type:
                print(f"Warning: Non-HTML content type: {content_type}")
            
            # Check content length
            content_length = len(response.content)
            if content_length < 100:
                print(f"Warning: Very short content ({content_length} bytes)")
            
            return response
            
        except Exception as e:
            print(f"Failed to scrape {url}: {e}")
            return None
```

### Advanced Resilience Patterns:

#### 1. **Circuit Breaker Pattern**

```python
import time
from enum import Enum

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    """Implement circuit breaker pattern for scraping"""
    
    def __init__(self, failure_threshold=5, recovery_timeout=60, expected_exception=Exception):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.expected_exception = expected_exception
        
        self.failure_count = 0
        self.last_failure_time = None
        self.state = CircuitState.CLOSED
    
    def call(self, func, *args, **kwargs):
        """Execute function with circuit breaker protection"""
        
        if self.state == CircuitState.OPEN:
            if self._should_attempt_reset():
                self.state = CircuitState.HALF_OPEN
            else:
                raise Exception("Circuit breaker is OPEN")
        
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
            
        except self.expected_exception as e:
            self._on_failure()
            raise e
    
    def _should_attempt_reset(self):
        """Check if circuit breaker should attempt reset"""
        return (
            time.time() - self.last_failure_time >= self.recovery_timeout
        )
    
    def _on_success(self):
        """Handle successful call"""
        self.failure_count = 0
        self.state = CircuitState.CLOSED
    
    def _on_failure(self):
        """Handle failed call"""
        self.failure_count += 1
        self.last_failure_time = time.time()
        
        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN

# Usage with circuit breaker
circuit_breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=30)

def scrape_with_circuit_breaker(url):
    """Scrape URL with circuit breaker protection"""
    return circuit_breaker.call(make_request, url)
```

#### 2. **Fallback Strategies**

```python
class FallbackScraper:
    """Implement fallback strategies for scraping"""
    
    def __init__(self):
        self.primary_scraper = RobustScraper()
        self.fallback_scrapers = [
            self._scrape_with_selenium,
            self._scrape_with_requests_alternative,
            self._scrape_with_cached_content,
        ]
    
    def scrape_with_fallbacks(self, url):
        """Scrape URL with multiple fallback strategies"""
        
        # Try primary scraper first
        try:
            result = self.primary_scraper.scrape_url_safely(url)
            if result and result.status_code == 200:
                return result
        except Exception as e:
            print(f"Primary scraper failed: {e}")
        
        # Try fallback strategies
        for i, fallback_scraper in enumerate(self.fallback_scrapers):
            try:
                print(f"Trying fallback strategy {i + 1}")
                result = fallback_scraper(url)
                if result:
                    return result
            except Exception as e:
                print(f"Fallback strategy {i + 1} failed: {e}")
                continue
        
        print("All scraping strategies failed")
        return None
    
    def _scrape_with_selenium(self, url):
        """Fallback: Use Selenium for JavaScript-heavy sites"""
        try:
            from selenium import webdriver
            from selenium.webdriver.chrome.options import Options
            
            options = Options()
            options.add_argument('--headless')
            options.add_argument('--no-sandbox')
            options.add_argument('--disable-dev-shm-usage')
            
            driver = webdriver.Chrome(options=options)
            driver.get(url)
            
            # Wait for content to load
            time.sleep(3)
            
            page_source = driver.page_source
            driver.quit()
            
            # Create mock response object
            class MockResponse:
                def __init__(self, content):
                    self.content = content.encode()
                    self.status_code = 200
                    self.headers = {'content-type': 'text/html'}
            
            return MockResponse(page_source)
            
        except Exception as e:
            print(f"Selenium fallback failed: {e}")
            return None
    
    def _scrape_with_requests_alternative(self, url):
        """Fallback: Use alternative request configuration"""
        try:
            # Try with different user agent
            headers = {
                'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15'
            }
            
            response = requests.get(url, headers=headers, timeout=60)
            response.raise_for_status()
            
            return response
            
        except Exception as e:
            print(f"Alternative requests failed: {e}")
            return None
    
    def _scrape_with_cached_content(self, url):
        """Fallback: Use cached content if available"""
        try:
            # Check if we have cached content
            cache_key = f"cached_{hash(url)}"
            # This would be implemented with actual caching logic
            print("No cached content available")
            return None
            
        except Exception as e:
            print(f"Cache fallback failed: {e}")
            return None
```

#### 3. **Monitoring and Alerting**

```python
class ScrapingMonitor:
    """Monitor scraping operations and alert on issues"""
    
    def __init__(self):
        self.metrics = {
            'total_requests': 0,
            'successful_requests': 0,
            'failed_requests': 0,
            'error_types': {},
            'response_times': [],
            'last_successful_request': None,
            'last_failed_request': None,
        }
        
        self.alert_thresholds = {
            'failure_rate': 0.3,  # 30% failure rate
            'max_response_time': 30,  # 30 seconds
            'consecutive_failures': 5,
        }
    
    def record_request(self, success, error_type=None, response_time=None):
        """Record request metrics"""
        
        self.metrics['total_requests'] += 1
        
        if success:
            self.metrics['successful_requests'] += 1
            self.metrics['last_successful_request'] = time.time()
        else:
            self.metrics['failed_requests'] += 1
            self.metrics['last_failed_request'] = time.time()
            
            if error_type:
                if error_type not in self.metrics['error_types']:
                    self.metrics['error_types'][error_type] = 0
                self.metrics['error_types'][error_type] += 1
        
        if response_time:
            self.metrics['response_times'].append(response_time)
            # Keep only last 100 response times
            if len(self.metrics['response_times']) > 100:
                self.metrics['response_times'] = self.metrics['response_times'][-100:]
        
        # Check for alerts
        self.check_alerts()
    
    def check_alerts(self):
        """Check if any alerts should be triggered"""
        
        total = self.metrics['total_requests']
        if total == 0:
            return
        
        # Check failure rate
        failure_rate = self.metrics['failed_requests'] / total
        if failure_rate > self.alert_thresholds['failure_rate']:
            self.trigger_alert(f"High failure rate: {failure_rate:.2%}")
        
        # Check response time
        if self.metrics['response_times']:
            avg_response_time = sum(self.metrics['response_times']) / len(self.metrics['response_times'])
            if avg_response_time > self.alert_thresholds['max_response_time']:
                self.trigger_alert(f"High response time: {avg_response_time:.2f}s")
        
        # Check consecutive failures
        if self.metrics['last_failed_request'] and self.metrics['last_successful_request']:
            if self.metrics['last_failed_request'] > self.metrics['last_successful_request']:
                # Count consecutive failures
                consecutive_failures = self.metrics['failed_requests'] - self.metrics['successful_requests']
                if consecutive_failures >= self.alert_thresholds['consecutive_failures']:
                    self.trigger_alert(f"Consecutive failures: {consecutive_failures}")
    
    def trigger_alert(self, message):
        """Trigger alert for monitoring issues"""
        print(f"ALERT: {message}")
        # In production, this would send to monitoring system
        # send_to_monitoring_system(message)
    
    def get_metrics(self):
        """Get current metrics"""
        return self.metrics.copy()
```

### Recovery and Self-Healing:

#### 1. **Automatic Recovery**

```python
class SelfHealingScraper:
    """Implement self-healing scraping system"""
    
    def __init__(self):
        self.monitor = ScrapingMonitor()
        self.circuit_breaker = CircuitBreaker()
        self.fallback_scraper = FallbackScraper()
        self.health_check_interval = 300  # 5 minutes
        self.last_health_check = 0
    
    def scrape_with_self_healing(self, url):
        """Scrape URL with self-healing capabilities"""
        
        # Check if health check is needed
        if time.time() - self.last_health_check > self.health_check_interval:
            self.perform_health_check()
            self.last_health_check = time.time()
        
        # Try scraping with circuit breaker
        try:
            result = self.circuit_breaker.call(
                self.fallback_scraper.scrape_with_fallbacks, url
            )
            
            if result:
                self.monitor.record_request(True)
                return result
            else:
                self.monitor.record_request(False, "No result")
                return None
                
        except Exception as e:
            self.monitor.record_request(False, str(type(e).__name__))
            return None
    
    def perform_health_check(self):
        """Perform health check on scraping system"""
        
        print("Performing health check...")
        
        # Test with a simple URL
        test_url = "https://httpbin.org/status/200"
        
        try:
            response = requests.get(test_url, timeout=10)
            if response.status_code == 200:
                print("Health check passed")
                return True
            else:
                print(f"Health check failed: Status {response.status_code}")
                return False
                
        except Exception as e:
            print(f"Health check failed: {e}")
            return False
    
    def get_system_status(self):
        """Get current system status"""
        
        metrics = self.monitor.get_metrics()
        circuit_state = self.circuit_breaker.state
        
        status = {
            'circuit_breaker_state': circuit_state.value,
            'metrics': metrics,
            'health_status': 'healthy' if self.perform_health_check() else 'unhealthy'
        }
        
        return status
```

### Benefits for Our RAG Chatbot:

1. **Continuous Operation**: System continues working despite errors
2. **Automatic Recovery**: Self-healing capabilities reduce manual intervention
3. **Quality Assurance**: Error handling ensures data quality
4. **Monitoring**: Real-time monitoring of scraping health
5. **Fallback Strategies**: Multiple approaches ensure data availability
6. **Performance**: Circuit breakers prevent cascading failures
7. **Reliability**: Robust error handling increases system reliability
8. **Maintainability**: Clear error logging and monitoring for debugging


## Q6: How do we integrate web scraping with our RAG system for optimal performance?

**Answer:**

Integrating web scraping with our RAG system requires careful orchestration to ensure optimal performance, data quality, and system reliability. Our Greek Derby chatbot implements a sophisticated integration pipeline that transforms scraped content into high-quality embeddings for the vector database.

### Integration Architecture:

#### 1. **End-to-End Data Pipeline**

```python
# Complete integration pipeline
class RAGScrapingIntegration:
    """Integrate web scraping with RAG system"""
    
    def __init__(self):
        # Initialize components
        self.scraper = SelfHealingScraper()
        self.preprocessor = RAGPreprocessor()
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=500,
            chunk_overlap=100,
            separators=["\n\n", "\n", ". ", "! ", "? ", " ", ""]
        )
        self.embeddings = OpenAIEmbeddings(
            model="text-embedding-3-small",
            dimensions=1024
        )
        self.vector_store = PineconeVectorStore(
            embedding=self.embeddings,
            index=self.pinecone_index
        )
    
    def scrape_and_ingest(self, urls):
        """Scrape URLs and ingest into RAG system"""
        
        all_documents = []
        
        for url in urls:
            try:
                # Scrape content
                response = self.scraper.scrape_with_self_healing(url)
                if not response:
                    continue
                
                # Extract content
                content = self.extract_content(response)
                if not content:
                    continue
                
                # Create document
                doc = Document(
                    page_content=content,
                    metadata={
                        'source': url,
                        'scraped_at': datetime.now().isoformat(),
                        'content_type': 'web_scraped'
                    }
                )
                
                all_documents.append(doc)
                
            except Exception as e:
                print(f"Error processing {url}: {e}")
                continue
        
        # Process documents
        processed_docs = self.preprocessor.preprocess_documents(all_documents)
        
        # Split into chunks
        split_docs = self.text_splitter.split_documents(processed_docs)
        
        # Add to vector store
        self.vector_store.add_documents(split_docs)
        
        return len(split_docs)
```

#### 2. **Content Extraction and Processing**

```python
class ContentExtractor:
    """Extract and process content from scraped responses"""
    
    def __init__(self):
        self.html_extractor = HTMLContentExtractor()
        self.greek_cleaner = GreekTextCleaner()
        self.quality_assessor = ContentQualityAssessor()
    
    def extract_content(self, response):
        """Extract clean content from HTTP response"""
        
        # Get HTML content
        html_content = response.content.decode('utf-8', errors='ignore')
        
        # Extract clean text
        clean_text = self.html_extractor.extract_clean_text(html_content)
        
        # Clean Greek text
        clean_text = self.greek_cleaner.clean_greek_text(clean_text)
        
        # Assess quality
        quality_score, quality_level, checks = self.quality_assessor.assess_quality(clean_text)
        
        # Skip low-quality content
        if quality_score < 0.6:
            print(f"Skipping low-quality content: {quality_level}")
            return None
        
        return clean_text
```

### Performance Optimization:

#### 1. **Batch Processing**

```python
class BatchProcessor:
    """Process multiple URLs in batches for efficiency"""
    
    def __init__(self, batch_size=10, max_workers=5):
        self.batch_size = batch_size
        self.max_workers = max_workers
        self.scraper = SelfHealingScraper()
        self.preprocessor = RAGPreprocessor()
    
    def process_urls_in_batches(self, urls):
        """Process URLs in batches with parallel processing"""
        
        from concurrent.futures import ThreadPoolExecutor, as_completed
        
        all_documents = []
        
        # Split URLs into batches
        url_batches = [urls[i:i + self.batch_size] for i in range(0, len(urls), self.batch_size)]
        
        for batch in url_batches:
            print(f"Processing batch of {len(batch)} URLs")
            
            # Process batch in parallel
            with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
                future_to_url = {
                    executor.submit(self.scrape_single_url, url): url 
                    for url in batch
                }
                
                for future in as_completed(future_to_url):
                    url = future_to_url[future]
                    try:
                        doc = future.result()
                        if doc:
                            all_documents.append(doc)
                    except Exception as e:
                        print(f"Error processing {url}: {e}")
            
            # Add delay between batches
            time.sleep(2)
        
        return all_documents
    
    def scrape_single_url(self, url):
        """Scrape a single URL and return document"""
        
        try:
            response = self.scraper.scrape_with_self_healing(url)
            if not response:
                return None
            
            # Extract content
            content = self.extract_content(response)
            if not content:
                return None
            
            # Create document
            doc = Document(
                page_content=content,
                metadata={
                    'source': url,
                    'scraped_at': datetime.now().isoformat(),
                    'content_type': 'web_scraped'
                }
            )
            
            return doc
            
        except Exception as e:
            print(f"Error scraping {url}: {e}")
            return None
```

#### 2. **Caching and Incremental Updates**

```python
class IncrementalUpdater:
    """Implement incremental updates for scraped content"""
    
    def __init__(self):
        self.cache = {}  # In production, use Redis or similar
        self.last_update = {}
        self.update_interval = 3600  # 1 hour
    
    def should_update_url(self, url):
        """Check if URL should be updated"""
        
        now = time.time()
        last_update = self.last_update.get(url, 0)
        
        return (now - last_update) > self.update_interval
    
    def get_cached_content(self, url):
        """Get cached content if available"""
        
        if url in self.cache:
            return self.cache[url]
        
        return None
    
    def cache_content(self, url, content, metadata):
        """Cache content for future use"""
        
        self.cache[url] = {
            'content': content,
            'metadata': metadata,
            'cached_at': time.time()
        }
        
        self.last_update[url] = time.time()
    
    def update_incremental(self, urls):
        """Update only URLs that need updating"""
        
        urls_to_update = []
        
        for url in urls:
            if self.should_update_url(url):
                urls_to_update.append(url)
            else:
                # Use cached content
                cached = self.get_cached_content(url)
                if cached:
                    print(f"Using cached content for {url}")
        
        if urls_to_update:
            print(f"Updating {len(urls_to_update)} URLs")
            # Process URLs that need updating
            return self.process_urls_in_batches(urls_to_update)
        
        return []
```

### Quality Assurance:

#### 1. **Content Validation**

```python
class ContentValidator:
    """Validate content quality before adding to RAG system"""
    
    def __init__(self):
        self.min_length = 100
        self.max_length = 5000
        self.min_quality_score = 0.6
        self.required_entities = ['teams', 'competitions']
    
    def validate_document(self, doc):
        """Validate document quality and relevance"""
        
        content = doc.page_content
        metadata = doc.metadata
        
        # Check length
        if len(content) < self.min_length or len(content) > self.max_length:
            return False, "Invalid content length"
        
        # Check quality score
        quality_score = metadata.get('quality_score', 0)
        if quality_score < self.min_quality_score:
            return False, "Low quality score"
        
        # Check for required entities
        entities = metadata.get('entities', {})
        if not any(entities.get(entity, []) for entity in self.required_entities):
            return False, "Missing required entities"
        
        # Check language
        language = metadata.get('language', 'unknown')
        if language not in ['greek', 'mixed']:
            return False, "Not Greek content"
        
        return True, "Valid document"
    
    def validate_batch(self, documents):
        """Validate a batch of documents"""
        
        valid_docs = []
        invalid_docs = []
        
        for doc in documents:
            is_valid, reason = self.validate_document(doc)
            
            if is_valid:
                valid_docs.append(doc)
            else:
                invalid_docs.append((doc, reason))
                print(f"Invalid document: {reason}")
        
        return valid_docs, invalid_docs
```

#### 2. **Duplicate Detection and Removal**

```python
class DuplicateDetector:
    """Detect and remove duplicate content"""
    
    def __init__(self):
        self.content_hashes = set()
        self.similarity_threshold = 0.8
    
    def is_duplicate(self, content):
        """Check if content is duplicate"""
        
        # Create content hash
        content_hash = self.create_content_hash(content)
        
        if content_hash in self.content_hashes:
            return True
        
        # Add to hashes
        self.content_hashes.add(content_hash)
        return False
    
    def create_content_hash(self, content):
        """Create hash for content deduplication"""
        
        # Normalize content
        normalized = re.sub(r'\s+', ' ', content.strip().lower())
        
        # Create hash
        return hashlib.md5(normalized.encode()).hexdigest()
    
    def remove_duplicates(self, documents):
        """Remove duplicate documents"""
        
        unique_docs = []
        
        for doc in documents:
            if not self.is_duplicate(doc.page_content):
                unique_docs.append(doc)
            else:
                print(f"Duplicate document found: {doc.metadata.get('source', 'Unknown')}")
        
        return unique_docs
```

### Monitoring and Metrics:

#### 1. **Performance Metrics**

```python
class ScrapingMetrics:
    """Track scraping and RAG integration metrics"""
    
    def __init__(self):
        self.metrics = {
            'total_urls_scraped': 0,
            'successful_scrapes': 0,
            'failed_scrapes': 0,
            'documents_processed': 0,
            'documents_added_to_vector_db': 0,
            'duplicate_documents_removed': 0,
            'low_quality_documents_skipped': 0,
            'processing_time': 0,
            'vector_db_size': 0,
        }
    
    def record_scrape_attempt(self, url, success):
        """Record scrape attempt"""
        
        self.metrics['total_urls_scraped'] += 1
        
        if success:
            self.metrics['successful_scrapes'] += 1
        else:
            self.metrics['failed_scrapes'] += 1
    
    def record_document_processing(self, total_docs, added_docs, duplicates, skipped):
        """Record document processing results"""
        
        self.metrics['documents_processed'] += total_docs
        self.metrics['documents_added_to_vector_db'] += added_docs
        self.metrics['duplicate_documents_removed'] += duplicates
        self.metrics['low_quality_documents_skipped'] += skipped
    
    def record_processing_time(self, start_time, end_time):
        """Record processing time"""
        
        processing_time = end_time - start_time
        self.metrics['processing_time'] = processing_time
    
    def get_metrics(self):
        """Get current metrics"""
        return self.metrics.copy()
    
    def get_success_rate(self):
        """Get scraping success rate"""
        
        total = self.metrics['total_urls_scraped']
        if total == 0:
            return 0
        
        return self.metrics['successful_scrapes'] / total
    
    def get_processing_efficiency(self):
        """Get document processing efficiency"""
        
        total = self.metrics['documents_processed']
        if total == 0:
            return 0
        
        return self.metrics['documents_added_to_vector_db'] / total
```

#### 2. **Real-time Monitoring**

```python
class RealTimeMonitor:
    """Monitor scraping and RAG integration in real-time"""
    
    def __init__(self):
        self.metrics = ScrapingMetrics()
        self.start_time = time.time()
        self.last_update = time.time()
    
    def monitor_scraping_progress(self, urls, processed_urls):
        """Monitor scraping progress"""
        
        progress = len(processed_urls) / len(urls) * 100
        elapsed_time = time.time() - self.start_time
        
        print(f"Scraping Progress: {progress:.1f}% ({len(processed_urls)}/{len(urls)})")
        print(f"Elapsed Time: {elapsed_time:.1f}s")
        
        # Calculate ETA
        if len(processed_urls) > 0:
            avg_time_per_url = elapsed_time / len(processed_urls)
            remaining_urls = len(urls) - len(processed_urls)
            eta = remaining_urls * avg_time_per_url
            print(f"ETA: {eta:.1f}s")
    
    def monitor_quality_metrics(self):
        """Monitor content quality metrics"""
        
        success_rate = self.metrics.get_success_rate()
        efficiency = self.metrics.get_processing_efficiency()
        
        print(f"Success Rate: {success_rate:.1%}")
        print(f"Processing Efficiency: {efficiency:.1%}")
        
        # Check for quality issues
        if success_rate < 0.8:
            print("WARNING: Low success rate detected")
        
        if efficiency < 0.7:
            print("WARNING: Low processing efficiency detected")
```

### Complete Integration Example:

```python
class CompleteRAGScrapingIntegration:
    """Complete integration of web scraping with RAG system"""
    
    def __init__(self):
        self.scraper = SelfHealingScraper()
        self.batch_processor = BatchProcessor()
        self.incremental_updater = IncrementalUpdater()
        self.validator = ContentValidator()
        self.duplicate_detector = DuplicateDetector()
        self.monitor = RealTimeMonitor()
        self.vector_store = PineconeVectorStore(
            embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
            index=pinecone_index
        )
    
    def run_complete_pipeline(self, urls):
        """Run complete scraping and RAG integration pipeline"""
        
        start_time = time.time()
        
        print("Starting complete RAG scraping pipeline...")
        
        # Step 1: Scrape URLs
        print("Step 1: Scraping URLs...")
        documents = self.batch_processor.process_urls_in_batches(urls)
        
        # Step 2: Validate documents
        print("Step 2: Validating documents...")
        valid_docs, invalid_docs = self.validator.validate_batch(documents)
        
        # Step 3: Remove duplicates
        print("Step 3: Removing duplicates...")
        unique_docs = self.duplicate_detector.remove_duplicates(valid_docs)
        
        # Step 4: Process for RAG
        print("Step 4: Processing for RAG...")
        processed_docs = self.preprocessor.preprocess_documents(unique_docs)
        
        # Step 5: Add to vector database
        print("Step 5: Adding to vector database...")
        self.vector_store.add_documents(processed_docs)
        
        # Step 6: Record metrics
        end_time = time.time()
        self.monitor.metrics.record_processing_time(start_time, end_time)
        self.monitor.metrics.record_document_processing(
            len(documents), len(processed_docs), 
            len(valid_docs) - len(unique_docs), len(invalid_docs)
        )
        
        # Step 7: Report results
        self.report_results()
        
        return len(processed_docs)
    
    def report_results(self):
        """Report pipeline results"""
        
        metrics = self.monitor.metrics.get_metrics()
        
        print("\n" + "="*50)
        print("PIPELINE RESULTS")
        print("="*50)
        print(f"Total URLs scraped: {metrics['total_urls_scraped']}")
        print(f"Successful scrapes: {metrics['successful_scrapes']}")
        print(f"Failed scrapes: {metrics['failed_scrapes']}")
        print(f"Documents processed: {metrics['documents_processed']}")
        print(f"Documents added to vector DB: {metrics['documents_added_to_vector_db']}")
        print(f"Duplicates removed: {metrics['duplicate_documents_removed']}")
        print(f"Low quality skipped: {metrics['low_quality_documents_skipped']}")
        print(f"Processing time: {metrics['processing_time']:.1f}s")
        print(f"Success rate: {self.monitor.metrics.get_success_rate():.1%}")
        print(f"Processing efficiency: {self.monitor.metrics.get_processing_efficiency():.1%}")
        print("="*50)
```

### Benefits for Our RAG Chatbot:

1. **Seamless Integration**: Smooth data flow from scraping to RAG system
2. **Quality Assurance**: Multiple validation layers ensure high-quality data
3. **Performance Optimization**: Batch processing and caching for efficiency
4. **Real-time Monitoring**: Track progress and identify issues quickly
5. **Incremental Updates**: Only update content that has changed
6. **Duplicate Prevention**: Avoid storing duplicate content
7. **Error Recovery**: Robust error handling throughout the pipeline
8. **Scalability**: Handle large amounts of data efficiently


---

## 🎯 Summary

This lesson covered the essential Web Scraping concepts used in our Greek Derby RAG chatbot:

### Key Takeaways:

1. **Web Scraping Fundamentals**: Data extraction, HTML parsing, and content processing
2. **HTML Parsing & CSS Selectors**: Advanced parsing techniques and selector strategies
3. **Anti-Bot Detection**: User agent rotation, rate limiting, and ethical scraping practices
4. **Data Cleaning & Preprocessing**: Content quality assessment and Greek language processing
5. **Error Handling & Resilience**: Circuit breakers, fallback strategies, and self-healing systems
6. **RAG Integration**: End-to-end pipeline from scraping to vector database

### Next Steps:

- **Practice**: Try building your own web scraper with these techniques
- **Explore**: Learn about advanced scraping tools like Scrapy and Playwright
- **Advanced**: Study machine learning approaches for content extraction
- **Production**: Implement monitoring and alerting for scraping systems

### Project Structure:

```
rag-langchain-langgraph/
├── backend/
│   ├── scheduler/
│   │   ├── update_vector_db.py      # Web scraping scheduler
│   │   └── config.py                # Scraping configuration
│   └── standalone-service/
│       └── greek_derby_chatbot.py   # Main scraping logic
├── educational-content/
│   └── 05_web_scraping_concepts.ipynb  # This lesson
└── .env                              # Environment variables
```

### Web Scraping Benefits for Our RAG Chatbot:

1. **Real-time Data**: Always current information from Gazzetta.gr
2. **Comprehensive Coverage**: Access to vast amounts of Greek football content
3. **Automated Updates**: No manual maintenance required
4. **Quality Assurance**: Multiple validation layers ensure high-quality data
5. **Error Resilience**: Robust error handling and recovery mechanisms
6. **Performance**: Optimized processing and caching for efficiency
7. **Scalability**: Can handle multiple sources and large amounts of data
8. **Integration**: Seamless connection with RAG system for optimal performance

This architecture provides a solid foundation for building robust, scalable, and production-ready web scraping systems! 🚀
