# 🤖 EthicalScraper - Complete Tutorial

**Welcome to the EthicalScraper interactive tutorial!** 

This Python library performs **ethical web scraping** by automatically respecting robots.txt, rate limiting, and exporting data to CSV/JSON.

## 🎯 What you'll learn in this tutorial:

- ✅ **How to verify** if a URL can be accessed (robots.txt)
- ✅ **How to perform scraping** ethically and safely
- ✅ **How to process multiple URLs** in batches with parallelization
- ✅ **How to export data** to CSV and JSON automatically
- ✅ **How to analyze results** with detailed statistics
- ✅ **Best practices** for ethical scraping

## ⚠️ IMPORTANT: 
This tutorial uses **safe example sites** (example.com, python.org). 

To use in **production with your real sites**, you'll need to edit the file:
📄 **`production_config.py`** (copy from `production_config.example.py`)

**Let's get started! 🚀**

## 📦 Step 1: Installation and Configuration

**First, let's install the dependencies and import the necessary libraries.**

Run the cell below to ensure everything is installed:

💡 **Required file**: `requirements.txt` (should already be in the project folder)

In [None]:
# Install dependencies (if not already installed)
!pip3 install requests
# Optional: !pip3 install pandas matplotlib

In [None]:
# Import main classes
import sys
sys.path.append('../src')

from scraper_etico import ScraperEtico
from analyzer import RobotsAnalyzer
from batch_processor import BatchProcessor

# Optional imports - comment out if not installed
try:
    import pandas as pd
    import matplotlib.pyplot as plt
    PANDAS_AVAILABLE = True
except ImportError:
    print("⚠️ Pandas/Matplotlib not installed - some features will be limited")
    print("   To install: pip3 install pandas matplotlib")
    PANDAS_AVAILABLE = False

## 🚀 Step 2: Create Your First Scraper

**Now let's create an EthicalScraper instance and configure it properly.**

⚠️ **IMPORTANT:** Always configure a descriptive user-agent with your real information!

📄 **Related files**: 
- `src/scraper_etico.py` (main code - don't edit)
- `production_config.py` (your settings - edit this for production)

In [None]:
# Create scraper instance
scraper = ScraperEtico(
    user_agent="MyBot/1.0 (example.com/contact)",
    default_delay=1.0  # 1 second between requests
)

print("EthicalScraper initialized successfully!")
print(f"User-Agent: {scraper.user_agent}")
print(f"Default delay: {scraper.default_delay}s")

In [None]:
# 🧪 Step 3: First Test - Verify a URL

# Let's test with a safe and reliable site
test_url = "https://example.com/"

print(f"🔍 Testing: {test_url}")
print(f"📄 Checking robots.txt...")

# 1. Check if site allows scraping
can_access = scraper.can_fetch(test_url)
print(f"✅ Robots.txt allows access: {can_access}")

# 2. Check for specific crawl-delay
delay = scraper.get_crawl_delay(test_url)
if delay:
    print(f"⏱️  Site specifies crawl-delay: {delay} seconds")
else:
    print(f"⏱️  Using default delay: {scraper.default_delay} seconds")

# 3. If allowed, make the request
if can_access:
    print(f"\n📡 Making ethical request...")
    response = scraper.get(test_url)
    
    if response:
        print(f"✅ Success!")
        print(f"   Status Code: {response.status_code}")
        print(f"   Page size: {len(response.text):,} characters")
        print(f"   Content-Type: {response.headers.get('content-type', 'N/A')}")
        
        # Show content preview
        preview = response.text[:200].replace('\n', ' ')
        print(f"   Preview: {preview}...")
    else:
        print("❌ Request failed")
else:
    print("❌ Site doesn't allow scraping - respecting robots.txt")

## 📊 Advanced robots.txt Analysis

RobotsAnalyzer allows for detailed analysis of robots.txt files.

In [None]:
# Create analyzer instance
analyzer = RobotsAnalyzer()

# Analyze robots.txt from a known site
test_site = "https://www.python.org"

# First, download robots.txt content
try:
    import requests
    response = requests.get(f"{test_site}/robots.txt", timeout=10)
    if response.status_code == 200:
        print(f"🔍 Manual analysis of robots.txt from {test_site}")
        
        # Simple content analysis
        robots_content = response.text
        lines = robots_content.split('\n')
        
        # Count elements
        user_agents = sum(1 for line in lines if line.lower().startswith('user-agent:'))
        disallows = sum(1 for line in lines if line.lower().startswith('disallow:'))
        allows = sum(1 for line in lines if line.lower().startswith('allow:'))
        sitemaps = sum(1 for line in lines if line.lower().startswith('sitemap:'))
        
        print(f"📄 Robots.txt found: ✅")
        print(f"🤖 User-agents defined: {user_agents}")
        print(f"🚫 Total rules: {disallows + allows}")
        print(f"🗺️  Sitemaps: {sitemaps}")
        
        # Show first lines
        print(f"\n📝 First 10 lines:")
        for i, line in enumerate(lines[:10]):
            if line.strip():
                print(f"   {line}")
                
except Exception as e:
    print(f"❌ Error analyzing: {e}")

In [6]:
# Mostrar exemplos de sitemaps encontrados (se houver)
if 'robots_content' in locals():
    print("\n🗺️ Sitemaps encontrados:")
    for line in robots_content.split('\n'):
        if line.lower().startswith('sitemap:'):
            print(f"  - {line.split(':', 1)[1].strip()}")
    
    # Mostrar alguns user-agents
    print("\n🤖 Alguns User-agents:")
    count = 0
    for line in robots_content.split('\n'):
        if line.lower().startswith('user-agent:') and count < 5:
            print(f"  - {line.split(':', 1)[1].strip()}")
            count += 1


🗺️ Sitemaps encontrados:

🤖 Alguns User-agents:
  - HTTrack
  - puf
  - MSIECrawler
  - Krugle
  - Nutch


## 🔄 Step 4: Batch Processing (Multiple URLs)

**Now let's process multiple URLs at once using BatchProcessor.**

This is useful when you want to monitor multiple sites automatically.

📄 **Related files**:
- `src/batch_processor.py` (processing code - don't edit)
- For production, edit the site list in: `production_config.py` → `PRODUCTION_SITES`

In [None]:
# List of URLs to test
test_urls = [
    "https://httpbin.org/get",
    "https://www.python.org/about/",
    "https://docs.python.org/3/",
    "https://github.com/python",
    "https://stackoverflow.com/questions"
]

# Create batch processor - correct API (no parameters in constructor)
batch_processor = BatchProcessor()

# Configure EthicalScraper with desired parameters
batch_processor.scraper = ScraperEtico(
    user_agent="Tutorial/1.0 (learning-ethical-scraping)",
    default_delay=1.5
)

print(f"📦 Batch processor created")
print(f"🔗 URLs to process: {len(test_urls)}")
print(f"🤖 User-agent: {batch_processor.scraper.user_agent}")
print(f"⏱️  Default delay: {batch_processor.scraper.default_delay}s")

In [None]:
# Execute batch processing
print("🚀 Starting batch processing...\n")

# Use process_batch method (not processar_lote) with max_workers as parameter
job_state = batch_processor.process_batch(
    test_urls,
    max_workers=2,  # max_workers goes here in the method, not in constructor
    show_progress=True
)

print(f"\n✨ Processing completed!")
print(f"📊 Statistics:")
print(f"   Total: {job_state.total_urls} URLs")
print(f"   Processed: {job_state.processed_count}")
print(f"   Success: {len(job_state.completed_urls)}")
print(f"   Failures: {len(job_state.failed_urls)}")
print(f"   Success rate: {job_state.completion_percentage:.1f}%")

## 📈 Results Analysis

Let's analyze the results from batch processing.

In [9]:
# Análise dos resultados
if PANDAS_AVAILABLE:
    # Converter para DataFrame para análise
    df = pd.DataFrame([
        {
            'url': resultado.url,
            'domain': resultado.domain,
            'success': resultado.success,
            'robots_allowed': resultado.robots_allowed,
            'crawl_delay': resultado.crawl_delay,
            'status_code': resultado.status_code,
            'response_size': resultado.response_size,
            'error_type': resultado.error_type
        }
        for resultado in job_state.results
    ])
    
    print("📊 Resumo dos resultados:")
    print(f"✅ URLs com sucesso: {df['success'].sum()}")
    print(f"❌ URLs com falha: {(~df['success']).sum()}")
    print(f"🤖 URLs permitidas por robots.txt: {df['robots_allowed'].sum() if df['robots_allowed'].notna().any() else 'N/A'}")
    
    # Mostrar tabela
    print("\n📋 Detalhes:")
    print(df[['url', 'success', 'robots_allowed', 'status_code']].head())
else:
    # Análise sem pandas
    print("📊 Resumo dos resultados:")
    success_count = sum(1 for r in job_state.results if r.success)
    total_count = len(job_state.results)
    robots_allowed_count = sum(1 for r in job_state.results if r.robots_allowed)
    
    print(f"✅ URLs com sucesso: {success_count}")
    print(f"❌ URLs com falha: {total_count - success_count}")
    print(f"🤖 URLs permitidas por robots.txt: {robots_allowed_count}")
    
    print("\n📋 Detalhes:")
    for resultado in job_state.results[:5]:  # Mostrar primeiros 5
        status = "✅" if resultado.success else "❌"
        robots_status = "🤖" if resultado.robots_allowed else "🚫"
        print(f"{status}{robots_status} {resultado.url[:50]}... - Status: {resultado.status_code}")

📊 Resumo dos resultados:
✅ URLs com sucesso: 5
❌ URLs com falha: 0
🤖 URLs permitidas por robots.txt: 5

📋 Detalhes:
✅🤖 https://www.python.org/about/... - Status: 200
✅🤖 https://docs.python.org/3/... - Status: 200
✅🤖 https://github.com/python... - Status: 200
✅🤖 https://stackoverflow.com/questions... - Status: 200
✅🤖 https://httpbin.org/get... - Status: 200


In [10]:
# Criar visualização (se pandas/matplotlib disponível)
if PANDAS_AVAILABLE:
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
    
    # Gráfico 1: Permitido vs Bloqueado
    permitidos = df['allowed'].value_counts()
    if len(permitidos) == 2:
        labels = ['Bloqueado', 'Permitido'] if False in permitidos.index else ['Permitido']
    else:
        labels = ['Permitido' if permitidos.index[0] else 'Bloqueado']
    
    ax1.pie(permitidos.values, labels=labels, autopct='%1.1f%%', 
            colors=['#ff6b6b', '#4ecdc4'] if len(permitidos) == 2 else ['#4ecdc4'])
    ax1.set_title('URLs: Permitidas vs Bloqueadas')
    
    # Gráfico 2: Sites com/sem robots.txt
    robots = df['robots_found'].value_counts()
    if len(robots) == 2:
        labels = ['Sem robots.txt', 'Com robots.txt'] if False in robots.index else ['Com robots.txt']
    else:
        labels = ['Com robots.txt' if robots.index[0] else 'Sem robots.txt']
        
    ax2.pie(robots.values, labels=labels, autopct='%1.1f%%',
            colors=['#ffa726', '#66bb6a'] if len(robots) == 2 else ['#66bb6a'])
    ax2.set_title('Sites com robots.txt')
    
    plt.tight_layout()
    plt.show()
else:
    print("📊 Visualizações não disponíveis - instale pandas e matplotlib")

📊 Visualizações não disponíveis - instale pandas e matplotlib


## 📊 Step 5: Export Results to CSV and JSON

**Now let's export the data to formats you can use in Excel, Python, or other tools.**

📄 **Generated files**:
- `scraping_results_YYYYMMDD_HHMMSS.csv` (Excel/Google Sheets)
- `scraping_results_YYYYMMDD_HHMMSS.json` (programmatic analysis)

📁 **Destination folder**: Files will be saved in the current notebook folder

In [None]:
# Export results
from datetime import datetime

# File name with timestamp
file_name = f"scraping_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}"

if PANDAS_AVAILABLE:
    # Export with pandas
    df.to_csv(f"{file_name}.csv", index=False)
    print(f"📄 Results saved to: {file_name}.csv")
else:
    # Export using BatchProcessor methods
    batch_processor.export_to_csv(job_state, f"{file_name}.csv")
    print(f"📄 Results saved to: {file_name}.csv")

# Export to JSON using BatchProcessor
batch_processor.export_to_json(job_state, f"{file_name}.json")
print(f"📄 Results saved to: {file_name}.json")

## 🛡️ Passo 6: Princípios Éticos - O que Você DEVE Saber

### ✅ **O ScraperÉtico SEMPRE faz automaticamente:**

- 🔍 **Verifica robots.txt** antes de cada acesso
- ⏱️ **Aplica delays** entre requests (nunca sobrecarrega)
- 🤖 **Identifica o bot** com user-agent claro
- 📝 **Gera logs completos** para auditoria
- 🚨 **Para se bloqueado** (erro 429, robots.txt)

### ❌ **O ScraperÉtico NUNCA faz:**

- ❌ Ignora robots.txt ou termos de uso
- ❌ Faz requests sem delay
- ❌ Usa user-agents falsos de navegadores
- ❌ Esconde a identidade do bot
- ❌ Continua tentando quando bloqueado

### 🚨 **Suas responsabilidades como usuário:**

1. **Configure user-agent** com SEU site e SEU email reais
2. **Use delays adequados** (mínimo 1s, recomendado 3-5s para sites gov)
3. **Monitore logs** regularmente
4. **Respeite termos de uso** dos sites
5. **Tenha propósito legítimo** para o scraping

### 📞 **Exemplo de User-Agent Ético:**

```python
# ✅ BOM - Identifica claramente quem você é
user_agent = "MeuProjeto/1.0 (+https://github.com/usuario/projeto; contato@email.com)"
user_agent = "PesquisaTCC/1.0 (+https://universidade.br/tcc; aluno@univ.br)"  
user_agent = "AnalisePublica/1.0 (+https://empresa.com/pesquisa; pesquisa@empresa.com)"

# ❌ RUIM - Genérico demais
user_agent = "MeuBot/1.0"
user_agent = "Python-requests/2.28"  # Padrão do requests
```

### 📄 **IMPORTANTE: Configure no arquivo `config_producao.py`**
Para produção, copie `config_producao.example.py` → `config_producao.py` e edite:
- `USER_AGENT` - Com seus dados reais
- `SITES_PRODUCAO` - Com seus sites para monitorar  
- `DEFAULT_DELAY` - Conforme tipo de sites (gov = 5s+)

## 🔧 Configurações Avançadas

In [12]:
# Exemplo de configuração personalizada
import logging

scraper_personalizado = ScraperEtico(
    user_agent="MeuProjeto/2.0 (+http://meusite.com/sobre-bot)",
    default_delay=2.0,  # Delay mais conservador
    timeout=10.0,       # Timeout menor
    log_level=logging.DEBUG  # Logs mais detalhados
)

# Acessar a sessão de requests para configurar headers
import requests
session = requests.Session()
session.headers.update({
    'User-Agent': scraper_personalizado.user_agent,
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'pt-BR,pt;q=0.9,en;q=0.8',
    'From': 'contato@meusite.com'  # Email para contato
})

# Usar a sessão customizada
scraper_personalizado.session = session

print("🔧 Scraper personalizado configurado!")

2025-09-04 12:58:29 - ScraperEtico - INFO - ScraperEtico initialized with user-agent: MeuProjeto/2.0 (+http://meusite.com/sobre-bot)


🔧 Scraper personalizado configurado!


## 🎯 Casos de Uso Práticos

### 1. Verificação de Lista de Sites

In [13]:
# Lista de sites de notícias para verificar
sites_noticias = [
    "https://g1.globo.com/rss",
    "https://folha.uol.com.br/rss",
    "https://estadao.com.br/rss",
]

print("📰 Verificando sites de notícias...")
for site in sites_noticias:
    try:
        # API correta: can_fetch ao invés de verificar_robots
        resultado = scraper.can_fetch(site)
        status = "✅ Permitido" if resultado else "❌ Bloqueado"
        print(f"{status} - {site}")
        
        # Verificar crawl-delay também
        delay = scraper.get_crawl_delay(site)
        if delay:
            print(f"   ⏱️  Crawl-delay: {delay}s")
    except Exception as e:
        print(f"❗ Erro - {site}: {str(e)}")

2025-09-04 12:58:29 - ScraperEtico - DEBUG - Fetching robots.txt from: https://g1.globo.com/robots.txt


📰 Verificando sites de notícias...


2025-09-04 12:58:29 - ScraperEtico - INFO - Successfully fetched robots.txt from https://g1.globo.com/robots.txt
2025-09-04 12:58:29 - ScraperEtico - DEBUG - Robots.txt check for https://g1.globo.com/rss: allowed
2025-09-04 12:58:29 - ScraperEtico - DEBUG - Fetching robots.txt from: https://folha.uol.com.br/robots.txt


✅ Permitido - https://g1.globo.com/rss


2025-09-04 12:58:30 - ScraperEtico - INFO - Successfully fetched robots.txt from https://folha.uol.com.br/robots.txt
2025-09-04 12:58:30 - ScraperEtico - DEBUG - Robots.txt check for https://folha.uol.com.br/rss: allowed
2025-09-04 12:58:30 - ScraperEtico - DEBUG - Fetching robots.txt from: https://estadao.com.br/robots.txt


✅ Permitido - https://folha.uol.com.br/rss


2025-09-04 12:58:30 - ScraperEtico - INFO - Successfully fetched robots.txt from https://estadao.com.br/robots.txt
2025-09-04 12:58:30 - ScraperEtico - DEBUG - Robots.txt check for https://estadao.com.br/rss: allowed


✅ Permitido - https://estadao.com.br/rss


### 2. Análise de Diferentes User-Agents

In [14]:
# Testar diferentes user-agents no mesmo site
site_teste = "https://example.com"
user_agents = [
    "*",  # Todos os bots
    "Googlebot",
    "Bingbot", 
    "MeuBot/1.0"
]

print(f"🤖 Testando diferentes user-agents em {site_teste}")
for ua in user_agents:
    scraper_temp = ScraperEtico(user_agent=ua)
    resultado = scraper_temp.can_fetch(site_teste)
    status = "✅" if resultado else "❌"
    print(f"{status} {ua}: {'Permitido' if resultado else 'Bloqueado'}")

2025-09-04 12:58:30 - ScraperEtico - INFO - ScraperEtico initialized with user-agent: *


🤖 Testando diferentes user-agents em https://example.com


2025-09-04 12:58:31 - ScraperEtico - INFO - Successfully fetched robots.txt from https://example.com/robots.txt
2025-09-04 12:58:31 - ScraperEtico - INFO - ScraperEtico initialized with user-agent: Googlebot


✅ *: Permitido


2025-09-04 12:58:32 - ScraperEtico - INFO - Successfully fetched robots.txt from https://example.com/robots.txt
2025-09-04 12:58:32 - ScraperEtico - INFO - ScraperEtico initialized with user-agent: Bingbot


✅ Googlebot: Permitido


2025-09-04 12:58:33 - ScraperEtico - INFO - Successfully fetched robots.txt from https://example.com/robots.txt
2025-09-04 12:58:33 - ScraperEtico - INFO - ScraperEtico initialized with user-agent: MeuBot/1.0


✅ Bingbot: Permitido


2025-09-04 12:58:33 - ScraperEtico - INFO - Successfully fetched robots.txt from https://example.com/robots.txt


✅ MeuBot/1.0: Permitido


## 🎓 Congratulations! You completed the tutorial

### 🎉 **What you learned:**

- ✅ How to perform **ethical scraping** respecting robots.txt
- ✅ How to process **multiple sites** in batches  
- ✅ How to **export data** to CSV and JSON automatically
- ✅ How to configure **delays and user-agents** properly
- ✅ **Ethical principles** fundamental for web scraping

### 🚀 **Next steps for production:**

1. **📄 Configure your real credentials in `production_config.py`:**
   ```bash
   cp production_config.example.py production_config.py
   nano production_config.py  # OR use your preferred editor
   ```

2. **🧪 Test your specific sites by editing `tests/examples/my_monitoring.py`:**
   ```bash
   python3 tests/examples/my_monitoring.py
   ```

3. **🚀 Execute production scraping with `run_production.py`:**
   ```bash
   python3 run_production.py
   ```

4. **📊 Monitor and analyze results with `analyze_results.py`:**
   ```bash
   python3 analyze_results.py
   open production_data/monitoring_*.csv
   ```

### 📚 **Important project files:**

- **📖 `README.md`** - Complete documentation
- **🧪 `tests/production_test.py`** - Tests before production  
- **📊 `analyze_results.py`** - Automatic analysis
- **🔧 `production_config.py`** - Your custom settings
- **📝 `production_config.example.py`** - Configuration template
- **🎯 `tests/examples/my_monitoring.py`** - Test with your specific sites
- **⚡ `run_production.py`** - Main production script
- **📋 `production_checklist.txt`** - Checklist before production

### 🛡️ **Always remember:**

> *"With great power comes great responsibility"*

- **Always respect robots.txt** (never try to circumvent)
- **Use adequate delays** (minimum 3s for gov sites)  
- **Identify your bot clearly** (user-agent with your real data)
- **Have legitimate purpose** (research, public monitoring)
- **Monitor logs regularly** (`logs/` folder)

### 🆘 **Need help?**

- 📖 **Read `README.md`** - Complete documentation with examples
- 🐛 **Problems?** Open an issue on GitHub
- 💬 **Questions?** Use GitHub Discussions

**Now you're ready to perform ethical web scraping! 🤖✨**

### 💡 **Final tip**: 
Always start testing with **few sites** (3-5) before scaling to hundreds. EthicalScraper is robust, but being conservative is always better!