# Newspaper Scraper v2 - Modular OOP Architecture

This notebook demonstrates the usage of the modular newspaper scraper that supports 4 Mendoza news portals:
- Los Andes
- Diario UNO
- El Sol
- MDZ

## Features
- ✅ Modular OOP architecture with abstract base class
- ✅ Portal-specific scrapers with custom XPath selectors
- ✅ Duplicate detection by URL
- ✅ Comprehensive logging with progress tracking
- ✅ Configurable delays (1s between requests, 2s between portals)
- ✅ Export to CSV and JSON
- ✅ Timestamp field (scraped_at)

## 1. Imports

In [None]:
# Import the scraper module
from newspapers_scraper_v2 import (
    NewspaperScraperOrchestrator,
    LosAndesScraper,
    DiarioUnoScraper,
    ElSolScraper,
    MDZScraper
)

import pandas as pd
import json

## 2. Scrape All Portals

The orchestrator will scrape all 4 portals sequentially with proper delays.

In [None]:
# Initialize the orchestrator
orchestrator = NewspaperScraperOrchestrator()

# Scrape all portals
results = orchestrator.scrape_all()

## 3. View Results as DataFrame

In [None]:
# Convert to DataFrame
df = orchestrator.to_dataframe()

# Display basic info
print(f"Total articles scraped: {len(df)}")
print("\nArticles per newspaper:")
print(df['newspaper'].value_counts())

# Display the DataFrame
df

## 4. Inspect Sample Articles

In [None]:
# Display first few articles with key fields
df[['newspaper', 'headline', 'date', 'author']].head(10)

In [None]:
# View a full article (change index to see different articles)
article_index = 0

print("=" * 80)
print(f"ARTICLE #{article_index}")
print("=" * 80)
print(f"Newspaper: {df.iloc[article_index]['newspaper']}")
print(f"URL: {df.iloc[article_index]['url']}")
print(f"Headline: {df.iloc[article_index]['headline']}")
print(f"Date: {df.iloc[article_index]['date']}")
print(f"Author: {df.iloc[article_index]['author']}")
print(f"\nSummary:\n{df.iloc[article_index]['summary']}")
print(f"\nBody (first 500 chars):\n{df.iloc[article_index]['body'][:500]}...")
print(f"\nScraped at: {df.iloc[article_index]['scraped_at']}")

## 5. Export Data

In [None]:
# Export to CSV
orchestrator.export_csv("news_data.csv")

# Export to JSON
orchestrator.export_json("news_data.json")

print("✅ Data exported successfully!")

## 6. Advanced: Scrape Individual Portal

You can also scrape individual portals if needed.

In [None]:
# Example: Scrape only Los Andes
los_andes = LosAndesScraper()
los_andes_results = los_andes.scrape()

print(f"Scraped {len(los_andes_results)} articles from Los Andes")

# Convert to DataFrame
df_los_andes = pd.DataFrame([
    {"url": url, **data} 
    for url, data in los_andes_results.items()
])

df_los_andes[['headline', 'date']].head()

## 7. Data Analysis Examples

In [None]:
# Check for missing data
print("Missing data summary:")
print(df.isnull().sum())

# Check for empty strings
print("\nEmpty string counts:")
for col in ['headline', 'summary', 'body', 'date', 'author']:
    empty_count = (df[col] == '').sum()
    print(f"{col}: {empty_count}")

In [None]:
# Article length statistics
df['body_length'] = df['body'].str.len()

print("Article body length statistics:")
print(df.groupby('newspaper')['body_length'].describe())

## 8. Configuration

You can view and modify the scraper configuration if needed.

In [None]:
from newspapers_scraper_v2 import SCRAPER_CONFIG

print("Current configuration:")
print(json.dumps(SCRAPER_CONFIG, indent=2))