# Web Scraper Toolkit - Google Colab Example

This notebook demonstrates how to use the Web Scraper Toolkit in Google Colab. It includes installation, setup, and examples of basic and advanced scraping.

## 1. Clone the Repository and Install Dependencies

First, let's clone the repository and install the required dependencies.

In [None]:
# Clone the repository
!git clone https://github.com/ahmed202020803/web-scraper-toolkit.git
!cd web-scraper-toolkit && pip install -r requirements.txt

## 2. Set Up the Environment

Now, let's set up the environment by adding the repository to the Python path.

In [None]:
import sys
import os
import logging

# Add the repository to the Python path
sys.path.append('/content/web-scraper-toolkit')

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

# Create necessary directories
os.makedirs('/content/web-scraper-toolkit/data', exist_ok=True)
os.makedirs('/content/web-scraper-toolkit/config', exist_ok=True)

## 3. Create Sample Configuration Files

Let's create the necessary configuration files for the toolkit.

In [None]:
# Create a sample user agents file if it doesn't exist
user_agents_path = '/content/web-scraper-toolkit/config/user_agents.txt'
if not os.path.exists(user_agents_path):
    with open(user_agents_path, 'w') as f:
        f.write("""
# Chrome
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36

# Firefox
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0
Mozilla/5.0 (X11; Linux i686; rv:89.0) Gecko/20100101 Firefox/89.0

# Safari
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15
        """)
    print("Created user agents file")
else:
    print("User agents file already exists")

## 4. Import the Web Scraper Toolkit

Now, let's import the Web Scraper Toolkit and create a scraper instance.

In [None]:
from web_scraper_toolkit import Scraper, ScraperConfig

# Create a scraper with default configuration
scraper = Scraper(engine="requests")
print("Scraper initialized successfully")

## 5. Basic Scraping Example

Let's start with a basic scraping example to extract information from a website.

In [None]:
# Define the data schema
schema = {
    "title": "h1",
    "description": "meta[name='description']",
    "paragraphs": {
        "selector": "p",
        "multiple": True
    },
    "links": {
        "selector": "a",
        "attribute": "href",
        "multiple": True
    },
    "images": {
        "selector": "img",
        "attribute": "src",
        "multiple": True
    }
}

# Scrape a website
print("Scraping example.com...")
data = scraper.scrape("https://example.com", schema)

# Print the scraped data
print("\nScraped Data:")
print(f"Title: {data.get('title')}")
print(f"Description: {data.get('description')}")
print(f"Number of paragraphs: {len(data.get('paragraphs', []))}")
print(f"Number of links: {len(data.get('links', []))}")
print(f"Number of images: {len(data.get('images', []))}")

# Export the data to various formats
print("\nExporting data...")
scraper.export(data, "/content/web-scraper-toolkit/data/example.json")
scraper.export(data, "/content/web-scraper-toolkit/data/example.csv")

print("\nDone!")

## 6. View the Exported Data

Let's view the exported data files.

In [None]:
# View the JSON file
import json
with open("/content/web-scraper-toolkit/data/example.json", "r") as f:
    json_data = json.load(f)

print("JSON Data:")
print(json.dumps(json_data, indent=2))

In [None]:
# View the CSV file
import pandas as pd
csv_data = pd.read_csv("/content/web-scraper-toolkit/data/example.csv")

print("CSV Data:")
csv_data

## 7. Advanced Scraping Example

Now, let's try a more advanced scraping example with custom configuration.

In [None]:
# Create a custom configuration
config = ScraperConfig(
    engine="requests",
    user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    user_agent_rotation=True,
    respect_robots_txt=True,
    request_delay=2.0,
    max_retries=3,
    timeout=30,
    verify_ssl=True,
    headers={
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "Cache-Control": "max-age=0",
    }
)

# Create a scraper with the custom configuration
advanced_scraper = Scraper(config=config)

# Scrape multiple websites
print("Scraping multiple websites...")
urls = [
    "https://example.com",
    "https://example.org",
    "https://example.net"
]

results = advanced_scraper.scrape_multiple(urls, schema)

# Print the scraped data
print("\nScraped Data:")
for i, result in enumerate(results):
    print(f"\nWebsite {i+1}:")
    print(f"URL: {result.get('url')}")
    print(f"Title: {result.get('title')}")
    print(f"Description: {result.get('description')}")
    print(f"Number of paragraphs: {len(result.get('paragraphs', []))}")
    print(f"Number of links: {len(result.get('links', []))}")
    print(f"Number of images: {len(result.get('images', []))}")

# Export the data to various formats
print("\nExporting data...")
advanced_scraper.export(results, "/content/web-scraper-toolkit/data/multiple_examples.json")
advanced_scraper.export(results, "/content/web-scraper-toolkit/data/multiple_examples.csv")

# Close the scraper
advanced_scraper.close()

print("\nDone!")

## 8. News Website Scraping Example

Let's create a more practical example by scraping headlines from a news website.

In [None]:
# Create a scraper for news websites
news_scraper = Scraper(engine="requests")

# Define the schema for BBC News
bbc_schema = {
    "headlines": {
        "selector": ".gs-c-promo-heading__title",
        "multiple": True
    },
    "summaries": {
        "selector": ".gs-c-promo-summary",
        "multiple": True
    },
    "article_links": {
        "selector": ".gs-c-promo-heading",
        "attribute": "href",
        "multiple": True
    }
}

# Scrape BBC News
print("Scraping BBC News...")
try:
    bbc_data = news_scraper.scrape("https://www.bbc.com/news", bbc_schema)
    
    # Print the headlines
    print("\nBBC News Headlines:")
    for i, headline in enumerate(bbc_data.get('headlines', [])):
        if i < 10:  # Limit to 10 headlines
            print(f"{i+1}. {headline}")
    
    # Export the data
    news_scraper.export(bbc_data, "/content/web-scraper-toolkit/data/bbc_news.json")
except Exception as e:
    print(f"Error scraping BBC News: {str(e)}")
    print("Note: Website structure may have changed or access might be restricted.")

# Define the schema for CNN
cnn_schema = {
    "headlines": {
        "selector": ".container__headline",
        "multiple": True
    },
    "summaries": {
        "selector": ".container__description",
        "multiple": True
    },
    "article_links": {
        "selector": ".container__link",
        "attribute": "href",
        "multiple": True
    }
}

# Scrape CNN
print("\nScraping CNN...")
try:
    cnn_data = news_scraper.scrape("https://www.cnn.com", cnn_schema)
    
    # Print the headlines
    print("\nCNN Headlines:")
    for i, headline in enumerate(cnn_data.get('headlines', [])):
        if i < 10:  # Limit to 10 headlines
            print(f"{i+1}. {headline}")
    
    # Export the data
    news_scraper.export(cnn_data, "/content/web-scraper-toolkit/data/cnn_news.json")
except Exception as e:
    print(f"Error scraping CNN: {str(e)}")
    print("Note: Website structure may have changed or access might be restricted.")

# Close the scraper
news_scraper.close()

print("\nNews scraping complete!")

## 9. E-commerce Price Monitoring Example

Let's create a simplified version of the e-commerce price monitoring example.

In [None]:
import datetime
import json

# Create a scraper for e-commerce websites
ecommerce_scraper = Scraper(engine="requests")

# Define a function to clean price strings
def clean_price(price_str):
    if not price_str:
        return None
    # Remove currency symbols and commas
    cleaned = price_str.replace("$", "").replace("€", "").replace("£", "").replace(",", "")
    # Extract the first number found
    import re
    match = re.search(r"(\d+\.\d+|\d+)", cleaned)
    if match:
        return float(match.group(1))
    return None

# Define product URLs to monitor (using example.com for demonstration)
# In a real scenario, you would use actual e-commerce websites
products = [
    {
        "name": "Example Product",
        "url": "https://example.com",
        "selector_schema": {
            "title": "h1",
            "price": {
                "selector": "p",  # Using paragraph as a placeholder for price
                "processors": [clean_price]
            }
        }
    }
]

# Current timestamp
timestamp = datetime.datetime.now().isoformat()

# Load existing price history or create a new one
price_history_file = "/content/web-scraper-toolkit/data/price_history.json"
try:
    with open(price_history_file, 'r') as f:
        price_history = json.load(f)
except (FileNotFoundError, json.JSONDecodeError):
    price_history = {}

# Scrape each product
for product in products:
    product_name = product["name"]
    product_url = product["url"]
    selector_schema = product["selector_schema"]
    
    print(f"Scraping product: {product_name}")
    
    try:
        # Scrape the product page
        data = ecommerce_scraper.scrape(product_url, selector_schema)
        
        # Add timestamp and URL
        data["timestamp"] = timestamp
        data["url"] = product_url
        
        # Initialize price history for this product if it doesn't exist
        if product_name not in price_history:
            price_history[product_name] = []
        
        # Add the current price to the history
        price_history[product_name].append({
            "timestamp": timestamp,
            "price": data.get("price"),
            "title": data.get("title")
        })
        
        # Print the current data
        print(f"Title: {data.get('title')}")
        print(f"Price: {data.get('price')}")
        
        # Export the current product data
        product_file = f"/content/web-scraper-toolkit/data/{product_name.lower().replace(' ', '_')}.json"
        ecommerce_scraper.export(data, product_file)
        
    except Exception as e:
        print(f"Error scraping {product_name}: {str(e)}")

# Save the updated price history
with open(price_history_file, 'w') as f:
    json.dump(price_history, f, indent=2)

# Close the scraper
ecommerce_scraper.close()

print("\nE-commerce price monitoring complete!")

## 10. Conclusion

In this notebook, we've demonstrated how to use the Web Scraper Toolkit in Google Colab. We've covered:

1. Setting up the environment
2. Basic scraping
3. Advanced scraping with custom configuration
4. News website scraping
5. E-commerce price monitoring

The Web Scraper Toolkit provides a flexible and powerful framework for web scraping tasks, and it can be easily used in Google Colab for quick prototyping and analysis.