Skip to content

adityagarwal2005/Dev-ops-project-automation-using-selenium

Repository files navigation

🛍️ Amazon Web Scraper - Complete Documentation

📋 Table of Contents

  1. Overview
  2. Project Structure
  3. Installation & Setup
  4. Configuration
  5. How It Works
  6. Step-by-Step Automation Process
  7. Running the Scraper
  8. Output Files
  9. Troubleshooting
  10. Important Notes

🎯 Overview

This project uses Selenium WebDriver to automate web browsing and extract product information from Amazon search results. The scraper mimics human behavior to search for products and collect data like:

  • Product Title
  • Price
  • Rating
  • Number of Reviews
  • Product URL
  • Image URL

Technology Stack:

  • Python 3.11+
  • Selenium 4.26+ (Browser Automation)
  • BeautifulSoup4 (HTML Parsing)
  • Pandas (Data Processing)
  • Colorama (Terminal Output)

📁 Project Structure

web_scrapping/
│
├── amazon_scraper.py      # Main scraper script
├── config.py              # Configuration file (EDIT THIS)
├── requirements.txt       # Python dependencies
├── README.md             # This documentation
│
└── scraped_data/         # Output directory (created automatically)
    ├── amazon_wireless_headphones_20251107_123456.csv
    ├── amazon_wireless_headphones_20251107_123456.xlsx
    └── amazon_wireless_headphones_20251107_123456.json

🚀 Installation & Setup

Step 1: Navigate to Project Directory

cd D:\PROJECTS\web_scrapping

Step 2: Create Virtual Environment

python -m venv venv

Step 3: Activate Virtual Environment

Windows PowerShell:

.\venv\Scripts\Activate.ps1

Windows CMD:

venv\Scripts\activate.bat

Step 4: Install Dependencies

pip install -r requirements.txt

This will install:

  • selenium - Browser automation
  • webdriver-manager - Automatic driver management
  • pandas - Data manipulation
  • openpyxl - Excel file support
  • beautifulsoup4 - HTML parsing
  • colorama - Colored terminal output
  • And other utilities

Step 5: Verify Installation

python -c "import selenium; print(f'Selenium version: {selenium.__version__}')"

⚙️ Configuration

Edit config.py to Customize Scraping

Open config.py and modify these key settings:

1️⃣ Search Query (MOST IMPORTANT)

SEARCH_QUERY = "wireless headphones"

What to change:

  • Replace "wireless headphones" with your desired search term
  • Examples:
    • "laptop backpack"
    • "gaming mouse"
    • "python programming books"
    • "smart watch"

2️⃣ Maximum Products

MAX_PRODUCTS = 20

What to change:

  • How many products to scrape (default: 20)
  • Recommended: 10-50 products
  • Higher numbers = longer scraping time

3️⃣ Browser Selection

BROWSER = "edge"

Options:

  • "edge" - Microsoft Edge (Recommended)
  • "chrome" - Google Chrome
  • "firefox" - Mozilla Firefox

4️⃣ Headless Mode

HEADLESS = False

Options:

  • False - Show browser window (easier to debug)
  • True - Run in background (faster, no GUI)

5️⃣ Export Format

EXPORT_FORMAT = "all"

Options:

  • "all" - Export to CSV, Excel, and JSON
  • "csv" - CSV file only
  • "excel" - Excel file only
  • "json" - JSON file only

6️⃣ Amazon Domain (For Different Countries)

AMAZON_DOMAIN = "www.amazon.com"

Options:

  • "www.amazon.com" - United States
  • "www.amazon.co.uk" - United Kingdom
  • "www.amazon.in" - India
  • "www.amazon.de" - Germany
  • "www.amazon.ca" - Canada

🔍 How It Works

Architecture Overview

┌─────────────────┐
│   config.py     │  ← User modifies search query
└────────┬────────┘
         │
         ▼
┌─────────────────────────────────────┐
│   amazon_scraper.py                 │
│                                     │
│  1. Initialize Browser (Selenium)   │
│  2. Navigate to Amazon              │
│  3. Find Search Box                 │
│  4. Enter Query & Submit            │
│  5. Wait for Results                │
│  6. Scroll Page (Load More)         │
│  7. Extract Product Data            │
│  8. Export to Files                 │
└─────────────────┬───────────────────┘
                  │
                  ▼
         ┌────────────────┐
         │  scraped_data/ │
         │   - CSV        │
         │   - Excel      │
         │   - JSON       │
         └────────────────┘

Data Flow

  1. User Inputconfig.py (SEARCH_QUERY)
  2. Browser Launch → Selenium opens Edge/Chrome/Firefox
  3. Amazon Navigation → Goes to amazon.com
  4. Search Execution → Types query and presses Enter
  5. Page Loading → Waits for search results
  6. Data Extraction → Parses HTML for product info
  7. Data Storage → Saves to CSV/Excel/JSON
  8. Cleanup → Closes browser

📖 Step-by-Step Automation Process

STEP 1: Browser Initialization

What happens:

def setup_driver(self):
    # Creates browser instance with anti-detection settings

Technical details:

  • Launches Microsoft Edge browser
  • Sets window to maximized
  • Adds user-agent to mimic real browser
  • Disables automation detection flags
  • Sets timeouts (10s implicit, 20s page load)

Why it matters:

  • Amazon can detect automated bots
  • These settings make Selenium look more human-like
  • Reduces chance of being blocked

STEP 2: Navigate to Amazon

What happens:

def navigate_to_amazon(self):
    self.driver.get("https://www.amazon.com")

Technical details:

  • Opens Amazon homepage
  • Waits 2 seconds for page to fully load
  • Handles any initial popups (if present)

Visual:

Browser → Opens → https://www.amazon.com → Loaded ✓

STEP 3: Search for Products

What happens:

def search_products(self):
    # Finds search box by ID: "twotabsearchtextbox"
    # Types your SEARCH_QUERY
    # Presses Enter

Technical details:

  1. Find Search Box:

    search_box = WebDriverWait(self.driver, 10).until(
        EC.presence_of_element_located((By.ID, "twotabsearchtextbox"))
    )
    • Uses explicit wait (up to 10 seconds)
    • Locates Amazon's search input by ID
    • ID twotabsearchtextbox is Amazon's standard search box
  2. Enter Search Query:

    search_box.clear()
    search_box.send_keys(self.search_query)  # From config.py
    • Clears any existing text
    • Types your search query character by character
  3. Submit Search:

    search_box.send_keys(Keys.RETURN)
    • Simulates pressing Enter key
    • Submits the search form
  4. Wait for Results:

    WebDriverWait(self.driver, 10).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, "[data-component-type='s-search-result']"))
    )
    • Waits for product results to appear
    • Looks for elements with attribute data-component-type='s-search-result'

Visual Flow:

Search Box → Type "wireless headphones" → Press Enter → Results Load ✓

STEP 4: Scroll Page

What happens:

def scroll_page(self):
    # Scrolls down 3 times to load lazy-loaded products

Technical details:

  • Amazon uses lazy loading (products load as you scroll)
  • Scrolls to bottom of page
  • Waits 1 second after each scroll
  • Repeats 3 times
  • Scrolls back to top

Why it matters:

  • Some products only appear after scrolling
  • Ensures all products are visible in DOM
  • Helps capture more data

Visual:

Top → Scroll Down → Wait → Scroll Down → Wait → Scroll Down → Back to Top ✓

STEP 5: Extract Product Data

What happens:

def extract_product_data(self):
    # Finds all product containers
    # Loops through each product
    # Extracts: title, price, rating, reviews, URL, image

Technical details:

Finding Products:

products = self.driver.find_elements(By.CSS_SELECTOR, "[data-component-type='s-search-result']")
  • Finds all div elements with attribute data-component-type='s-search-result'
  • Each div represents one product

For Each Product, Extract:

  1. Title:

    title_elem = product.find_elements(By.CSS_SELECTOR, "h2 a span")
    title = title_elem[0].text
    • Looks for <h2><a><span> structure
    • Gets text content
  2. Price:

    price_whole = product.find_elements(By.CSS_SELECTOR, ".a-price-whole")
    price_fraction = product.find_elements(By.CSS_SELECTOR, ".a-price-fraction")
    price = f"{price_whole[0].text}{price_fraction[0].text}"
    • Amazon splits price into whole and fraction
    • Example: $29 + .99 = $29.99
  3. Rating:

    rating_elem = product.find_elements(By.CSS_SELECTOR, ".a-icon-alt")
    rating = rating_elem[0].get_attribute('textContent')
    • Finds star rating (e.g., "4.5 out of 5 stars")
  4. Reviews Count:

    reviews_elem = product.find_elements(By.CSS_SELECTOR, "span.a-size-base.s-underline-text")
    num_reviews = reviews_elem[0].text
    • Gets number like "(1,234)"
  5. Product URL:

    url_elem = product.find_elements(By.CSS_SELECTOR, "h2 a")
    url = url_elem[0].get_attribute('href')
    • Full URL to product page
  6. Image URL:

    image_elem = product.find_elements(By.CSS_SELECTOR, "img.s-image")
    image_url = image_elem[0].get_attribute('src')
    • Direct link to product image

Data Storage:

product_data = {
    'rank': count + 1,
    'title': title,
    'price': price,
    'rating': rating,
    'num_reviews': num_reviews,
    'url': url,
    'image_url': image_url,
    'scraped_at': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
}
self.products_data.append(product_data)
  • Creates dictionary for each product
  • Adds to list

Visual:

Product 1 → Extract → Store
Product 2 → Extract → Store
Product 3 → Extract → Store
...
Product 20 → Extract → Store ✓

STEP 6: Export Data

What happens:

def export_data(self):
    # Converts data to Pandas DataFrame
    # Exports to CSV/Excel/JSON based on config

Technical details:

  1. Create Output Directory:

    os.makedirs(OUTPUT_DIR, exist_ok=True)
    • Creates scraped_data/ folder if doesn't exist
  2. Generate Filename:

    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    base_filename = f"amazon_{self.search_query.replace(' ', '_')}_{timestamp}"
    • Example: amazon_wireless_headphones_20251107_143022
  3. Convert to DataFrame:

    df = pd.DataFrame(self.products_data)
    • Creates table structure from data
  4. Export Formats:

    CSV:

    df.to_csv(csv_file, index=False, encoding='utf-8')
    • Plain text, comma-separated
    • Opens in Excel, Google Sheets

    Excel:

    df.to_excel(excel_file, index=False, engine='openpyxl')
    • Native Excel format (.xlsx)
    • Preserves formatting

    JSON:

    json.dump(self.products_data, f, indent=2)
    • Structured data format
    • Good for APIs

Output Location:

scraped_data/
  └── amazon_wireless_headphones_20251107_143022.csv
  └── amazon_wireless_headphones_20251107_143022.xlsx
  └── amazon_wireless_headphones_20251107_143022.json

STEP 7: Cleanup

What happens:

def cleanup(self):
    self.driver.quit()

Technical details:

  • Closes browser window
  • Releases system resources
  • Ensures clean exit

🏃 Running the Scraper

Method 1: Default Configuration

# Activate virtual environment
.\venv\Scripts\Activate.ps1

# Run scraper
python amazon_scraper.py

Method 2: Custom Search Query

Option A: Edit config.py

# Open config.py
# Change line:
SEARCH_QUERY = "gaming laptop"  # Your custom query

# Save and run
python amazon_scraper.py

Option B: Create custom script

# quick_search.py
from amazon_scraper import AmazonScraper

# Create scraper instance
scraper = AmazonScraper()

# Modify search query programmatically
scraper.search_query = "python books"
scraper.max_products = 10

# Run
scraper.run()

Expected Output (Terminal):

============================================================
Amazon Web Scraper Initialized
============================================================
Search Query: wireless headphones
Max Products: 20
Browser: EDGE
============================================================

[STEP 1] Setting up EDGE browser...
✓ Browser initialized successfully

[STEP 2] Navigating to Amazon (https://www.amazon.com)...
✓ Successfully loaded Amazon homepage
Current URL: https://www.amazon.com/

[STEP 3] Searching for: 'wireless headphones'...
✓ Found search box
✓ Entered search query
✓ Search results loaded
Current URL: https://www.amazon.com/s?k=wireless+headphones

[STEP 4] Scrolling page to load all products...
✓ Scrolled (iteration 1)
✓ Scrolled (iteration 2)
✓ Scrolled (iteration 3)
✓ Page scrolling complete

[STEP 5] Extracting product data...
✓ Found 60 product elements
  [1] Sony WH-1000XM5 Wireless Noise Canceling Headphon... - $398.00
  [2] Bose QuietComfort 45 Bluetooth Wireless Noise Can... - $279.00
  [3] Apple AirPods Pro (2nd Generation) Wireless Ear B... - $249.00
  ...
  [20] JBL Tune 510BT: Wireless On-Ear Headphones with ... - $29.95

✓ Successfully extracted 20 products

[STEP 6] Exporting data...
✓ Exported to CSV: scraped_data\amazon_wireless_headphones_20251107_143022.csv
✓ Exported to Excel: scraped_data\amazon_wireless_headphones_20251107_143022.xlsx
✓ Exported to JSON: scraped_data\amazon_wireless_headphones_20251107_143022.json

✓ All exports completed successfully!

[STEP 7] Cleaning up...
✓ Browser closed

============================================================
✓ Scraping completed successfully!
Total products scraped: 20
Total time: 45.67 seconds
============================================================

📊 Output Files

CSV Format (*.csv)

rank,title,price,rating,num_reviews,url,image_url,scraped_at
1,"Sony WH-1000XM5","$398.00","4.7 out of 5 stars","(12,345)","https://amazon.com/dp/B09XS7JWHH","https://m.media-amazon.com/images/I/...",2025-11-07 14:30:22
2,"Bose QuietComfort 45","$279.00","4.6 out of 5 stars","(8,901)","https://amazon.com/dp/B098FKXT8L","https://m.media-amazon.com/images/I/...",2025-11-07 14:30:23
...

Best for:

  • Excel analysis
  • Google Sheets import
  • Simple data viewing

Excel Format (*.xlsx)

Opens directly in Microsoft Excel with columns:

  • A: Rank
  • B: Title
  • C: Price
  • D: Rating
  • E: Number of Reviews
  • F: Product URL
  • G: Image URL
  • H: Scraped At

Features:

  • Formatted cells
  • Sortable columns
  • Can create charts/pivot tables

JSON Format (*.json)

[
  {
    "rank": 1,
    "title": "Sony WH-1000XM5 Wireless Noise Canceling Headphones",
    "price": "$398.00",
    "rating": "4.7 out of 5 stars",
    "num_reviews": "(12,345)",
    "url": "https://amazon.com/dp/B09XS7JWHH",
    "image_url": "https://m.media-amazon.com/images/I/...",
    "scraped_at": "2025-11-07 14:30:22"
  },
  {
    "rank": 2,
    "title": "Bose QuietComfort 45",
    ...
  }
]

Best for:

  • API integration
  • Web applications
  • Data processing scripts

🔧 Troubleshooting

Issue 1: Browser Doesn't Open

Symptoms:

  • Error: "WebDriver not found"
  • Browser doesn't launch

Solution:

# Selenium will auto-download driver, but if it fails:
# Make sure you have the browser installed:
# - Microsoft Edge (recommended)
# - Google Chrome
# - Mozilla Firefox

# Try changing browser in config.py:
BROWSER = "edge"  # or "chrome" or "firefox"

Issue 2: No Products Found

Symptoms:

  • "✓ Found 0 product elements"
  • Empty output files

Possible causes:

  1. Amazon changed their HTML structure
  2. Search query has no results
  3. Page didn't load fully

Solution:

# In config.py, increase timeouts:
PAGE_LOAD_TIMEOUT = 30  # Increase from 20
IMPLICIT_WAIT = 15      # Increase from 10
ACTION_DELAY = 3        # Increase from 2

Issue 3: Getting Blocked by Amazon

Symptoms:

  • CAPTCHA appears
  • "Access Denied" page
  • IP temporarily blocked

Solution:

# In config.py:
ACTION_DELAY = 5        # Slow down actions
SCROLL_DELAY = 2        # Slow down scrolling
MAX_PRODUCTS = 10       # Scrape fewer products

# Consider:
# - Using VPN
# - Waiting between scrapes
# - Running during off-peak hours

Issue 4: Incomplete Data

Symptoms:

  • Some fields show "N/A"
  • Missing prices or ratings

Cause:

  • Amazon's HTML varies by product type
  • Some products don't have all fields

Solution:

  • This is normal behavior
  • Filter data in Excel/CSV afterward
  • Products without key data will show "N/A"

Issue 5: Import Errors

Symptoms:

ModuleNotFoundError: No module named 'selenium'

Solution:

# Make sure virtual environment is activated:
.\venv\Scripts\Activate.ps1

# Reinstall dependencies:
pip install -r requirements.txt

# Verify installation:
pip list

⚠️ Important Notes

Legal & Ethical Considerations

  1. Amazon's Terms of Service:

    • Amazon prohibits automated scraping without permission
    • This tool is for educational purposes only
    • Commercial use may violate Terms of Service
  2. Rate Limiting:

    • Don't scrape too frequently (hours between sessions)
    • Keep MAX_PRODUCTS reasonable (≤50)
    • Respect Amazon's servers
  3. Data Usage:

    • Don't republish scraped data
    • Don't use for competitive intelligence
    • Personal research only

Best Practices

  1. Responsible Scraping:

    ACTION_DELAY = 3      # Slower = More polite
    MAX_PRODUCTS = 20     # Don't be greedy
  2. Testing:

    • Start with small numbers (MAX_PRODUCTS = 5)
    • Test different search queries
    • Verify data quality
  3. Maintenance:

    • Amazon changes their HTML frequently
    • You may need to update selectors
    • Check logs if scraping fails

Technical Limitations

  1. Dynamic Content:

    • Some elements load via JavaScript
    • May need longer wait times
    • Not all products may be captured
  2. CAPTCHA:

    • Amazon may show CAPTCHA
    • No automated solution (that's the point!)
    • If CAPTCHA appears, solve manually or restart
  3. IP Blocking:

    • Too many requests = temporary block
    • Usually lifts after 1-24 hours
    • Use responsibly

🎓 Learning Resources

Understanding the Code

Key Concepts:

  1. Selenium WebDriver - Browser automation
  2. CSS Selectors - Finding elements in HTML
  3. Explicit Waits - Waiting for elements to load
  4. BeautifulSoup - Parsing HTML
  5. Pandas - Data manipulation

CSS Selectors Used

# By ID
By.ID, "twotabsearchtextbox"

# By CSS Class
By.CSS_SELECTOR, ".a-price-whole"

# By Attribute
By.CSS_SELECTOR, "[data-component-type='s-search-result']"

# By Tag Hierarchy
By.CSS_SELECTOR, "h2 a span"

Selenium Wait Strategies

# Implicit Wait (global)
driver.implicitly_wait(10)

# Explicit Wait (specific element)
WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.ID, "element-id"))
)

📞 Support

Common Questions:

Q: Can I scrape other websites? A: Yes! Modify the URL and selectors in the code. Each website has different HTML structure.

Q: How do I change what data is extracted? A: Edit the extract_product_data() method in amazon_scraper.py. Add new CSS selectors for additional fields.

Q: Can I run this on a schedule? A: Yes, use Windows Task Scheduler or create a loop with time delays.

Q: Is there a GUI version? A: Currently CLI only. You could build a GUI using tkinter or PyQt.


🚀 Next Steps

  1. Run a Test:

    python amazon_scraper.py
  2. Customize Search:

    • Edit config.py
    • Change SEARCH_QUERY
    • Adjust MAX_PRODUCTS
  3. Analyze Data:

    • Open output files in Excel
    • Create charts
    • Analyze pricing trends
  4. Expand Functionality:

    • Add more data fields
    • Scrape multiple pages
    • Compare prices over time
    • Send results via email

📝 Version History

  • v1.0.0 (2025-11-07)
    • Initial release
    • Basic Amazon search scraping
    • CSV/Excel/JSON export
    • Edge browser support

📄 License

This project is for educational purposes only.

Use responsibly and respect website terms of service.


Happy Scraping! 🎉

Remember: With great scraping power comes great responsibility!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published