🛍️ Amazon Web Scraper - Complete Documentation

📋 Table of Contents

Overview
Project Structure
Installation & Setup
Configuration
How It Works
Step-by-Step Automation Process
Running the Scraper
Output Files
Troubleshooting
Important Notes

🎯 Overview

This project uses Selenium WebDriver to automate web browsing and extract product information from Amazon search results. The scraper mimics human behavior to search for products and collect data like:

Product Title
Price
Rating
Number of Reviews
Product URL
Image URL

Technology Stack:

Python 3.11+
Selenium 4.26+ (Browser Automation)
BeautifulSoup4 (HTML Parsing)
Pandas (Data Processing)
Colorama (Terminal Output)

📁 Project Structure

web_scrapping/
│
├── amazon_scraper.py      # Main scraper script
├── config.py              # Configuration file (EDIT THIS)
├── requirements.txt       # Python dependencies
├── README.md             # This documentation
│
└── scraped_data/         # Output directory (created automatically)
    ├── amazon_wireless_headphones_20251107_123456.csv
    ├── amazon_wireless_headphones_20251107_123456.xlsx
    └── amazon_wireless_headphones_20251107_123456.json

🚀 Installation & Setup

Step 1: Navigate to Project Directory

cd D:\PROJECTS\web_scrapping

Step 2: Create Virtual Environment

python -m venv venv

Step 3: Activate Virtual Environment

Windows PowerShell:

.\venv\Scripts\Activate.ps1

Windows CMD:

venv\Scripts\activate.bat

Step 4: Install Dependencies

pip install -r requirements.txt

This will install:

selenium - Browser automation
webdriver-manager - Automatic driver management
pandas - Data manipulation
openpyxl - Excel file support
beautifulsoup4 - HTML parsing
colorama - Colored terminal output
And other utilities

Step 5: Verify Installation

python -c "import selenium; print(f'Selenium version: {selenium.__version__}')"

⚙️ Configuration

Edit `config.py` to Customize Scraping

Open config.py and modify these key settings:

1️⃣ Search Query (MOST IMPORTANT)

SEARCH_QUERY = "wireless headphones"

What to change:

Replace "wireless headphones" with your desired search term
Examples:
- "laptop backpack"
- "gaming mouse"
- "python programming books"
- "smart watch"

2️⃣ Maximum Products

MAX_PRODUCTS = 20

What to change:

How many products to scrape (default: 20)
Recommended: 10-50 products
Higher numbers = longer scraping time

3️⃣ Browser Selection

BROWSER = "edge"

Options:

"edge" - Microsoft Edge (Recommended)
"chrome" - Google Chrome
"firefox" - Mozilla Firefox

4️⃣ Headless Mode

HEADLESS = False

Options:

False - Show browser window (easier to debug)
True - Run in background (faster, no GUI)

5️⃣ Export Format

EXPORT_FORMAT = "all"

Options:

"all" - Export to CSV, Excel, and JSON
"csv" - CSV file only
"excel" - Excel file only
"json" - JSON file only

6️⃣ Amazon Domain (For Different Countries)

AMAZON_DOMAIN = "www.amazon.com"

Options:

"www.amazon.com" - United States
"www.amazon.co.uk" - United Kingdom
"www.amazon.in" - India
"www.amazon.de" - Germany
"www.amazon.ca" - Canada

🔍 How It Works

Architecture Overview

┌─────────────────┐
│   config.py     │  ← User modifies search query
└────────┬────────┘
         │
         ▼
┌─────────────────────────────────────┐
│   amazon_scraper.py                 │
│                                     │
│  1. Initialize Browser (Selenium)   │
│  2. Navigate to Amazon              │
│  3. Find Search Box                 │
│  4. Enter Query & Submit            │
│  5. Wait for Results                │
│  6. Scroll Page (Load More)         │
│  7. Extract Product Data            │
│  8. Export to Files                 │
└─────────────────┬───────────────────┘
                  │
                  ▼
         ┌────────────────┐
         │  scraped_data/ │
         │   - CSV        │
         │   - Excel      │
         │   - JSON       │
         └────────────────┘

Data Flow

User Input → config.py (SEARCH_QUERY)
Browser Launch → Selenium opens Edge/Chrome/Firefox
Amazon Navigation → Goes to amazon.com
Search Execution → Types query and presses Enter
Page Loading → Waits for search results
Data Extraction → Parses HTML for product info
Data Storage → Saves to CSV/Excel/JSON
Cleanup → Closes browser

📖 Step-by-Step Automation Process

STEP 1: Browser Initialization

What happens:

def setup_driver(self):
    # Creates browser instance with anti-detection settings

Technical details:

Launches Microsoft Edge browser
Sets window to maximized
Adds user-agent to mimic real browser
Disables automation detection flags
Sets timeouts (10s implicit, 20s page load)

Why it matters:

Amazon can detect automated bots
These settings make Selenium look more human-like
Reduces chance of being blocked

STEP 2: Navigate to Amazon

What happens:

def navigate_to_amazon(self):
    self.driver.get("https://www.amazon.com")

Technical details:

Opens Amazon homepage
Waits 2 seconds for page to fully load
Handles any initial popups (if present)

Visual:

Browser → Opens → https://www.amazon.com → Loaded ✓

STEP 3: Search for Products

What happens:

def search_products(self):
    # Finds search box by ID: "twotabsearchtextbox"
    # Types your SEARCH_QUERY
    # Presses Enter

Technical details:

Find Search Box:
```
search_box = WebDriverWait(self.driver, 10).until(
    EC.presence_of_element_located((By.ID, "twotabsearchtextbox"))
)
```
- Uses explicit wait (up to 10 seconds)
- Locates Amazon's search input by ID
- ID twotabsearchtextbox is Amazon's standard search box
Enter Search Query:
```
search_box.clear()
search_box.send_keys(self.search_query)  # From config.py
```
- Clears any existing text
- Types your search query character by character
Submit Search:
```
search_box.send_keys(Keys.RETURN)
```
- Simulates pressing Enter key
- Submits the search form

Wait for Results:

WebDriverWait(self.driver, 10).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, "[data-component-type='s-search-result']"))
)

Waits for product results to appear
Looks for elements with attribute data-component-type='s-search-result'

Visual Flow:

Search Box → Type "wireless headphones" → Press Enter → Results Load ✓

STEP 4: Scroll Page

What happens:

def scroll_page(self):
    # Scrolls down 3 times to load lazy-loaded products

Technical details:

Amazon uses lazy loading (products load as you scroll)
Scrolls to bottom of page
Waits 1 second after each scroll
Repeats 3 times
Scrolls back to top

Why it matters:

Some products only appear after scrolling
Ensures all products are visible in DOM
Helps capture more data

Visual:

Top → Scroll Down → Wait → Scroll Down → Wait → Scroll Down → Back to Top ✓

STEP 5: Extract Product Data

What happens:

def extract_product_data(self):
    # Finds all product containers
    # Loops through each product
    # Extracts: title, price, rating, reviews, URL, image

Technical details:

Finding Products:

products = self.driver.find_elements(By.CSS_SELECTOR, "[data-component-type='s-search-result']")

Finds all div elements with attribute data-component-type='s-search-result'
Each div represents one product

For Each Product, Extract:

Title:

title_elem = product.find_elements(By.CSS_SELECTOR, "h2 a span")
title = title_elem[0].text

Looks for <h2><a><span> structure
Gets text content

Price:

price_whole = product.find_elements(By.CSS_SELECTOR, ".a-price-whole")
price_fraction = product.find_elements(By.CSS_SELECTOR, ".a-price-fraction")
price = f"{price_whole[0].text}{price_fraction[0].text}"

Amazon splits price into whole and fraction
Example: $29 + .99 = $29.99

Rating:

rating_elem = product.find_elements(By.CSS_SELECTOR, ".a-icon-alt")
rating = rating_elem[0].get_attribute('textContent')

Finds star rating (e.g., "4.5 out of 5 stars")

Reviews Count:

reviews_elem = product.find_elements(By.CSS_SELECTOR, "span.a-size-base.s-underline-text")
num_reviews = reviews_elem[0].text

Gets number like "(1,234)"

Product URL:

url_elem = product.find_elements(By.CSS_SELECTOR, "h2 a")
url = url_elem[0].get_attribute('href')

Full URL to product page

Image URL:

image_elem = product.find_elements(By.CSS_SELECTOR, "img.s-image")
image_url = image_elem[0].get_attribute('src')

Direct link to product image

Data Storage:

product_data = {
    'rank': count + 1,
    'title': title,
    'price': price,
    'rating': rating,
    'num_reviews': num_reviews,
    'url': url,
    'image_url': image_url,
    'scraped_at': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
}
self.products_data.append(product_data)

Creates dictionary for each product
Adds to list

Visual:

Product 1 → Extract → Store
Product 2 → Extract → Store
Product 3 → Extract → Store
...
Product 20 → Extract → Store ✓

STEP 6: Export Data

What happens:

def export_data(self):
    # Converts data to Pandas DataFrame
    # Exports to CSV/Excel/JSON based on config

Technical details:

Create Output Directory:
```
os.makedirs(OUTPUT_DIR, exist_ok=True)
```
- Creates scraped_data/ folder if doesn't exist

Generate Filename:

timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
base_filename = f"amazon_{self.search_query.replace(' ', '_')}_{timestamp}"

Example: amazon_wireless_headphones_20251107_143022

Convert to DataFrame:
```
df = pd.DataFrame(self.products_data)
```
- Creates table structure from data
Export Formats:

CSV:
```
df.to_csv(csv_file, index=False, encoding='utf-8')
```
- Plain text, comma-separated
- Opens in Excel, Google Sheets
Excel:
```
df.to_excel(excel_file, index=False, engine='openpyxl')
```
- Native Excel format (.xlsx)
- Preserves formatting
JSON:
```
json.dump(self.products_data, f, indent=2)
```
- Structured data format
- Good for APIs

Output Location:

scraped_data/
  └── amazon_wireless_headphones_20251107_143022.csv
  └── amazon_wireless_headphones_20251107_143022.xlsx
  └── amazon_wireless_headphones_20251107_143022.json

STEP 7: Cleanup

What happens:

def cleanup(self):
    self.driver.quit()

Technical details:

Closes browser window
Releases system resources
Ensures clean exit

🏃 Running the Scraper

Method 1: Default Configuration

# Activate virtual environment
.\venv\Scripts\Activate.ps1

# Run scraper
python amazon_scraper.py

Method 2: Custom Search Query

Option A: Edit config.py

# Open config.py
# Change line:
SEARCH_QUERY = "gaming laptop"  # Your custom query

# Save and run
python amazon_scraper.py

Option B: Create custom script

# quick_search.py
from amazon_scraper import AmazonScraper

# Create scraper instance
scraper = AmazonScraper()

# Modify search query programmatically
scraper.search_query = "python books"
scraper.max_products = 10

# Run
scraper.run()

Expected Output (Terminal):

============================================================
Amazon Web Scraper Initialized
============================================================
Search Query: wireless headphones
Max Products: 20
Browser: EDGE
============================================================

[STEP 1] Setting up EDGE browser...
✓ Browser initialized successfully

[STEP 2] Navigating to Amazon (https://www.amazon.com)...
✓ Successfully loaded Amazon homepage
Current URL: https://www.amazon.com/

[STEP 3] Searching for: 'wireless headphones'...
✓ Found search box
✓ Entered search query
✓ Search results loaded
Current URL: https://www.amazon.com/s?k=wireless+headphones

[STEP 4] Scrolling page to load all products...
✓ Scrolled (iteration 1)
✓ Scrolled (iteration 2)
✓ Scrolled (iteration 3)
✓ Page scrolling complete

[STEP 5] Extracting product data...
✓ Found 60 product elements
  [1] Sony WH-1000XM5 Wireless Noise Canceling Headphon... - $398.00
  [2] Bose QuietComfort 45 Bluetooth Wireless Noise Can... - $279.00
  [3] Apple AirPods Pro (2nd Generation) Wireless Ear B... - $249.00
  ...
  [20] JBL Tune 510BT: Wireless On-Ear Headphones with ... - $29.95

✓ Successfully extracted 20 products

[STEP 6] Exporting data...
✓ Exported to CSV: scraped_data\amazon_wireless_headphones_20251107_143022.csv
✓ Exported to Excel: scraped_data\amazon_wireless_headphones_20251107_143022.xlsx
✓ Exported to JSON: scraped_data\amazon_wireless_headphones_20251107_143022.json

✓ All exports completed successfully!

[STEP 7] Cleaning up...
✓ Browser closed

============================================================
✓ Scraping completed successfully!
Total products scraped: 20
Total time: 45.67 seconds
============================================================

📊 Output Files

CSV Format (`*.csv`)

rank,title,price,rating,num_reviews,url,image_url,scraped_at
1,"Sony WH-1000XM5","$398.00","4.7 out of 5 stars","(12,345)","https://amazon.com/dp/B09XS7JWHH","https://m.media-amazon.com/images/I/...",2025-11-07 14:30:22
2,"Bose QuietComfort 45","$279.00","4.6 out of 5 stars","(8,901)","https://amazon.com/dp/B098FKXT8L","https://m.media-amazon.com/images/I/...",2025-11-07 14:30:23
...

Best for:

Excel analysis
Google Sheets import
Simple data viewing

Excel Format (`*.xlsx`)

Opens directly in Microsoft Excel with columns:

A: Rank
B: Title
C: Price
D: Rating
E: Number of Reviews
F: Product URL
G: Image URL
H: Scraped At

Features:

Formatted cells
Sortable columns
Can create charts/pivot tables

JSON Format (`*.json`)

[
  {
    "rank": 1,
    "title": "Sony WH-1000XM5 Wireless Noise Canceling Headphones",
    "price": "$398.00",
    "rating": "4.7 out of 5 stars",
    "num_reviews": "(12,345)",
    "url": "https://amazon.com/dp/B09XS7JWHH",
    "image_url": "https://m.media-amazon.com/images/I/...",
    "scraped_at": "2025-11-07 14:30:22"
  },
  {
    "rank": 2,
    "title": "Bose QuietComfort 45",
    ...
  }
]

Best for:

API integration
Web applications
Data processing scripts

🔧 Troubleshooting

Issue 1: Browser Doesn't Open

Symptoms:

Error: "WebDriver not found"
Browser doesn't launch

Solution:

# Selenium will auto-download driver, but if it fails:
# Make sure you have the browser installed:
# - Microsoft Edge (recommended)
# - Google Chrome
# - Mozilla Firefox

# Try changing browser in config.py:
BROWSER = "edge"  # or "chrome" or "firefox"

Issue 2: No Products Found

Symptoms:

"✓ Found 0 product elements"
Empty output files

Possible causes:

Amazon changed their HTML structure
Search query has no results
Page didn't load fully

Solution:

# In config.py, increase timeouts:
PAGE_LOAD_TIMEOUT = 30  # Increase from 20
IMPLICIT_WAIT = 15      # Increase from 10
ACTION_DELAY = 3        # Increase from 2

Issue 3: Getting Blocked by Amazon

Symptoms:

CAPTCHA appears
"Access Denied" page
IP temporarily blocked

Solution:

# In config.py:
ACTION_DELAY = 5        # Slow down actions
SCROLL_DELAY = 2        # Slow down scrolling
MAX_PRODUCTS = 10       # Scrape fewer products

# Consider:
# - Using VPN
# - Waiting between scrapes
# - Running during off-peak hours

Issue 4: Incomplete Data

Symptoms:

Some fields show "N/A"
Missing prices or ratings

Cause:

Amazon's HTML varies by product type
Some products don't have all fields

Solution:

This is normal behavior
Filter data in Excel/CSV afterward
Products without key data will show "N/A"

Issue 5: Import Errors

Symptoms:

ModuleNotFoundError: No module named 'selenium'

Solution:

# Make sure virtual environment is activated:
.\venv\Scripts\Activate.ps1

# Reinstall dependencies:
pip install -r requirements.txt

# Verify installation:
pip list

⚠️ Important Notes

Legal & Ethical Considerations

Amazon's Terms of Service:
- Amazon prohibits automated scraping without permission
- This tool is for educational purposes only
- Commercial use may violate Terms of Service
Rate Limiting:
- Don't scrape too frequently (hours between sessions)
- Keep MAX_PRODUCTS reasonable (≤50)
- Respect Amazon's servers
Data Usage:
- Don't republish scraped data
- Don't use for competitive intelligence
- Personal research only

Best Practices

Responsible Scraping:

ACTION_DELAY = 3      # Slower = More polite
MAX_PRODUCTS = 20     # Don't be greedy

Testing:
- Start with small numbers (MAX_PRODUCTS = 5)
- Test different search queries
- Verify data quality
Maintenance:
- Amazon changes their HTML frequently
- You may need to update selectors
- Check logs if scraping fails

Technical Limitations

Dynamic Content:
- Some elements load via JavaScript
- May need longer wait times
- Not all products may be captured
CAPTCHA:
- Amazon may show CAPTCHA
- No automated solution (that's the point!)
- If CAPTCHA appears, solve manually or restart
IP Blocking:
- Too many requests = temporary block
- Usually lifts after 1-24 hours
- Use responsibly

🎓 Learning Resources

Understanding the Code

Key Concepts:

Selenium WebDriver - Browser automation
CSS Selectors - Finding elements in HTML
Explicit Waits - Waiting for elements to load
BeautifulSoup - Parsing HTML
Pandas - Data manipulation

CSS Selectors Used

# By ID
By.ID, "twotabsearchtextbox"

# By CSS Class
By.CSS_SELECTOR, ".a-price-whole"

# By Attribute
By.CSS_SELECTOR, "[data-component-type='s-search-result']"

# By Tag Hierarchy
By.CSS_SELECTOR, "h2 a span"

Selenium Wait Strategies

# Implicit Wait (global)
driver.implicitly_wait(10)

# Explicit Wait (specific element)
WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.ID, "element-id"))
)

📞 Support

Common Questions:

Q: Can I scrape other websites? A: Yes! Modify the URL and selectors in the code. Each website has different HTML structure.

Q: How do I change what data is extracted? A: Edit the extract_product_data() method in amazon_scraper.py. Add new CSS selectors for additional fields.

Q: Can I run this on a schedule? A: Yes, use Windows Task Scheduler or create a loop with time delays.

Q: Is there a GUI version? A: Currently CLI only. You could build a GUI using tkinter or PyQt.

🚀 Next Steps

Run a Test:
```
python amazon_scraper.py
```
Customize Search:
- Edit config.py
- Change SEARCH_QUERY
- Adjust MAX_PRODUCTS
Analyze Data:
- Open output files in Excel
- Create charts
- Analyze pricing trends
Expand Functionality:
- Add more data fields
- Scrape multiple pages
- Compare prices over time
- Send results via email

📝 Version History

v1.0.0 (2025-11-07)
- Initial release
- Basic Amazon search scraping
- CSV/Excel/JSON export
- Edge browser support

📄 License

This project is for educational purposes only.

Use responsibly and respect website terms of service.

Happy Scraping! 🎉

Remember: With great scraping power comes great responsibility!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
__pycache__		__pycache__
scraped_data		scraped_data
venv		venv
.DS_Store		.DS_Store
PROJECT_COMPLETE.md		PROJECT_COMPLETE.md
QUICK_START.md		QUICK_START.md
README.md		README.md
TROUBLESHOOTING.md		TROUBLESHOOTING.md
amazon_scraper.py		amazon_scraper.py
config.py		config.py
main.py		main.py
requirements.txt		requirements.txt
run_scraper.bat		run_scraper.bat
run_scraper.ps1		run_scraper.ps1

adityagarwal2005/Dev-ops-project-automation-using-selenium

Folders and files

Latest commit

History

Repository files navigation

🛍️ Amazon Web Scraper - Complete Documentation

📋 Table of Contents

🎯 Overview

📁 Project Structure

🚀 Installation & Setup

Step 1: Navigate to Project Directory

Step 2: Create Virtual Environment

Step 3: Activate Virtual Environment

Step 4: Install Dependencies

Step 5: Verify Installation

⚙️ Configuration

Edit config.py to Customize Scraping

1️⃣ Search Query (MOST IMPORTANT)

2️⃣ Maximum Products

3️⃣ Browser Selection

4️⃣ Headless Mode

5️⃣ Export Format

6️⃣ Amazon Domain (For Different Countries)

🔍 How It Works

Architecture Overview

Data Flow

📖 Step-by-Step Automation Process

STEP 1: Browser Initialization

STEP 2: Navigate to Amazon

STEP 3: Search for Products

STEP 4: Scroll Page

STEP 5: Extract Product Data

Finding Products:

For Each Product, Extract:

Data Storage:

STEP 6: Export Data

STEP 7: Cleanup

🏃 Running the Scraper

Method 1: Default Configuration

Method 2: Custom Search Query

Expected Output (Terminal):

📊 Output Files

CSV Format (*.csv)

Excel Format (*.xlsx)

JSON Format (*.json)

🔧 Troubleshooting

Issue 1: Browser Doesn't Open

Issue 2: No Products Found

Issue 3: Getting Blocked by Amazon

Issue 4: Incomplete Data

Issue 5: Import Errors

⚠️ Important Notes

Legal & Ethical Considerations

Best Practices

Technical Limitations

🎓 Learning Resources

Understanding the Code

CSS Selectors Used

Selenium Wait Strategies

📞 Support

🚀 Next Steps

📝 Version History

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Edit `config.py` to Customize Scraping

CSV Format (`*.csv`)

Excel Format (`*.xlsx`)

JSON Format (`*.json`)

Packages