- Overview
- Project Structure
- Installation & Setup
- Configuration
- How It Works
- Step-by-Step Automation Process
- Running the Scraper
- Output Files
- Troubleshooting
- Important Notes
This project uses Selenium WebDriver to automate web browsing and extract product information from Amazon search results. The scraper mimics human behavior to search for products and collect data like:
- Product Title
- Price
- Rating
- Number of Reviews
- Product URL
- Image URL
Technology Stack:
- Python 3.11+
- Selenium 4.26+ (Browser Automation)
- BeautifulSoup4 (HTML Parsing)
- Pandas (Data Processing)
- Colorama (Terminal Output)
web_scrapping/
│
├── amazon_scraper.py # Main scraper script
├── config.py # Configuration file (EDIT THIS)
├── requirements.txt # Python dependencies
├── README.md # This documentation
│
└── scraped_data/ # Output directory (created automatically)
├── amazon_wireless_headphones_20251107_123456.csv
├── amazon_wireless_headphones_20251107_123456.xlsx
└── amazon_wireless_headphones_20251107_123456.json
cd D:\PROJECTS\web_scrappingpython -m venv venvWindows PowerShell:
.\venv\Scripts\Activate.ps1Windows CMD:
venv\Scripts\activate.batpip install -r requirements.txtThis will install:
selenium- Browser automationwebdriver-manager- Automatic driver managementpandas- Data manipulationopenpyxl- Excel file supportbeautifulsoup4- HTML parsingcolorama- Colored terminal output- And other utilities
python -c "import selenium; print(f'Selenium version: {selenium.__version__}')"Open config.py and modify these key settings:
SEARCH_QUERY = "wireless headphones"What to change:
- Replace
"wireless headphones"with your desired search term - Examples:
"laptop backpack""gaming mouse""python programming books""smart watch"
MAX_PRODUCTS = 20What to change:
- How many products to scrape (default: 20)
- Recommended: 10-50 products
- Higher numbers = longer scraping time
BROWSER = "edge"Options:
"edge"- Microsoft Edge (Recommended)"chrome"- Google Chrome"firefox"- Mozilla Firefox
HEADLESS = FalseOptions:
False- Show browser window (easier to debug)True- Run in background (faster, no GUI)
EXPORT_FORMAT = "all"Options:
"all"- Export to CSV, Excel, and JSON"csv"- CSV file only"excel"- Excel file only"json"- JSON file only
AMAZON_DOMAIN = "www.amazon.com"Options:
"www.amazon.com"- United States"www.amazon.co.uk"- United Kingdom"www.amazon.in"- India"www.amazon.de"- Germany"www.amazon.ca"- Canada
┌─────────────────┐
│ config.py │ ← User modifies search query
└────────┬────────┘
│
▼
┌─────────────────────────────────────┐
│ amazon_scraper.py │
│ │
│ 1. Initialize Browser (Selenium) │
│ 2. Navigate to Amazon │
│ 3. Find Search Box │
│ 4. Enter Query & Submit │
│ 5. Wait for Results │
│ 6. Scroll Page (Load More) │
│ 7. Extract Product Data │
│ 8. Export to Files │
└─────────────────┬───────────────────┘
│
▼
┌────────────────┐
│ scraped_data/ │
│ - CSV │
│ - Excel │
│ - JSON │
└────────────────┘
- User Input →
config.py(SEARCH_QUERY) - Browser Launch → Selenium opens Edge/Chrome/Firefox
- Amazon Navigation → Goes to amazon.com
- Search Execution → Types query and presses Enter
- Page Loading → Waits for search results
- Data Extraction → Parses HTML for product info
- Data Storage → Saves to CSV/Excel/JSON
- Cleanup → Closes browser
What happens:
def setup_driver(self):
# Creates browser instance with anti-detection settingsTechnical details:
- Launches Microsoft Edge browser
- Sets window to maximized
- Adds user-agent to mimic real browser
- Disables automation detection flags
- Sets timeouts (10s implicit, 20s page load)
Why it matters:
- Amazon can detect automated bots
- These settings make Selenium look more human-like
- Reduces chance of being blocked
What happens:
def navigate_to_amazon(self):
self.driver.get("https://www.amazon.com")Technical details:
- Opens Amazon homepage
- Waits 2 seconds for page to fully load
- Handles any initial popups (if present)
Visual:
Browser → Opens → https://www.amazon.com → Loaded ✓
What happens:
def search_products(self):
# Finds search box by ID: "twotabsearchtextbox"
# Types your SEARCH_QUERY
# Presses EnterTechnical details:
-
Find Search Box:
search_box = WebDriverWait(self.driver, 10).until( EC.presence_of_element_located((By.ID, "twotabsearchtextbox")) )
- Uses explicit wait (up to 10 seconds)
- Locates Amazon's search input by ID
- ID
twotabsearchtextboxis Amazon's standard search box
-
Enter Search Query:
search_box.clear() search_box.send_keys(self.search_query) # From config.py
- Clears any existing text
- Types your search query character by character
-
Submit Search:
search_box.send_keys(Keys.RETURN)
- Simulates pressing Enter key
- Submits the search form
-
Wait for Results:
WebDriverWait(self.driver, 10).until( EC.presence_of_element_located((By.CSS_SELECTOR, "[data-component-type='s-search-result']")) )
- Waits for product results to appear
- Looks for elements with attribute
data-component-type='s-search-result'
Visual Flow:
Search Box → Type "wireless headphones" → Press Enter → Results Load ✓
What happens:
def scroll_page(self):
# Scrolls down 3 times to load lazy-loaded productsTechnical details:
- Amazon uses lazy loading (products load as you scroll)
- Scrolls to bottom of page
- Waits 1 second after each scroll
- Repeats 3 times
- Scrolls back to top
Why it matters:
- Some products only appear after scrolling
- Ensures all products are visible in DOM
- Helps capture more data
Visual:
Top → Scroll Down → Wait → Scroll Down → Wait → Scroll Down → Back to Top ✓
What happens:
def extract_product_data(self):
# Finds all product containers
# Loops through each product
# Extracts: title, price, rating, reviews, URL, imageTechnical details:
products = self.driver.find_elements(By.CSS_SELECTOR, "[data-component-type='s-search-result']")- Finds all div elements with attribute
data-component-type='s-search-result' - Each div represents one product
-
Title:
title_elem = product.find_elements(By.CSS_SELECTOR, "h2 a span") title = title_elem[0].text
- Looks for
<h2><a><span>structure - Gets text content
- Looks for
-
Price:
price_whole = product.find_elements(By.CSS_SELECTOR, ".a-price-whole") price_fraction = product.find_elements(By.CSS_SELECTOR, ".a-price-fraction") price = f"{price_whole[0].text}{price_fraction[0].text}"
- Amazon splits price into whole and fraction
- Example:
$29+.99=$29.99
-
Rating:
rating_elem = product.find_elements(By.CSS_SELECTOR, ".a-icon-alt") rating = rating_elem[0].get_attribute('textContent')
- Finds star rating (e.g., "4.5 out of 5 stars")
-
Reviews Count:
reviews_elem = product.find_elements(By.CSS_SELECTOR, "span.a-size-base.s-underline-text") num_reviews = reviews_elem[0].text
- Gets number like "(1,234)"
-
Product URL:
url_elem = product.find_elements(By.CSS_SELECTOR, "h2 a") url = url_elem[0].get_attribute('href')
- Full URL to product page
-
Image URL:
image_elem = product.find_elements(By.CSS_SELECTOR, "img.s-image") image_url = image_elem[0].get_attribute('src')
- Direct link to product image
product_data = {
'rank': count + 1,
'title': title,
'price': price,
'rating': rating,
'num_reviews': num_reviews,
'url': url,
'image_url': image_url,
'scraped_at': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
}
self.products_data.append(product_data)- Creates dictionary for each product
- Adds to list
Visual:
Product 1 → Extract → Store
Product 2 → Extract → Store
Product 3 → Extract → Store
...
Product 20 → Extract → Store ✓
What happens:
def export_data(self):
# Converts data to Pandas DataFrame
# Exports to CSV/Excel/JSON based on configTechnical details:
-
Create Output Directory:
os.makedirs(OUTPUT_DIR, exist_ok=True)
- Creates
scraped_data/folder if doesn't exist
- Creates
-
Generate Filename:
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S') base_filename = f"amazon_{self.search_query.replace(' ', '_')}_{timestamp}"
- Example:
amazon_wireless_headphones_20251107_143022
- Example:
-
Convert to DataFrame:
df = pd.DataFrame(self.products_data)
- Creates table structure from data
-
Export Formats:
CSV:
df.to_csv(csv_file, index=False, encoding='utf-8')
- Plain text, comma-separated
- Opens in Excel, Google Sheets
Excel:
df.to_excel(excel_file, index=False, engine='openpyxl')
- Native Excel format (.xlsx)
- Preserves formatting
JSON:
json.dump(self.products_data, f, indent=2)
- Structured data format
- Good for APIs
Output Location:
scraped_data/
└── amazon_wireless_headphones_20251107_143022.csv
└── amazon_wireless_headphones_20251107_143022.xlsx
└── amazon_wireless_headphones_20251107_143022.json
What happens:
def cleanup(self):
self.driver.quit()Technical details:
- Closes browser window
- Releases system resources
- Ensures clean exit
# Activate virtual environment
.\venv\Scripts\Activate.ps1
# Run scraper
python amazon_scraper.pyOption A: Edit config.py
# Open config.py
# Change line:
SEARCH_QUERY = "gaming laptop" # Your custom query
# Save and run
python amazon_scraper.pyOption B: Create custom script
# quick_search.py
from amazon_scraper import AmazonScraper
# Create scraper instance
scraper = AmazonScraper()
# Modify search query programmatically
scraper.search_query = "python books"
scraper.max_products = 10
# Run
scraper.run()============================================================
Amazon Web Scraper Initialized
============================================================
Search Query: wireless headphones
Max Products: 20
Browser: EDGE
============================================================
[STEP 1] Setting up EDGE browser...
✓ Browser initialized successfully
[STEP 2] Navigating to Amazon (https://www.amazon.com)...
✓ Successfully loaded Amazon homepage
Current URL: https://www.amazon.com/
[STEP 3] Searching for: 'wireless headphones'...
✓ Found search box
✓ Entered search query
✓ Search results loaded
Current URL: https://www.amazon.com/s?k=wireless+headphones
[STEP 4] Scrolling page to load all products...
✓ Scrolled (iteration 1)
✓ Scrolled (iteration 2)
✓ Scrolled (iteration 3)
✓ Page scrolling complete
[STEP 5] Extracting product data...
✓ Found 60 product elements
[1] Sony WH-1000XM5 Wireless Noise Canceling Headphon... - $398.00
[2] Bose QuietComfort 45 Bluetooth Wireless Noise Can... - $279.00
[3] Apple AirPods Pro (2nd Generation) Wireless Ear B... - $249.00
...
[20] JBL Tune 510BT: Wireless On-Ear Headphones with ... - $29.95
✓ Successfully extracted 20 products
[STEP 6] Exporting data...
✓ Exported to CSV: scraped_data\amazon_wireless_headphones_20251107_143022.csv
✓ Exported to Excel: scraped_data\amazon_wireless_headphones_20251107_143022.xlsx
✓ Exported to JSON: scraped_data\amazon_wireless_headphones_20251107_143022.json
✓ All exports completed successfully!
[STEP 7] Cleaning up...
✓ Browser closed
============================================================
✓ Scraping completed successfully!
Total products scraped: 20
Total time: 45.67 seconds
============================================================
rank,title,price,rating,num_reviews,url,image_url,scraped_at
1,"Sony WH-1000XM5","$398.00","4.7 out of 5 stars","(12,345)","https://amazon.com/dp/B09XS7JWHH","https://m.media-amazon.com/images/I/...",2025-11-07 14:30:22
2,"Bose QuietComfort 45","$279.00","4.6 out of 5 stars","(8,901)","https://amazon.com/dp/B098FKXT8L","https://m.media-amazon.com/images/I/...",2025-11-07 14:30:23
...
Best for:
- Excel analysis
- Google Sheets import
- Simple data viewing
Opens directly in Microsoft Excel with columns:
- A: Rank
- B: Title
- C: Price
- D: Rating
- E: Number of Reviews
- F: Product URL
- G: Image URL
- H: Scraped At
Features:
- Formatted cells
- Sortable columns
- Can create charts/pivot tables
[
{
"rank": 1,
"title": "Sony WH-1000XM5 Wireless Noise Canceling Headphones",
"price": "$398.00",
"rating": "4.7 out of 5 stars",
"num_reviews": "(12,345)",
"url": "https://amazon.com/dp/B09XS7JWHH",
"image_url": "https://m.media-amazon.com/images/I/...",
"scraped_at": "2025-11-07 14:30:22"
},
{
"rank": 2,
"title": "Bose QuietComfort 45",
...
}
]Best for:
- API integration
- Web applications
- Data processing scripts
Symptoms:
- Error: "WebDriver not found"
- Browser doesn't launch
Solution:
# Selenium will auto-download driver, but if it fails:
# Make sure you have the browser installed:
# - Microsoft Edge (recommended)
# - Google Chrome
# - Mozilla Firefox
# Try changing browser in config.py:
BROWSER = "edge" # or "chrome" or "firefox"Symptoms:
- "✓ Found 0 product elements"
- Empty output files
Possible causes:
- Amazon changed their HTML structure
- Search query has no results
- Page didn't load fully
Solution:
# In config.py, increase timeouts:
PAGE_LOAD_TIMEOUT = 30 # Increase from 20
IMPLICIT_WAIT = 15 # Increase from 10
ACTION_DELAY = 3 # Increase from 2Symptoms:
- CAPTCHA appears
- "Access Denied" page
- IP temporarily blocked
Solution:
# In config.py:
ACTION_DELAY = 5 # Slow down actions
SCROLL_DELAY = 2 # Slow down scrolling
MAX_PRODUCTS = 10 # Scrape fewer products
# Consider:
# - Using VPN
# - Waiting between scrapes
# - Running during off-peak hoursSymptoms:
- Some fields show "N/A"
- Missing prices or ratings
Cause:
- Amazon's HTML varies by product type
- Some products don't have all fields
Solution:
- This is normal behavior
- Filter data in Excel/CSV afterward
- Products without key data will show "N/A"
Symptoms:
ModuleNotFoundError: No module named 'selenium'
Solution:
# Make sure virtual environment is activated:
.\venv\Scripts\Activate.ps1
# Reinstall dependencies:
pip install -r requirements.txt
# Verify installation:
pip list-
Amazon's Terms of Service:
- Amazon prohibits automated scraping without permission
- This tool is for educational purposes only
- Commercial use may violate Terms of Service
-
Rate Limiting:
- Don't scrape too frequently (hours between sessions)
- Keep MAX_PRODUCTS reasonable (≤50)
- Respect Amazon's servers
-
Data Usage:
- Don't republish scraped data
- Don't use for competitive intelligence
- Personal research only
-
Responsible Scraping:
ACTION_DELAY = 3 # Slower = More polite MAX_PRODUCTS = 20 # Don't be greedy
-
Testing:
- Start with small numbers (MAX_PRODUCTS = 5)
- Test different search queries
- Verify data quality
-
Maintenance:
- Amazon changes their HTML frequently
- You may need to update selectors
- Check logs if scraping fails
-
Dynamic Content:
- Some elements load via JavaScript
- May need longer wait times
- Not all products may be captured
-
CAPTCHA:
- Amazon may show CAPTCHA
- No automated solution (that's the point!)
- If CAPTCHA appears, solve manually or restart
-
IP Blocking:
- Too many requests = temporary block
- Usually lifts after 1-24 hours
- Use responsibly
Key Concepts:
- Selenium WebDriver - Browser automation
- CSS Selectors - Finding elements in HTML
- Explicit Waits - Waiting for elements to load
- BeautifulSoup - Parsing HTML
- Pandas - Data manipulation
# By ID
By.ID, "twotabsearchtextbox"
# By CSS Class
By.CSS_SELECTOR, ".a-price-whole"
# By Attribute
By.CSS_SELECTOR, "[data-component-type='s-search-result']"
# By Tag Hierarchy
By.CSS_SELECTOR, "h2 a span"# Implicit Wait (global)
driver.implicitly_wait(10)
# Explicit Wait (specific element)
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "element-id"))
)Common Questions:
Q: Can I scrape other websites? A: Yes! Modify the URL and selectors in the code. Each website has different HTML structure.
Q: How do I change what data is extracted?
A: Edit the extract_product_data() method in amazon_scraper.py. Add new CSS selectors for additional fields.
Q: Can I run this on a schedule? A: Yes, use Windows Task Scheduler or create a loop with time delays.
Q: Is there a GUI version? A: Currently CLI only. You could build a GUI using tkinter or PyQt.
-
Run a Test:
python amazon_scraper.py
-
Customize Search:
- Edit
config.py - Change
SEARCH_QUERY - Adjust
MAX_PRODUCTS
- Edit
-
Analyze Data:
- Open output files in Excel
- Create charts
- Analyze pricing trends
-
Expand Functionality:
- Add more data fields
- Scrape multiple pages
- Compare prices over time
- Send results via email
- v1.0.0 (2025-11-07)
- Initial release
- Basic Amazon search scraping
- CSV/Excel/JSON export
- Edge browser support
This project is for educational purposes only.
Use responsibly and respect website terms of service.
Happy Scraping! 🎉
Remember: With great scraping power comes great responsibility!