This project is a Python-based web scraper designed to extract product listings and details from Amazon.in. It uses the requests
library for HTTP requests and BeautifulSoup
for parsing HTML content.
- Features
- Technologies Used
- Installation
- Usage
- Project Structure
- Detailed Description
- Contributing
- License
- Product Listing Extraction: Retrieves product name, URL, price, rating, number of reviews, description, ASIN, and manufacturer.
- CSV Export: Exports scraped data to a CSV file for further analysis and processing.
- Configurable Parameters: Allows specifying the search query and the number of pages to scrape.
- Python 3.11.3
- Requests Library
- BeautifulSoup Library
- Clone the repository:
git clone https://github.com/dasdebanna/Amazon-Product-Scraper.git
- Navigate to the project directory:
cd Amazon-Product-Scraper
- Install the required libraries:
pip install requests beautifulsoup4
- Open the
scraper.py
file. - Set the
url
variable to the desired Amazon search results page URL:url = 'https://www.amazon.in/s?k=product'
- Specify the number of pages to scrape by setting the
num_pages
variable:num_pages = 5
- Run the script:
python scraper.py
- The scraped product data will be saved in the
product_data.csv
file.
scraper.py
: Main script for scraping Amazon product data.product_data.csv
: Output file containing the scraped product data.
The scraper.py
script performs the following steps:
- Fetch HTML Content: Uses the
requests
library to get the HTML content of the Amazon search results page. - Parse HTML Content: Utilizes
BeautifulSoup
to parse the HTML and extract product details. - Extract Data: Gathers information such as product name, URL, price, rating, number of reviews, description, ASIN, and manufacturer.
- Store Data: Stores the extracted data in a list of dictionaries.
- Export to CSV: Writes the collected data to a CSV file for easy analysis and processing.
- Fetching HTML Content:
response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser')
- Extracting Data:
product_name = item.find('span', class_='a-size-medium a-color-base a-text-normal').text product_url = 'https://www.amazon.in' + item.find('a', class_='a-link-normal')['href'] price = item.find('span', class_='a-price-whole').text
- Writing to CSV:
with open('product_data.csv', 'w', newline='') as file: writer = csv.DictWriter(file, fieldnames=fieldnames) writer.writeheader() writer.writerows(products)
Contributions are welcome! Please follow these steps to contribute:
- Fork the repository.
- Create a new branch:
git checkout -b feature-branch
- Make your changes and commit them:
git commit -m "Add new feature"
- Push to the branch:
git push origin feature-branch
- Create a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.