Skip to content

anaeim/ATTARII

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Amazon Textual and TAbulaR Information extractIon (ATTARII)

Intro

Amazon e-commerce products contain rich sources of information spread over text and tables. Amazon Textual and TAbulaR Information extractIon (ATTARII) can effectively scrape Amazon product web pages and extract sections of interest. These sections on Amazon web pages are divided into two categories:

  • Textual information
    • Product titles
    • Bullet points
    • Product descriptions
  • Tabular information
    • Product detail tables
    • Product overview tables

These sections are marked in the figure below:

Alt text

Given the URL of an Amazon product web page, ATTARII retrieves the web page content by the webdriver of Selenium library. In the next step, ATTARII parses the HTML content with Beautiful Soup library, and it extracts the desired sections using HTML tags and ids. There is an excellent tutorial for Beautiful Soup library.

Different suppliers and developers may use different HTML tags and ids to include the product data. The tool that I have developed here is capable of extracting the desired sections for the majority of Amazon products, when I test the tools for Amazon-PQA dataset.

Installation

To get started, you'll need Python and pip installed.

  1. Clone the Git repository
git clone https://github.com/anaeim/ATTARII.git
  1. Navigate to the project directory
cd ATTARII
  1. Create a directory for the extracted textual and tabular information
mkdir extracted_info
  1. Install the requirements
pip install -r requirements.txt

Usage

python ATTARII.py --URL https://www.amazon.com/dp/B08KHR6B3W/ \
    --info-type tabular \
    --verbosity-enabled \
    --dump-info-enabled \
    --dump-info-path extracted_info

The meaning of the flags:

  • --URL: the URL of the Amazon product web page
  • --info-type: the type of information for extraction by ATTARII. You can choose between tabular and textual data.
  • --verbosity-enabled: to display the extracted information.
  • --dump-info-enabled: to dump and store the extracted information as a .JSON file.
  • --dump-info-path: to specify the directory to dump and store the extracted information.

Example

Here is an example of extracted tabular info for the Apple Watch Series 6 on Amazon:

Alt text

About

A Web Scraper With Python for Amazon Products

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages