Amazon Textual and TAbulaR Information extractIon (ATTARII)

Intro

Amazon e-commerce products contain rich sources of information spread over text and tables. Amazon Textual and TAbulaR Information extractIon (ATTARII) can effectively scrape Amazon product web pages and extract sections of interest. These sections on Amazon web pages are divided into two categories:

Textual information
- Product titles
- Bullet points
- Product descriptions
Tabular information
- Product detail tables
- Product overview tables

These sections are marked in the figure below:

Given the URL of an Amazon product web page, ATTARII retrieves the web page content by the webdriver of Selenium library. In the next step, ATTARII parses the HTML content with Beautiful Soup library, and it extracts the desired sections using HTML tags and ids. There is an excellent tutorial for Beautiful Soup library.

Different suppliers and developers may use different HTML tags and ids to include the product data. The tool that I have developed here is capable of extracting the desired sections for the majority of Amazon products, when I test the tools for Amazon-PQA dataset.

Installation

To get started, you'll need Python and pip installed.

Clone the Git repository

git clone https://github.com/anaeim/ATTARII.git

Navigate to the project directory

cd ATTARII

Create a directory for the extracted textual and tabular information

mkdir extracted_info

Install the requirements

pip install -r requirements.txt

Usage

python ATTARII.py --URL https://www.amazon.com/dp/B08KHR6B3W/ \
    --info-type tabular \
    --verbosity-enabled \
    --dump-info-enabled \
    --dump-info-path extracted_info

The meaning of the flags:

--URL: the URL of the Amazon product web page
--info-type: the type of information for extraction by ATTARII. You can choose between tabular and textual data.
--verbosity-enabled: to display the extracted information.
--dump-info-enabled: to dump and store the extracted information as a .JSON file.
--dump-info-path: to specify the directory to dump and store the extracted information.

Example

Here is an example of extracted tabular info for the Apple Watch Series 6 on Amazon:

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
dataextractiontools		dataextractiontools
images		images
.gitignore		.gitignore
ATTARII.py		ATTARII.py
LICENSE		LICENSE
README.md		README.md
config.py		config.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Amazon Textual and TAbulaR Information extractIon (ATTARII)

Intro

Installation

Usage

Example

About

Releases

Packages

Languages

License

anaeim/ATTARII

Folders and files

Latest commit

History

Repository files navigation

Amazon Textual and TAbulaR Information extractIon (ATTARII)

Intro

Installation

Usage

Example

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages