This is a command-line website parser written in Python. It search product on site from main page with Playwright after parse HTML and collects product data (name, price, specifications, color, memory, etc.) using BeautifulSoup. It supports error handling and saves to JSON/CSV.
- Product data collection
- Full product name
- Color
- Storage capacity
- Manufacturer
- Regular price
- Promotional price (if any)
- All product photos. Photos and links to photos are collected and saved in a list.
- Product code
- Number of reviews
- Screen size
- Display resolution
- Product specifications. All specifications are on the tab. Specifications are collected as a dictionary.
- Backend:
- Python programming language;
- Django framework ;
- PostgreSQL database (Django ORM).
- Playwright + BeautifulSoup4 + lxml
-
SECRET_KEY=
-
ALLOWED_HOSTS=
-
DEBUG=
-
MEDIA_ROOT=
-
STATIC_ROOT=
-
POSTGRES_DB=
-
POSTGRES_USER=
-
POSTGRES_PASSWORD=
-
POSTGRES_HOST=
-
POSTGRES_PORT=
Look at the .env.example
- url = " website url "
- search_input.type(text="write product full name", delay=0.3) !!! Important product name must be phone name !!!
To get started with the project, follow these steps:
Note: Don't forget about environment variables
-
Clone the repository:
git clone https://github.com/dalv-oio/playwright_parser.git -
Go to the project directory:
cd playwright_parser -
Install the required dependencies:
pip install -r requirements.txt -
Set up the database connection and configurations according to the selected database engine. Apply migrations
python manage.py makemigrations python manage.py migrate -
Run the Django development server:
python manage.py run_parser