Skip to content

Web Crawler for products on Oda.com

License

Notifications You must be signed in to change notification settings

dangrasso/oda-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Oda Product Crawler

This tool crawls the product catalogue of the online grocery store Oda. The codebase defines abstractions that can be used to build custom adapters for different crawling use cases.

The project is my solution to a home assignment when applying for a position at the company in oct 2021. For a detailed presentation of my solution (choices/approach/tradeoffs/improvements) check out my notes on the solution.

Prerequisites

  • python >3.9
  • pipenv

You can then install dependencies like:

pipenv install

This project has 2 dependencies:

  • requests as http client
  • beautifulsoup4 as html parser

How to Test

python -m unittest discover tests

How to Run

python main.py

There is a hardcoded max number of page visits, but it's possible to stop the parsing at any moment with CTRL+C.

Before stopping, the program will save the current state in 3 files:

  • oda_frontier_<DATETIME>.json, for debugging
  • oda_visited_<DATETIME>.json, for debugging
  • oda_products_<DATETIME>.csv <-- this is the main application output

Results

This is the outcome of a full run. Check out the /output folder.

Burndown chart

Here we get a glimpse of how the crawler discovered and visited all urls.

frontier_burndown

This shows:

  • an initial discovery phase
  • a peak around 1500 visits (with the frontier topping at ~2800 urls)
  • almost-linear smooth slope going through all the products in the frontier
  • a final "bumpy" ride when reaching the bottom part of the frontier, discovering the remaining products

Visualizations

Using the crawler output I created some visualizations on https://rawgraphs.io/

All products grouped by categories (and subcategories): mosaic_by_category_nested

All products grouped by category, sized by price: mosaic_price_by_top_category

Guess which category is the red one where half the products are cheap while the other half expensive.

About

Web Crawler for products on Oda.com

Topics

Resources

License

Stars

Watchers

Forks

Languages