Skip to content

Crawl all Tiki products and categories, then save them to SQLite database for analysis.

Notifications You must be signed in to change notification settings

duongtruongtrong/CoderSchool_project_week_2_duong_huy

Repository files navigation

OBJECTIVE: build a web crawler that crawl all pages from each category in Tiki and store the data in the SQLite.

1. Files:

  • duong_crawl_category.py: Get all categories.
  • duong_crawl_items.py: Get all items in all crawled categories.
  • tiki_db.rar: Contain tiki.db because GitHub does not allow pushing file with >100MB on repository.
  • tiki_cat.xlsx: Categories table as an excel file.
  • tiki_products.xlsx: Sample records from tiki_products table.
  • duong_analysis.ipynb: Some charts.

2. Results:

  • 3,191 categories.
  • 708,133 products: All products that can be crawled from Tiki website.

3. Steps to achieve the target:

  • Step 1: Create a SQLite database

    • Create table "categories" in the database

    Columns:

    • id
    • name
    • url
    • level: Level of the category in the category tree. Level 1 = the main categories.
    • total_sub_category: Total of sub categories that the category contain. Purpose: 0 = the category does not have any sub category.
    • parent_id
    • created_at

    image

    • Use OOP to temporarily store, create save_to_db method using "INSERT" and update_total_sub_category using "UPDATE"
      • save_to_db: Save every category to table categories as 1 row.
      • update_total_sub_category: Update total_sub_category of every parent category in table categories.

    image

  • Step 2: Crawl the link of the sub categories and return a list of last sub-categories.

    • Get main categories: The categories on tiki.vn homepage

    image

    • Get sub categories: Assign level of the sub category (= parent_level + 1) and update total_sub_category of the direct parent category.

    image

    • Get all categories: Only the last sub-categories (category with total_sub_category = 0) are used for crawling items.

    image

  • Step 3: Crawl all items in each last sub-categories (category with total_sub_category = 0) and save each item to table tiki_products in SQLite database.

    Similar to Week 1 project with some updates:

    • Use class "Product" (OOP) to temporily save product values from HTML, instead of a dictionary.
    • Save items to SQLite database, instead of a dataframe and a excel file.
    • Extra features: Continue crawling where the script left when interupted by using SQLite database (catgories and tiki_products tables) as backup source, instead of a excel file.
      • Continue to crawl the next item in a page;
      • Or continue the next page;
      • Or continue the next category.

4. Data Analysis:

A. Categories:

It took more than 4 hours to finish crawling all categories urls.

Total sub categories in each main categories

"Homes for Life" (Nhà Cửa Đời Sống) has the biggest number of sub categories: 600 sub categories. image

Total products in each main categories

Big number of sub categories of "Homes for Life" (Nhà Cửa Đời Sống) is due to the big number of products. image

Interesting findings

Top 10 sub categories with highest number of products

Number of products of each category usually reachs the maximum of 1000. Tiki database contains more than that, but not showing on the website.

However, there are 3 categories breaks the rule: "Bộ chăn ga, ra, drap" (Blankets, sheets), "Tranh canvas" (Canvas pictures), "Đồ chơi" (Toys) have over 4000 products and up to 200 pages.

Not sure how the script can scrawl such data because with UI on the webpage, pages above 21 can not be accessed.

All products in those 3 categories are only duplicated because Tiki place the same product in different places with the product links are also different.

image image

700k products on the website may be only 10% of products in Tiki database.

B. Products:

It took more than 19 hours to finish crawling all products in all categories.

Average current price of each main categories

Unit: 10,000,000 VND (10 million VND)

Price after discount.

Electronics (Điện tử - điện lạnh) has the highest average price, over 10 million VND. image

Highest price: Harley Davidson FXDRS 2019 (Xe Mô Tô Harley Davidson FXDRS - 2019) - 799,500,000 VND

A sample of 0 VND price: Box of 10 Volluto Coffee Capsule Capsules - Nespresso (Hộp 10 Viên Nén Capsule Cà Phê Volluto - Nespresso) due to out of stock

Lowest price: Zinc Branches (Kẽm Cành) - 1,920 VND

Intersting Findings

Product position on a page vs Discount rate (only considering products having discount rate > 0%)

Position number 7, 17, 27, 37, 47 usually do not have discount range of 40-60%. Popular discount percentage: 10, 20, 30, 40, 50.

image

Product page number vs Discount rate (only consiering products having discount rate > 0%)

Discounted products are mostly placed on page 1 to page 6. After page 6, the number of discounted products starts decreasing. image

About

Crawl all Tiki products and categories, then save them to SQLite database for analysis.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages