OBJECTIVE: build a web crawler that crawl all pages from each category in Tiki and store the data in the SQLite.
- duong_crawl_category.py: Get all categories.
- duong_crawl_items.py: Get all items in all crawled categories.
- tiki_db.rar: Contain tiki.db because GitHub does not allow pushing file with >100MB on repository.
- tiki_cat.xlsx: Categories table as an excel file.
- tiki_products.xlsx: Sample records from tiki_products table.
- duong_analysis.ipynb: Some charts.
- 3,191 categories.
- 708,133 products: All products that can be crawled from Tiki website.
-
Step 1: Create a SQLite database
- Create table "categories" in the database
Columns:
- id
- name
- url
- level: Level of the category in the category tree. Level 1 = the main categories.
- total_sub_category: Total of sub categories that the category contain. Purpose: 0 = the category does not have any sub category.
- parent_id
- created_at
- Use OOP to temporarily store, create save_to_db method using "INSERT" and update_total_sub_category using "UPDATE"
- save_to_db: Save every category to table categories as 1 row.
- update_total_sub_category: Update total_sub_category of every parent category in table categories.
-
Step 2: Crawl the link of the sub categories and return a list of last sub-categories.
- Get main categories: The categories on tiki.vn homepage
- Get sub categories: Assign level of the sub category (= parent_level + 1) and update total_sub_category of the direct parent category.
- Get all categories: Only the last sub-categories (category with total_sub_category = 0) are used for crawling items.
-
Step 3: Crawl all items in each last sub-categories (category with total_sub_category = 0) and save each item to table tiki_products in SQLite database.
Similar to Week 1 project with some updates:
- Use class "Product" (OOP) to temporily save product values from HTML, instead of a dictionary.
- Save items to SQLite database, instead of a dataframe and a excel file.
- Extra features: Continue crawling where the script left when interupted by using SQLite database (catgories and tiki_products tables) as backup source, instead of a excel file.
- Continue to crawl the next item in a page;
- Or continue the next page;
- Or continue the next category.
It took more than 4 hours to finish crawling all categories urls.
Total sub categories in each main categories
"Homes for Life" (Nhà Cửa Đời Sống) has the biggest number of sub categories: 600 sub categories.
Total products in each main categories
Big number of sub categories of "Homes for Life" (Nhà Cửa Đời Sống) is due to the big number of products.
Top 10 sub categories with highest number of products
Number of products of each category usually reachs the maximum of 1000. Tiki database contains more than that, but not showing on the website.
However, there are 3 categories breaks the rule: "Bộ chăn ga, ra, drap" (Blankets, sheets), "Tranh canvas" (Canvas pictures), "Đồ chơi" (Toys) have over 4000 products and up to 200 pages.
Not sure how the script can scrawl such data because with UI on the webpage, pages above 21 can not be accessed.
All products in those 3 categories are only duplicated because Tiki place the same product in different places with the product links are also different.
700k products on the website may be only 10% of products in Tiki database.
It took more than 19 hours to finish crawling all products in all categories.
Average current price of each main categories
Unit: 10,000,000 VND (10 million VND)
Price after discount.
Electronics (Điện tử - điện lạnh) has the highest average price, over 10 million VND.
Highest price: Harley Davidson FXDRS 2019 (Xe Mô Tô Harley Davidson FXDRS - 2019) - 799,500,000 VND
A sample of 0 VND price: Box of 10 Volluto Coffee Capsule Capsules - Nespresso (Hộp 10 Viên Nén Capsule Cà Phê Volluto - Nespresso) due to out of stock
Lowest price: Zinc Branches (Kẽm Cành) - 1,920 VND
Product position on a page vs Discount rate (only considering products having discount rate > 0%)
Position number 7, 17, 27, 37, 47 usually do not have discount range of 40-60%. Popular discount percentage: 10, 20, 30, 40, 50.
Product page number vs Discount rate (only consiering products having discount rate > 0%)
Discounted products are mostly placed on page 1 to page 6. After page 6, the number of discounted products starts decreasing.