- Step 1: Crawl data of 1 item in page 1.
- Step 2: Loop step 1 to get all items in page 1.
- Step 3: Loop step 2 for all pages in Tiki category until the final page.
- Extend an excel (.xlsx) file with crawled data of every page.
- Continue the page where the script left off.
- data_columns.xlsx: Data need to crawl from Tiki webpage.
- web_crawl_tiki.py: Get data of 1 item in page 1 (Step 1).
- duong_web_crawl_tiki.py: Test loops with step 1 and step 2.
- week_1_duong_jun_craw_tiki_category.py: Complete script with final result as .xlsx file.
- tiki_tv_product.xlsx, tiki_tv_product.csv: Final result files.
Author: Jun
Experiment with 1 item first in web_crawl_tiki.py
Use BeautifulSoup to crawl data on https://tiki.vn/tivi/c5015?src=c.5015.hamburger_menu_fly_out_banner&page
Tip: Use the bottom bar to get the exact level the HTML tag is on.
Author: Duong
Created 3 functions:
- get_html(link)
- get_item_list(full_webpage_html)
- get_data(item_html, output_dict): Use the result of Step 1
Use for loop
with the above 3 functions:
- Get HTML from link
- -> Get the list of items in 1 page
- -> Loop through the list to get data of each item
- -> Extend to pre-generated dictionary.
- -> Loop through the list to get data of each item
- -> Get the list of items in 1 page
Output dictionary format:
{'data-seller-product-id':[],
'product-sku':[],
'data-title':[],
...
'rating_percentage':[],
'number_of_reviews':[]}
Tip: Use try except
if can't crawl the data to return None or '0' value for each element so that error won't stop the script.
Author: Duong
Use while loop
and stop when there is 0 product in the page ~ len(list of items in 1 page) == 0
1 page 1 loop.
Tip: Create a column in output file to store page number df['page'] = page_number
.
Author: Duong
Why need extra features: Script can be interrupted at anytime, specially when crawling hundreds of pages.
- Don't want to lose the data the script crawled.
- Don't want to start over again.
Create a dataframe after each loop in the while loop
-> Concat that dataframe to a master dataframe -> Export the master dataframe to a excel file.
Try to read output file (use try except
) -> Get the max page number in the file -> The script starts from max page number + 1