Welcome to the project_week_1_duong_jun!

Crawling data of all Tiki product in TV category

Overall Logic:

Step 1: Crawl data of 1 item in page 1.
Step 2: Loop step 1 to get all items in page 1.
Step 3: Loop step 2 for all pages in Tiki category until the final page.

Extra Features:

Extend an excel (.xlsx) file with crawled data of every page.
Continue the page where the script left off.

Files:

data_columns.xlsx: Data need to crawl from Tiki webpage.
web_crawl_tiki.py: Get data of 1 item in page 1 (Step 1).
duong_web_crawl_tiki.py: Test loops with step 1 and step 2.
week_1_duong_jun_craw_tiki_category.py: Complete script with final result as .xlsx file.
tiki_tv_product.xlsx, tiki_tv_product.csv: Final result files.

Python Code and Tips:

Step 1: Crawl data of 1 item in page 1:

Author: Jun

Experiment with 1 item first in web_crawl_tiki.py

Use BeautifulSoup to crawl data on https://tiki.vn/tivi/c5015?src=c.5015.hamburger_menu_fly_out_banner&page

Tip: Use the bottom bar to get the exact level the HTML tag is on.

Step 2: Loop step 1 to get all items in page 1:

Author: Duong

Created 3 functions:

get_html(link)
get_item_list(full_webpage_html)
get_data(item_html, output_dict): Use the result of Step 1

Use for loop with the above 3 functions:

Get HTML from link
- -> Get the list of items in 1 page
  - -> Loop through the list to get data of each item
    - -> Extend to pre-generated dictionary.

Output dictionary format:

{'data-seller-product-id':[],
'product-sku':[],
'data-title':[],
...
'rating_percentage':[],
'number_of_reviews':[]}

Tip: Use try except if can't crawl the data to return None or '0' value for each element so that error won't stop the script.

Step 3: Loop step 2 for all pages in Tiki category until the final page:

Author: Duong

Use while loop and stop when there is 0 product in the page ~ len(list of items in 1 page) == 0

1 page 1 loop.

Tip: Create a column in output file to store page number df['page'] = page_number.

Extra Features:

Author: Duong

Why need extra features: Script can be interrupted at anytime, specially when crawling hundreds of pages.

Don't want to lose the data the script crawled.
Don't want to start over again.

1. Extend an excel (.xlsx) file with crawled data of every page:

Create a dataframe after each loop in the while loop -> Concat that dataframe to a master dataframe -> Export the master dataframe to a excel file.

2. Continue the page where the script left off:

Try to read output file (use try except) -> Get the max page number in the file -> The script starts from max page number + 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Welcome to the project_week_1_duong_jun!

Crawling data of all Tiki product in TV category

Overall Logic:

Extra Features:

Files:

Python Code and Tips:

Step 1: Crawl data of 1 item in page 1:

Step 2: Loop step 1 to get all items in page 1:

Step 3: Loop step 2 for all pages in Tiki category until the final page:

Extra Features:

1. Extend an excel (.xlsx) file with crawled data of every page:

2. Continue the page where the script left off:

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
README.md		README.md
data_columns.xlsx		data_columns.xlsx
duong_web_crawl_tiki.py		duong_web_crawl_tiki.py
tiki_tv_product.csv		tiki_tv_product.csv
tiki_tv_product.xlsx		tiki_tv_product.xlsx
web_crawl_tiki.py		web_crawl_tiki.py
week_1_duong_jun_craw_tiki_category.py		week_1_duong_jun_craw_tiki_category.py

duongtruongtrong/CoderSchool_project_week_1_duong_jun

Folders and files

Latest commit

History

Repository files navigation

Welcome to the project_week_1_duong_jun!

Crawling data of all Tiki product in TV category

Overall Logic:

Extra Features:

Files:

Python Code and Tips:

Step 1: Crawl data of 1 item in page 1:

Step 2: Loop step 1 to get all items in page 1:

Step 3: Loop step 2 for all pages in Tiki category until the final page:

Extra Features:

1. Extend an excel (.xlsx) file with crawled data of every page:

2. Continue the page where the script left off:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages