<img src='https://haymora.com/blog/wp-content/uploads/2020/05/tiki-hanh-trinh-10-nam-phat-trien.jpg' width=browser_width>
<!--<img src="https://s3.cloud.cmctelecom.vn/tinhte1/2018/03/4267082_CV.jpg" width=browser_width >-->

# **Tiki Web Scraping with Selenium**


**Overview**: Build a web-crawler that take in a Tiki URL and return a dataframe contains information of products.

**Requirements** 
1. Scrape at least **3 pages of any single category**.
2. Your program should be able to scrape www.tiki.vn and generate a `.csv` file as a result.
3. The `.csv` file should contain the following information: 
    * Product Name
    * Price
    * URL of the product image (thumbnail)
    * URL of that product page

Follow the guideline below for more details.

## 0.Setting up

### Install and import resources

In [None]:
# install selenium and other resources for crawling data
!pip install selenium
!apt-get update
!apt install chromium-chromedriver

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting selenium
  Downloading selenium-4.3.0-py3-none-any.whl (981 kB)
[K     |████████████████████████████████| 981 kB 29.4 MB/s 
[?25hCollecting trio-websocket~=0.9
  Downloading trio_websocket-0.9.2-py3-none-any.whl (16 kB)
Collecting urllib3[secure,socks]~=1.26
  Downloading urllib3-1.26.10-py2.py3-none-any.whl (139 kB)
[K     |████████████████████████████████| 139 kB 50.3 MB/s 
[?25hCollecting trio~=0.17
  Downloading trio-0.21.0-py3-none-any.whl (358 kB)
[K     |████████████████████████████████| 358 kB 40.9 MB/s 
[?25hCollecting outcome
  Downloading outcome-1.2.0-py2.py3-none-any.whl (9.7 kB)
Collecting async-generator>=1.9
  Downloading async_generator-1.10-py3-none-any.whl (18 kB)
Collecting sniffio
  Downloading sniffio-1.2.0-py3-none-any.whl (10 kB)
Collecting wsproto>=0.14
  Downloading wsproto-1.1.0-py3-none-any.whl (24 kB)
Collecting cryptography>=1.3.4
  Downloadi

In [None]:
import re
import time
import pandas as pd

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException

### Function to start and close driver

In [None]:
# Global driver to use throughout the script
DRIVER = None

# Wrapper to close driver if its created
def close_driver():
    global DRIVER
    if DRIVER is not None:
        DRIVER.close()
    DRIVER = None

# Function to (re)start driver
def start_driver(force_restart=False):
    global DRIVER
    
    if force_restart:
        close_driver()
    
    # Setting up the driver
    options = webdriver.ChromeOptions()
    options.add_argument('-headless') # we don't want a chrome browser opens, so it will run in the background
    options.add_argument('-no-sandbox')
    options.add_argument('-disable-dev-shm-usage')

    DRIVER = webdriver.Chrome('chromedriver',options=options)


### Setting up categories and links

In [None]:
# Urls
TIKI = 'https://tiki.vn'

In [None]:
#@title <a name="tiki-cats"></a> { form-width:'1px' }
# Pre-defined links to each categories
MAIN_CATEGORIES = [
    {'Name': 'Đồ Chơi - Mẹ &amp; Bé',            'URL': 'https://tiki.vn/do-choi-me-be/c2549'},
    {'Name': 'Điện Thoại - Máy Tính Bảng',       'URL': 'https://tiki.vn/dien-thoai-may-tinh-bang/c1789'},
    {'Name': 'Làm Đẹp - Sức Khỏe',               'URL': 'https://tiki.vn/lam-dep-suc-khoe/c1520'},
    {'Name': 'Điện Gia Dụng',                    'URL': 'https://tiki.vn/dien-gia-dung/c1882'},
    {'Name': 'Thời trang nữ',                    'URL': 'https://tiki.vn/thoi-trang-nu/c931'},
    {'Name': 'Thời trang nam',                   'URL': 'https://tiki.vn/thoi-trang-nam/c915'},
    {'Name': 'Giày - Dép nữ',                    'URL': 'https://tiki.vn/giay-dep-nu/c1703'},
    {'Name': 'Giày - Dép nam',                   'URL': 'https://tiki.vn/giay-dep-nam/c1686'},
    {'Name': 'Túi thời trang nữ',                'URL': 'https://tiki.vn/tui-vi-nu/c976'},
    {'Name': 'Túi thời trang nam',               'URL': 'https://tiki.vn/tui-thoi-trang-nam/c27616'},
    {'Name': 'Balo và Vali',                     'URL': 'https://tiki.vn/balo-va-vali/c6000'},
    {'Name': 'Phụ kiện thời trang',              'URL': 'https://tiki.vn/phu-kien-thoi-trang/c27498'},
    {'Name': 'Đồng hồ và Trang sức',             'URL': 'https://tiki.vn/dong-ho-va-trang-suc/c8371'},
    {'Name': 'Laptop - Máy Vi Tính - Linh kiện', 'URL': 'https://tiki.vn/laptop-may-vi-tinh-linh-kien/c1846'},
    {'Name': 'Nhà Cửa - Đời Sống',               'URL': 'https://tiki.vn/nha-cua-doi-song/c1883'},
    {'Name': 'Bách Hóa Online',                  'URL': 'https://tiki.vn/bach-hoa-online/c4384'},
    {'Name': 'Hàng Quốc Tế',                     'URL': 'https://tiki.vn/hang-quoc-te/c17166'},
    {'Name': 'Thiết Bị Số - Phụ Kiện Số',        'URL': 'https://tiki.vn/thiet-bi-kts-phu-kien-so/c1815'},
    {'Name': 'Voucher - Dịch vụ',                'URL': 'https://tiki.vn/voucher-dich-vu/c11312'},
    {'Name': 'Ô Tô - Xe Máy - Xe Đạp',           'URL': 'https://tiki.vn/o-to-xe-may-xe-dap/c8594'},
    {'Name': 'Nhà Sách Tiki',                    'URL': 'https://tiki.vn/nha-sach-tiki/c8322'},
    {'Name': 'Điện Tử - Điện Lạnh',              'URL': 'https://tiki.vn/dien-tu-dien-lanh/c4221'},
    {'Name': 'Thể Thao - Dã Ngoại',              'URL': 'https://tiki.vn/the-thao-da-ngoai/c1975'},
    {'Name': 'Máy Ảnh - Máy Quay Phim',          'URL': 'https://tiki.vn/may-anh/c1801'}
]

## 1.Function to get info from one product

>⚠️ **NOTE:** Sometimes, the web element returned by the driver can be faulty due to the way the website is set up. This can lead to a situation where calling `.text` from that web element returns an empty string, even though there are visible texts inside the element when checked manually using Inspect. When this happens, you can use `.get_attribute('innerHTML')` instead of `.text`.

In [None]:
close_driver()

In [None]:
# Function to extract product info from the product
def get_product_info_single(product_item):
  info1 = {'name':'',
          'price':'',
          'product_url':'',
          'image':''}
  try:   
    info1['product_url'] = product_item.get_attribute('href')
    info1['price']  = product_item.find_element(By.CLASS_NAME,'price-discount__price').text
    info1['name'] = product_item.find_element(By.TAG_NAME, "h3").text
    info1['image'] = product_item.find_element(By.CLASS_NAME,'thumbnail').find_element(By.CLASS_NAME,'image-wrapper').find_element(By.CLASS_NAME,'webpimg-container').find_element(By.TAG_NAME, "img").get_attribute('src')
  except:
    pass  
  return info1

## 2.Function to scrape info of all products from a Page URL

To make your own life easier, you should use the function for a single product inside this one.

In [None]:
# Function to scrape all products from a page
def get_product_info_from_page(page_url):
  global DRIVER
  DRIVER.get(page_url)
  storage = []
  all_items_elements = DRIVER.find_elements(By.CLASS_NAME, 'product-item')
  for i in range(len(all_items_elements)):
    info1 = get_product_info_single(all_items_elements[i])
    storage.append(info1) 
                # Store the info dictionary of each product in this list
                         # Use the driver to get info from the product page
    
    time.sleep(3)        # Sleep AFTER loading website in order to wait for it to finish
    
    # Get a list of product elements. Print number of products found if desired.

    # Loop through list of product elements, read and add each product info into `data`
  print(len(storage))
  return storage

## 3.Start scraping

In [None]:
cat_idx  = 1 # Pick any category you like by changing the index
main_cat = MAIN_CATEGORIES[cat_idx]

start_driver(force_restart=True)
print('Scraping', main_cat['Name'])
print('Link:', main_cat['URL'])

prod_data = [] # STORE YOUR PRODUCT INFO DICTIONARIES IN HERE
data = get_product_info_from_page(main_cat['URL'])
prod_data.extend(data)
print(prod_data)
close_driver() # Close driver when we're done

Scraping Điện Thoại - Máy Tính Bảng
Link: https://tiki.vn/dien-thoai-may-tinh-bang/c1789
48
[{'name': 'Máy Tính Bảng Samsung Galaxy Tab S7 FE LTE T735 (4GB/64GB) - Hàng Chính Hãng', 'price': '10.290.000 ₫', 'product_url': 'https://tka.tiki.vn/pixel/pixel?data=djAwMfzNgqPNf0pcB4miJacMm7d_vT4Ncj_74vJEll38kjJujnDmJDd8pmmJN2VZfyLyeYi_MXtp2mgrj0PjlTNN1n230Fhok_LdK9PmRIL3kglDB4B4EJ4xi5DbOU0g4qVmxtlB9tPYyFpgE3j2tBkDZ4UPu8pGMn-fjG1lYDfgiU5kbBU1v3v4UgTxRj0jN2P_KzUsyE24-avou_8BXJP_yb3CT9xEKmpulPirCMv64lzCf8JDvi_nKMjlSRo2puGR1MAbSWlDfhlBhZc_2lW_sg_z8Y7kc672Iabqq5I8U5mFHYZO5UkKEbhG5S7nniPYGFbUhuuph7PrCbLdo_NcJv6wTWQG5bMr9Z6CZ6mM8boAFhduoJggZRxKD2PzCGDgHp0-PRoxPQw4HVhhXminGfrT7BV7Op5l1BPwfs6YBeU0ptHG_1xejJQ-VJg-kM0CyeTgBMAe_Vis6vh5dv7uIqT2Lq33N2SmdNVu5q0fcw7YbdSTXcrmITGbkxMUpW75TAJOA6k-glPHXwRAugfsWpwDe-H76rRIpoVhjiRQHDjrJ9gsLJnI5grv8uK1YXSeh5jntwZBeMeGbi-ZPYTDx8_6SMMKqbO8Mj8QhCg8sLuK7klpiTExG75wPJUW9kcdaRvGWVQQ8FGt09ZmrtR1KQCJMJmQuxEO4LG9iMO4EyFKSmhHe85fducyGK4T242MG7W9zniPYe3eicNNaVML2jXCuBYRZKHb

## 4.Run cell below to save your scraped data into a .csv file
If you've scraped correctly, then the cell should run without error and the information in the table should look reasonable.

In [None]:
# SAVE DATA TO CSV FILE
df = pd.DataFrame(data=prod_data, columns=prod_data[0].keys())
df.to_csv('tiki_products.csv')

n_products_to_view = 10 # Change this as you like to check more products
df.head(n_products_to_view)

Unnamed: 0,name,price,product_url,image
0,Máy Tính Bảng Samsung Galaxy Tab S7 FE LTE T73...,10.290.000 ₫,https://tka.tiki.vn/pixel/pixel?data=djAwMfzNg...,https://salt.tikicdn.com/cache/200x200/ts/prod...
1,Điện thoại Samsung M33 5G (8Gb/128GB) - Hàng c...,5.660.000 ₫,https://tka.tiki.vn/pixel/pixel?data=djAwMXvWD...,https://salt.tikicdn.com/cache/200x200/ts/prod...
2,Máy tính bảng Samsung Galaxy Tab A8 (4GB/64GB)...,7.490.000 ₫,https://tka.tiki.vn/pixel/pixel?data=djAwMUZAX...,https://salt.tikicdn.com/cache/200x200/ts/prod...
3,,,https://tka.tiki.vn/pixel/pixel?data=djAwMSxtz...,https://salt.tikicdn.com/cache/200x200/ts/prod...
4,Máy đọc sách All New Kindle Paperwhite 5 (11th...,3.990.000 ₫,https://tiki.vn/may-doc-sach-all-new-kindle-pa...,
5,Điện thoại Xiaomi Redmi 9A (2GB/32GB) - Hàng c...,2.090.000 ₫,https://tiki.vn/dien-thoai-xiaomi-redmi-9a-2gb...,
6,Tecno Pova 2 6GB l 128GB - Điện Thoại Thông Mi...,3.690.000 ₫,https://tiki.vn/tecno-pova-2-6gb-l-128gb-dien-...,
7,,,https://tiki.vn/apple-iphone-13-hang-chinh-han...,
8,Điện Thoại Oppo A16k (3GB/32G) - Hàng Chính Hãng,2.920.000 ₫,https://tiki.vn/dien-thoai-oppo-a16k-3gb-32g-h...,
9,Điện Thoại Samsung Galaxy M32 (8GB/128GB) - Hà...,4.590.000 ₫,https://tiki.vn/dien-thoai-samsung-galaxy-m32-...,


In [None]:
# Overview of the table
df.info()

## 5.Finally, download your hard-earned `.csv` file that you've managed to create

In [None]:
# Just run the code below after you've created your own csv file.
# You can then open this file in a spreadsheet management program, like MS Excel
# and see your results.
from google.colab import files
files.download('tiki_products.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## **⚠️ IMPORTANT ⚠️: SHARE FILE PERMISSION**

After finishing, share this file with ***editing*** permission to the following emails:
- minh.nguyen@coderschool.vn
- ngoc.le@coderschool.vn
- Your personal mentor's email

Your grades will not be processed unless we have received this permission. This way, we can add comments and feedbacks to your project and help you learn better 🙌

## 🎁  **BONUS**
1. Scrape the [category names and links](#tiki-cats) from www.tiki.vn instead of writing them out exactly. Make sure that the format stays the same (`list` of `dict`). (1 pt)

1. Change your code to scrape any number of pages you want in a single category. For example, try to scrape every page of the "Điện Thoại - Máy Tính Bảng" category, or any other category you like. You can create a new function, add new arguments, and/or writing new statements as you like. (2 pts)

1. Extra information (3 pts). Think carefully about how to add these to your existing functions.
    * What's the rating?
    * Number of units sold?
    * Discount percentage?
    * Does it have TikiNow? <img src="https://salt.tikicdn.com/ts/upload/9f/32/dd/8a8d39d4453399569dfb3e80fe01de75.png">
    * Does it have FreeShip? <img src="https://salt.tikicdn.com/ts/upload/dc/0d/49/3251737db2de83b74eba8a9ad6d03338.png">
    * Is the product from an official seller? <img src="https://salt.tikicdn.com/ts/upload/b9/1f/4b/557eac9c67a4466ccebfa74cde854215.png">
    * Is the product from a trusted seller? <img src="https://salt.tikicdn.com/ts/upload/e0/41/da/bb0fc684a838eff5e264ce0534a148f0.png">
    * Does it have "badge under price" (Rẻ hơn hoàn tiền)? <img src="https://salt.tikicdn.com/ts/upload/51/ac/cc/528e80fe3f464f910174e2fdf8887b6f.png">
    * Does it allowed to be paid by installments? <img src="https://salt.tikicdn.com/ts/upload/ba/4e/6e/26e9f2487e9f49b7dcf4043960e687dd.png">
    * Does it comes with free gifts? <img src="https://salt.tikicdn.com/ts/upload/47/35/8c/446f61d046eba9a305d3f39dc0834c4a.png">
    <!-- * Does it have "shocking price" badge? <img src="https://salt.tikicdn.com/ts/upload/75/34/d2/4a9a0958a782da8930cdad8f08afff37.png"> -->


The .csv result should be similar to this

<img src="https://imgur.com/FpwFdIQ.png">