# Selenium 
Selenium allows you to mimic an actual person browsing a website. 
With Selinium we can do the following:
- Fill in search box
- Navigate to the next page
- Load pages by mimicking scrolling 

We will again follow a tutorial online. This time scrapping **Lazada** 

- https://medium.com/@zfwong.wilson/web-scraping-e-commerce-sites-to-compare-prices-with-python-part-1-360509ee5c62


## First we have to install 

Install Selenium
- `conda install -c conda-forge selenium`
- `pip install selenium`

Install driver
1. Find out what version of Chrome do you use 
    - Steps:
        - Go to the 3 dots at the **top right** of your browser.
        - Help
        - About Google Chorme
        - Copy the version
2. Download Chrome Driver with the same version as your current browser
    - https://sites.google.com/a/chromium.org/chromedriver/downloads
3. Save the Chrome Driver Application in the same directory as your current code

### Import libraries 



In [19]:
# Web Scraping
from selenium import webdriver
from selenium.common.exceptions import *
# Data manipulation
import pandas as pd
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from selenium.webdriver.common.keys import Keys
from time import sleep




## Scrapping data with Selenium

In [41]:
# Place the directory to the path of your chrome driver
#webdriver_path = '/home/ustad/Documents/untuk_iSaham/belajar_irfan/dec_w3/chromedriver'

webdriver_path = webdriver.Chrome()


In [78]:
# url
Lazada_url = 'https://www.lazada.com.my'

In [79]:

search_item = 'Nescafe Gold refill 170g'

In [80]:
# Select custom Chrome options
options = webdriver.ChromeOptions()
#options.add_argument('--headless') 
options.add_argument('start-maximized') 
options.add_argument('disable-infobars')
options.add_argument('--disable-extensions')
# Open the Chrome browser
browser = webdriver.Chrome( options=options)
browser.get(Lazada_url)


In [81]:
# finds search bar
search_bar = browser.find_element_by_id('q')

# keys in search item
search_bar.send_keys(search_item)

# clicks the key Enter
search_bar.send_keys(Keys.ENTER)

In [82]:
# obtain title and price
item_titles = browser.find_elements_by_class_name('c16H9d')
item_prices = browser.find_elements_by_class_name('c13VH6')

In [83]:
# place items in lists

# Initialize empty lists
titles_list = []
prices_list = []
# Loop over the item_titles and item_prices
for title in item_titles:
    titles_list.append(title.text)
for price in item_prices:
    prices_list.append(price.text)

In [39]:
# Close browser
browser.quit()

## Scrolling between pages 
- We have to seach the  `Next button` and click it. 

In [85]:
import time

while True:
    elm = browser.find_element_by_class_name('ant-pagination-next')

    print (elm)
    time.sleep(3)

    elm.click()
    if 'disabled' in elm.get_attribute('class'):
        break;
print(' done scrapping ')

<selenium.webdriver.remote.webelement.WebElement (session="0760f3672d68b74cf440a891d6caa477", element="910bfc25-3c98-44ae-92a8-e6d11a359a52")>
<selenium.webdriver.remote.webelement.WebElement (session="0760f3672d68b74cf440a891d6caa477", element="910bfc25-3c98-44ae-92a8-e6d11a359a52")>
 done scrapping 


## Scrolling down pages
- Some pages only load when we scroll down.
- This can be done with Selenium as well. 

In [49]:

# Obtain the scroll height
last_height = browser.execute_script("return document.body.scrollHeight")

# print (last_height)  # print this to check when debugging
while True:
    # Scroll to bottom
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to scroll the down
    sleep(2)

    # Obtain height after scrolling
    new_height = browser.execute_script("return document.body.scrollHeight")
    
    # print (new_height)  # print new height after scroll for debugging
    
    # break loop if height is the same. which means, there is no need to scroll anymore
    if new_height == last_height:
        break
    last_height = new_height
    

## Save data in dataframe 

In [62]:
# view data 
prices_list

['RM29.90',
 'RM35.15',
 'RM19.99',
 'RM21.99',
 'RM21.30',
 'RM44.00',
 'RM56.00',
 'RM21.50',
 'RM27.90',
 'RM43.11',
 'RM56.99',
 'RM42.99',
 'RM48.50',
 'RM22.00',
 'RM27.50',
 'RM24.90',
 'RM25.40',
 'RM26.70',
 'RM42.99',
 'RM48.50',
 'RM12.66',
 'RM12.99',
 'RM46.00',
 'RM49.00',
 'RM12.99',
 'RM54.99',
 'RM60.00',
 'RM35.99',
 'RM18.50',
 'RM21.99',
 'RM45.00',
 'RM45.00',
 'RM38.00',
 'RM37.98',
 'RM23.50',
 'RM26.50',
 'RM13.60',
 'RM13.72',
 'RM15.00',
 'RM12.99',
 'RM15.99',
 'RM13.20',
 'RM15.00',
 'RM15.00',
 'RM45.00',
 'RM20.88',
 'RM27.00',
 'RM22.63',
 'RM44.00',
 'RM66.00',
 'RM32.00',
 'RM19.80',
 'RM72.00',
 'RM45.00',
 'RM12.66',
 'RM23.90',
 'RM31.25',
 'RM35.65',
 'RM21.45']

In [51]:
dfL = pd.DataFrame(zip(titles_list, prices_list), columns=["ItemName", "Price"])

In [52]:
dfL["Price"] = dfL["Price"].str.replace('RM', '').astype(float)

# This removes any entry with 'x2' in its title
dfL = dfL[dfL['ItemName'].str.contains('x2') == False]

In [53]:
dfL = dfL[dfL['ItemName'].str.contains('170g') == True]

dfL['Platform'] = 'Lazada'


In [54]:
dfL.head()

Unnamed: 0,ItemName,Price,Platform
3,SHOPPA Nescafe Gold Refill Pack - Rich & Smoot...,21.99,Lazada
5,NESCAFE GOLD Refill Twin Pack 170g X 2,44.0,Lazada
6,Nescafe Gold Refill 170g x 2 packs exp: Mar2022,56.0,Lazada
7,[READY STOCK] Nestle Nescafe Gold Blend Refill...,21.5,Lazada
8,NESCAFE GOLD Refill 170g,27.9,Lazada


# Tips ( if scrapping many pages) :
- Place a wait in between as the website may block you. Because you are a `robot`.
- If you fear your computer might crash, append your data into a `.csv` file. This way your data will be saved even if your machine crashes.
- Place  columns for the  `page`, and `time` in you `.csv` that you are apending to. If your computer crashes, you know where to continue and when it crashed.

## Exercise 1: 
- Make  a python code which is able to scrape multiple pages and save the data in a `.csv` file using Selenium 

In [1]:
# Web Scraping
from selenium import webdriver
from selenium.common.exceptions import *
# Data manipulation
import pandas as pd
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from selenium.webdriver.common.keys import Keys
import time


titles_list = []
prices_list = []


#driver = webdriver.Chrome()

#driver = webdriver.Chrome()
#driver.get("http://search.lppeh.gov.my/")
#wait = WebDriverWait(driver, 10)

#Lazada_url = 'https://www.lazada.com.my'

#search_item = 'Nescafe Gold refill 170g'


## Select custom Chrome options
#options = webdriver.ChromeOptions()
###options.add_argument('--headless') 
#options.add_argument('start-maximized') 
#options.add_argument('disable-infobars')
#options.add_argument('--disable-extensions')
## Open the Chrome browser
#browser = webdriver.Chrome( options=options)
#browser.get(Lazada_url)



def scrape ():
 
    # finds search bar
    search_bar = browser.find_element_by_id('q')

    # keys in search item
    search_bar.send_keys(search_item)

    # clicks the key Enter
    search_bar.send_keys(Keys.ENTER)


    # obtain title and price
    item_titles = browser.find_elements_by_class_name('c16H9d')
    item_prices = browser.find_elements_by_class_name('c13VH6')


    # place items in lists

    # Initialize empty lists
    titles_list = []
    prices_list = []
    # Loop over the item_titles and item_prices
    for title in item_titles:
        titles_list.append(title.text)
    for price in item_prices:
        prices_list.append(price.text)

def loop_page ():
    while True:
        scrape()
        elm = browser.find_element_by_class_name('ant-pagination-next')
        
        elm.click()
        time.sleep(5)

        if 'disabled' in elm.get_attribute('class'):
            break;

def list_to_df (titles_list, prices_list):

    dfL = pd.DataFrame(zip(titles_list, prices_list), columns=["ItemName", "Price"])
    dfL["Price"] = dfL["Price"].str.replace('RM', '').astype(float)

    # This removes any entry with 'x2' in its title
    dfL = dfL[dfL['ItemName'].str.contains('x2') == False]

    dfL = dfL[dfL['ItemName'].str.contains('170g') == True]

    dfL['Platform'] = 'Lazada'

    return dfL




def main ():

    #titles_list = []
    #prices_list = []

    loop_page ()

    print(titles_list, "satu")
    print(prices_list,"dua")
    #len(titles_list)

    df_lazada = list_to_df(titles_list,prices_list)

    print(df_lazada)

    df_lazada.to_csv('lazada_akmal.csv', index= False)


#if __name__ == "__main__":

Lazada_url = 'https://www.lazada.com.my'

search_item = 'Nescafe Gold refill 170g'


# Select custom Chrome options
options = webdriver.ChromeOptions()
#options.add_argument('--headless') 
options.add_argument('start-maximized') 
options.add_argument('disable-infobars')
options.add_argument('--disable-extensions')
#    Open the Chrome browser
browser = webdriver.Chrome( options=options)
browser.get(Lazada_url)

main()
browser.quit()




[] satu
[] dua
Empty DataFrame
Columns: [ItemName, Price, Platform]
Index: []


In [2]:
# Web Scraping
from selenium import webdriver
from selenium.common.exceptions import *
# Data manipulation
import pandas as pd
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from selenium.webdriver.common.keys import Keys
import time




In [17]:
amazon_url = 'https://www.amazon.com/'
search_item = 'laptop'


In [18]:
# Select custom Chrome options
options = webdriver.ChromeOptions()
options.add_argument('--headless') 
options.add_argument('start-maximized') 
options.add_argument('disable-infobars')
options.add_argument('--disable-extensions')
# Open the Chrome browser
browser = webdriver.Chrome( options=options)
browser.get(amazon_url)

In [19]:
# finds search bar
search_bar = browser.find_element_by_id('twotabsearchtextbox')

# keys in search item
search_bar.send_keys(search_item)

# clicks the key Enter
search_bar.send_keys(Keys.ENTER)

In [20]:


item_titles = browser.find_elements_by_xpath('//span[@class="a-size-medium a-color-base a-text-normal"]')

item_prices = browser.find_elements_by_class_name('a-price-whole')

item_global_ratings = browser.find_elements_by_xpath('//span[@class="a-size-base"]')

In [21]:
# place items in lists

# Initialize empty lists

titles_list = []
prices_list = []
rating =[]

# Loop over the item_titles and item_prices

for title in item_titles:
    titles_list.append(title.text)

for price in item_prices:
    prices_list.append(price.text)

for i in item_global_ratings :
    rating.append(i.text)



In [22]:
print(titles_list)
len(titles_list)

['Lenovo Chromebook S330 Laptop, 14-Inch FHD (1920 x 1080) Display, MediaTek MT8173C Processor, 4GB LPDDR3, 64GB eMMC, Chrome OS, 81JW0000US, Business Black', 'Lenovo Chromebook Flex 5 13" Laptop, FHD (1920 x 1080) Touch Display, Intel Core i3-10110U Processor, 4GB DDR4 Onboard RAM, 64GB SSD, Intel Integrated Graphics, Chrome OS, 82B80006UX, Graphite Grey', 'HP Chromebook 14-inch FHD Laptop, Intel Celeron N4000, 4 GB RAM, 32 GB eMMC, Chrome (14a-na0050nr, Mineral Silver)', 'Acer Predator Helios 300 Gaming Laptop, Intel i7-10750H, NVIDIA GeForce RTX 2060 6GB, 15.6" Full HD 144Hz 3ms IPS Display, 16GB Dual-Channel DDR4, 512GB NVMe SSD, Wi-Fi 6, RGB Keyboard, PH315-53-72XD', 'HP Chromebook 11-inch Laptop - Up to 15 Hour Battery Life - MediaTek - MT8183 - 4 GB RAM - 32 GB eMMC Storage - 11.6-inch HD Display - with Chrome OS - (11a-na0021nr, 2020 Model, Snow White)', '2020 Lenovo IdeaPad Laptop ComputerAMD A6-9220e 1.6GHz 4GB Memory 64GB eMMC Flash Memory 14" AMD Radeon R4 AC WiFi Microsoft

22

In [23]:
print(prices_list)
len(prices_list)

['279', '404', '279', '1,175', '263', '498', '489', '', '267', '714', '644', '999', '259', '242', '269', '259', '339', '242', '289', '449', '618', '673', '379', '1,279']


24

In [24]:
print(rating)
len(rating)

['1,412', '1,005', '1,951', '1,555', '165', '1,616', '3,515', '', '', '1,143', '1,615', '122', '1,669', '521', '1,061', '1,086', '2,788', '1,541', '4,761', '1,086', '376', '11', '1,544', '451', '9', '54']


26

In [None]:
while True:
    elm = browser.find_element_by_class_name('ant-pagination-next')
    elm.click()
    
    
    
    
    
    
    
    
    
    if 'disabled' in elm.get_attribute('class'):
        break;
print(' done scrapping ')

In [4]:

amazon_url = 'https://www.amazon.com/'
search_item = 'laptop'

# Select custom Chrome options
options = webdriver.ChromeOptions()
options.add_argument('--headless') 
options.add_argument('start-maximized') 
options.add_argument('disable-infobars')
options.add_argument('--disable-extensions')
# Open the Chrome browser
browser = webdriver.Chrome( options=options)
browser.get(amazon_url)


# finds search bar
search_bar = browser.find_element_by_id('twotabsearchtextbox')

# keys in search item
search_bar.send_keys(search_item)

# clicks the key Enter
search_bar.send_keys(Keys.ENTER)

# item_titles = browser.find_elements_by_xpath('//span[@class="a-size-medium a-color-base a-text-normal"]')

# item_prices = browser.find_elements_by_class_name('a-price-whole')

# item_global_ratings = browser.find_elements_by_xpath('//span[@class="a-size-base"]')


# place items in lists

# Initialize empty lists

titles_list = []
prices_list = []
rating =[]

# Loop over the item_titles and item_prices

# for title in item_titles:
#     titles_list.append(title.text)

# for price in item_prices:
#     prices_list.append(price.text)

# for i in item_global_ratings :
#     rating.append(i.text)

# elm = browser.find_element_by_class_name('a-last')
# elm.click()


def scrape_berita ():

    item_titles = browser.find_elements_by_xpath('//span[@class="a-size-medium a-color-base a-text-normal"]')

    item_prices = browser.find_elements_by_class_name('a-price-whole')

    item_global_ratings = browser.find_elements_by_xpath('//span[@class="a-size-base"]')

    for title in item_titles:
        titles_list.append(title.text)

    for price in item_prices:
        prices_list.append(price.text)

    for i in item_global_ratings :
        rating.append(i.text)


count = 0

while count < 3:

    print(count)

    scrape_berita()
    elm = browser.find_element_by_class_name('a-last')
    elm.click()
    time.sleep(5)
    count +=1






0
1
2


In [5]:
print(titles_list)
len(titles_list)

['Lenovo Chromebook S330 Laptop, 14-Inch FHD (1920 x 1080) Display, MediaTek MT8173C Processor, 4GB LPDDR3, 64GB eMMC, Chrome OS, 81JW0000US, Business Black', 'Fusion5 14.1inch A90B+ Pro 64GB Windows 10 Laptop - 4GB RAM, 64GB Storage, Full HD IPS, Bluetooth, 2MP Webcam, Dual Band WiFi Laptop', 'Acer Predator Helios 300 Gaming Laptop, Intel i7-10750H, NVIDIA GeForce RTX 2060 6GB, 15.6" Full HD 144Hz 3ms IPS Display, 16GB Dual-Channel DDR4, 512GB NVMe SSD, Wi-Fi 6, RGB Keyboard, PH315-53-72XD', 'Acer Aspire 5 Slim Laptop, 15.6 inches Full HD IPS Display, AMD Ryzen 3 3200U, Vega 3 Graphics, 4GB DDR4, 128GB SSD, Backlit Keyboard, Windows 10 in S Mode, A515-43-R19L, Silver', 'Acer Nitro 5 Gaming Laptop, 9th Gen Intel Core i5-9300H, NVIDIA GeForce GTX 1650, 15.6" Full HD IPS Display, 8GB DDR4, 256GB NVMe SSD, Wi-Fi 6, Backlit Keyboard, Alexa Built-in, AN515-54-5812', 'ASUS TUF Gaming Laptop, 15.6” 144Hz Full HD IPS-Type Display, Intel Core i7-9750H Processor, GeForce GTX 1650, 8GB DDR4, 512G

66

In [6]:
print(prices_list)
len(prices_list)

['279', '257', '1,174', '364', '714', '498', '489', '', '', '839', '738', '644', '1,127', '1,018', '368', '449', '489', '298', '280', '249', '263', '668', '999', '895', '379', '242', '1,279', '1,699', '698', '849', '618', '1,099', '673', '839', '449', '498', '289', '295', '729', '312', '1,253', '339', '1,575', '1,169', '1,349', '1,879', '1,342', '629', '12', '1,356', '897', '729', '1,316', '269', '1,099', '239', '646', '556', '1,199', '499', '769', '1,999', '508', '265', '259', '279']


66

In [7]:
print(rating)
len(rating)

['1,415', '169', '1,571', '19,376', '1,624', '1,630', '3,519', '', '', '386', '543', '1,675', '526', '109', '14', '11', '3,519', '2,505', '1,961', '8', '169', '434', '105', '211', '10', '1,092', '55', '98', '116', '38', '1,548', '260', '451', '47', '6', '1,630', '386', '5,599', '5', '1,127', '1,518', '1,415', '4,775', '19', '91', '552', '11', '20', '4', '314', '1', '213', '90', '124', '1,112', '1,082', '2,798', '28', '6', '24', '384', '32', '2,767', '38', '858', '1,146', '57', '1,092']


68

In [10]:
dfL = pd.DataFrame(zip(titles_list, prices_list), columns=["ItemName", "Price (USD)"])
#dfL["Price"] = dfL["Price"].str.replace('RM', '').astype(float)

# This removes any entry with 'x2' in its title
dfL['Platform'] = 'Amazon'


dfL.to_csv('amazon_akmal.csv', index= False)
