# Browser Automation Lecture Notes
Date: 2023-07-18

We'll use this notebook to work through some examples and showcase some essential functions in Selenium.

Rather than basic Selenium, we'll use Selenium Wire, which can be used to intercept API calls/network requests.

In [208]:
#!pip install selenium selenium-wire chromedriver-binary-auto

In [1]:
# housekeeping
from pathlib import Path
import os
import random
import time

#selenium
from seleniumwire import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

from selenium.common.exceptions import (
    MoveTargetOutOfBoundsException,
    TimeoutException,
    WebDriverException,
)

import chromedriver_binary

In [2]:
os.makedirs('data/', exist_ok=True)

### opens browsers with no cookies, untracked
Though your IP is still trackable.

Selenium runs on browser drivers (hence the usage of chrome drivers)

In [3]:
def open_browser():
    """
    Opens a new automated browser window with all tell-tales of automated browser disabled
    """
    options = webdriver.ChromeOptions()
    options.add_argument("start-maximized")
    
    # remove all signs of this being an automated browser
    options.add_argument('--disable-blink-features=AutomationControlled')
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)

    # open the browser with the new options
    driver = webdriver.Chrome(options=options)
    return driver

In [5]:
driver = open_browser()

In [None]:
# how to visit websites
url = 'https://amazon.com'
driver.get(url)

## XPATH
You learned about beautifulSoup to work with HTML...

Xpath is another way to donavigating hierarchical structures of HTML and SVG

It's fast, versatile, and can be used in the developer console and in any computing language.

## Using XPATH
You can test xpaths in the devloper tools under the `console` tab using the function `$x()`. Read about that function [here](https://developer.chrome.com/docs/devtools/console/utilities/#xpath-function).


You can use Selenium's `find_element` [function](https://selenium-python.readthedocs.io/locating-elements.html?highlight=find_element#locating-elements) to find the search box on Amazon site. It'll return the first match to whatever criteria you use.

You can use whichever `By` [option](https://selenium-python.readthedocs.io/api.html?highlight=By#locate-elements-by) you feel most comfortable. This includes Xpath, but also any kind of CSS selector.

## Using xpath inputs:
-`XPATH` universal syntax, transferrable between browser console, jupyter (python code)
- Precision is the name of the game; `find.elements` for sanity checks


docs: https://selenium-python.readthedocs.io/locating-elements.html

In [None]:
# find element, in this case it's the input textbox for search
# Aria is always helpful to look for, indicates accessible element
## helpful to look for stable, unchanging elements.

# docs: https://selenium-python.readthedocs.io/locating-elements.html

search_box = driver.find_element(
    By.CSS_SELECTOR, 
    'input#nav-bb-search'
)
search_box

### typing in our query using selenium

In [None]:
search_term = 'hello kitty'
search_box.send_keys(search_term)

In [None]:
def press_enter(driver):
    """
    Sends the ENTER to a webdriver instance.
    """
    actions = ActionChains(driver)
    actions.send_keys(Keys.ENTER)
    actions.perform()

### hitting the enter button programmatically

In [None]:
press_enter(driver)

In [None]:
# downloading HTML to file
driver.get("https://www.amazon.com/s?k=hello+kitty")

hk_path = "data/hello_kitty_products.html"
with open(hk_path, "w", encoding='utf-8') as f:
    f.write(driver.page_source)

# driver.get(r'file:' + hk_path)

### load the file in

In [6]:
# THIS IS THE CORRECT URL - FULL PATH, NOT RELATIVE.
driver.get('file:///users/nguyenkim/.dev/lede-2023/class/0718-leon/browser-automation/data/hello_kitty_products.html')


### Parse the products

For each product, let's print the brand name:

Notice we're using `find_elements` (plural), this will return a list, rather than the first element.

In [7]:
product_tiles = driver.find_elements(
    By.XPATH, 
    './/div[contains(@cel_widget_id, "MAIN-SEARCH_RESULTS-")]' # the contains syntac allows a substring match
)
len(product_tiles)

60

Let's iterate through each product and print the brand name, which is saved as the only header (h2) in the element:

In [8]:
#print first 10 products
for product in product_tiles[:10]:
    print(product.find_element(
        By.TAG_NAME, 'h2'
    ).text)

    

Hello Kitty Rainbow Pullover Hoodie
Lip Smacker Valentine's Day Collection Hello Kitty Lip Balm Tin
ANEIMIAH Stitch Charm Bracelet, Kids Jewelry for Girls Chain Bracelet, Birthday Gifts for Girls- Ohana Means Family
Cannity Hello Kitty Stickers, 50PCS Cute Stickers White Theme Kawaii Cat Stickers for Kids Teens Adults, Vinyl Waterproof Stickers Pack for Laptop Phone Luggage Water Bottles
PaPiJoJo Cute Keychain Kawaii Anime Keychain, Hello Kitty, My Melody,Kuromi,Keroppi, Badtz-Maru, Cinnamoroll, Pompompurin
Wet Brush Original Hello Kitty Detangling Brush - Under My Umbrella - All Hair Types - Ultra-Soft IntelliFlex Bristles Glide Through Tangles with Ease, White, 1 Count
Crocs Unisex-Adult Classic Hello Kitty Clog
Goody x Hello Kitty Ouchless Scrunchie - 6 Count, Assorted - Help Keep Hairs In Place - Hair Accessories to Style With Ease and Keep Your Hair Secured - For All Hair Types - Pain Free
Sanrio Company, Ltd. Hello Kitty Tote Bag Hello Kitty Shopping Bag Gym Bag Hello Kitty Lunch

Although we did this all using Selenium, it's better to save the page source and then parse the saved results in BeautifulSoup, lxml, or whatever parsing software you prefer.

### Annotate the elements we find
Let's find all the ads, and highlight them red on the page.

You can "inject" attributes into elements, including style attributes.

In [193]:
# applying style using function

# def transform(driver, img, color = 'red'):
#     # setting up initial transition
#     style = f"background-color: {color} !important; "
#                 # "transform: rotate(360deg) !important; "\
#                 # "transition: transform 20s linear;"
#     driver.execute_script(
#         f"arguments[0].setAttribute('style','{style}')", elem
#     )

In [11]:
# testing to see if class gets added

def classed(driver, elem):
    class_to_add = "test"
    driver.execute_script(
            f"arguments[0].setAttribute('class','{class_to_add}')", elem
    )

In [50]:
def stain(driver, elem, color = 'red'):
    """
    Injects a style attribute to stain `elem` the `color` red.
    """
    style = f"background-color: {color} !important; "\
                    "transition: all 0.5s linear;"
    driver.execute_script(
            f"arguments[0].setAttribute('style','{style}')", elem
    )

In [66]:
# grabbing our elements

# XPATH for sponsored divs
ads = driver.find_elements(
    By.XPATH, 
    # you can use XPATH to specify the attributes of the children of the node you want...
    './/div[@data-asin and .//a[@aria-label="View Sponsored information or leave ad feedback"]]'
)


# XPATH for img - this was working and now it's not??? I tried
imgs = driver.find_elements(
    By.XPATH, 
    './/div[@data-asin and .//a[@aria-label="View Sponsored information or leave ad feedback"]//img]'
)

In [63]:
for ad in ads:
    stain(driver, ad)

# for img in imgs:
#     stain(driver, img)

### she spins....she wins
if you wanted to loop it continuously, add a @keyframes.

In [64]:
def spin_class(driver, elem):
    # adding a class to rotate back and forth
    spin = "rotate"
    driver.execute_script(
            f"arguments[0].setAttribute('class','{spin}')", elem
    )

In [65]:
for ad in ads:
    spin_class(driver, ad)

### Get height of document

In [170]:
import pandas as pd

In [171]:
height = driver.execute_script("return document.body.scrollHeight")

In [172]:
height

24060

Get the coorindates and size of each element using the `rect` function.

In [174]:
ad_metadata = []
for elem in ads:
    if elem.is_displayed(): # use this function to only analyze visable elements
        ad_metadata.append(elem.rect)

In [175]:
df = pd.DataFrame(ad_metadata)

In [176]:
df['how_far_down'] = df['y'] / height

In [177]:
df.how_far_down.value_counts()

how_far_down
0.154937    1
0.196222    1
0.307844    1
0.347152    1
0.387291    1
0.428576    1
0.513898    1
0.553206    1
0.592514    1
0.632801    1
Name: count, dtype: int64

### Save receipts

In [178]:
# how to save what the emulator sees
source = driver.page_source
with open('data/amazon_selenium_test.html', 'w') as f:
    f.write(source)

In [179]:
# just what's visible
driver.save_screenshot('data/amazon_selenium_test.png')

True

There's are ways to do a full screen screenshot, but none of my function seem too work. Can you take a full-screenshot?

### Parsing the results however you like
For me it means using lxml, but you can do this same thing in BeautifulSoup, and I encourage you do so...

In [200]:
from lxml import etree

In [201]:
dom = etree.HTML(open('data/amazon_selenium_test.html').read())

In [204]:
dom

<Element html at 0x114268fc0>

In [205]:
product_metadata = []
for result in dom.xpath('.//div[contains(@cel_widget_id, "MAIN-SEARCH_RESULTS")]'):
    # this is where you can parse as many fields as you like.
    brand, product_name = result.xpath('.//h2//text()')[:2]
    product_metadata.append({
        'brand': brand,
        'product_name': product_name
    })

In [203]:
pd.DataFrame(product_metadata)

Unnamed: 0,brand,product_name
0,Hello Kitty Rainbow Pullover Hoodie,
1,Lip Smacker Valentine's Day Collection Hello K...,
2,"ANEIMIAH Stitch Charm Bracelet, Kids Jewelry f...",
3,"Cannity Hello Kitty Stickers, 50PCS Cute Stick...",
4,"PaPiJoJo Cute Keychain Kawaii Anime Keychain, ...",
5,Wet Brush Original Hello Kitty Detangling Brus...,
6,Crocs Unisex-Adult Classic Hello Kitty Clog,
7,Goody x Hello Kitty Ouchless Scrunchie - 6 Cou...,
8,"Sanrio Company, Ltd. Hello Kitty Tote Bag Hell...",
9,Kerr's Choice Hello Kitty Bag for Girls | Cros...,


### References:
- https://selenium-python.readthedocs.io/locating-elements.html
- https://lxml.de/xpathxslt.html
- https://stackoverflow.com/questions/55796339/selenium-how-to-load-a-local-html-file-on-mac
- https://stackoverflow.com/questions/42900214/how-to-download-a-html-webpage-using-selenium-with-python
- https://stackoverflow.com/questions/13309673/how-to-play-css3-transitions-in-a-loop
- https://stackoverflow.com/questions/16176648/trying-to-do-a-css-transition-on-a-class-change
- https://www.w3schools.com/css/css3_transitions.asp
- https://realpython.com/python-pathlib/

---

---

---