# Selenium
Selenium automates the extraction of content (scraping) from emails and skype conversation, by recreating the process of scrolling-clicking-copying-pasting conversations - something that would otherwise have to be done by an obedient employee.

Messages from Outlook and Skype are formatted into a tabular form

<img src="images_inkscape/tabular_form_selenium.png" style="width: 1000px;">

***

<font color=red><b>The notebook parallels the <a href="./adv04_selenium_skypeOutlook.pdf" ><b>PDF Report<b></a></b></font>. Use the report as the main learning source, and this notebook for practical examples.

# 1 - Packages
Using selenium requires the:
- `selenium` package for creating a Chrome instance and communicating with it;
- `bs4` <code>(BeautifulSoup)</code> package for efficient extraction;
- `Chrome` must be installed
- `chromedriver` to launch a Chrome instance that can be controlled through python
And what sucks, is that a full Chrome window will genuinely have to open, hitting your RAM. Expect your fans to produce an airplane noise ✈


# 2 - Standard call to create a Chrome <font color=red><b>driver</b></font>
This is a standard call that has to start every `selenium` session

In [1]:
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

def load_driver():
    # 1 - set options and capabilities for Chrome
    capabilities = {'chromeOptions':
                                      {
                                          'useAutomationExtension': False,
                                          'args': ['--disable-extensions']}
                                      }     

    chrome_options = Options()
    chrome_options.add_experimental_option("prefs", {
        "download.prompt_for_download":   False,
        "download.directory_upgrade":     True,
        "safebrowsing.enabled":           True
    })

    # 2 - create a driver instance with defined options
    driver = webdriver.Chrome(executable_path="./chromedriver",
                               desired_capabilities=capabilities,
                              options=chrome_options)

    driver.maximize_window()
    return driver

# 3 - Accessing webpages
Following a succesful driver creation, a page is accessed 

In [2]:
# 1 - load driver and max 
driver = load_driver()

# 2 - load url and 
driver.get("https://en.wikipedia.org/wiki/Demon_core")

# 3 - close
time.sleep(2)
driver.close()

# 4 - Structure of HTML
All the functionality of selenium resides in the ability of interacting with the `html` of the webpage which is viewed by doing
<center><font color=red size=4><b>Right Click → Inspect → Element Selector</b></font></center>

Below is the html for the selection in the image:
```html


<div id="toc" class="toc">
	<input type="checkbox" role="button" id="toctogglecheckbox" class="toctogglecheckbox" style="display:none">
	<div class="toctitle" lang="en" dir="ltr">
		<h2>Contents</h2>
		<span class="toctogglespan"><label class="toctogglelabel" for="toctogglecheckbox"></label></span>
	</div>
	<ul>
		<li class="toclevel-1 tocsection-1"><a href="#Manufacturing_and_early_history"><span class="tocnumber">1</span> <span class="toctext">Manufacturing and early history</span></a></li>
		<li class="toclevel-1 tocsection-2"><a href="#First_incident"><span class="tocnumber">2</span> <span class="toctext">First incident</span></a></li>
		<li class="toclevel-1 tocsection-3">
			<a href="#Second_incident"><span class="tocnumber">3</span> <span class="toctext">Second incident</span></a>
			<ul>
				<li class="toclevel-2 tocsection-4"><a href="#Medical_studies"><span class="tocnumber">3.1</span> <span class="toctext">Medical studies</span></a></li>
			</ul>
		</li>
	</ul>
</div>
```

***
<img src="images_inkscape/html_example.png" style="width: 1000px;">

***

As seen, `html` has a layered structure. The job is to traverse through these layers, which is done with either:
- `xPaths` to grab elements for clicking, filling, or any other interaction
- `BeautifulSoup` to extract text

# 5 - Extraction with `xPaths`: use when elements need to be interacted with or fully loaded

An `xPath` is a string that directs a html search. Consult the pdf `xPath` chapter for more details.

<img src="images_inkscape/xpath_anatomy_example.png">

It is built using
- <font color=red><b>Nodes</b></font> e.g. `/div` `/img` `/span` `/button`
- <font color=red><b>Attributes</b></font> e.g. `/id[@id="bannerCnt"]`
- <font color=red><b>Indicies (starting from 1)</b></font> e.g. `/id[2]` or `id[last()-1]`

A wilcard `*` or `@*` matches any node or attribute.

***
## Performing the search
With an xPath built, single elements are <font color=red>webElements</font> are located with
```python
webelement = driver.find_element_by_xpath(xpath)
```
and multiple elements are located with
```python
webelementS = driver.find_elements_by_xpath(xpath)
```
<font color=red><b>Webelements</b></font> are needed for clicking and other forms of interaction

In [3]:
########################################
# 🍏 Search elements with xPaths
########################################
driver=load_driver()
driver.get("https://en.wikipedia.org/wiki/Demon_core")

# 1 - get the title
xpath_title = "//h1[@id='firstHeading']"
title = driver.find_element_by_xpath(xpath_title)
print("⦿ Title webelement: ", title.text)

# 2 - get the links on side of page
xpath_links = "//div[@id='p-navigation' and @role='navigation']/div/ul/li/a"
links = driver.find_elements_by_xpath(xpath_links)
print("⦿ Sidebar links:\t", links.text)

# 3 - get the top link in sidebar
xpath_top = "//div[@id='p-navigation' and @role='navigation']/div/ul/li[0]/a"
top = driver.find_elements_by_xpath(xpath_top)
print("⦿ Sidebar links:\t", top)

⦿ Title webelement:  <selenium.webdriver.remote.webelement.WebElement (session="c9d6e23000f0586e0bbd43e51feb3bbc", element="71748a84-996b-4134-adfd-639d77fe9e06")>
⦿ Sidebar links:	 [<selenium.webdriver.remote.webelement.WebElement (session="c9d6e23000f0586e0bbd43e51feb3bbc", element="00a01135-46f5-4741-b7d0-b63e9c32a6aa")>, <selenium.webdriver.remote.webelement.WebElement (session="c9d6e23000f0586e0bbd43e51feb3bbc", element="7e39d823-1323-4949-9d53-a7568f633a19")>, <selenium.webdriver.remote.webelement.WebElement (session="c9d6e23000f0586e0bbd43e51feb3bbc", element="1ba645ec-faba-488d-aa46-a1906f129db0")>, <selenium.webdriver.remote.webelement.WebElement (session="c9d6e23000f0586e0bbd43e51feb3bbc", element="82e30490-f04e-4759-a250-f09f550a16a1")>, <selenium.webdriver.remote.webelement.WebElement (session="c9d6e23000f0586e0bbd43e51feb3bbc", element="3669d44b-7a5e-4c84-9dd7-232da11be9c1")>, <selenium.webdriver.remote.webelement.WebElement (session="c9d6e23000f0586e0bbd43e51feb3bbc", ele

# Interaction with Webelements
The most common ways of interacting with webElements is:
- `webelement.click()`
- `webelement.send_keys("What to send")`

In [3]:
########################################
# 🍏 Using the extracted webelements
########################################
# 1 - click on "main page" link
driver.find_element_by_xpath("//div[@id='p-navigation' and @role='navigation']/div/ul/li[1]/a").click()

# 2 - fill search box
xpath_search = "//*[@id='searchInput']"
search = driver.find_element_by_xpath(xpath_search)
search.send_keys("1934")

In [None]:
########################################
# 📔 xpaths to automate ebay
########################################
# 1 - create a selenium driver and navigate to the supplied url
url = "https://www.ebay.com"

# 2 - use xpaths to search for a user supplied object
item = input("💰 I want to buy: ")

# 3 - use xpaths to tick the "buy it now" box and click on the top object

# 4 - click buy it now and stop there

# 7 - Extraction with `BeautifulSoup` - use when text needs to be extracted from the webpage

`xPaths` are great for interacting with elements on the page, however practically any kind of text extraction is easier and less confusing with `BeautifulSoup`.

There is a single preparations stage with BeautifulSoup - convert the HTML into an internal format used by `BeautifulSoup`

```python
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
```
***

Once formatting is done, the `find` and `findall` functions can be used to extract elements
```python
single_structure = soup.find("div", attrs={"role": "option"})
multiple_structures = soup.find_all("span", attrs={"role": "option", "style": "
                                     aria-label": "Reading Pane"})
```

<font color=red><b>Each return is another BeautifulSoup object that can be searched, and thus a recursive call can be setup to navigate down the HTML layers</b></font>

***

Text extraction is run on the elements obtained
```python
multiple_structures = #Block from above example
unpacked_text = [i.get_text().strip() for i in multiple_structures]
```

Tag extraction works in a similar way
```python
unpacked_tag = structure.get("href")
```

In [4]:
from bs4 import BeautifulSoup

def extract_elements(soup, search):
    """
    __ Parameters __
    [soup] soup:                object to extract
    [1D-dict] search:           {"node":        [str] node,
                                 "attributes":  [dict] of attribute - value}
    __ Description __
    searches through the supplied layers and returns the soup objects
    __ Return __
    [1D-soup] objects that fit criteria
    """
    return_val = -1
    # 1 - already in final layer
    if (len(search) == 1):
        unpacked_elements = soup
    else:
        # 2 - unpack intermediate layers
        for i in search[:-1]:
            unpacked_elements = soup.find(i['node'], attrs=i['attributes'])
    # 3 - unpack the final layer    
    return unpacked_elements.find_all(search[-1]['node'], attrs=search[-1]['attributes'])

In [6]:
import re
import requests

########################################
# 🍏 Example of instagram scraping (if you have account)
########################################
hashtag = "🍑🍑🍑🍑"

# 1 - load hashtag
driver=load_driver()
driver.get('https://www.instagram.com/explore/tags/'+hashtag)

# 2 - load webpage into BeautifulSoupK
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

# 3 - extract images located in body -> span -> a
search = [{"node": "body",
           "attributes": None},
          {"node": "span",
           "attributes": None},
          {"node": "a",
           "attributes": None}]
image_structures = extract_elements(soup, search)

# 4 - search for "img" and the following src=https://scontent-hkg3-1.cdninstagram.co...""
image_url = []
for i in image_structures:
    mg = re.search('img.*src=\"(.+?)\"', str(i))
    if(mg):
        image_url.append(mg.group(1))
        
# 5 - request images
for idx, i in enumerate(image_url[:10]):
    r = requests.get(i)
    with open(f"instagram_images/{idx+1}.jpg", 'wb') as fout:
        fout.write(r.content)

In [None]:
########################################
# 📔 Wikipedia scraping
########################################
# 1 - load driver and direct to wikipedia page
url = "https://en.wikipedia.org/wiki/Main_Page"

# 2 - scrape all the words that are on the page
#   - split them by spaces into a list
word_list = None

# 🗑 Removing junk
# 3 - remove anything in the list after "Refferences" (regexp search for refferences)
for i in word_list:
    pass

# 4 - keep only a-z words (once againm regexp)
for i in word_list:
    pass

# 8 - Wait functionality
Waiting on elements is a very important procedure, since certain elements of the page may not be loaded in time for clicking or reading operations. This is handled by the `WebDriverWaiter`:

```python
timeout = 50 # timeout given in seconds
web_waiter = WebDriverWait(driver , timeout)
```

***

The most basic check involves checking for the presence of an xpath:
```python
WebDriverWaiter.until(EC.presence_of_element_located((By.XPATH, xpath)))
```

***
However most likely a separate class would be defined to wait on a specific event to occur.
- <font color=red><b>Must have a function called</b> <code>_call(driver)__</code> that takes <code>driver</code> as a parameter</font>
- Returns `True` once a condition is satisfied

In [None]:
from selenium.webdriver.support.ui import WebDriverWaiter
from selenium.webdriver.support import expected_conditions as EC

class wait_for_change():
  """
  waits on the text of the specified xpath to change from it's old value

  __Use-case__
  a field is due to be updated, but is suffering from some lag

  __Usage__
  WebDriverWaitINSTANCE.until(wait_for_content_change(XPATH, OLD_TEXT))
  """

  def __init__(self, xpath, old_text):
      
      self.xpath = xpath
      self.old_text = old_text
  
  def __call__(self, driver):

    try:
      # 1 - get the content on the chosen xpath
      curret_text = EC._find_element(driver, self.xpath).text.strip()
      
      # 2 - check if value changed or is zero
      return = ((curret_text != "" ) and (curret_text != self.old_text))

    # 3 - pass on error of not catching anything
    except StaleElementReferenceException:
        return False

In [None]:
########################################
# 📔 Integrate  check to Google Translate on the set of words
########################################
translate_list = ["The", "industrial", "revolution", "has", "been", "a", "disaster", "for", "human", "race"]
timeout = 50 # timeout given in seconds
translation_list = []

# 1 - load driver and move it to google translate
url = "https://translate.google.com/"

# 2 - create a waiter
waiter = WebDriverWait(driver , timeout)

# 3 - set the input language to english and output to chinese
pass

for i in translate_list:
    # 4 - go through translate list and write the word to input box
    pass

    # 5 - wait for translation to occur using the above "wait_for_change_class"
    # waiter.until(wait_for_change("🍄XPATH🍄", "🍄OLDVALUE🍄"))

    # 6 - extract the translation and add it to list

For a specific example of Skype and Outlook bots in action <a href="./material_selenium_notebook_EXAMPLES.ipynb"><b>See this example notebook</b></a>