# Using Selenium for browser automation

Selenium is a powerful tool that automates browsers. In simple words, it means whatever you usually do manually: clicking on links, filling out the forms, scrolling, moving the cursor or pressing a key on a keyboard - can all be automated and can save you a lot of time.

In my experience, Selenium was mostly useful when we wanted to download many files that require manual clicks. Instead of clicking on each link, or sometimes going through all the combinations of all the options in several drop-down menus, we used Selenium to automate data collection. Another application of Selenium that I found useful is when you want to scrape data that is constantly updated, and instead of manually refreshing the website, Selenium will do this for you.

Main source: https://selenium-python.readthedocs.io/index.html

In [1]:
# !pip install selenium
# !pip install --upgrade selenium
# !pip install webdriver-manager

In [3]:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains

import time
import pandas as pd
from os.path import expanduser

In [4]:
#Indicate path for downloaded files

path = expanduser("~/Library/CloudStorage/Box-Box/web_scraping_session/")
# path = expanduser("~/Downloads")

### Installing a driver manager

Selenium requires a driver to interface with the chosen browser. You can choose whatever browser you want to use, for the purposes of this session we use Google Chrome.

In [4]:
#load a Chrome driver 
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager

In [5]:
#Add a directory for downloaded files
options = webdriver.ChromeOptions()
preferences = {"download.default_directory":path}
options.add_experimental_option("prefs", preferences)

In [6]:
#Open Chrome browser
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)

### Loading website and downloading data from it 

The following are the most frequently used methods:
- **driver.get("url")** - navigate to a link
- **driver.find_element(By.ID, "id")** - locate an element by ID (can be also located by XPATH, CLASS, CSS_SELECTOR, etc). Best practice is to use ID if the element has it since it's unique and short. CSS_SELECTOR and XPATH are also unique but might change if a page's developer makes changes. I choose it based on how robust I want my code to be in the future
- **element.click()** - simple click

To find an ID or XPATH to the element, open Google Chrome, double-click an element you want and click inspect. You will be able to see ID if it exists, and copy its relative and full XPATHs. 

If an element you want lacks an ID, and an XPATH is dynamic (sometimes paths to elements refresh every time you refresh a page), you might use CSS_SELECTOR that will connect an element without an ID to a parent element with ID (reference: https://www.w3schools.com/cssref/css_selectors.asp)

For example, if you use the following css selector: **driver.find_elements(By.CSS_SELECTOR, '#element_id a[class="class_of_elements"]')**, it means find elements that descend from an element with id=element_id, that have tag *a* and belong to a class=class_of_elements.

In [7]:
#Open the link with data
driver.get("https://edgar.jrc.ec.europa.eu/country_profile")
time.sleep(2)

In [8]:
#This element will not be found until you scroll to the section
#element_error = driver.find_element(By.CSS_SELECTOR, '#chart1 g[class*="highcharts-no-tooltip highcharts-button highcharts-contextbutton"]')

In [9]:
#We need to scroll to the section we want (I am cheating and scrolling to "Air and Toxic Pollutants" which
#allows to load the data we want. There are alternative ways to do it, like using the XY coordinates of the page)
element = driver.find_element(By.XPATH, "/html/body/div[3]/div/div[7]/h2")
actions = ActionChains(driver)
actions.scroll_to_element(element).perform()
time.sleep(3)

In [10]:
#Locate and click a drop-down list
open_list = driver.find_element(By.CSS_SELECTOR, '#chart1 g[class*="highcharts-no-tooltip highcharts-button highcharts-contextbutton"]')
open_list.click()
time.sleep(2)

In [11]:
#To download csv files
download_csv = driver.find_elements(By.CSS_SELECTOR,'#chart1 li[class=highcharts-menu-item]')[6]
download_csv.click()

In [12]:
pd.read_csv(path+"/co2-emissions-by-sector.csv").head()

Unnamed: 0,Category,Buildings,Other industrial combustion,Other sectors,Power Industry,Transport,Total CO2/cap
0,1970,2926555000.0,4559329000.0,1851515000.0,3695783000.0,2796015000.0,4.28
1,1971,2939668000.0,4293773000.0,1875260000.0,3797794000.0,2876216000.0,4.18
2,1972,3059888000.0,4427175000.0,1991057000.0,4061311000.0,3046428000.0,4.31
3,1973,3121674000.0,4702212000.0,2128932000.0,4388431000.0,3222686000.0,4.47
4,1974,3039179000.0,4706438000.0,2108541000.0,4462033000.0,3191823000.0,4.38


In [13]:
#To select different grouping from drop-down menus
group = Select(driver.find_element(By.ID, "select-group"))
group.select_by_value("sec")
group.select_by_visible_text("Sector")

**Exercise:**

**Step I.** In Greenhouse Gases chart choose group by Sector and choose sector as Transport. 

**Step II.** In drop-down menu select "View data table" (you do not need to locate the menu again)

**Step III.** Scrape the table using BeautifulSoup tools
