# 3.14.35 Advanced Selenium

In [1]:
# import requred libraries and modules
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
from selenium.webdriver.chrome.service import Service
import os

## Pagination

Pagination is used in contexts beyond webscraping and broadly refers to splitting up a large piece of content into smaller sections (i.e., pages). For example, when you search something on Google the result list (which could be thousands or millions of results) is split up into smaller 'pages' than contain 20 results each. 

In webscraping, we often need to be able to write our script so that it can deal with pagination and navigate between multiple pages of results or data. In the `Script1_Pagination.py` file we'll find a script that allows us to use pagination to retrieve all the results from the [hockey teams webpage](https://www.scrapethissite.com/pages/forms/). Let's explore this in more detail below: 

In [2]:
### DRIVER SETUP

# initialise the driver (and open up a browser window)
driver = webdriver.Chrome()

# open up a specific web page
driver.get("https://www.scrapethissite.com/pages/forms/")


### DEFINE FUNCTION FOR COLUMN LIST CREATION

# since we will need to repeat this process for other 5 variables, 
# let's create a function
def create_col(class_name):
    s_column = driver.find_elements(By.CLASS_NAME, class_name)
    column = [col.text for col in s_column]
    return column


As with the previous lesson, our script opens with the initial set up of our Chrome webdriver and navigates to the page we want to scrape. We also set up our `create_col` function that we saw from the previous lesson. Previously, when we were only scraping one page of data, we used the `create_col` function to extract each column of data from the table. Seeing that we'll now need to do that for every single page on the website, we'll create another function called `scrape_page` to do all of this in one function call

In [3]:
### DEFINE FUNCTION FOR PAGE SCRAPING

def scrape_page(page):
    # click on first page
    driver.find_element(By.LINK_TEXT, str(page)).click()
    
    # create a list for each column of the table
    names = create_col('name')
    year = create_col('year')
    wins = create_col('wins')
    losses = create_col('losses')
    goals_for = create_col('gf')
    goals_against = create_col('ga')
    
    # create a temporary DataFrame to store the current page
    df_teams_tmp = pd.DataFrame(
        {'names': names, 
         'year': year,
         'wins': wins, 
         'losses': losses, 
         'goals_for': goals_for, 
         'goals_against': goals_against
         })
    
    return df_teams_tmp

Our `scrape_page` function takes in page as a parameter and then uses that to find the relevant link text and click on it. If we look at the [webpage we're scraping](https://www.scrapethissite.com/pages/forms/), you'll notice that each number on the page navigation at the bottom is actually a link that navigates to a specific page when clicked. Our function uses this to click on the relevant page link based on the parameter we pass into it, and uses our create_col function to scrape the data from that page and wrap it all up in a tidy dataframe. 


We could then use this function to scrape a specific page of data if we wanted to, like in the below code line

In [4]:
df_teams = scrape_page(1)

However, in our case we want to scrape all of the pages! In order to do this, we first need a way of determining exactly how many pages of data there are on this webpage. If we look at the HTML code for our webpage, we'll see that the page navigation bar is contained within an unordered list `<ul>` tag with a class name of `pagination`. Let's start by first extracting this segment of HTML code.

In [6]:
s_pages = driver.find_element(By.CLASS_NAME, "pagination")

If we look at the text of this HTML element, we'll see that it's a long string of all the page numbers, separated by the new line character `\n`. We can use the split method to create a list of the page numbers by splitting on the new line characters

In [12]:
s_pages.text

'1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n»'

In [13]:
s_pages.text.split('\n')

['1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 '10',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '17',
 '18',
 '19',
 '20',
 '21',
 '22',
 '23',
 '24',
 '»']

Great! We've nearly got our complete list of numbers. However, you'll notice that the very last element of our list actually isn't a number but two arrows `>>`, which is the button on the navigation bar to move to the next page. This obviously won't work with our scrape_page function, so we'll adjust this list slightly to remove the last element. 

In [14]:
# List of strings
pages = s_pages.text.split('\n')[:-1]

['1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 '10',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '17',
 '18',
 '19',
 '20',
 '21',
 '22',
 '23',
 '24']

We can now loop through this list of numbers and use them as arguments for our scrape_page function. We'll create an empty dataframe to hold all the data, and then at each loop we'll append (concat) the scraped page of data to this dataframe.

In [17]:
# Initialise an empty dataframe to contain our data
df_teams = pd.DataFrame([])

# Loop through all of the pages and use the scrape page function to append the data to df_teams
for p in pages: 
    # append (with pd.concat()) the temp DataFrame to the empty one
    df_teams = pd.concat([df_teams, scrape_page(p)])
    
df_teams.reset_index(inplace=True, drop=True)

driver.close()

This is just one example of how we could paginate through a webpage. Other common ways of pagination are to use a while loop to check whether a "next page button" is present and continue to scrape the data until that button no longer appears (i.e., when you reach the final page) or sometimes websites will use a very structure url that has a page parameter (e.g., page=1, page=2 etc.) so we could directly update the webpage url.

## Dealing with Ajax

Ajax stands for "**A**synchronous **Ja**vaScript And **X**ML" and is a technique for transfering data between a server side database and a client side browser. It allows for faster, more dynamic and more interactive web pages. 


<img src="https://miro.medium.com/max/1400/1*6PasVp89PTHbDvYJ-Zq8ZQ.png" width="600" style="display: block; margin: 0 auto">



Most of the webpages we have seen so far have been loaded using **synchronous requests** - that is, all of the HTML content for the webpage is loaded in one go the first time we send a request to the web url. With **asynchronous requests**, a request is sent to the web server and then the current page is updated without having to reload the page or be redirected to another web url. This introduces some problems for webscraping as we need a way to wait until the ajax request has been completed and the new data has been loaded onto the webpage before we can go ahead and start scraping it. 

Let's start with setting up our Chrome webdriver and navigating to this [link](https://www.scrapethissite.com/pages/ajax-javascript/) from the scrapethissite.com website

In [2]:
driver = webdriver.Chrome()
driver.get("https://www.scrapethissite.com/pages/ajax-javascript/")


As you'll see on the webpage, the page contains a series of clickable links for different years. When we click these links, we see a brief loading wheel before a table with the top films for that year is displayed. 

Behind the scenes, when we click on one of the year links the website is sending an ajax request to it's database to retrieve the relevent data and then loading it to the screen. If we were to run our code normally to try and retrieve this data, we would run into some problems (see the code below!)

In [31]:
def create_col(class_name):
    s_column = driver.find_elements(By.CLASS_NAME, class_name)
    column = [col.text for col in s_column]
    return column


driver.find_element(By.LINK_TEXT, "2015").click()
film_titles = create_col("film-title")

print(film_titles)

[]


As you can see, what we end up with is an empty list. What's happening here is that our code is running faster than the website can load the relevant data, so when our code executes the line `film_titles = create_col("film-title")`, that part of the page hasn't actually been loaded in yet, and so we end up with an empty list. What we need is a way for our code to wait until the relevant data has been loaded before it continues with the rest of the code.

Fortunately, selenium provides a number of built in functions that can make our code wait until certain conditions have been met. If we check the selenium documentation on [waits](https://selenium-python.readthedocs.io/waits.html), we'll see that selenium can make our code wait for certain events. There are a number of different [built in checks](https://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.support.expected_conditions) that we can do before our code continues.

Depending on the webpage you're scraping a different check might be required. For example, we could check that table elements are currently displayed on the page before trying to scrpae them. In our case we'll do the reverse. Whenever we click one of the year buttons, a spinning wheel is briefly displayed. So we'll check that the spinning wheel is not visible first, before then proceeding to scrape the page.

In [5]:
# Import some additional libraries
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

In [None]:
driver.find_element(By.LINK_TEXT, "2015").click()
WebDriverWait(driver, 10).until(EC.invisibility_of_element_located((By.ID, "loading")))
film_titles = create_col("film-title")

film_titles

The above code block uses the `WebDriverWait` function to pause our code for a set number of seconds (in this case 10). We combine this with the `until` method to continiously pause our code until a relevant condition is met - in this case it will pause until the loading wheel is no longer shown on screen. We could chain this to loop through each button to retieve the full set of data. 

In [6]:
# Set up browser
driver = webdriver.Chrome()
driver.get("https://www.scrapethissite.com/pages/ajax-javascript/")

def create_col(class_name):
    s_column = driver.find_elements(By.CLASS_NAME, class_name)
    column = [col.text for col in s_column]
    return column

# List of years to loop through
years = ["2010", "2011", "2012", "2013", "2014", "2015"]

#Initialise empty dataframe
df_films = pd.DataFrame([])

# Loop through years, click the relevant button, wait for the content to load
# then create a temporary dataframe, which we then append to the main one
for year in years:
    driver.find_element(By.LINK_TEXT, year).click()
    WebDriverWait(driver, 10).until(EC.invisibility_of_element_located((By.ID, "loading")))
    
    film_titles = create_col("film-title")
    film_noms = create_col("film-nominations")
    film_awards = create_col("film-awards")
    
    temp_df = pd.DataFrame({
    "year": year,
    "title": film_titles,
    "nominations": film_noms,
    "awards": film_awards
    
    })

    
    df_films = pd.concat([df_films, temp_df])
    
df_films.head()   
driver.close()

Unnamed: 0,year,title,nominations,awards
0,2010,The King's Speech,12,4
1,2010,Inception,8,4
2,2010,The Social Network,8,3
3,2010,The Fighter,7,2
4,2010,Toy Story 3,5,2


### Dealing with Infinite Scrolls

<img src="https://cdn-images-1.medium.com/max/1600/0*ADvLjzLzT7SELZDs.gif" width="600" style="display: block; margin: 0 auto">

We've probably all seen infinite scrolling before - most (if not all) of the big social media platforms use this technique to keep you engaged with their content. The basic idea is that all of the content of the webpage is not loaded in one go. Instead a small selection of content is loaded to the screen and as you scroll towards the end of the selection of content more is automatically loaded to the page, leading you to spend hours and hours scrolling through infinite social media posts!

Again, this poses another problem for our webscraping skills - if we wanted to scrape all the data from a page that uses infinite scrolling, we need a way to simulate this scrolling behaviour. Fortunately, selenium can do this for us.

#### An Example with US Politicians

The website [GovTrack](https://www.govtrack.us/) is a non-profit organisation that tries to make US congress and politics more transparent, open and accessible. It contains heaps of data and information on US politicians, such as voting history and their campaign funding. Today we'll use our webscraping skills to initially create a list of all of the [Democrat representatives](https://www.govtrack.us/congress/members/current#current_role_party=Democrat).


Let's start with the usual steps of initialising our ChromeDriver and navigating to the webpage.

In [8]:
driver = webdriver.Chrome()
driver.get("https://www.govtrack.us/congress/members/current#current_role_party=Democrat")

The first thing you should notice is another problem we haven't discussed yet: **pop-ups**. These can be handled in selenium the same way as in real life, by clicking the big X button. 

In [9]:
driver.find_element(By.XPATH, '//*[@id="followus_modal"]/div/div/div[1]/button').click()

If we take a look at the Democrat Representatives webpage, we'll see that it uses an infinite scroll mechanic to display all of the Democrat representatives. This is where we'll need to get our selenium script to mimic scrolling behaviour. 

The way we do this is by executing some javascript commands within our web browser. The basic process is like this:

- Find what the maximum scroll height of the webpage is.
- Set our current scroll height to be the same as the maximum height as the page.
- At this point some more content will load and the scroll height of the webpage will change. 
- Find out what the new height of the webpage is and repeat the process until the webpage doesn't get any bigger (or until you reach a specified page height)

For our selenium script to work, we'll make use of two key javascript lines:

- `document.body.scrollHeight` - This gets the maximum scroll height of the page
- `window.scrollTo()` - This function allows us to simulate the actual scrolling

The first thing to do is to find out what the current scroll height is. To run this through our automated web browser, we'll use the `execute_script` method from selenium:

In [15]:
driver.execute_script("return document.body.scrollHeight")

2016


The next thing to do will be to move our scroll bar to the maximum scroll height, which we can do with the scrollTo() javascript function:

In [16]:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")

At this point, the page has updated and loaded additional content, which means the scroll height has now changed:

In [17]:
driver.execute_script("return document.body.scrollHeight")

3376

What we need then is to create a loop to continue scrolling until the scroll height no longer increases (i.e., we've reached the bottom of the page). To do this we'll make use of a while loop, and at each point compare the new scroll height to the previous scroll height. We'll also see another (simpler) way to get our code to wait until things have loaded, and that's with the `time` library

In [None]:
import time

In [18]:
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll to the bottom of the page
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
    
    # Add in a brief wait for the content to load 
    time.sleep(2)
    
    # Calculate the new scroll height and compare it with the previous one
    new_height = driver.execute_script("return document.body.scrollHeight")
    
    # If both the last and new height are the same, we've reached the bottom and can exit the loop
    if new_height == last_height:
        break
    
    # Update the value of the last_height with the new height ready to begin the loop again
    last_height = new_height

Perfect! We've now managed to scroll to the bottom of the page and can now proceed with scraping all of the Democrat names. We'll first of all filter the HTML content to the container of results

In [20]:
results = driver.find_element(By.CLASS_NAME, "results")

From this we'll create a result set of all the url links (the < a > tags) which we can then use our list comprehensions to extract the names and their associated links

In [32]:
# Create the result set
people = results.find_elements(By.TAG_NAME, "a")

In [33]:
# Extract names and links
names = [name.text for name in people]
links = [link.get_attribute("href") for link in people]

In [37]:
# Finally, wrap everything up in a dataframe
df = pd.DataFrame({
    "name": names,
    "info": links})

df.head()

Unnamed: 0,name,info
0,"Adams, Alma",https://www.govtrack.us/congress/members/alma_...
1,"Aguilar, Pete",https://www.govtrack.us/congress/members/pete_...
2,"Allred, Colin",https://www.govtrack.us/congress/members/colin...
3,"Auchincloss, Jake",https://www.govtrack.us/congress/members/jake_...
4,"Baldwin, Tammy",https://www.govtrack.us/congress/members/tammy...


In [38]:
driver.close()