# Web Scraping with Selenium


In this notebook, I will show how to use Selenium to search Google and parse the results.


**Table of Contents**
1. [Part 1: Test Selenium](#part1)
2. [Part 2: Get to know Selenium commands](#part2)
3. [Part 3: Searching Google (with an example)](#part3)
4. [Part 4: Do your own search](#part4)

<a id="part1"></a>
## Part 1: Test Selenium

In this part we will make sure that selenium is installed and can communicate with the Firefox browser.

In [5]:
import selenium 

If you are getting an error, such as: ModuleNotFoundError: No module named 'selenium', then you'll need to install the selenium package.

In [3]:
pip install selenium

Note: you may need to restart the kernel to use updated packages.


**IMPORTANT STEP:** Check the slides for Day 3 to deal with the "path" of where Geckodriver is stored, otherwise, the code below will show an error.

Now we are creating a Firefox instance and telling it to open a web page:

In [17]:
from selenium import webdriver

browser = webdriver.Firefox()
browser.get('http://cs.wellesley.edu')

If the test was successful, you should be able to see a Firefox browser open, and then display the Wellesley CS homepage.

**Note:** In order to continue with the commands in Part 2, DO NOT close the Firefox window that was opened above. Without the browser, the commands will fail.

<a id="part2"></a>

## Part 2: Get to know Selenium commands

One way to learn more about what methods and attributes of a class are available is to use the Python function `dir`:

In [5]:
print(dir(browser))

['CONTEXT_CHROME', 'CONTEXT_CONTENT', '__abstractmethods__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_authenticator_id', '_check_if_window_handle_is_current', '_file_detector', '_get_cdp_details', '_is_remote', '_mobile', '_shadowroot_cls', '_switch_to', '_unwrap_value', '_web_element_cls', '_wrap_value', 'add_cookie', 'add_credential', 'add_virtual_authenticator', 'back', 'bidi_connection', 'capabilities', 'caps', 'close', 'command_executor', 'context', 'create_web_element', 'current_url', 'current_window_handle', 'delete_all_cookies', 'delete_cookie', 'delete_downloadable_files', 'download_file', 'error_handler', 'execute', 'execut

Here are some useful attributes of the `browser` object that we might be interested to use:

In [6]:
browser.current_url

'https://www.wellesley.edu/cs'

In [7]:
browser.title

'Computer Science | Wellesley College'

In [8]:
browser.page_source[500:1000] # don't print the entire page, might be too big

'>\n  <meta name="viewport" content="width=device-width, initial-scale=1.0">\n  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">\n<meta property="og:title" content="Computer Science">\n<meta property="og:description" content="\n\n.flex-parent {\n  display: flex;\n  flex-wrap:wrap;\n  justify-content: center;\n}\n.button {\n  border: 1px solid black;\n  color: white;\n  background-color: #002776;\n  padding: 15px 35px;\n  t">\n<meta property="og:image" content="https://www.wellesley.edu/sites/de'

Selenium has its own set of commands for accessing the DOM of a page, here are some examples:

In [18]:
from selenium.webdriver.common.by import By # contains operators for the type of search we want to do

# Find the element with id="navbar"
try:
    print(browser.find_element(By.ID, 'navbar').text)
except Exception as e:
    print(e)

MyWellesley
Give
ABOUT
ADMISSION & FINANCIAL AID
ACADEMICS
CAMPUS LIFE
ATHLETICS
NEWS
EVENTS
ADMINISTRATION
ALUMNAE


In [10]:
# Find the element with a given class name
try:
    print(browser.find_element(By.CLASS_NAME, 'md-left-sidebar').text)
except Exception as e:
    print(e)

Computer Science
Curriculum
Faculty
Research
News and Events
Resources
Diversity, Equity, and Inclusion
Student Opportunities
Beyond Wellesley
Contact Us
Sohie Lee
Department Co-Chair
Tel: 781.283.3123
Email: slee@wellesley.edu
Orit Shaer
Department Co-Chair
Tel: 781.283.3093
Email: oshaer@wellesley.edu
Narine Emdjian
Academic Administrator/Grants Coordinator
Tel: 781.283.3147
Email: ne100@wellesley.edu


Let's close the browser instance:

In [11]:
browser.close()

<a id="part3"></a>

## Part 3: Searching Google

We'll open a new browser instance and get to visit Google.

In [22]:
browser = webdriver.Firefox()
browser.get('https://google.com')

We want to access the search box, which we know it's named "q", so it can be accessed by its name:

In [23]:
browser.find_element(By.NAME, "q")

<selenium.webdriver.remote.webelement.WebElement (session="96912ee9-6fc2-47b2-8c9d-57c590044fc1", element="d397cd8f-413a-484c-85bb-2c7b5497fb73")>

This result shows us that there is an element named "q". We will assign a variable to this instance so that we can interact with it:

In [24]:
inputBox = browser.find_element(By.NAME, "q")
inputBox.send_keys("wellesley college")

Notice how the browser copied the phrase into the search box and Google showed the suggested searches. If we want the search to start, we can send an enter event via code:

In [25]:
from selenium.webdriver.common.keys import Keys
inputBox.send_keys(Keys.ENTER)

This sends the query phrase and then the page is loaded.

It's possible to perform other operations with the keyboard, here is a [list of keyboard keys](https://www.selenium.dev/selenium/docs/api/py/webdriver/selenium.webdriver.common.keys.html) that Selenium recognizes.

Now that we know how to search Google, here are some simple tasks to try.

### Example: Artist Popularity

Let's take the list of **five** famous artists, e.g. Lady Gaga, Rihana, Taylor Swift, Beyonce, and Britney Spears, and find the number of results (hits) that Google returns for each of them. We will use these numbers to rank the artists based on the number of hits. In a sense, one can use the number of hits as a signal of an artist's popularity. More hits means more people have created pages mentioning the search phrase.

**Good to know**

- The element of the search page that contains the number of results has id="result-stats".
- The result usually looks like this: 'About 6,120,000 results (1.25 seconds) ', thus, we will extract the number, turn it into an integer, before doing the ranking.
- It takes some time between sending a query and the page loading, so you want the program to wait in between calls. We can use Python's time.sleep(N) as the simplest way to wait.

In [14]:
import time

def getResults(query):
    """Given a query, open a browser instance, search Google for the 
    phrase, then get the result-stats phrase from the page and return it.
    """
    browser = webdriver.Firefox()
    browser.get('https://google.com')
    inputBox = browser.find_element(By.NAME, "q")
    inputBox.clear() # so that in between searches it starts empty
    inputBox.send_keys(query)
    inputBox.send_keys(Keys.ENTER)
    time.sleep(1) # wait for the page to load
    try:
        result = browser.find_element(By.ID, "result-stats").text
    except:
        # Occasionally, Google Search shows something else at the top of the page.
        print(f"Couldn't find result for {query}")
        result = ""
    browser.close()
    return result

A helper function to return the hit number.

In [15]:
def processHitNumber(phrase):
    """Assumes that the phrase has the following format:
    About 39,600,000 results (0.75 seconds)
    and extracts the number of results.
    """
    hitNumber = phrase.split()[1]
    return int(hitNumber.replace(',', ''))

We'll iterate over a list of artists and get the results.

In [26]:
artistsAndHits = [] # to store the pairs of (artistName, numberOfHits)

for name in ['Lady Gaga', 'Rihanna', 'Taylor Swift', 
             'Beyonce', 'Britney Spears']:
    
    results = getResults(name)
    
    print(name, '|', results) 
    if results:
        hitnumber = processHitNumber(results)
        artistsAndHits.append((name, hitnumber))
    
    
sorted(artistsAndHits, key=lambda item: item[1], reverse=True)

Lady Gaga | About 152,000,000 results (0.40 seconds) 
Rihanna | About 271,000,000 results (0.54 seconds) 


InvalidSessionIdException: Message: WebDriver session does not exist, or is not active
Stacktrace:
RemoteError@chrome://remote/content/shared/RemoteError.sys.mjs:8:8
WebDriverError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:191:5
InvalidSessionIDError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:446:5
assert.that/<@chrome://remote/content/shared/webdriver/Assert.sys.mjs:485:13
assert.session@chrome://remote/content/shared/webdriver/Assert.sys.mjs:37:4
despatch@chrome://remote/content/marionette/server.sys.mjs:315:19
execute@chrome://remote/content/marionette/server.sys.mjs:289:16
onPacket/<@chrome://remote/content/marionette/server.sys.mjs:262:20
onPacket@chrome://remote/content/marionette/server.sys.mjs:263:9
_onJSONObjectReady/<@chrome://remote/content/marionette/transport.sys.mjs:494:20


<a id="part4"></a>
## Part 4: Do your own search

Make a list of Universities and Colleges (that you applied to or that you were interested in before Wellesley), repeat the search as before and find out which of them is most "popular" according to Google.

**Optional:** Do the results correlate with other metrics about these institutions (e.g. rankings, endowment, etc.)? How would you go about finding out?

In [7]:
import time

In [9]:
def getResults(query):
    """Given a query, open a browser instance, search Google for the 
    phrase, then get the result-stats phrase from the page and return it.
    """
    browser = webdriver.Firefox()
    browser.get('https://google.com')
    inputBox = browser.find_element(By.NAME, "q")
    inputBox.clear() # so that in between searches it starts empty
    inputBox.send_keys(query)
    inputBox.send_keys(Keys.ENTER)
    time.sleep(1) # wait for the page to load
    try:
        result = browser.find_element(By.ID, "result-stats").text
    except:
        # Occasionally, Google Search shows something else at the top of the page.
        print(f"Couldn't find result for {query}")
        result = ""
    browser.close()
    return result

In [10]:
def processHitNumber(phrase):
    """Assumes that the phrase has the following format:
    About 39,600,000 results (0.75 seconds)
    and extracts the number of results.
    """
    hitNumber = phrase.split()[1]
    return int(hitNumber.replace(',', ''))

In [27]:
schoolsandhits = [] # to store the pairs of (artistName, numberOfHits)

for name in ['Yale', 'Umich', 'UChicago', 
             'UMD', 'Pomona College']:
    
    results = getResults(name)
    
    print(name, '|', results) 
    if results:
        hitnumber = processHitNumber(results)
        schoolsandhits.append((name, hitnumber))
    
    
sorted(schoolsandhits, key=lambda item: item[1], reverse=True)

Yale | About 631,000,000 results (0.64 seconds) 
Umich | About 456,000,000 results (0.66 seconds) 
UChicago | About 429,000,000 results (0.46 seconds) 
UMD | About 94,500,000 results (0.40 seconds) 
Pomona College | About 23,400,000 results (0.33 seconds) 


[('Yale', 631000000),
 ('Umich', 456000000),
 ('UChicago', 429000000),
 ('UMD', 94500000),
 ('Pomona College', 23400000)]

In [30]:
fullschoolsandhits = [] # to store the pairs of (artistName, numberOfHits)

for name in ['Yale University', 'University of Michigan', 'University of Chicago', 
             'University of Maryland', 'Pomona College']:
    
    results = getResults(name)
    
    print(name, '|', results) 
    if results:
        hitnumber = processHitNumber(results)
        fullschoolsandhits.append((name, hitnumber))
    
    
sorted(fullschoolsandhits, key=lambda item: item[1], reverse=True)

Yale University | About 415,000,000 results (0.52 seconds) 
University of Michigan | About 1,790,000,000 results (0.44 seconds) 
University of Chicago | About 2,900,000,000 results (0.57 seconds) 
University of Maryland | About 1,320,000,000 results (0.50 seconds) 
Pomona College | About 20,700,000 results (0.29 seconds) 


[('University of Chicago', 2900000000),
 ('University of Michigan', 1790000000),
 ('University of Maryland', 1320000000),
 ('Yale University', 415000000),
 ('Pomona College', 20700000)]

In [None]:
#would go about finding rankings probably from US News and endowments from the school's own site
#interesting find is that the depending on what name you search (full university name vs. more common name) the number of results will differ and change the result rankings