# Intro to webscraping with Selenium

Hi everyone!ðŸ‘‹ In this notebook, we will begin learning how we can "scrape" information from websites using the popular Selenium library. Selenium exists both in Java and Python, and we will use its Python version.

Of course, why would you want to web-scrape anything from the internet? Well web-scraping gives you access to novel data, which is typically very up to date. (compare data scraped from Reddit to employment surveys done by Bureau of Labor Statistics)

Anyway, to use Selenium, you need what's called a "webdriver". Below, I will use the one for Chrome, called ChromeDriver. What you should do is, you should find the version of Chrome you have installed in your computer and go to https://chromedriver.chromium.org/downloads and download the correct ChromeDriver for the Chrome version you have. Also make sure to select the correct operating system that matches yours. (win32 for windows users, mac_arm64 for mac users with Apple silicon chip, mac_64 for Apple Intel chip, linux64 for linux users.) After you download it, unzip it. You will use the path to the unzipped ChromeDriver file to run it in your Python session.

In [1]:
import pandas as pd
# the next two will be useful when we want to mimic a human user by introducing random "waits" while scraping
# you should pip install them if you don't have them in your machine
import time
import random
# this is for calling Selenium's webdriver, again pip install Selenium if you don't have it
from selenium import webdriver # pip install it if you don't have it

In [2]:
# this is how to boot-up the webdriver
# btw, Selenium just recently had an update, so some of the syntax online might be out of date
driver = webdriver.Chrome(service=webdriver.ChromeService('/Users/bilene/Downloads/chromedriver-mac-x64/chromedriver')) # make sure to use your path!

In [3]:
# this below takes the webdriver to a website
# for fun, let's go to a forum post from 2002 on Edmunds.com, a website about cars.
driver.get('https://forums.edmunds.com/discussion/2864')

Take a look at the website and see its structure. There is a forum post, followed by comments. We might be interested in scraping all comments made on this post. (you can see there are 435 pages.)

What we need at this stage is to make an xpath search for each username you see on the page. You can think of this process as finding a pattern that captures the usernames.

The way to get a username's path is to right click on a username on a comment and hit inspect. That pulls up the developer tab and we can grab that particular username's xpath location. If you right click and do inspect again, the inspector tab will take you to exactly where that username is in the entire html structure of the page. What you want next is to right click on where the username is in the inspector tab, select Copy, then Copy as Xpath.

Here is the xpath I have below. You use the `find_element()` function to return that particular username.

In [4]:
userid_element = driver.find_element('xpath','//*[@id="Comment_1726631"]/div/div[2]/div[1]/span[1]/a[2]')

To get the text for the username:

In [5]:
userid = userid_element.text
userid

'merc1'

Similarly, for date:

In [6]:
user_date = driver.find_element('xpath','//*[@id="Comment_1726631"]/div/div[2]/div[2]/span/a/time')

notice in the html chunk, there is a `title` attribute which we can exploit:

In [7]:
user_date.get_attribute('title')

'March 24, 2002 9:54PM'

Now, let's get the comment itself:

In [8]:
user_comment = driver.find_element('xpath','//*[@id="Comment_1726631"]/div/div[3]/div/div[1]')

In [9]:
user_comment.text

'I personally think that with a few tweaks the C320 could also sit at the top of this group. It\'s still more of a entry-level luxury sedan than Mercedes would lead you to believe. The C320 "Sport" needs to have it\'s suspension retuned to accept 17 inch tires without a harsh ride. MB also needs to stop forcing metal trim and black leather on everyone that wants the sport package. The C240\'s 6-speed wouldn\'t hurt either if offered on the C320, though it needs to be "fixed" first.\n\nThat said, my favorite of the group is (gasp!) the new A4 3.0. The A4 easily has the best interior of any sub-40K car, combine that with stunning looks and a body that is as rigid as any ever made and you have my personal favorite. The single thing I don\'t like about the A4 is the grey trim around the bottom of the car. This trim doesn\'t match any of the body colors, as it should be body color anyway. This is an excellent car and in my opinion the best there is under 40K. The upcoming A4 Convertible, Co

So far, we have been grabbing info on a single comment. But we need info from all comments. We need to exploit a pattern. I noticed that each comment has the following structure `Comment_` followed by some numbers. (as seen above with `Comment_1726631`)

We can use the function `find_elements()` (notice the plural in `elements`) with contains option. This basically returns all xpaths whenever a chunk in the html file hits one that has `Comment_` which are the comments left by the users.

In [10]:
ids = driver.find_elements('xpath',"//*[contains(@id, 'Comment_')]")

However, `find_elements()` returns a list. So we need to loop over it to grab the `id` attribute:

In [11]:
comment_ids=[]
for i in ids:
    comment_ids.append(i.get_attribute('id'))

In [12]:
# The first comment's id
comment_ids[0]

'Comment_1726631'

Now that we have ids for all the comments on the page, we can loop over them to grab username, date, and comment info we need:

In [13]:
# scraping a single page
comments=[]
for x in comment_ids:
    userid_element = driver.find_element('xpath','//*[@id="' + x + '"]/div/div[2]/div[1]/span[1]/a[2]')
    userid = userid_element.text
    
    user_date = driver.find_element('xpath','//*[@id="' + x + '"]/div/div[2]/div[2]/span/a/time')
    date = user_date.get_attribute('title')
    
    user_comment = driver.find_element('xpath','//*[@id="' + x + '"]/div/div[3]/div/div[1]')
    comment = user_comment.text
    
    comments.append([userid,date,comment])

In [14]:
df_single = pd.DataFrame(comments,columns=['userid','date','comment'])
df_single

Unnamed: 0,userid,date,comment
0,merc1,"March 24, 2002 9:54PM",I personally think that with a few tweaks the ...
1,fredvh,"March 24, 2002 11:06PM",I am debating a new purchase and these two are...
2,blueguydotcom,"March 25, 2002 9:02AM","Great handling, RWD, excellent engine and the ..."
3,hungrywhale,"March 25, 2002 3:04PM",And no manual tranny. That may not matter to y...
4,riez,"March 25, 2002 4:44PM",One beauty of BMW 3 Series is that there are s...
5,blueguydotcom,"March 26, 2002 12:20PM","good grief, so you wait 9 months for the manua..."
6,hungrywhale,"March 26, 2002 1:02PM",I'll give it a fair shot when the manual comes...
7,blueguydotcom,"March 26, 2002 1:45PM",I understand it's not about the speed. I reall...
8,hungrywhale,"March 26, 2002 4:00PM",It sounds like you know why I won't even put i...
9,riez,"March 26, 2002 6:00PM",Is there such a thing as an automatic-only tru...


You might want to write a "comment id grabber" function to have your code more organized:

In [15]:
def get_comment_ids():
    # this is a function that gets comment ids that appear on a single page
    ids = driver.find_elements('xpath',"//*[contains(@id, 'Comment_')]")
    comment_ids=[]
    for i in ids:
        comment_ids.append(i.get_attribute('id'))
    return comment_ids

So far so good! Now, let's consider how we can scale this up to multiple pages. (remember there are 435 of them!!)

One idea is to loop over the website link itself. Typically, websites follow a structure where the second page gets p2, third gets p3 etc. This is not guaranteed, but if there is such structure, you can exploit it. Turns out Edmunds follows this structure so we can do:

In [16]:
# scraping multiple pages, idea 1
ls = ['p1','p2','p3']

comments = []
for k in ls:
    print(k) # just so you know where the crawler is in
    driver.get('https://forums.edmunds.com/discussion/2864/general/x/entry-level-luxury-performance-sedans/' + k)
    
    # wait for a random duration of time between 2 and 3 seconds before going to the next page
    time.sleep(random.uniform(2,3))
    
    # get comment ids first
    comment_ids = get_comment_ids()
    
    # scraping a single page
    for x in comment_ids:
        userid_element = driver.find_element('xpath','//*[@id="' + x + '"]/div/div[2]/div[1]/span[1]/a[2]')
        userid = userid_element.text

        user_date = driver.find_element('xpath','//*[@id="' + x + '"]/div/div[2]/div[2]/span/a/time')
        date = user_date.get_attribute('title')

        user_comment = driver.find_element('xpath','//*[@id="' + x + '"]/div/div[3]/div/div[1]')
        comment = user_comment.text

        comments.append([userid,date,comment])
    

p1
p2
p3


Notice I used a sleep function above before grabbing the comment links. This serves two purposes

1. The webdriver is super fast, and it can actually ask for the comment information before the website has a change to get it, which might result in data loss.
2. Maybe more importantly, the website may block you (don't worry, this is only temporary!) if your activity resembles to a Bot than a human. Of course, you should check with terms of use of the website to make sure you are not breaking any law! (Web-scraping is still a pretty gray area. The recent court decisions were in favor of scraping being OK for research only, not for commercial purposes. Given that we are scraping for research here, we are OK!)

In [17]:
df_multi = pd.DataFrame(comments,columns=['userid','date','comment'])
df_multi

Unnamed: 0,userid,date,comment
0,merc1,"March 24, 2002 9:54PM",I personally think that with a few tweaks the ...
1,fredvh,"March 24, 2002 11:06PM",I am debating a new purchase and these two are...
2,blueguydotcom,"March 25, 2002 9:02AM","Great handling, RWD, excellent engine and the ..."
3,hungrywhale,"March 25, 2002 3:04PM",And no manual tranny. That may not matter to y...
4,riez,"March 25, 2002 4:44PM",One beauty of BMW 3 Series is that there are s...
...,...,...,...
145,linard,"September 27, 2002 6:01PM","Well, in any case, we don't know the exact fig..."
146,cybersol,"September 27, 2002 6:19PM","Actually, stiffness mainly determines how much..."
147,jagboyxtype,"September 27, 2002 10:14PM","Actually, this isn't the X-TYPE we have here i..."
148,linard,"September 29, 2002 9:58AM",I honestly cannot remember a car I've been in ...


Looks good! We have all the info from multi pages. You can imagine creating a list of numbers up to 435 and concatenating them with 'p' and loop over all those links to grab all the comments from 435 pages.

Let's now consider an alternative, more robust idea. Here we will have the webdriver "Click" on the next page button you see on the website. First, we need to grab the element of the "next page button". Again, right click on the next page arrow button to go to the inspector. You will see that the website put a title label on it called "Next Page". We can use that to grab its location.

Once we have the button's location, we can use `.click()` to have the webdriver literally click on it.

In [18]:
# this is how to click on a website object:
button = driver.find_element('xpath','//*[@title="Next Page"]')
button.click()

Now, again a bit more organization of the code. We can write a function that grabs all the comments on a single page:

In [19]:
# define "scrape comments from a single page" func
def get_comments():
    comments=[]
    for x in comment_ids:
        userid_element = driver.find_element('xpath','//*[@id="' + x + '"]/div/div[2]/div[1]/span[1]/a[2]')
        userid = userid_element.text

        user_date = driver.find_element('xpath','//*[@id="' + x + '"]/div/div[2]/div[2]/span/a/time')
        date = user_date.get_attribute('title')

        user_comment = driver.find_element('xpath','//*[@id="' + x + '"]/div/div[3]/div/div[1]')
        comment = user_comment.text

        comments.append([userid,date,comment])
    return comments

In [20]:
# scraping multiple pages, idea 2
# clicking to go to next page,
driver.get('https://forums.edmunds.com/discussion/2864')

# define an empty dataframe that will be populated
df_out = pd.DataFrame()
for j in range(20):
    try:
        #print(j) # again, if you want to see where the crawler is in
        
        # wait between 2-3 seconds
        time.sleep(random.uniform(2,3))
        
        # get comment ids
        comment_ids = get_comment_ids()

        # scrape the data on the page
        comments = get_comments()

        # df with the current page data
        df_single = pd.DataFrame(comments,columns=['userid','date','comment'])

        # df_out that keeps the entire data
        df_out = pd.concat([df_out,df_single])
        
        # df_out.to_csv('dropbox/myfolder/myfile.csv') # good idea to save as it goes

        # this is how to click on a website object:
        button = driver.find_element('xpath','//*[@title="Next Page"]')
        button.click()

    except:
        print('Error!')


In [23]:
df_out.reset_index(drop=True)

Unnamed: 0,userid,date,comment
0,merc1,"March 24, 2002 9:54PM",I personally think that with a few tweaks the ...
1,fredvh,"March 24, 2002 11:06PM",I am debating a new purchase and these two are...
2,blueguydotcom,"March 25, 2002 9:02AM","Great handling, RWD, excellent engine and the ..."
3,hungrywhale,"March 25, 2002 3:04PM",And no manual tranny. That may not matter to y...
4,riez,"March 25, 2002 4:44PM",One beauty of BMW 3 Series is that there are s...
...,...,...,...
995,prattster,"March 3, 2004 9:06AM","Yeah, I'm surprised about that lol. You really..."
996,speedracer3,"March 3, 2004 9:54AM","""It's been almost a year since I got the BMW a..."
997,qbrozen,"March 3, 2004 10:14AM",i never tried the previous gen.
998,blueguydotcom,"March 3, 2004 10:29AM",That's the weird thing. Bimmers are ubiquitous...


We are done! You will see that Selenium is an extremely simple but powerful library to scrape information from websites. (I use it all the time in my research!) There are other things you can do with Selenium, including having the webdriver to type things in a search bar and return the results, have it scroll down on a page, and more.

Finally, one caveat is, some of the commonly scraped websites (such as Zillow, Yelp.. etc.) might have stronger Bot checks (you may be familiar with CAPTCHA checks!) If that is the case, the website can detect you are using a crawler and that you are not a human. There are ways to get around that too, but it requires a lot of extra intermediate steps, which hopefully you won't have to deal with in the future. If you do, you can reach me out and we can discuss.

That is all! Take care ðŸ‘‹

Prof. B