Selenium allows us to control a web browser programmatically, so we can navigate web pages, interact with buttons or forms, and extract the information we need.

Unlike simpler scraping tools, Selenium can handle dynamic content, such as pages that load data with JavaScript, making it ideal for modern websites.

We will start by setting up the environment in Google Colab, configuring Chrome in headless mode, and then learn how to locate and extract elements from web pages.

In [None]:
# !pip install selenium #Installs the Selenium library, which allows Python to control web browsers programmatically



In [None]:
# !apt-get -q update #Updates the package lists on the system to ensure the latest versions of software and dependencies are available.

Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:2 https://cli.github.com/packages stable InRelease
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:4 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:7 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:8 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:12 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 Packages [1,576 kB]
Get:13 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages [3,569 kB]
Fetched 5,401 kB in 1s (3,675 kB/s)
Readi

In [None]:
# !apt install -yq chromium-chromedriver #Installs ChromeDriver, a separate executable that Selenium uses to interact with the Chromium browser

Reading package lists...
Building dependency tree...
Reading state information...
chromium-chromedriver is already the newest version (1:85.0.4183.83-0ubuntu2.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


In [None]:
# !cp /usr/lib/chromium-browser/chromedriver /usr/bin #Copies ChromeDriver to a directory in the system PATH so Selenium can easily find and use it.

cp: '/usr/lib/chromium-browser/chromedriver' and '/usr/bin/chromedriver' are the same file


This setup is essential because Google Colab runs in a cloud environment without a native browser installed, so we explicitly install Chromium and its driver.

In [None]:
# import sys
# sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')

Adds the ChromeDriver directory to Python’s search path, ensuring Selenium can find it when launching the browser.

In [1]:
from selenium import webdriver                  #Imports the webdriver module, which allows Python to launch and control web browsers like Chrome or Firefox.
from selenium.webdriver.common.by import By     #Imports the By class, which provides a way to locate elements on a web page (e.g., by ID, name, class, or XPath).

These imports are essential for launching the browser and finding elements on web pages so that Selenium can interact with them programmatically.

In [2]:
chrome_options = webdriver.ChromeOptions()              #Creates a ChromeOptions object to customize how Chrome will run.


chrome_options.add_argument('--headless=new')               #Runs Chrome in headless mode, meaning the browser operates in the background without opening a visible window. Useful for cloud environments like Colab.
chrome_options.add_argument('--no-sandbox')             #Disables Chrome’s sandbox security feature, which can cause issues in restricted environments like Colab.
chrome_options.add_argument('--disable-dev-shm-usage')  #Prevents errors related to limited shared memory (/dev/shm) in Colab by using a temporary disk-based memory.

chrome_options.add_argument("--incognito")
chrome_options.add_argument("--start-maximized")

In [3]:
wd = webdriver.Chrome(options=chrome_options)
wd.get("https://www.example.com")
print(wd.page_source)

<html><head>
    <title>Example Domain</title>

    <meta charset="utf-8">
    <meta http-equiv="Content-type" content="text/html; charset=utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustr

In [4]:
wd.get('https://forums.edmunds.com/discussion/50806/general/x/car-subscription-vs-lease-vs-purchase')
print(wd.page_source)

<html lang="en"><head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <style type="text/css">@font-face {font-family:Open Sans;font-style:normal;font-weight:400;src:url(/cf-fonts/s/open-sans/5.0.20/cyrillic/400/normal.woff2);unicode-range:U+0301,U+0400-045F,U+0490-0491,U+04B0-04B1,U+2116;font-display:swap;}@font-face {font-family:Open Sans;font-style:normal;font-weight:400;src:url(/cf-fonts/s/open-sans/5.0.20/greek/400/normal.woff2);unicode-range:U+0370-03FF;font-display:swap;}@font-face {font-family:Open Sans;font-style:normal;font-weight:400;src:url(/cf-fonts/s/open-sans/5.0.20/cyrillic-ext/400/normal.woff2);unicode-range:U+0460-052F,U+1C80-1C88,U+20B4,U+2DE0-2DFF,U+A640-A69F,U+FE2E-FE2F;font-display:swap;}@font-face {font-family:Open Sans;font-style:normal;font-weight:400;src:url(/cf-fonts/s/open-sans/5.0.20/latin-ext/400/normal.woff2);unicode-range:U+0100-02AF,U+0304,U

XPATH

https://www.guru99.com/xpath-selenium.html

(Understand absolute and relative xpaths in Selenium)

In [5]:
#XPath stands for XML Path Language

#Absolute Path
user_message = wd.find_element(by=By.XPATH, value ='/html/body/div[1]/div[1]/div[2]/div/div/div/div/div[3]/main/div[2]/div[2]/div/div[2]/div/div[1]')

#Relative Path
#user_message = wd.find_element(by=By.XPATH, value ='//*[@id="Discussion_50806"]/div/div[2]/div/div[1]')

comment= user_message.text
print(comment)

Car subscriptions seem to be the "next new thing" in marketing for the automobile industry.

Here's a brief rundown of what they are and how they work:

What are Car Subscriptions?

Does this tempt you with any apparent advantages of how you buy/lease right now?


In [6]:
userid=wd.find_element(by=By.XPATH,value='//*[@id="Discussion_50806"]/div/div[1]/div[1]/span[1]/a[2]')#/html/body/div[1]/div[1]/div[2]/div/div/div/div/div[3]/main/div[3]/div[2]/div/div[1]/div[1]/span[1]/a[2]
#//*[@id="Discussion_50806"]/div/div[1]/div[1]/span[1]/a[2]
userid = userid.text

print(userid)

Mr_Shiftright


In [7]:
time_element = wd.find_element(by=By.XPATH,value='//*[@id="Comment_5482782"]/div/div[2]/div[2]/span/a/time')
date = time_element.text
time = time_element.get_attribute('title')

print(date)
print(time)

April 2018
April 7, 2018 11:38AM


In [8]:
#Pandas is a Python library. Pandas is used to analyze data.
#//*A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.
import pandas as pd
df = pd.DataFrame(columns=['username', 'date', 'comment'])
df.loc[0, 'username'] = userid
df.loc[0, 'date'] = date
df.loc[0, 'comment'] = comment

df.index.name = "ID"
#df.head(3) as an example
df.head()

Unnamed: 0_level_0,username,date,comment
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Mr_Shiftright,April 2018,"Car subscriptions seem to be the ""next new thi..."


In [9]:
df.to_csv("results.csv", sep="\t")
# !ls
# from google.colab import files
# files.download("results.csv")

In [10]:
from IPython.utils import text

# Initialize the WebDriver and navigate to the page
wd = webdriver.Chrome(options=chrome_options)
wd.get('https://forums.edmunds.com/discussion/50806/general/x/car-subscription-vs-lease-vs-purchase')

# Find all elements with the class name "Message userContent"
comments = wd.find_elements(by=By.CLASS_NAME, value="Message")

# Loop through the list of elements and print the text of each comment
for comment in comments:
    print(comment.text)  # Extract and print the text from each comment element

Car subscriptions seem to be the "next new thing" in marketing for the automobile industry.

Here's a brief rundown of what they are and how they work:

What are Car Subscriptions?

Does this tempt you with any apparent advantages of how you buy/lease right now?
Zero interest in the car subscription idea.  I like my insurance company and don't want to switch to Liberty Mutual (company Volvo uses).  I'm not interested in being that locked into the dealer's idea of when I might need to switch cars.  For me, ownership only.  No subscription for sure, probably not ever a lease either.
I hadn't thought about being forced into another insurance company, so that's a consideration.
Also here in PA tax is on the monthly payment. So I assume in a subscription scenario you will pay tax on everything not just the car part of the payment.
I don't see what the consumer gains in this arrangement given how much these cost. Even factoring in what I pay for upkeep and insurance, how is this of benefit t

In [13]:
wd = webdriver.Chrome(options=chrome_options)
wd.get('https://forums.edmunds.com/discussion/50806/general/x/car-subscription-vs-lease-vs-purchase')

hrefs=wd.find_elements(By.TAG_NAME, "a")
# traverse list
substring = "profile"
my_set = set()
for lnk in hrefs:
   # get_attribute() to get all href
    full_link = lnk.get_attribute('href')
    if full_link:
      if substring in full_link:
        my_set.add(full_link[35:])

for item in my_set:
  print(item)

biancar
GamesBx2
Glockenspiel
PlasticBottle
carolsutton_86
Howard_wep
idrachman
storm10
MorePowahhh2Me
isellhondas
Michaell
rkelly17
ksoman
turbo_v6
jasonkimberson
TheMarkusAllen
bryanfox177
qbrozen
PF_Flyer
antoninb
Mr_Shiftright
kyfdx
jdm11
stickguy
leroy111
clawsoncars


In [14]:
wd = webdriver.Chrome(options=chrome_options)
wd.get('https://forums.edmunds.com/discussion/50806/general/x/car-subscription-vs-lease-vs-purchase')
hrefs = wd.find_elements(By.CLASS_NAME, "Author")
my_set = set()

for lnk in hrefs:
    username = lnk.text.strip()
    if username:
        my_set.add(username)

for item in my_set:
    print(item)

biancar
GamesBx2
Glockenspiel
PlasticBottle
carolsutton_86
Howard_wep
idrachman
storm10
MorePowahhh2Me
isellhondas
Michaell
rkelly17
ksoman
turbo_v6
jasonkimberson
TheMarkusAllen
bryanfox177
qbrozen
PF_Flyer
antoninb
Mr_Shiftright
kyfdx
jdm11
stickguy
leroy111
clawsoncars


In [15]:
import pandas as pd

driver = webdriver.Chrome(options=chrome_options)
driver.get('https://forums.edmunds.com/discussion/50806/general/x/car-subscription-vs-lease-vs-purchase')
elements = driver.find_elements(By.CLASS_NAME, "Comment")

comments = pd.DataFrame(columns = ['Date','user_id','comments'])

for element in elements:
  author = element.find_element(By.CLASS_NAME, "Author").text
  #print(element.find_element(By.CLASS_NAME, "Meta CommentMeta CommentInfo").text)
  date = element.find_element(By.TAG_NAME, "time").text
  comment = element.find_element(By.CLASS_NAME, "Item-Body").text

  #/** loc=locator
  comments.loc[len(comments.index)] = [date, author, comment]


In [18]:
comments.index.name = "ID"
comments.to_csv('final data.csv')
# !ls
# from google.colab import files
# files.download("final data.csv")
comments.head()

Unnamed: 0_level_0,Date,user_id,comments
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,April 2018,biancar,Zero interest in the car subscription idea. I...
1,April 2018,Mr_Shiftright,I hadn't thought about being forced into anoth...
2,April 2018,rkelly17,Also here in PA tax is on the monthly payment....
3,April 2018,PF_Flyer,I don't see what the consumer gains in this ar...
4,April 2018,stickguy,"price aside, it is the convenience they are se..."


PAGINATION

In [19]:
#NUMBERED PAGINATION
url = 'https://forums.edmunds.com/discussion/50806/general/x/car-subscription-vs-lease-vs-purchase'
driver.get(url + '/p2')

# Get all page number links
page_links = driver.find_elements(By.XPATH, '//span[@id="PagerBefore"]/a[contains(@class, "Pager-p")]')
print("Number of page links found:", len(page_links))

for link in page_links:
    print("Text:", link.text, " | URL:", link.get_attribute("href"))

Number of page links found: 2
Text: 1  | URL: https://forums.edmunds.com/discussion/50806/general/x/car-subscription-vs-lease-vs-purchase
Text: 2  | URL: https://forums.edmunds.com/discussion/50806/general/x/car-subscription-vs-lease-vs-purchase/p2
