<p style="color: darkred; font-size: 50px; text-align: center;"><b>Webscraping and Social Media Scraping</b></p>
<p style="color: darkred; font-size: 30px; text-align: center;">Labs #5. Scraping dynamic pages</p>
<p style="font-size: 20px; text-align: center;">Maciej Świtała, Ewa Weychert</p>
<p style="font-size: 20px; text-align: center;">Spring 2025</p>
<p align="center">
  <img src="img/wne-logo-new-en.jpg" width="498" height="107">
</p>

### Setting up the environment

Let us (install if neccessary and) load the needed libraries.

In [1]:
#!pip install pandas numpy selenium

In [None]:
# FOR DATA PROCESSING:
import pandas as pd
import numpy as np

# FOR MEASURING COMPUTATION TIME, CREATING FIXED DELAYS:
import time

# FOR APPLYING SELENIUM:
import selenium # Python Selenium
from selenium import webdriver # for specifying webdriver

from webdriver_manager.chrome import ChromeDriverManager # chromedriver for automatized access to Chrome
from webdriver_manager.microsoft import EdgeChromiumDriverManager # msedgedriver for automatized access to Microsoft Edge

from selenium.webdriver.chrome.service import Service # needed since Selenium 4.10.0 see: https://github.com/SeleniumHQ/selenium/commit/9f5801c82fb3be3d5850707c46c3f8176e3ccd8e

from selenium.webdriver.support.ui import WebDriverWait # this three enable waiting until sth is displayed on website
from selenium.webdriver.support import expected_conditions as EC # for checking visibility of an element
from selenium.webdriver.common.by import By # for checking element visibility by XPath

# FOR SAVING DATA:
import pickle # pickle format of saved output

def save_object(obj, filename): #  function defined for saving Python objects
    with open(filename, 'wb') as output: # overwrites any existing file
        pickle.dump(obj, output, pickle.HIGHEST_PROTOCOL)

### Browsers' drivers

To proceed with Selenium, browser's webdriver need to be installed - which exactly depends on the browser we want to use. Here, let us try three the most popular browsers: Chrome, edge and Edge.

In [None]:
chromepath = ChromeDriverManager().install(); print(chromepath)
edgepath = EdgeChromiumDriverManager().install(); print(edgepath)

C:\Users\Wojciech\.wdm\drivers\chromedriver\win64\114.0.5735.90\chromedriver.exe
C:\Users\Wojciech\.wdm\drivers\geckodriver\win64\v0.36.0\geckodriver.exe
C:\Users\Wojciech\.wdm\drivers\edgedriver\win64\135.0.3179.18\msedgedriver.exe


### Example: Scrapping news on climate solutions from CNN

Before we start: can we scrap news from the CNN website?

We should investigate the robots.txt file first, see: https://edition.cnn.com/robots.txt. Looks like one can scrap some of the website's subsections.

What about the website's terms of service? Let us investigate it, see: https://edition.cnn.com/2015/01/06/world/terms-service/index.html. No specific statements on webscraping. Yet, clearly we must not collect and process personal data.

It is highly unlikely that we could infringe anyone's copyright by scraping content from this website. Also, it is hard to imagine causing a material damage unless we break the website.

Conclusion: we can scrap it as long as it is consistent with robots.txt and it is not personal data.

Step 1: Open the CNN website with Selenium.

In [8]:
from selenium import webdriver
from selenium.webdriver.edge.service import Service
from selenium.webdriver.edge.options import Options

website = "https://edition.cnn.com/climate/solutions/page/1"

edgepath = r"C:\Users\Wojciech\.wdm\drivers\edgedriver\win64\135.0.3179.18\msedgedriver.exe"  # Make sure to set the correct path to your Edge driver
service_edge = Service(executable_path=edgepath)
options_edge = Options()
driver_edge = webdriver.Edge(service=service_edge, options=options_edge)  # opens edge

driver_edge.maximize_window()  # maximizes browser's window
driver_edge.get(website)  # opens a website

Step 2: Accept the cookies.

In [9]:
website = "https://edition.cnn.com/climate/solutions/page/1"

service_Edge = Service(executable_path = edgepath) 
options_Edge = webdriver.EdgeOptions()
driver_Edge = webdriver.Edge(service = service_Edge, options = options_Edge) # opens Edge

driver_Edge.maximize_window() # maximizes browser's window
driver_Edge.get(website) # opens a website

cookies_button_xpath = '''//*[@id="onetrust-accept-btn-handler"]'''

# wait at most 30 seconds until cookies button is visible
WebDriverWait(driver_Edge, 30).until(EC.visibility_of_element_located((By.XPATH, cookies_button_xpath))) 

# + wait random time drawn from specific (strongly right-side-skewed) distribution to better imitate human behavior
time.sleep(np.random.chisquare(1)+3)

content = driver_Edge.find_element("xpath", cookies_button_xpath) # finds the button
content.click() # clicks the button

Step 3: Build a scraper collecting links to subpages with single news.

In [10]:
start = time.time()

website = "https://edition.cnn.com/climate/solutions/page/1" # here we limit to 1st page

service_Edge = Service(executable_path = edgepath) 
options_Edge = webdriver.EdgeOptions()
driver_Edge = webdriver.Edge(service = service_Edge, options = options_Edge) # opens Edge

driver_Edge.maximize_window() # maximizes browser's window
driver_Edge.get(website) # opens a website

cookies_button_xpath = '''//*[@id="onetrust-accept-btn-handler"]''' 

# wait at most 30 seconds until cookies button is visible
WebDriverWait(driver_Edge, 30).until(EC.visibility_of_element_located((By.XPATH, cookies_button_xpath))) 

# + wait random time drawn from specific (strongly right-side-skewed) distribution to better imitate human behavior
time.sleep(np.random.chisquare(1)+3)

content = driver_Edge.find_element("xpath", cookies_button_xpath) # finds the button
content.click() # clicks the button
time.sleep(3) # extra time needed for the button to disappear

# wait at most 30 seconds until webpage is reloaded
WebDriverWait(driver_Edge, 30).until(EC.visibility_of_element_located((By.XPATH, '''//a[contains(@data-link-type,'article')]'''))) 

# + wait random time drawn from specific (strongly right-side-skewed) distribution to better imitate human behavior
time.sleep(np.random.chisquare(1)+3)

# finds all <a> elements on website that include: data-link-type='article'
tags = driver_Edge.find_elements('xpath','''//a[contains(@data-link-type,'article')]''')

# finds all links from the 'tags'
hrefs = []
for tag in tags:
    href = tag.get_attribute('href') # for each <a> finds 'href', i.e., link to subpage
    if(href not in hrefs): # here we handle duplicates
        hrefs.append(href)

driver_Edge.close() # this closes the webdriver
        
print(len(hrefs))

end = time.time()
print(end-start)

29
25.26259970664978


Step 4: Scale the procedure of collecting the links to subpages for the whole section of the website.

In [11]:
start = time.time()

service_Edge = Service(executable_path = edgepath) 
options_Edge = webdriver.EdgeOptions()
driver_Edge = webdriver.Edge(service = service_Edge, options = options_Edge) # opens Edge

driver_Edge.maximize_window() # maximizes browser's window
website = "https://edition.cnn.com/climate/solutions/page/1"

driver_Edge.get(website) # opens a website

cookies_button_xpath = '''//*[@id="onetrust-accept-btn-handler"]'''

# wait at most 30 seconds until cookies button is visible
WebDriverWait(driver_Edge, 30).until(EC.visibility_of_element_located((By.XPATH, cookies_button_xpath))) 

# + wait random time drawn from specific (strongly right-side-skewed) distribution to better imitate human behavior
time.sleep(np.random.chisquare(1)+3)

content = driver_Edge.find_element("xpath", cookies_button_xpath) # finds the button
content.click() # clicks the button
time.sleep(3) # extra time needed for the button to disappear

hrefs = []

for i in range(1,14): # looks like there are 13 pages (state for 06.03.2025, 16:36)

    try:

        # wait at most 30 seconds until webpage is reloaded
        WebDriverWait(driver_Edge, 30).until(EC.visibility_of_element_located((By.XPATH, '''//a[contains(@data-link-type,'article')]'''))) 
        
        # + wait random time drawn from specific (strongly right-side-skewed) distribution to better imitate human behavior
        time.sleep(np.random.chisquare(1)+3)

        # finds all <a> elements on website that include: data-link-type='article'
        tags = driver_Edge.find_elements('xpath','''//a[contains(@data-link-type,'article')]''')

        # finds all links from the 'tags'
        for tag in tags:
            href = tag.get_attribute('href') # for each <a> finds 'href', i.e., link to subpage
            if(href not in hrefs): # here we handle duplicates
                hrefs.append(href)

        website = "https://edition.cnn.com/climate/solutions/page/"+str(i)
        driver_Edge.get(website) # opens a website

    except:
        continue
    
driver_Edge.close() # this closes the webdriver
    
print(len(hrefs))

end = time.time()
print(end-start)

174
102.20039010047913


Step 5: Access the collected links and collect the data.

CAUTION! It appears we do not have enough time to scrap everything. Let us collect first 30 texts.

In [12]:
start = time.time()

service_Edge = Service(executable_path = edgepath) 
options_Edge = webdriver.EdgeOptions()
driver_Edge = webdriver.Edge(service = service_Edge, options = options_Edge) # opens Edge

driver_Edge.maximize_window() # maximizes browser's window
website = hrefs[0]

driver_Edge.get(website) # opens a website

cookies_button_xpath = '''//*[@id="onetrust-accept-btn-handler"]'''

# wait at most 30 seconds until cookies button is visible
WebDriverWait(driver_Edge, 30).until(EC.visibility_of_element_located((By.XPATH, cookies_button_xpath))) 

# + wait random time drawn from specific (strongly right-side-skewed) distribution to better imitate human behavior
time.sleep(np.random.chisquare(1)+3)

content = driver_Edge.find_element("xpath", cookies_button_xpath) # finds the button
content.click() # clicks the button
time.sleep(3) # extra time needed for the button to disappear

texts = []

for href in hrefs[1:30]: # here we limit to first 30 articles

    try:

        article_content_xpath = '''//div[@class='article__content']'''
        
        # wait at most 30 seconds until the article is visible
        WebDriverWait(driver_Edge, 30).until(EC.visibility_of_element_located((By.XPATH, article_content_xpath))) 
        
        # + wait random time drawn from specific (strongly right-side-skewed) distribution to better imitate human behavior
        time.sleep(np.random.chisquare(1)+3)

        content = driver_Edge.find_element("xpath", article_content_xpath) # finds the content
        texts.append(content.text)
        
        driver_Edge.get(href)
    
    except:
        continue
    
driver_Edge.close() # this closes the webdriver
    
print(len(texts))

end = time.time()
print(end-start)

29
211.64213299751282


In [13]:
# here, we save what we scraped as pickle or txt
# obviously we could use some different formats though
save_object(texts, r'outputs/output_CNNclimate.pkl')
save_object(texts, r'outputs/output_CNNclimate.txt')

### Example: setting up more Selenium options

Selenium offers plenty of options that can be useful in different context. Let us focus on those that appear the most popular.

In [14]:
start = time.time()

website = "https://edition.cnn.com/climate/solutions/page/1"

service_Edge = Service(executable_path = edgepath) 
options_Edge = webdriver.EdgeOptions()

############################## CHECK OUT THE OPTIONS ADDED ##############################
options_Edge.add_argument("--start-maximized") # the window will start as maximized
options_Edge.add_argument("--headless") # browser's window will not be displayed when this applied
options_Edge.add_argument("--incognito") # incognito mode
options_Edge.add_argument("--no-sandbox")
# sandboxing is a security feature that helps isolate web pages in their own secure
# "sandbox" to prevent malicious code from affecting the entire system
options_Edge.add_argument("--disable-gpu")
# instructs Chrome to disable the GPU (Graphics Processing Unit) hardware acceleration
options_Edge.add_argument("--disable-notifications")
options_Edge.add_argument("--disable-infobars")
options_Edge.add_argument("--disable-extensions")
options_Edge.add_argument("--disable-web-security")
#########################################################################################

driver_Edge = webdriver.Edge(service = service_Edge, options = options_Edge) # opens Edge

driver_Edge.maximize_window() # maximizes browser's window
driver_Edge.get(website) # opens a website

cookies_button_xpath = '''//*[@id="onetrust-accept-btn-handler"]''' 

# wait at most 30 seconds until cookies button is visible
WebDriverWait(driver_Edge, 30).until(EC.visibility_of_element_located((By.XPATH, cookies_button_xpath))) 
# + wait random time drawn from specific (strongly right-side-skewed) distribution to better imitate human behavior
time.sleep(np.random.chisquare(1)+3)

content = driver_Edge.find_element("xpath", cookies_button_xpath) # finds the button
content.click() # clicks the button

for i in range(1,14): # there are 13 pages, state as for 06.03.2025, 16:36

    # wait at most 30 seconds until webpage is reloaded
    WebDriverWait(driver_Edge, 30).until(EC.visibility_of_element_located((By.XPATH, '''//a[contains(@data-link-type,'article')]'''))) 
    # + wait random time drawn from specific (strongly right-side-skewed) distribution to better imitate human behavior
    time.sleep(np.random.chisquare(1)+3)

    # finds all <a> elements on website that include: data-link-type='article'
    tags = driver_Edge.find_elements('xpath','''//a[contains(@data-link-type,'article')]''')

    # finds all links from the 'tags'
    for tag in tags:
        href = tag.get_attribute('href') # for each <a> finds 'href', i.e., link to subpage
        if(href not in hrefs): # here we handle duplicates
            hrefs.append(href)

    website = "https://edition.cnn.com/climate/solutions/page/"+str(i)
    driver_Edge.get(website) # opens a website
        
texts = []

for href in hrefs[0:30]: # here we limit to first 30 articles
    
    driver_Edge.get(href)

    article_content_xpath = '''//div[@class='article__content']'''
    
    # wait at most 30 seconds until the article is visible
    WebDriverWait(driver_Edge, 30).until(EC.visibility_of_element_located((By.XPATH, article_content_xpath))) 
    # + wait random time drawn from specific (strongly right-side-skewed) distribution to better imitate human behavior
    time.sleep(np.random.chisquare(1)+3)

    content = driver_Edge.find_element("xpath", article_content_xpath) # finds the content
    texts.append(content.text)
    
driver_Edge.close() # this closes the webdriver

print(len(texts))

end = time.time()
print(end-start)

30
303.9829959869385


### Example: UserAgent rotation

The User-Agent header is an HTTP header designed to identify the user agent responsible for executing a request. One can identify a scraper by it. It is possible to keep it changing. 

In [None]:
service_Edge = Service(executable_path = edgepath) 
options_Edge = webdriver.EdgeOptions(); options_Edge.add_argument("--headless")
driver_Edge = webdriver.Edge(service = service_Edge, options = options_Edge)

driver_Edge.get("https://us.cnn.com/")

# returns our current User-Agent
user_agent = driver_Edge.execute_script("return navigator.userAgent;")
print("Current User-Agent:", user_agent)

driver_Edge.quit()

Current User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:135.0) Gecko/20100101 Firefox/135.0


In [None]:
from fake_useragent import UserAgent

# generating fake (random) User-Agent
ua = UserAgent()
user_agent = ua.random

service_Edge = Service(executable_path = edgepath) 
options_Edge = webdriver.EdgeOptions()
options_Edge.add_argument("--headless")
options_Edge.set_preference("general.useragent.override", user_agent) # changing User-Agent to the randomly generated one
driver_Edge = webdriver.Edge(service = service_Edge, options = options_Edge)

driver_Edge.get("https://us.cnn.com/")

user_agent = driver_Edge.execute_script("return navigator.userAgent;")
print("Current User-Agent:", user_agent)

driver_Edge.quit()

Current User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.0.0


### Example: device ID rotation

One can be also identified and banned based on device ID. It is possible to emulate that we use a fake devide.

In [14]:
service_chrome = Service(executable_path = chromepath) 
options_chrome = webdriver.ChromeOptions(); options_chrome.add_argument("--headless")
options_chrome.add_experimental_option("mobileEmulation", {"deviceName": "Galaxy S5"})
driver_chrome = webdriver.Chrome(service = service_chrome, options = options_chrome)

driver_chrome.get("https://us.cnn.com/")

# returns our current User-Agent
user_agent = driver_chrome.execute_script("return navigator.userAgent;")
print("Current User-Agent:", user_agent)

driver_chrome.quit()

Current User-Agent: Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Mobile Safari/537.36


List of available devices: https://github.com/DevExpress/device-specs/blob/master/devices.md.

### Example: IP rotation

In [None]:
service_Edge = Service(executable_path = edgepath) 

options_Edge = webdriver.EdgeOptions()
options_Edge.add_argument("--headless")

driver_Edge = webdriver.Edge(service = service_Edge, options = options_Edge) # opens Edge

driver_Edge.get("https://sslproxies.org/") # webpage with some example IPs

xpath = "/html[1]/body[1]/section[1]/div[1]/div[2]/div[1]/table[1]/tbody[1]/tr[1]/td[1]"

WebDriverWait(driver_Edge, 30).until(EC.visibility_of_element_located((By.XPATH, xpath)))
time.sleep(np.random.chisquare(1)+3)

i = 1; proxy_list = []
while(True):
    try: # try getting all IPs you can one by one
        element1 = driver_Edge.find_element("xpath", "/html[1]/body[1]/section[1]/div[1]/div[2]/div[1]/table[1]/tbody[1]/tr["+str(i)+"]/td[1]")
        element2 = driver_Edge.find_element("xpath", "/html[1]/body[1]/section[1]/div[1]/div[2]/div[1]/table[1]/tbody[1]/tr["+str(i)+"]/td[2]")
        
        proxy_list.append(element1.text+":"+element2.text) # formates to "IP:port"
        
        i += 1 # if no error, then continue
    except: # error means no more rows in table with IPs, then break
        break
        
driver_Edge.close()

In [31]:
' '.join(proxy_list)

'54.248.238.110:80 51.68.175.56:1080 13.59.156.167:3128 3.141.217.225:80 54.179.39.14:3128 3.130.65.162:3128 3.129.184.210:80 3.139.242.184:80 3.37.125.76:3128 3.71.239.218:3128 44.218.183.55:80 15.236.106.236:3128 13.36.87.105:3128 3.212.148.199:3128 51.20.19.159:3128 3.127.62.252:80 3.122.84.99:3128 13.36.104.85:80 8.219.97.248:80 115.72.34.134:10003 91.107.130.145:11000 71.14.218.2:8080 47.252.29.28:11222 45.140.143.77:18080 86.106.132.194:3128 34.244.90.35:80 13.40.63.96:8001 3.96.208.91:3128 103.133.26.119:8080 44.219.175.186:80 204.236.137.68:80 15.207.35.241:1080 54.152.3.36:80 3.97.167.115:3128 3.97.176.251:3128 13.213.114.238:3128 51.17.58.162:3128 15.156.24.206:3128 52.63.129.110:3128 13.246.209.48:1080 204.236.176.61:3128 52.67.10.183:80 52.196.1.182:80 43.201.121.81:80 43.200.77.128:3128 46.51.249.135:3128 3.124.133.93:3128 35.79.120.242:3128 35.76.62.196:80 35.72.118.126:80 217.77.102.18:3128 13.126.79.133:80 3.136.29.104:80 13.56.192.187:80 47.251.122.81:8888 13.55.210.14

In [5]:
len(proxy_list)

100

In [None]:
from selenium.webdriver.edge.options import Options
from selenium.webdriver.common.proxy import Proxy, ProxyType

current_proxy_index = 0

def get_next_proxy():
    global current_proxy_index, proxy_list
    
    proxy = proxy_list[current_proxy_index]
    current_proxy_index = (current_proxy_index + 1) % len(proxy_list)
    # increases proxy index by 1 and takes the remains from division with the number of proxies
    # terefore it goes: 0, 1, 2, ..., 98, 99, 0, 1, 2, ...
    
    return proxy

def setup_driver(proxy_address):

    options_Edge = Options()

    options_Edge.set_preference("network.proxy.type", 1)
    options_Edge.set_preference("network.proxy.http", proxy_address.split(':')[0])
    options_Edge.set_preference("network.proxy.http_port", int(proxy_address.split(':')[1]))
    options_Edge.set_preference("network.proxy.ssl", proxy_address.split(':')[0])
    options_Edge.set_preference("network.proxy.ssl_port", int(proxy_address.split(':')[1]))
    options_Edge.set_preference("network.proxy.socks", proxy_address.split(':')[0])
    options_Edge.set_preference("network.proxy.socks_port", int(proxy_address.split(':')[1]))
    options_Edge.set_preference("network.proxy.socks_version", 5)
    options_Edge.set_preference("network.proxy.share_proxy_settings", True)
    options_Edge.set_preference("network.proxy.no_proxies_on", "localhost, 127.0.0.1")

    options_Edge.set_preference("signon.autofillForms", False)
    options_Edge.set_preference("signon.rememberSignons", False)
    options_Edge.set_preference("network.http.use-cache", False)

    options_Edge.add_argument("--headless")

    driver_Edge = webdriver.Edge(options=options_Edge)

    return driver_Edge

Let us try using a proxy. We will start from https://api64.ipify.org to check if our IP changes.

In [None]:
service_Edge = Service(executable_path = edgepath) 

options_Edge = webdriver.EdgeOptions()
options_Edge.add_argument("--headless")

driver_Edge = webdriver.Edge(service = service_Edge, options = options_Edge) # opens Edge

driver_Edge.get("https://api64.ipify.org")

WebDriverWait(driver_Edge, 30).until(EC.visibility_of_element_located((By.XPATH, "/html[1]/body[1]/pre[1]")))
time.sleep(np.random.chisquare(1)+3)

element = driver_Edge.find_element('xpath',"/html[1]/body[1]/pre[1]")
print("Current IP:", element.text)

driver_Edge.close()

Current IP: 78.11.220.225


In [None]:
for i in range(100): 
# we try it as many times as needed for https://api64.ipify.org to be accessed
# neccessary as Internet connection can be unstable, and IP can be in use

    print(i)
    try:
        proxy_address = get_next_proxy()
        driver_Edge = setup_driver(proxy_address)
        
        driver_Edge.get('https://api64.ipify.org')
        
        WebDriverWait(driver_Edge, 30).until(EC.visibility_of_element_located((By.XPATH, "/html[1]/body[1]/pre[1]")))
        time.sleep(np.random.chisquare(1)+3)

        element = driver_Edge.find_element('xpath',"/html[1]/body[1]/pre[1]")
        print("Current IP:", element.text)

        driver_Edge.close()
        break
        
    except:
        driver_Edge.close()
        continue

0
1
2
3
Current IP: 200.174.198.197


It is not efficient as when we try certain IP, probably some other people are using them (as these are freely available online). To make the procedure more efficient, one should introduce a mechanism of building a machine with new IP, e.g. with **scrapoxy**.

### Supplementary materials

More Selenium options can be found here: https://www.seleniumeasy.com/.

### Exercises

##### Exercise 1.

Collect texts of court cases issued before the District Court in Warsaw in 01.01.2020-31.12.2024 using https://orzeczenia.ms.gov.pl/. Fill in the search tool as presented in `img/example.png` and search for the cases (i.e. click the 'Szukaj' button). You can fill certain parts of the website with content executing `element = driver_chrome.find_element(); element.send_keys()` Next, access the subpages with texts of court cases and collect them. Specify the delays wisely as this website involves CAPTCHAs.

##### Exercise 2.

Collect individual opinions on McDonald's restaurant shared by its employees from: https://www.niche.com/places-to-work/mcdonalds-oak-brook-il/reviews/. Use the advanced Selenium options. Check if UserAgent, Device ID or IP rotation is needed as this website blocks when multiple requests from the same UserAgent are sent.