<style>
    .embedded-image {
        float: left;
        margin-left:20px;
        text-align: center;
    }

    .embedded-image img{
        border: 1px solid black !important; 
        margin: 0 auto;
    }

    .jp-CodeCell {
        max-height: 500px;
        overflow: scroll;
        
    }
</style>

Web Scraper Project
==================

## WSSC Water Water Main Breaks
<img src="images/Wlogocolor-FINAL-01.png" width="50%" style="margin-left:auto; margin-right:auto"/>

## About the Target

<figure class="embedded-image" style="width:25%;">
    <img src="images/wssc_alert_example_narrow.png"/>
    <figcaption style="margin-top: 10px">
        <a href="https://www.wsscwater.com/news/2026/february/emergency-water-main-repair-chillum-0">Source</a>
    </figcaption>
</figure>

WSSC Water provides water and wastewater utilities for much of Montgomery and Prince George's County

When WSSC needs to make emergency repairs to a sewer pipe, they issue an alert, like the one pictured


<!-- > ### Emergency Water Main Repair - Chillum
> **Laurel, MD – February 19, 2026** - WSSC Water is repairing a 6-inch water main at 2713 Nicholson Street in Chillum. Customers in this area will lose water service while the main is being fixed. Repairs will be made as quickly as possible.
> 
> Thank you for your patience. -->

The text of these alerts is nearly identical every time; only the date, location, and pipe diameter change. This will be our target data
<!-- ![](images/wssc_alert_example.png) -->


## Why this data?

Scraping this data would allow us to make a historical map of all water main breaks in the last several years

This could potentially reveal patterns in what areas are likely to need emergency repairs, and perhaps identify correlating factors that could be used to anticpate future problems

## Ethics and legality

I don't anticipate any significant ethical or legal concerns:

- This dataset is compiled from information thjat has already been explicitly and deliberately broadcast to the public, for the public benefit
- WSSC already publishes maps of current sewer breaks and repairs; this merely adds historical dimension to that
- There are only at most a few hundred such alerts; scraping is unlikely to overburden the website with rerquests

<!-- Links to these alerts can be found on their website, at https://www.wsscwater.com/newsroom 

![](images/wssc_newsroom_page.png) -->


<!-- Alerts have very similar structured format:

> #### Emergency Water Main Repair - Chillum
> **Laurel, MD – February 19, 2026** - WSSC Water is repairing a 6-inch water main at 2713 Nicholson Street in Chillum. Customers in this area will lose water service while the main is being fixed. Repairs will be made as quickly as possible.
> 
> Thank you for your patience.

> #### Emergency Water Main Repair - Chillum
> 
> **Laurel, MD – February 17, 2026** - WSSC Water is repairing an 8-inch water main at 2712 Nicholson St, Chillum. Customers in this area will lose water service while the main is being fixed. Repairs will be made as quickly as possible.
> 
> Thank you for your patience.

> #### Emergency Water Main Repair - Silver Spring
> 
> **Laurel, MD – February 17, 2026** - WSSC Water is repairing a 30-inch water main at 2100 Shorefield Road in Silver Spring. Customers in this area will lose water service while the main is being fixed. Repairs will be made as quickly as possible.
> 
> Thank you for your patience. -->

## Web Scraping Approach

1. Get all links to emergency repair alerts

2. Hit each link to get the text of the alert

3. Extract the specific variables we want (date, location, pipe diameter) from the text

4. Save as CSV



<figure class="embedded-image" style="width:40%;">
    <img src="images/wssc_newsroom_page.png"/>
    <figcaption style="margin-top: 10px">
        <a href="https://www.wsscwater.com/newsroom">Source</a>
    </figcaption>
</figure>

### Step 1: Get the links

We need our scraper to go through the newsroom page, page by page, year by year

We only need the links to news items with the type "Alert" 

There are no alerts prior to 2022, so we begin there and work forward

Because this page uses Javascript to load in any links not on the first page, we will use Selenium instead of a more traditional sraping tool like Beautiful Soup or Scrapy


### Code:

In [None]:

import pandas as pd
import datetime
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.select import Select
from selenium.common.exceptions import NoSuchElementException, TimeoutException
from selenium.webdriver.support import expected_conditions as EC
# from selenium.webdriver.support import 
from selenium.webdriver.support.ui import WebDriverWait

def get_current_datetime_as_intstring():
    return datetime.datetime.now().strftime("%m%d%Y%H%M%S")



In [None]:

def get_article_links_with_selenium():

    article_links_list = []

    start_url = "https://www.wsscwater.com/newsroom"
    print(start_url)
    driver = webdriver.Chrome()
    driver.delete_all_cookies()
    driver.get(start_url)
    driver.implicitly_wait(0.5)

    wait = WebDriverWait(driver, timeout=2, poll_frequency=0.2)

    # Click on the News Type checkbox for alerts
    alert_filter = driver.find_element(by=By.XPATH, value="//label[@for='edit-type-253']")
    alert_filter.click()
    
    # Wait to make sure the page reloaded with just the alerts
    wait.until(EC.text_to_be_present_in_element_attribute((By.CSS_SELECTOR, ".pager__item--next a"), "href", "type%5B253%5D=253"))

    for year in range(2022,2027):
        print(f"Current year: {year}")
        # Select the appropriate year
        select_year_element = driver.find_element(by=By.ID, value="edit-year")
        select_year = Select(select_year_element)
        print(f"Selected year: {select_year.all_selected_options[0].text}")
        select_year.select_by_value(f"{year}")
        # Wait to make sure the correct year loaded. If it doesn't load, try selecting the year again (but only once)
        try:
            wait.until(EC.text_to_be_present_in_element_attribute((By.CSS_SELECTOR, ".pager__item--next a"), "href", f"year={year}"))
        except Exception:
            select_year_element = driver.find_element(by=By.ID, value="edit-year")
            select_year = Select(select_year_element)
            print(f"Selected year: {select_year.all_selected_options[0].text}")
            select_year.select_by_value(f"{year}")
            wait.until(EC.text_to_be_present_in_element_attribute((By.CSS_SELECTOR, ".pager__item--next a"), "href", f"year={year}"))


        current_page_num = 0
        
        while(True):
            current_page_num = current_page_num + 1
            alerts = driver.find_elements(by=By.CSS_SELECTOR, value="h3 a")
            repair_alerts = filter(lambda x : x.text.startswith("Emergency Water Main Repair"), alerts)
            for alert in repair_alerts:
                article_links_list.append(alert.get_attribute("href"))

            try:
                next_link = driver.find_element(by=By.CSS_SELECTOR, value=".pager__item--next a")
                next_link.click()

                # For some reason, clicking "next" doesn't take on the first try most of the time. 
                # To work around this, we will wait a few seconds to see if it takes, then try again. 
                # If the second time doesn't work, then we worry  
                try:
                    wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, "li.pager__item.is-active a"), f"{current_page_num+1}"))
                except Exception:
                    next_link = driver.find_element(by=By.CSS_SELECTOR, value=".pager__item--next a")
                    next_link.click()
                    wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, "li.pager__item.is-active a"), f"{current_page_num+1}"))
                continue
            
            # If selenium cannot find a "next page" link, then this was the last page and we can exit the while loop
            except NoSuchElementException:
                break

    return article_links_list


In [None]:
article_links_list = get_article_links_with_selenium()

# we'll save a timestamped copy for debugging purposes
timestamp = datetime.datetime.now().strftime("%m%d%Y%H%M%S")
with open(f'output/article_links_{timestamp}.txt', 'w') as f:
    for link in article_links_list:
        f.write(link)



https://www.wsscwater.com/newsroom
Current year: 2022
Selected year: - Current Year (2026) -
Break!
Current year: 2023
Selected year: 2022
Selected year: 2022
Break!
Current year: 2024
Selected year: 2023
Selected year: 2023
Break!
Current year: 2025
Selected year: 2024
Selected year: 2024
Break!
Current year: 2026
Selected year: 2025
Selected year: 2025
Break!


### Results:

In [12]:

# # We'll also save a copy of the dataframe with a timestamp in the name, so we have a version 
# # of it saved that won't get overwritten if we run the scraper again later
# timestamp = datetime.datetime.now().strftime("%m%d%Y%H%M%S")
# article_links_df.to_csv(f"output/article_links_{timestamp}.csv", index=False)
# article_links_df.head()


In [11]:

article_links_list

['https://www.wsscwater.com/news/2022/december/emergency-water-main-repair-takoma-park',
 'https://www.wsscwater.com/news/2022/december/emergency-water-main-repair-capitol-heights',
 'https://www.wsscwater.com/news/2022/december/emergency-water-main-repair-oxon-hill',
 'https://www.wsscwater.com/news/2022/december/emergency-water-main-repair-silver-spring',
 'https://www.wsscwater.com/news/2022/december/emergency-water-main-repair-greenbelt',
 'https://www.wsscwater.com/news/2022/december/emergency-water-main-repair-gaithersburg',
 'https://www.wsscwater.com/news/2022/december/emergency-water-main-repair-north-bethesda',
 'https://www.wsscwater.com/news/2022/december/emergency-water-main-repair-damascus',
 'https://www.wsscwater.com/news/2022/december/emergency-water-main-repair-germantown',
 'https://www.wsscwater.com/news/2022/december/emergency-water-main-repair-bethesda',
 'https://www.wsscwater.com/news/2022/december/emergency-water-main-repair-suitland',
 'https://www.wsscwater.c

### Step 2: Scrape the Alert Text

Because there is no Javascript complicating things, this can be done with Beautiful Soup

In [5]:
from bs4 import BeautifulSoup
import requests, re, time, random


In [None]:
# This function is borrowed from here, with some light tweaking: https://www.datacamp.com/blog/ethical-web-scraping
def fetch_with_retry(url, headers, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = requests.get(url, headers)
            response.raise_for_status()  # Raise exception for HTTP errors
            return response
        except requests.RequestException:
            if attempt == max_retries - 1:
                # Last attempt failed, log and give up
                print(f"Failed to fetch {url} after {max_retries} attempts")
                return None
            
            # Wait with exponential backoff + small random offset
            wait_time = (2 ** attempt) + random.uniform(0, 1)
            print(f"Attempt {attempt+1} failed, waiting {wait_time:.2f}s before retry")
            time.sleep(wait_time)	


In [None]:

def scrape_wssc_alert_page(url, headers):
    response = fetch_with_retry(url=url,headers=headers)

    if response is None:
        return None
    
    soup = BeautifulSoup(response.content)

    title = soup.select_one("h1").text.strip()
    date = soup.select_one("time").text.strip()
    full_text = "\n".join([item.get_text() for item in soup.select(".node__content .field--type-text-long p")])
    
    alert_re = re.compile(r'(?P<diameter>\d{1,3})-inch water main at (?P<address>[\w,\s\.-]+?)\. Customers', re.IGNORECASE)
    re_search_results = alert_re.search(full_text)

    data = {
        "title": title,
        "date": date,
        "pipe_diameter": re_search_results.group("diameter") if re_search_results is not None else None,
        "address": re_search_results.group("address") if re_search_results is not None else None,
        "full_text": full_text,
    }

    return data


In [8]:

data = []
headers = {
    "USER-AGENT": "wssc_alerts scraper (efurth@montgomerycollege.edu)"
}

for link in article_links_list:
    print(f"Scraping {link}...")
    data.append(scrape_wssc_alert_page(link, headers))
    print("...Done!")
    time.sleep(3)


Scraping https://www.wsscwater.com/news/2022/december/emergency-water-main-repair-takoma-park...
...Done!
Scraping https://www.wsscwater.com/news/2022/december/emergency-water-main-repair-capitol-heights...
...Done!
Scraping https://www.wsscwater.com/news/2022/december/emergency-water-main-repair-oxon-hill...
...Done!
Scraping https://www.wsscwater.com/news/2022/december/emergency-water-main-repair-silver-spring...
...Done!
Scraping https://www.wsscwater.com/news/2022/december/emergency-water-main-repair-greenbelt...
...Done!
Scraping https://www.wsscwater.com/news/2022/december/emergency-water-main-repair-gaithersburg...
...Done!
Scraping https://www.wsscwater.com/news/2022/december/emergency-water-main-repair-north-bethesda...
...Done!
Scraping https://www.wsscwater.com/news/2022/december/emergency-water-main-repair-damascus...
...Done!
Scraping https://www.wsscwater.com/news/2022/december/emergency-water-main-repair-germantown...
...Done!
Scraping https://www.wsscwater.com/news/2022

### Step 3: Extract the data from the text

We can use regular expressions to get what we need.

```python
re.compile(r'(?P<diameter>\d{1,3})-inch water main at (?P<address>[\w,\s\.-]+?)\. Customers', re.IGNORECASE)
```

<figure class="embedded-image" >
    <img src="images/wssc_alert_example_highlighted_elements.png"/>
    <figcaption style="margin-top: 10px">
        <a href="https://www.wsscwater.com/news/2026/february/emergency-water-main-repair-chillum-0">Source</a>
    </figcaption>
</figure>

### Results

In [None]:

df = pd.DataFrame(data)
df.head()

Unnamed: 0,title,date,pipe_diameter,address,full_text
0,Emergency Water Main Repair - Takoma Park,"December 30, 2022",8,602 Ethan Allen Avenue in Takoma Park,"Laurel, MD – December 30, 2022: WSSC Water is ..."
1,Emergency Water Main Repair - Capitol Heights,"December 29, 2022",12,6180 Old Central Avenue at Rollins Avenue in C...,"Laurel, MD – December 29, 2022: WSSC Water is ..."
2,Emergency Water Main Repair - Oxon Hill,"December 29, 2022",8,801 Owens Road in Oxon Hill,"Laurel, MD – December 29, 2022: WSSC Water is ..."
3,Emergency Water Main Repair - Silver Spring,"December 28, 2022",8,12001 Old Columbia Pike in Silver Spring,"Laurel, MD – December 28, 2022: WSSC Water is ..."
4,Emergency Water Main Repair - Greenbelt,"December 28, 2022",10,9115 Springhill Ln. in Greenbelt,"Laurel, MD – December 28, 2022: WSSC Water is ..."


In [10]:
df.to_csv("output/wssc_alerts.csv")