# Scraping the PressGaney Blog

**Project Title:** Healthcare Experience Trends Analysis from Press Ganey Blogs (2024-2025)

**Author:** Virginia Wenger

**Date:** 04.02.2025

**Description:** This notebook performs web scraping of Press Ganey's blog articles and applies text analytics to uncover recurring themes and emerging trends in healthcare experience measurement and improvement. The analysis focuses on identifying key themes such as technology integration, patient feedback, and workforce engagement.
    
**Methods Used:**
- Web Scraping (BeautifulSoup, Requests)
- Text Preprocessing (NLTK, Regular Expressions)
- N-gram Frequency Analysis
- Named Entity Recognition (NER) using SpaCy
- Topic Modeling using Latent Dirichlet Allocation (LDA)
- Sentiment Analysis (TextBlob and Transformer Models)
- Summarization using Transformer Models (T5-small)
- Data Visualization (Matplotlib, Seaborn)
    
**Outcome:**

The analysis highlights five key trends shaping healthcare experience measurement:
- AI & Technology Integration
- Patient-Centered Care
- Workforce Engagement & Safety
- Regulatory Standards
- Health Equity & Social Determinants


In [4]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd

In [2]:
# Restart with a fresh session
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0 Safari/537.36")

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

# Open the blog page
url = "https://info.pressganey.com/press-ganey-blog-healthcare-experience-insights"
driver.get(url)
time.sleep(5)

print(driver.title)  # Check that the session is working

Human Experience in Healthcare Blog | Press Ganey


In [3]:
# Find all <article> elements on the page
articles = driver.find_elements(By.TAG_NAME, 'article')

# Check how many articles were found
print(f"Number of articles found: {len(articles)}")

# Print the text of the first article to verify
if articles:
    print("First article preview:")
    print(articles[0].text)
else:
    print("No articles found.")

Number of articles found: 20
First article preview:
BLOG
AI for safety leaders: How will emerging technology impact our daily work in healthcare?
Read blog


In [14]:
try:
    # Wait until the 'See More' button is clickable
    see_more_button = WebDriverWait(driver, 10).until(
        EC.element_to_be_clickable((By.ID, 'uf-lazy-loader-load-more'))
    )
    print("See More button found and is clickable.")
except Exception as e:
    print("See More button not found or not clickable.", e)

See More button found and is clickable.


In [16]:
try:
    # Scroll to the 'See More' button before clicking
    driver.execute_script("arguments[0].scrollIntoView(true);", see_more_button)
    time.sleep(2)  # Give time for any animations or dynamic effects

    # Click the button
    see_more_button.click()
    print("Clicked 'See More' button.")
    time.sleep(5)  # Wait for new articles to load
except Exception as e:
    print("Failed to click 'See More' button.", e)


Failed to click 'See More' button. Message: element click intercepted: Element <button id="uf-lazy-loader-load-more" type="button" class="uf-lazy-loader-load-more uf-button is-primary is-margin-centered">...</button> is not clickable at point (599, 32). Other element would receive the click: <div class="mega-dropdown-section">...</div>
  (Session info: chrome=132.0.6834.160)
Stacktrace:
0   chromedriver                        0x000000010067b0d4 cxxbridge1$str$ptr + 2600792
1   chromedriver                        0x00000001006739f0 cxxbridge1$str$ptr + 2570356
2   chromedriver                        0x00000001002143d8 cxxbridge1$string$len + 89376
3   chromedriver                        0x000000010025ed44 cxxbridge1$string$len + 394892
4   chromedriver                        0x000000010025d320 cxxbridge1$string$len + 388200
5   chromedriver                        0x000000010025b1fc cxxbridge1$string$len + 379716
6   chromedriver                        0x000000010025a62c cxxbridge1$strin

In [19]:
while True:
    try:
        # Wait for 'See More' button to be clickable
        see_more_button = WebDriverWait(driver, 10).until(
            EC.element_to_be_clickable((By.ID, 'uf-lazy-loader-load-more'))
        )

        # Scroll to the button and click
        driver.execute_script("arguments[0].scrollIntoView(true);", see_more_button)
        time.sleep(2)
        see_more_button.click()
        print("Clicked 'See More' button.")
        time.sleep(5)  # Wait for articles to load
    except:
        print("No more 'See More' button found or all articles loaded.")
        break


No more 'See More' button found or all articles loaded.


In [20]:
# Find all loaded <article> elements
articles = driver.find_elements(By.TAG_NAME, 'article')

# Check how many articles were found after loading
print(f"Total articles loaded: {len(articles)}")

# Preview the first article to verify
if articles:
    print("First article after loading all content:")
    print(articles[0].text)
else:
    print("No articles found after clicking 'See More'.")

Total articles loaded: 40
First article after loading all content:
BLOG
AI for safety leaders: How will emerging technology impact our daily work in healthcare?
Read blog


In [21]:
# Extract links from all loaded articles
article_links = []

for article in articles:
    try:
        link_tag = article.find_element(By.TAG_NAME, 'a')
        link = link_tag.get_attribute('href')
        if link and link.startswith('https://info.pressganey.com/press-ganey-blog-healthcare-experience-insights/'):
            article_links.append(link)
    except Exception as e:
        print(f"Error extracting link: {e}")

# Display the collected links
print(f"Collected {len(article_links)} article links.")
for link in article_links[:5]:  # Preview the first 5 links
    print(link)

Collected 40 article links.
https://info.pressganey.com/press-ganey-blog-healthcare-experience-insights/ai-safety-leaders
https://info.pressganey.com/press-ganey-blog-healthcare-experience-insights/early-updates-healthcare-policy
https://info.pressganey.com/press-ganey-blog-healthcare-experience-insights/mcahps-2025-changes
https://info.pressganey.com/press-ganey-blog-healthcare-experience-insights/2025-themes
https://info.pressganey.com/press-ganey-blog-healthcare-experience-insights/health-outcome-surveys-2026


In [22]:
# Load the first article to start testing
test_link = article_links[0]  # Select the first article link
driver.get(test_link)
print(f"Loaded article: {test_link}")

# Pause to ensure the page is fully loaded
time.sleep(3)


Loaded article: https://info.pressganey.com/press-ganey-blog-healthcare-experience-insights/ai-safety-leaders


In [23]:
# Extract the article title
try:
    title = driver.find_element(By.TAG_NAME, 'h1').text
    print(f"Title: {title}")
except Exception as e:
    print(f"Failed to extract title: {e}")

Title: AI for safety leaders: How will emerging technology impact our daily work in healthcare?


In [24]:
# Extract the publish date
try:
    date = driver.find_element(By.CLASS_NAME, 'uf-datetime').text
    print(f"Publish Date: {date}")
except Exception as e:
    print(f"Failed to extract date: {e}")

Publish Date: 31 January 2025


In [26]:
# Extract the article content
try:
    content_section = driver.find_element(By.ID, 'uf-item-blog-content')
    paragraphs = content_section.find_elements(By.TAG_NAME, 'p')
    
    # Combine all paragraph texts
    content = ' '.join([p.text for p in paragraphs])
    print(f"Content Preview: {content[:500]}...")  # Preview first 500 characters
except Exception as e:
    print(f"Failed to extract content: {e}")


Content Preview: Artificial intelligence (AI) is revolutionizing the healthcare landscape as we speak. Its extraordinary potential to streamline processes, sharpen decision-making, and improve efficiencies is becoming increasingly evident. And its impact on safety leaders is no exception.  While much of the conversation has centered around rapid technological and futuristic innovations, its most profound effects may also be seen and felt in the day-to-day work of improving patient safety by improving  data quali...


In [27]:
# Combine the extracted data
article_data = {
    'url': test_link,
    'title': title,
    'date': date,
    'content': content
}

# Display the complete extracted data
print("\nExtracted Article Data:")
print(article_data)



Extracted Article Data:
{'url': 'https://info.pressganey.com/press-ganey-blog-healthcare-experience-insights/ai-safety-leaders', 'title': 'AI for safety leaders: How will emerging technology impact our daily work in healthcare?', 'date': '31 January 2025', 'content': 'Artificial intelligence (AI) is revolutionizing the healthcare landscape as we speak. Its extraordinary potential to streamline processes, sharpen decision-making, and improve efficiencies is becoming increasingly evident. And its impact on safety leaders is no exception.  While much of the conversation has centered around rapid technological and futuristic innovations, its most profound effects may also be seen and felt in the day-to-day work of improving patient safety by improving  data quality and, empowering safety leaders to make more informed decisions, anticipate risks earlier, and develop stronger interventions. Healthcare safety leaders must first recognize their essential role in AI governance. Robust governan

In [28]:
# Function to scrape title, date, and content from each article
def scrape_article_content(links):
    contents = []

    for idx, link in enumerate(links):
        driver.get(link)
        time.sleep(3)  # Wait for the page to load

        print(f"Scraping article {idx + 1} of {len(links)}: {link}")

        # Extract title
        try:
            title = driver.find_element(By.TAG_NAME, 'h1').text
        except:
            title = "No Title Found"
        
        # Extract publish date
        try:
            date = driver.find_element(By.CLASS_NAME, 'uf-datetime').text
        except:
            date = "No Date Found"

        # Extract article content
        try:
            content_section = driver.find_element(By.ID, 'uf-item-blog-content')
            paragraphs = content_section.find_elements(By.TAG_NAME, 'p')
            content = ' '.join([p.text for p in paragraphs])
        except:
            content = "No Content Found"

        # Append the extracted data to the list
        contents.append({
            'url': link,
            'title': title,
            'date': date,
            'content': content
        })

    return contents

# Run the function to scrape all articles
article_data = scrape_article_content(article_links)

# Display a sample of the scraped articles
for article in article_data[:3]:
    print(f"Title: {article['title']}")
    print(f"Date: {article['date']}")
    print(f"Content Preview: {article['content'][:300]}...\n")

Scraping article 1 of 40: https://info.pressganey.com/press-ganey-blog-healthcare-experience-insights/ai-safety-leaders
Scraping article 2 of 40: https://info.pressganey.com/press-ganey-blog-healthcare-experience-insights/early-updates-healthcare-policy
Scraping article 3 of 40: https://info.pressganey.com/press-ganey-blog-healthcare-experience-insights/mcahps-2025-changes
Scraping article 4 of 40: https://info.pressganey.com/press-ganey-blog-healthcare-experience-insights/2025-themes
Scraping article 5 of 40: https://info.pressganey.com/press-ganey-blog-healthcare-experience-insights/health-outcome-surveys-2026
Scraping article 6 of 40: https://info.pressganey.com/press-ganey-blog-healthcare-experience-insights/hx-2025-organizational-resilience
Scraping article 7 of 40: https://info.pressganey.com/press-ganey-blog-healthcare-experience-insights/community-advisor
Scraping article 8 of 40: https://info.pressganey.com/press-ganey-blog-healthcare-experience-insights/white-house-conference

In [29]:
# Convert the scraped data to a DataFrame
df_articles = pd.DataFrame(article_data)

# Save the DataFrame to a CSV file
df_articles.to_csv('data/press_ganey_blog_articles.csv', index=False)

# Display the first few articles to verify
print(df_articles.head())


                                                 url  \
0  https://info.pressganey.com/press-ganey-blog-h...   
1  https://info.pressganey.com/press-ganey-blog-h...   
2  https://info.pressganey.com/press-ganey-blog-h...   
3  https://info.pressganey.com/press-ganey-blog-h...   
4  https://info.pressganey.com/press-ganey-blog-h...   

                                               title             date  \
0  AI for safety leaders: How will emerging techn...  31 January 2025   
1  Early updates from the new administration on h...  23 January 2025   
2      MCAHPS 2025: Your guide to the latest changes  22 January 2025   
3               3 critical strategic themes for 2025  21 January 2025   
4  How Health Outcome Surveys will play a crucial...  17 January 2025   

                                             content  
0  Artificial intelligence (AI) is revolutionizin...  
1  As the new administration begins its term, it’...  
2  The Medicare Consumer Assessment of Healthcare...  
3  2