## TOP 5 OUTLETS
#### top_5_outlets.csv
1. Anytime Fitness MacPherson Mall (5/5)- 126 Reviews
2. Anytime Fitness City Square Mall (4.9/5)- 1051 Ratings
3. Anytime Fitness Bedok 85 (4.9/5)- 1047 Ratings, _ Reviews
4. Anytime Fitness Bukit Timah Central (4.9/5)- 960 Reviews
5. Anytime Fitness Buona Vista (4.9/5)- 837 Reviews

## BOTTOM 5 OUTLETS
#### bottom_5_outlets.csv
141. Anytime Fitness Paya Lebar (3.7/5)- 139 Reviews
142. Anytime Fitness Upper Cross Street (3.5/5)- 142 Reviews
143. Anytime Fitness hillV2 (3.2/5)- 112 Reviews
144. Anytime Fitness NEX (3.1/5)- 303 Reviews
145. Anytime Fitness Northpoint City (3/5)- 242 Reviews

In [17]:
!pip install selenium webdriver-manager



### REVIEW EXTRACTION: (1) MACPHERSON MALL
#### 126 Ratings & Reviews

In [None]:
import time
import os
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options as ChromeOptions
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException, StaleElementReferenceException
from selenium.webdriver.common.action_chains import ActionChains
from webdriver_manager.chrome import ChromeDriverManager

# ==============================
# LOAD OUTLET DATA
# ==============================
df_outlets = pd.read_csv("top_5_outlets.csv")  # columns: name, maps_url

# Pick top outlet
df_top1 = df_outlets.head(1)
outlet_name = df_top1.iloc[0]["name"]
outlet_url = df_top1.iloc[0]["maps_url"]

print(f"üß≠ Testing scrape for: {outlet_name}")
print(f"üîó URL: {outlet_url}")

# ==============================
# SETUP CHROME DRIVER
# ==============================
options = ChromeOptions()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument("--lang=en-US")
options.add_argument("--window-size=1920,1080")
# options.add_argument("--headless")  # uncomment to run in background

driver = webdriver.Chrome(
    service=Service(ChromeDriverManager().install()),
    options=options
)
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
wait = WebDriverWait(driver, 15)
actions = ActionChains(driver)
print("‚úÖ Chrome WebDriver initialized successfully.")

# ==============================
# OPEN OUTLET PAGE AND CLICK REVIEWS
# ==============================
driver.get(outlet_url)
time.sleep(3)

# Click the "Reviews" button
reviews_button = wait.until(
    EC.element_to_be_clickable((By.XPATH, '//button[contains(@aria-label, "Reviews for")]'))
)
reviews_button.click()
time.sleep(2)

# ==============================
# SCROLL AND SCRAPE REVIEWS
# ==============================
# Find the scrollable container - this is the key fix
scrollable_div = wait.until(
    EC.presence_of_element_located((By.XPATH, '//div[contains(@class, "m6QErb") and contains(@class, "DxyBCb")]'))
)

all_reviews_data = []
seen_review_ids = set()
no_text_rating_count = 0

scroll_pause = 1.5  # Slightly faster
no_new_count = 0
max_no_new = 5  # Increased patience
previous_height = 0

print("\nüîÑ Starting to scroll and collect reviews...")

scroll_iteration = 0
while True:
    scroll_iteration += 1
    
    # Get current scroll height
    current_height = driver.execute_script("return arguments[0].scrollHeight", scrollable_div)
    
    # Find all reviews currently loaded
    review_elements = driver.find_elements(By.XPATH, '//div[@data-review-id]')
    new_reviews = [r for r in review_elements if r.get_attribute("data-review-id") not in seen_review_ids]

    if new_reviews:
        no_new_count = 0
        print(f"üìä Iteration {scroll_iteration}: Found {len(new_reviews)} new reviews (Total: {len(seen_review_ids) + len(new_reviews)})")
    else:
        no_new_count += 1
        print(f"‚è≥ No new reviews found. Waiting... ({no_new_count}/{max_no_new})")
        if no_new_count >= max_no_new:
            print("‚úã Reached end of reviews.")
            break

    for r in new_reviews:
        review_id = r.get_attribute("data-review-id")
        
        # Skip if already processed
        if review_id in seen_review_ids:
            continue
            
        seen_review_ids.add(review_id)
        
        try:
            # Expand truncated review text ("More" button)
            try:
                more_button = r.find_element(By.CLASS_NAME, 'w8nwRe')
                driver.execute_script("arguments[0].click();", more_button)
                time.sleep(0.15)
            except (NoSuchElementException, StaleElementReferenceException):
                pass

            # Extract review data
            author_name = r.find_element(By.CLASS_NAME, 'd4r55').text
            
            # Check if this is an owner response (no rating element)
            rating_elements = r.find_elements(By.CLASS_NAME, 'kvMYJc')
            if not rating_elements:
                continue  # Skip owner responses
            
            rating_element = rating_elements[0]
            rating_text = rating_element.get_attribute('aria-label')
            star_rating = int(rating_text.split(' ')[0])

            # Extract review text (will be empty string if no text is present)
            review_text = r.find_element(By.CLASS_NAME, 'wiI7pd').text.strip()

            if not review_text:
                no_text_rating_count += 1

            try:
                date_element = r.find_element(By.CLASS_NAME, 'rsqApe')
                posting_date = date_element.text
            except NoSuchElementException:
                posting_date = "Date not found"

            all_reviews_data.append({
                "outlet": outlet_name,
                "author": author_name,
                "rating": star_rating,
                "text": review_text,
                "date_posted": posting_date
            })
        except Exception as e:
            continue

    # Scroll down smoothly - KEY FIX: Multiple small scrolls
    for _ in range(3):
        driver.execute_script(
            "arguments[0].scrollBy(0, arguments[0].scrollHeight / 3);", 
            scrollable_div
        )
        time.sleep(0.3)
    
    # Additional wait for lazy loading
    time.sleep(scroll_pause)
    
    # Check if we've actually scrolled
    new_height = driver.execute_script("return arguments[0].scrollHeight", scrollable_div)
    if new_height == previous_height and len(new_reviews) == 0:
        no_new_count += 1
    previous_height = new_height

# ==============================
# SAVE RESULTS
# ==============================
driver.quit()

if all_reviews_data:
    os.makedirs("Reviews/Best", exist_ok=True)
    output_filename = os.path.join("Reviews/Best", f"{outlet_name}_reviews.csv")
    df_reviews = pd.DataFrame(all_reviews_data)
    
    # Remove duplicates based on author + text combination
    initial_count = len(df_reviews)
    df_reviews = df_reviews.drop_duplicates(subset=['author', 'text'], keep='first')
    final_count = len(df_reviews)
    
    if initial_count > final_count:
        print(f"‚ö†Ô∏è  Removed {initial_count - final_count} duplicate reviews")
    
    df_reviews.to_csv(output_filename, index=False)
    print(f"\n‚úÖ Saved {final_count} unique reviews to '{output_filename}'")
    print(f"üìÑ Found {no_text_rating_count} reviews that were ratings only (no text).")
    print(f"üìà Review breakdown by rating:")
    print(df_reviews['rating'].value_counts().sort_index(ascending=False))
else:
    print("‚ùå No reviews were scraped.")


FileNotFoundError: [Errno 2] No such file or directory: 'top_5_outlets.csv'

In [None]:
# Checking that the scrollable container is correctly identified (map reviews side panel)
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options as ChromeOptions
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager

# ==============================
# LOAD OUTLET DATA
# ==============================
df_outlets = pd.read_csv("top_5_outlets.csv")
df_top1 = df_outlets.head(1)
outlet_name = df_top1.iloc[0]["name"]
outlet_url = df_top1.iloc[0]["maps_url"]

print(f"üß≠ Testing: {outlet_name}")
print(f"üîó URL: {outlet_url}\n")

# ==============================
# SETUP CHROME DRIVER
# ==============================
options = ChromeOptions()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument("--lang=en-US")
options.add_argument("--window-size=1920,1080")

driver = webdriver.Chrome(
    service=Service(ChromeDriverManager().install()),
    options=options
)
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
wait = WebDriverWait(driver, 15)

# ==============================
# OPEN PAGE AND CLICK REVIEWS
# ==============================
driver.get(outlet_url)
time.sleep(3)

reviews_button = wait.until(
    EC.element_to_be_clickable((By.XPATH, '//button[contains(@aria-label, "Reviews for")]'))
)
reviews_button.click()
time.sleep(2)

# ==============================
# METHOD 1: TEST MULTIPLE XPATHS
# ==============================
print("=" * 60)
print("METHOD 1: Testing Different XPath Selectors")
print("=" * 60)

xpaths_to_test = [
    ('//div[@role="main"]', "Main role div"),
    ('//div[contains(@class, "m6QErb")]', "m6QErb class"),
    ('//div[contains(@class, "DxyBCb")]', "DxyBCb class"),
    ('//div[contains(@class, "m6QErb") and contains(@class, "DxyBCb")]', "Both m6QErb and DxyBCb"),
    ('//div[@data-review-id]//ancestor::div[contains(@style, "overflow")]', "Overflow styled ancestor"),
]

for xpath, description in xpaths_to_test:
    try:
        elements = driver.find_elements(By.XPATH, xpath)
        if elements:
            elem = elements[0]
            # Check scrollability properties
            scroll_height = driver.execute_script("return arguments[0].scrollHeight", elem)
            client_height = driver.execute_script("return arguments[0].clientHeight", elem)
            overflow_y = driver.execute_script("return window.getComputedStyle(arguments[0]).overflowY", elem)
            
            is_scrollable = scroll_height > client_height and overflow_y in ['scroll', 'auto']
            
            print(f"\n‚úì {description}")
            print(f"  Found: {len(elements)} element(s)")
            print(f"  scrollHeight: {scroll_height}px")
            print(f"  clientHeight: {client_height}px")
            print(f"  overflow-y: {overflow_y}")
            print(f"  ‚Üí Scrollable: {'YES ‚úì' if is_scrollable else 'NO ‚úó'}")
        else:
            print(f"\n‚úó {description}")
            print(f"  Found: 0 elements")
    except Exception as e:
        print(f"\n‚úó {description}")
        print(f"  Error: {str(e)[:100]}")

# ==============================
# METHOD 2: VISUAL TEST WITH HIGHLIGHTING
# ==============================
print("\n" + "=" * 60)
print("METHOD 2: Visual Highlighting Test")
print("=" * 60)
print("\nüé® Highlighting potential scrollable containers...")
print("   (Watch your browser window!)\n")

# Try to find and highlight the container
try:
    scrollable_div = driver.find_element(By.XPATH, '//div[contains(@class, "m6QErb") and contains(@class, "DxyBCb")]')
    
    # Highlight it with a red border
    driver.execute_script("""
        arguments[0].style.border = '5px solid red';
        arguments[0].style.backgroundColor = 'rgba(255, 0, 0, 0.1)';
    """, scrollable_div)
    
    print("‚úì Red border applied to the detected scrollable container")
    print("  ‚Üí Check your browser to see if it's around the reviews panel")
    time.sleep(3)
    
except Exception as e:
    print(f"‚úó Could not highlight: {e}")

# ==============================
# METHOD 3: SCROLL TEST
# ==============================
print("\n" + "=" * 60)
print("METHOD 3: Actual Scroll Test")
print("=" * 60)

try:
    scrollable_div = driver.find_element(By.XPATH, '//div[contains(@class, "m6QErb") and contains(@class, "DxyBCb")]')
    
    # Count reviews before scroll
    reviews_before = len(driver.find_elements(By.XPATH, '//div[@data-review-id]'))
    print(f"\nüìä Reviews visible before scroll: {reviews_before}")
    
    # Scroll down
    print("üîÑ Scrolling down...")
    driver.execute_script("arguments[0].scrollTo(0, arguments[0].scrollHeight);", scrollable_div)
    time.sleep(2)
    
    # Count reviews after scroll
    reviews_after = len(driver.find_elements(By.XPATH, '//div[@data-review-id]'))
    print(f"üìä Reviews visible after scroll: {reviews_after}")
    
    if reviews_after > reviews_before:
        print(f"‚úì SUCCESS! Loaded {reviews_after - reviews_before} new reviews")
    else:
        print("‚úó WARNING: No new reviews loaded - might be wrong container or already at end")
    
    # Try one more scroll
    print("\nüîÑ Scrolling again...")
    driver.execute_script("arguments[0].scrollTo(0, arguments[0].scrollHeight);", scrollable_div)
    time.sleep(2)
    
    reviews_after_2 = len(driver.find_elements(By.XPATH, '//div[@data-review-id]'))
    print(f"üìä Reviews after 2nd scroll: {reviews_after_2}")
    
    if reviews_after_2 > reviews_after:
        print(f"‚úì SUCCESS! Loaded {reviews_after_2 - reviews_after} more reviews")
    else:
        print("‚Üí No more reviews loaded (might have reached the end)")
        
except Exception as e:
    print(f"‚úó Scroll test failed: {e}")

# ==============================
# METHOD 4: INSPECT ALL SCROLLABLE DIVS
# ==============================
print("\n" + "=" * 60)
print("METHOD 4: Find ALL Scrollable Divs")
print("=" * 60)

all_divs = driver.find_elements(By.TAG_NAME, "div")
scrollable_divs = []

for i, div in enumerate(all_divs):
    try:
        scroll_height = driver.execute_script("return arguments[0].scrollHeight", div)
        client_height = driver.execute_script("return arguments[0].clientHeight", div)
        overflow_y = driver.execute_script("return window.getComputedStyle(arguments[0]).overflowY", div)
        
        if scroll_height > client_height and overflow_y in ['scroll', 'auto']:
            class_name = div.get_attribute("class") or "no-class"
            scrollable_divs.append({
                'index': i,
                'classes': class_name[:80],  # Truncate long class names
                'scrollHeight': scroll_height,
                'clientHeight': client_height
            })
    except:
        continue

print(f"\nFound {len(scrollable_divs)} scrollable divs on page:\n")
for div_info in scrollable_divs[:10]:  # Show first 10
    print(f"  [{div_info['index']}] {div_info['classes']}")
    print(f"      scrollHeight: {div_info['scrollHeight']}px, clientHeight: {div_info['clientHeight']}px\n")

if len(scrollable_divs) > 10:
    print(f"  ... and {len(scrollable_divs) - 10} more")

# ==============================
# KEEP BROWSER OPEN
# ==============================
print("\n" + "=" * 60)
print("‚úã Browser will stay open for 10 seconds for inspection")
print("   Press Ctrl+C to close immediately")
print("=" * 60)

try:
    time.sleep(10)
except KeyboardInterrupt:
    print("\nüëã Closing browser...")

driver.quit()
print("\n‚úÖ Test complete!")

üß≠ Testing: Anytime Fitness MacPherson Mall
üîó URL: https://www.google.com/maps/place/?q=place_id:ChIJX41bNAAX2jER8L9rlgPDC7E

METHOD 1: Testing Different XPath Selectors

‚úì Main role div
  Found: 1 element(s)
  scrollHeight: 652px
  clientHeight: 652px
  overflow-y: visible
  ‚Üí Scrollable: NO ‚úó

‚úì m6QErb class
  Found: 10 element(s)
  scrollHeight: 0px
  clientHeight: 0px
  overflow-y: visible
  ‚Üí Scrollable: NO ‚úó

‚úì DxyBCb class
  Found: 1 element(s)
  scrollHeight: 7070px
  clientHeight: 603px
  overflow-y: auto
  ‚Üí Scrollable: YES ‚úì

‚úì Both m6QErb and DxyBCb
  Found: 1 element(s)
  scrollHeight: 7070px
  clientHeight: 603px
  overflow-y: auto
  ‚Üí Scrollable: YES ‚úì

‚úó Overflow styled ancestor
  Found: 0 elements

METHOD 2: Visual Highlighting Test

üé® Highlighting potential scrollable containers...
   (Watch your browser window!)

‚úì Red border applied to the detected scrollable container
  ‚Üí Check your browser to see if it's around the reviews pan

### REVIEW EXTRACTION: (1) MACPHERSON MALL
<span style="font-size: 15pt; font-weight: bold;">1051 Ratings, 900 Reviews</span> 

<span style="font-size: 15pt; font-weight: bold;">151 Empty Reviews (Ratings only)</span> 

In [None]:
# ==============================
# LOAD OUTLET DATA
# ==============================
df_outlets = pd.read_csv("top_5_outlets.csv") 

# Pick 2nd top outlet
outlet_name = df_outlets.iloc[1]["name"]
outlet_url = df_outlets.iloc[1]["maps_url"]

print(f"üß≠ Testing scrape for: {outlet_name}")
print(f"üîó URL: {outlet_url}")

# ==============================
# SETUP CHROME DRIVER
# ==============================
options = ChromeOptions()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument("--lang=en-US")
options.add_argument("--window-size=1920,1080")
# options.add_argument("--headless")  # uncomment to run in background

driver = webdriver.Chrome(
    service=Service(ChromeDriverManager().install()),
    options=options
)
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
wait = WebDriverWait(driver, 15)
actions = ActionChains(driver)
print("‚úÖ Chrome WebDriver initialized successfully.")

# ==============================
# OPEN OUTLET PAGE AND CLICK REVIEWS
# ==============================
driver.get(outlet_url)
time.sleep(3)

# Click the "Reviews" button
reviews_button = wait.until(
    EC.element_to_be_clickable((By.XPATH, '//button[contains(@aria-label, "Reviews for")]'))
)
reviews_button.click()
time.sleep(2)

# ==============================
# SCROLL AND SCRAPE REVIEWS
# ==============================
# Find the scrollable container - this is the key fix
scrollable_div = wait.until(
    EC.presence_of_element_located((By.XPATH, '//div[contains(@class, "m6QErb") and contains(@class, "DxyBCb")]'))
)

all_reviews_data = []
seen_review_ids = set()
no_text_rating_count = 0

scroll_pause = 2  
no_new_count = 0
max_no_new = 5  # Increased patience
previous_height = 0

print("\nüîÑ Starting to scroll and collect reviews...")

scroll_iteration = 0
while True:
    scroll_iteration += 1
    
    # Get current scroll height
    current_height = driver.execute_script("return arguments[0].scrollHeight", scrollable_div)
    
    # Find all reviews currently loaded
    review_elements = driver.find_elements(By.XPATH, '//div[@data-review-id]')
    new_reviews = [r for r in review_elements if r.get_attribute("data-review-id") not in seen_review_ids]

    if new_reviews:
        no_new_count = 0
        print(f"üìä Iteration {scroll_iteration}: Found {len(new_reviews)} new reviews (Total: {len(seen_review_ids) + len(new_reviews)})")
    else:
        no_new_count += 1
        print(f"‚è≥ No new reviews found. Waiting... ({no_new_count}/{max_no_new})")
        if no_new_count >= max_no_new:
            print("‚úã Reached end of reviews.")
            break

    for r in new_reviews:
        review_id = r.get_attribute("data-review-id")
        
        # Skip if already processed
        if review_id in seen_review_ids:
            continue
            
        seen_review_ids.add(review_id)
        
        try:
            # Expand truncated review text ("More" button)
            try:
                more_button = r.find_element(By.CLASS_NAME, 'w8nwRe')
                driver.execute_script("arguments[0].click();", more_button)
                time.sleep(0.15)
            except (NoSuchElementException, StaleElementReferenceException):
                pass

            # Extract review data
            author_name = r.find_element(By.CLASS_NAME, 'd4r55').text
            
            # Check if this is an owner response (no rating element)
            rating_elements = r.find_elements(By.CLASS_NAME, 'kvMYJc')
            if not rating_elements:
                continue  # Skip owner responses
            
            rating_element = rating_elements[0]
            rating_text = rating_element.get_attribute('aria-label')
            star_rating = int(rating_text.split(' ')[0])
            review_text = r.find_element(By.CLASS_NAME, 'wiI7pd').text.strip()

            if not review_text:
                no_text_rating_count += 1

            try:
                date_element = r.find_element(By.CLASS_NAME, 'rsqApe')
                posting_date = date_element.text
            except NoSuchElementException:
                posting_date = "Date not found"

            all_reviews_data.append({
                "outlet": outlet_name,
                "author": author_name,
                "rating": star_rating,
                "text": review_text,
                "date_posted": posting_date
            })
        except Exception as e:
            continue

    # Scroll down smoothly - KEY FIX: Multiple small scrolls
    for _ in range(3):
        driver.execute_script(
            "arguments[0].scrollBy(0, arguments[0].scrollHeight / 3);", 
            scrollable_div
        )
        time.sleep(0.5)
    
    # Additional wait for lazy loading
    time.sleep(scroll_pause)
    
    new_height = driver.execute_script("return arguments[0].scrollHeight", scrollable_div)
    
    # If the scroll height hasn't changed AND we didn't find new reviews, increment the patience counter
    if new_height == previous_height:
        if not new_reviews:
            no_new_count += 1
        # If new_reviews *was* found but the height didn't change, we reset the patience counter
        # because the page likely rendered hidden reviews without scrolling.
    else:
        no_new_count = 0 # Reset patience if scrolling was successful
        
    previous_height = new_height

    # After your existing scroll logic, try scrolling back up occasionally
    if scroll_iteration % 10 == 0:
        driver.execute_script("arguments[0].scrollBy(0, -500);", scrollable_div)
        time.sleep(0.5)
        driver.execute_script("arguments[0].scrollBy(0, 1000);", scrollable_div)    

# ==============================
# SAVE RESULTS
# ==============================
driver.quit()

if all_reviews_data:
    os.makedirs("Reviews/Best", exist_ok=True)
    output_filename = os.path.join("Reviews/Best", f"{outlet_name}_reviews.csv")
    df_reviews = pd.DataFrame(all_reviews_data)
    
    # Remove duplicates based on author + text combination
    initial_count = len(df_reviews)
    df_reviews = df_reviews.drop_duplicates(subset=['author', 'text'], keep='first')
    final_count = len(df_reviews)
    
    if initial_count > final_count:
        print(f"‚ö†Ô∏è  Removed {initial_count - final_count} duplicate reviews")
    
    df_reviews.to_csv(output_filename, index=False)
    print(f"\n‚úÖ Saved {final_count} unique reviews to '{output_filename}'")
    print(f"üìÑ Found {no_text_rating_count} reviews that were ratings only (no text).")
    print(f"üìà Review breakdown by rating:")
    print(df_reviews['rating'].value_counts().sort_index(ascending=False))
else:
    print("‚ùå No reviews were scraped.")

üß≠ Testing scrape for: Anytime Fitness City Square Mall
üîó URL: https://www.google.com/maps/place/?q=place_id:ChIJDzz1mk8Z2jER9spstsimgQE
‚úÖ Chrome WebDriver initialized successfully.

üîÑ Starting to scroll and collect reviews...
üìä Iteration 1: Found 20 new reviews (Total: 20)
üìä Iteration 2: Found 20 new reviews (Total: 30)
‚è≥ No new reviews found. Waiting... (1/5)
üìä Iteration 4: Found 20 new reviews (Total: 40)
üìä Iteration 5: Found 60 new reviews (Total: 90)
üìä Iteration 6: Found 60 new reviews (Total: 120)
üìä Iteration 7: Found 60 new reviews (Total: 150)
üìä Iteration 8: Found 60 new reviews (Total: 180)
üìä Iteration 9: Found 40 new reviews (Total: 190)
üìä Iteration 10: Found 80 new reviews (Total: 250)
üìä Iteration 11: Found 40 new reviews (Total: 250)
üìä Iteration 12: Found 60 new reviews (Total: 290)
üìä Iteration 13: Found 60 new reviews (Total: 320)
üìä Iteration 14: Found 60 new reviews (Total: 350)
üìä Iteration 15: Found 60 new reviews (To

### REVIEW EXTRACTION: (3) BEDOK 85
#### 1047 Ratings, 855 Reviews

In [47]:
# ==============================
# LOAD OUTLET DATA
# ==============================
df_outlets = pd.read_csv("top_5_outlets.csv") 

# Pick 2nd top outlet
outlet_name = df_outlets.iloc[2]["name"]
outlet_url = df_outlets.iloc[2]["maps_url"]

print(f"üß≠ Testing scrape for: {outlet_name}")
print(f"üîó URL: {outlet_url}")

# ==============================
# SETUP CHROME DRIVER
# ==============================
options = ChromeOptions()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument("--lang=en-US")
options.add_argument("--window-size=1920,1080")
# options.add_argument("--headless")  # uncomment to run in background

driver = webdriver.Chrome(
    service=Service(ChromeDriverManager().install()),
    options=options
)
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
wait = WebDriverWait(driver, 15)
actions = ActionChains(driver)
print("‚úÖ Chrome WebDriver initialized successfully.")

# ==============================
# OPEN OUTLET PAGE AND CLICK REVIEWS
# ==============================
driver.get(outlet_url)
time.sleep(3)

# Click the "Reviews" button
reviews_button = wait.until(
    EC.element_to_be_clickable((By.XPATH, '//button[contains(@aria-label, "Reviews for")]'))
)
reviews_button.click()
time.sleep(2)

# ==============================
# SCROLL AND SCRAPE REVIEWS
# ==============================
# Find the scrollable container - this is the key fix
scrollable_div = wait.until(
    EC.presence_of_element_located((By.XPATH, '//div[contains(@class, "m6QErb") and contains(@class, "DxyBCb")]'))
)

all_reviews_data = []
seen_review_ids = set()

# TRACKING COUNTERS
total_review_elements_found = 0
empty_text_reviews = 0
extraction_errors = 0
owner_responses_skipped = 0

scroll_pause = 2  
no_new_count = 0
max_no_new = 5
previous_height = 0

print("\nüîÑ Starting to scroll and collect reviews...")

scroll_iteration = 0
while True:
    scroll_iteration += 1
    
    # Get current scroll height
    current_height = driver.execute_script("return arguments[0].scrollHeight", scrollable_div)
    
    # Find all reviews currently loaded
    review_elements = driver.find_elements(By.XPATH, '//div[@data-review-id]')
    new_reviews = [r for r in review_elements if r.get_attribute("data-review-id") not in seen_review_ids]

    if new_reviews:
        no_new_count = 0
        print(f"üìä Iteration {scroll_iteration}: Found {len(new_reviews)} new reviews (Total: {len(seen_review_ids) + len(new_reviews)})")
    else:
        no_new_count += 1
        print(f"‚è≥ No new reviews found. Waiting... ({no_new_count}/{max_no_new})")
        if no_new_count >= max_no_new:
            print("‚úã Reached end of reviews.")
            break

    for r in new_reviews:
        review_id = r.get_attribute("data-review-id")
        
        # Skip if already processed
        if review_id in seen_review_ids:
            continue
            
        seen_review_ids.add(review_id)
        total_review_elements_found += 1
        
        try:
            # Expand truncated review text ("More" button)
            try:
                more_button = r.find_element(By.CLASS_NAME, 'w8nwRe')
                driver.execute_script("arguments[0].click();", more_button)
                time.sleep(0.15)
            except (NoSuchElementException, StaleElementReferenceException):
                pass

            # Re-fetch the element from DOM to avoid stale references
            try:
                r = driver.find_element(By.XPATH, f'//div[@data-review-id="{review_id}"]')
            except NoSuchElementException:
                extraction_errors += 1
                continue

            # Extract author (gracefully handle missing)
            author_elems = r.find_elements(By.CLASS_NAME, 'd4r55')
            author_name = author_elems[0].text if author_elems else ''
            
            # Check if this is an owner response by looking for the "Response from the owner" label
            owner_response_labels = r.find_elements(By.CLASS_NAME, 'fontTitleSmall')
            is_owner_response = any('response from the owner' in label.text.lower() for label in owner_response_labels)
            if is_owner_response:
                owner_responses_skipped += 1
                continue  # Skip owner responses
            
            # Extract rating
            rating_elements = r.find_elements(By.CLASS_NAME, 'kvMYJc')
            if not rating_elements:
                star_rating = None
            else:
                rating_text = rating_elements[0].get_attribute('aria-label') or ''
                try:
                    star_rating = int(rating_text.split(' ')[0])
                except (ValueError, IndexError):
                    star_rating = None
            
            # Extract review text - use find_elements to avoid exceptions
            text_elems = r.find_elements(By.CLASS_NAME, 'wiI7pd')
            review_text = (text_elems[0].text or '').strip() if text_elems else ''
            
            # Track empty text reviews
            if not review_text:
                empty_text_reviews += 1

            # Extract date (gracefully handle missing)
            date_elems = r.find_elements(By.CLASS_NAME, 'rsqApe')
            posting_date = date_elems[0].text if date_elems else "Date not found"

            all_reviews_data.append({
                "outlet": outlet_name,
                "author": author_name,
                "rating": star_rating,
                "text": review_text,
                "date_posted": posting_date,
                "review_id": review_id
            })
        except Exception as e:
            extraction_errors += 1
            continue

    # Scroll down smoothly - KEY FIX: Multiple small scrolls
    for _ in range(3):
        driver.execute_script(
            "arguments[0].scrollBy(0, arguments[0].scrollHeight / 3);", 
            scrollable_div
        )
        time.sleep(0.5)
    
    # Additional wait for lazy loading
    time.sleep(scroll_pause)
    
    new_height = driver.execute_script("return arguments[0].scrollHeight", scrollable_div)
    
    # If the scroll height hasn't changed AND we didn't find new reviews, increment the patience counter
    if new_height == previous_height:
        if not new_reviews:
            no_new_count += 1
        # If new_reviews *was* found but the height didn't change, we reset the patience counter
        # because the page likely rendered hidden reviews without scrolling.
    else:
        no_new_count = 0 # Reset patience if scrolling was successful
        
    previous_height = new_height

    # After your existing scroll logic, try scrolling back up occasionally
    if scroll_iteration % 10 == 0:
        driver.execute_script("arguments[0].scrollBy(0, -500);", scrollable_div)
        time.sleep(0.5)
        driver.execute_script("arguments[0].scrollBy(0, 1000);", scrollable_div)    

# ==============================
# SAVE RESULTS
# ==============================
driver.quit()

if all_reviews_data:
    os.makedirs("Reviews", exist_ok=True)
    output_filename = os.path.join("Reviews", f"{outlet_name}_reviews.csv")
    df_reviews = pd.DataFrame(all_reviews_data)
    
    # Remove duplicates based on review_id (most reliable)
    initial_count = len(df_reviews)
    df_reviews = df_reviews.drop_duplicates(subset=['review_id'], keep='first')
    final_count = len(df_reviews)
    
    if initial_count > final_count:
        print(f"‚ö†Ô∏è  Removed {initial_count - final_count} duplicate reviews")
    
    df_reviews.to_csv(output_filename, index=False)
    
    print(f"\n{'='*60}")
    print(f"üìä SCRAPING SUMMARY FOR: {outlet_name}")
    print(f"{'='*60}")
    print(f"Total review elements found:     {total_review_elements_found}")
    print(f"Owner responses (skipped):       {owner_responses_skipped}")
    print(f"Extraction errors (failed):      {extraction_errors}")
    print(f"Successfully processed:          {total_review_elements_found - extraction_errors - owner_responses_skipped}")
    print(f"  - With text:                   {(total_review_elements_found - extraction_errors - owner_responses_skipped) - empty_text_reviews}")
    print(f"  - Rating-only (no text):       {empty_text_reviews}")
    print(f"Total extracted to list:         {len(all_reviews_data)}")
    print(f"Duplicates removed:              {initial_count - final_count}")
    print(f"Final unique reviews saved:      {final_count}")
    print(f"{'='*60}")
    print(f"üìÑ Review breakdown by rating:")
    print(df_reviews['rating'].value_counts().sort_index(ascending=False))
else:
    print("‚ùå No reviews were scraped.")

üß≠ Testing scrape for: Anytime Fitness Bedok 85
üîó URL: https://www.google.com/maps/place/?q=place_id:ChIJa71bPCI92jERWtJc9IYBGUo
‚úÖ Chrome WebDriver initialized successfully.

üîÑ Starting to scroll and collect reviews...
üìä Iteration 1: Found 20 new reviews (Total: 20)
üìä Iteration 2: Found 20 new reviews (Total: 30)
üìä Iteration 3: Found 40 new reviews (Total: 60)
üìä Iteration 4: Found 60 new reviews (Total: 100)
üìä Iteration 5: Found 60 new reviews (Total: 130)
üìä Iteration 6: Found 60 new reviews (Total: 160)
üìä Iteration 7: Found 60 new reviews (Total: 190)
üìä Iteration 8: Found 60 new reviews (Total: 220)
üìä Iteration 9: Found 60 new reviews (Total: 250)
üìä Iteration 10: Found 60 new reviews (Total: 280)
üìä Iteration 11: Found 60 new reviews (Total: 310)
üìä Iteration 12: Found 60 new reviews (Total: 340)
üìä Iteration 13: Found 60 new reviews (Total: 370)
üìä Iteration 14: Found 20 new reviews (Total: 360)
üìä Iteration 15: Found 60 new reviews (

### REVIEW EXTRACTION: (4) Bukit Timah Central
#### 960 Ratings, 760 Reviews

In [46]:
# ==============================
# LOAD OUTLET DATA
# ==============================
df_outlets = pd.read_csv("top_5_outlets.csv") 

# Pick 2nd top outlet
outlet_name = df_outlets.iloc[3]["name"]
outlet_url = df_outlets.iloc[3]["maps_url"]

print(f"üß≠ Testing scrape for: {outlet_name}")
print(f"üîó URL: {outlet_url}")

# ==============================
# SETUP CHROME DRIVER
# ==============================
options = ChromeOptions()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument("--lang=en-US")
options.add_argument("--window-size=1920,1080")
# options.add_argument("--headless")  # uncomment to run in background

driver = webdriver.Chrome(
    service=Service(ChromeDriverManager().install()),
    options=options
)
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
wait = WebDriverWait(driver, 15)
actions = ActionChains(driver)
print("‚úÖ Chrome WebDriver initialized successfully.")

# ==============================
# OPEN OUTLET PAGE AND CLICK REVIEWS
# ==============================
driver.get(outlet_url)
time.sleep(3)

# Click the "Reviews" button
reviews_button = wait.until(
    EC.element_to_be_clickable((By.XPATH, '//button[contains(@aria-label, "Reviews for")]'))
)
reviews_button.click()
time.sleep(2)

# ==============================
# SCROLL AND SCRAPE REVIEWS
# ==============================
# Find the scrollable container - this is the key fix
scrollable_div = wait.until(
    EC.presence_of_element_located((By.XPATH, '//div[contains(@class, "m6QErb") and contains(@class, "DxyBCb")]'))
)

all_reviews_data = []
seen_review_ids = set()

# TRACKING COUNTERS
total_review_elements_found = 0
empty_text_reviews = 0
extraction_errors = 0
owner_responses_skipped = 0

scroll_pause = 2  
no_new_count = 0
max_no_new = 5
previous_height = 0

print("\nüîÑ Starting to scroll and collect reviews...")

scroll_iteration = 0
while True:
    scroll_iteration += 1
    
    # Get current scroll height
    current_height = driver.execute_script("return arguments[0].scrollHeight", scrollable_div)
    
    # Find all reviews currently loaded
    review_elements = driver.find_elements(By.XPATH, '//div[@data-review-id]')
    new_reviews = [r for r in review_elements if r.get_attribute("data-review-id") not in seen_review_ids]

    if new_reviews:
        no_new_count = 0
        print(f"üìä Iteration {scroll_iteration}: Found {len(new_reviews)} new reviews (Total: {len(seen_review_ids) + len(new_reviews)})")
    else:
        no_new_count += 1
        print(f"‚è≥ No new reviews found. Waiting... ({no_new_count}/{max_no_new})")
        if no_new_count >= max_no_new:
            print("‚úã Reached end of reviews.")
            break

    for r in new_reviews:
        review_id = r.get_attribute("data-review-id")
        
        # Skip if already processed
        if review_id in seen_review_ids:
            continue
            
        seen_review_ids.add(review_id)
        total_review_elements_found += 1
        
        try:
            # Expand truncated review text ("More" button)
            try:
                more_button = r.find_element(By.CLASS_NAME, 'w8nwRe')
                driver.execute_script("arguments[0].click();", more_button)
                time.sleep(0.15)
            except (NoSuchElementException, StaleElementReferenceException):
                pass

            # Re-fetch the element from DOM to avoid stale references
            try:
                r = driver.find_element(By.XPATH, f'//div[@data-review-id="{review_id}"]')
            except NoSuchElementException:
                extraction_errors += 1
                continue

            # Extract author (gracefully handle missing)
            author_elems = r.find_elements(By.CLASS_NAME, 'd4r55')
            author_name = author_elems[0].text if author_elems else ''
            
            # Check if this is an owner response by looking for the "Response from the owner" label
            owner_response_labels = r.find_elements(By.CLASS_NAME, 'fontTitleSmall')
            is_owner_response = any('response from the owner' in label.text.lower() for label in owner_response_labels)
            if is_owner_response:
                owner_responses_skipped += 1
                continue  # Skip owner responses
            
            # Extract rating
            rating_elements = r.find_elements(By.CLASS_NAME, 'kvMYJc')
            if not rating_elements:
                star_rating = None
            else:
                rating_text = rating_elements[0].get_attribute('aria-label') or ''
                try:
                    star_rating = int(rating_text.split(' ')[0])
                except (ValueError, IndexError):
                    star_rating = None
            
            # Extract review text - use find_elements to avoid exceptions
            text_elems = r.find_elements(By.CLASS_NAME, 'wiI7pd')
            review_text = (text_elems[0].text or '').strip() if text_elems else ''
            
            # Track empty text reviews
            if not review_text:
                empty_text_reviews += 1

            # Extract date (gracefully handle missing)
            date_elems = r.find_elements(By.CLASS_NAME, 'rsqApe')
            posting_date = date_elems[0].text if date_elems else "Date not found"

            all_reviews_data.append({
                "outlet": outlet_name,
                "author": author_name,
                "rating": star_rating,
                "text": review_text,
                "date_posted": posting_date,
                "review_id": review_id
            })
        except Exception as e:
            extraction_errors += 1
            continue

    # Scroll down smoothly - KEY FIX: Multiple small scrolls
    for _ in range(3):
        driver.execute_script(
            "arguments[0].scrollBy(0, arguments[0].scrollHeight / 3);", 
            scrollable_div
        )
        time.sleep(0.5)
    
    # Additional wait for lazy loading
    time.sleep(scroll_pause)
    
    new_height = driver.execute_script("return arguments[0].scrollHeight", scrollable_div)
    
    # If the scroll height hasn't changed AND we didn't find new reviews, increment the patience counter
    if new_height == previous_height:
        if not new_reviews:
            no_new_count += 1
        # If new_reviews *was* found but the height didn't change, we reset the patience counter
        # because the page likely rendered hidden reviews without scrolling.
    else:
        no_new_count = 0 # Reset patience if scrolling was successful
        
    previous_height = new_height

    # After your existing scroll logic, try scrolling back up occasionally
    if scroll_iteration % 10 == 0:
        driver.execute_script("arguments[0].scrollBy(0, -500);", scrollable_div)
        time.sleep(0.5)
        driver.execute_script("arguments[0].scrollBy(0, 1000);", scrollable_div)    

# ==============================
# SAVE RESULTS
# ==============================
driver.quit()

if all_reviews_data:
    os.makedirs("Reviews", exist_ok=True)
    output_filename = os.path.join("Reviews", f"{outlet_name}_reviews.csv")
    df_reviews = pd.DataFrame(all_reviews_data)
    
    # Remove duplicates based on review_id (most reliable)
    initial_count = len(df_reviews)
    df_reviews = df_reviews.drop_duplicates(subset=['review_id'], keep='first')
    final_count = len(df_reviews)
    
    if initial_count > final_count:
        print(f"‚ö†Ô∏è  Removed {initial_count - final_count} duplicate reviews")
    
    df_reviews.to_csv(output_filename, index=False)
    
    print(f"\n{'='*60}")
    print(f"üìä SCRAPING SUMMARY FOR: {outlet_name}")
    print(f"{'='*60}")
    print(f"Total review elements found:     {total_review_elements_found}")
    print(f"Owner responses (skipped):       {owner_responses_skipped}")
    print(f"Extraction errors (failed):      {extraction_errors}")
    print(f"Successfully processed:          {total_review_elements_found - extraction_errors - owner_responses_skipped}")
    print(f"  - With text:                   {(total_review_elements_found - extraction_errors - owner_responses_skipped) - empty_text_reviews}")
    print(f"  - Rating-only (no text):       {empty_text_reviews}")
    print(f"Total extracted to list:         {len(all_reviews_data)}")
    print(f"Duplicates removed:              {initial_count - final_count}")
    print(f"Final unique reviews saved:      {final_count}")
    print(f"{'='*60}")
    print(f"üìÑ Review breakdown by rating:")
    print(df_reviews['rating'].value_counts().sort_index(ascending=False))
else:
    print("‚ùå No reviews were scraped.")

üß≠ Testing scrape for: Anytime Fitness Bukit Timah Central
üîó URL: https://www.google.com/maps/place/?q=place_id:ChIJU5AB6_sb2jERZXpNUjEz7Fk
‚úÖ Chrome WebDriver initialized successfully.

üîÑ Starting to scroll and collect reviews...
üìä Iteration 1: Found 20 new reviews (Total: 20)
üìä Iteration 2: Found 20 new reviews (Total: 30)
üìä Iteration 3: Found 40 new reviews (Total: 60)
üìä Iteration 4: Found 60 new reviews (Total: 100)
üìä Iteration 5: Found 60 new reviews (Total: 130)
üìä Iteration 6: Found 60 new reviews (Total: 160)
üìä Iteration 7: Found 60 new reviews (Total: 190)
üìä Iteration 8: Found 60 new reviews (Total: 220)
üìä Iteration 9: Found 60 new reviews (Total: 250)
üìä Iteration 10: Found 60 new reviews (Total: 280)
üìä Iteration 11: Found 60 new reviews (Total: 310)
üìä Iteration 12: Found 60 new reviews (Total: 340)
‚è≥ No new reviews found. Waiting... (1/5)
üìä Iteration 14: Found 80 new reviews (Total: 390)
üìä Iteration 15: Found 60 new reviews 

### REVIEW EXTRACTION: (5) Buona Vista
#### 837 Ratings, 733 Reviews

In [37]:
# ==============================
# LOAD OUTLET DATA
# ==============================
df_outlets = pd.read_csv("top_5_outlets.csv") 

# Pick 2nd top outlet
outlet_name = df_outlets.iloc[4]["name"]
outlet_url = df_outlets.iloc[4]["maps_url"]

print(f"üß≠ Testing scrape for: {outlet_name}")
print(f"üîó URL: {outlet_url}")

# ==============================
# SETUP CHROME DRIVER
# ==============================
options = ChromeOptions()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument("--lang=en-US")
options.add_argument("--window-size=1920,1080")
# options.add_argument("--headless")  # uncomment to run in background

driver = webdriver.Chrome(
    service=Service(ChromeDriverManager().install()),
    options=options
)
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
wait = WebDriverWait(driver, 15)
actions = ActionChains(driver)
print("‚úÖ Chrome WebDriver initialized successfully.")

# ==============================
# OPEN OUTLET PAGE AND CLICK REVIEWS
# ==============================
driver.get(outlet_url)
time.sleep(3)

# Click the "Reviews" button
reviews_button = wait.until(
    EC.element_to_be_clickable((By.XPATH, '//button[contains(@aria-label, "Reviews for")]'))
)
reviews_button.click()
time.sleep(2)

# ==============================
# SCROLL AND SCRAPE REVIEWS
# ==============================
# Find the scrollable container - this is the key fix
scrollable_div = wait.until(
    EC.presence_of_element_located((By.XPATH, '//div[contains(@class, "m6QErb") and contains(@class, "DxyBCb")]'))
)

all_reviews_data = []
seen_review_ids = set()
no_text_rating_count = 0

scroll_pause = 2  
no_new_count = 0
max_no_new = 5  # Increased patience
previous_height = 0

print("\nüîÑ Starting to scroll and collect reviews...")

scroll_iteration = 0
while True:
    scroll_iteration += 1
    
    # Get current scroll height
    current_height = driver.execute_script("return arguments[0].scrollHeight", scrollable_div)
    
    # Find all reviews currently loaded
    review_elements = driver.find_elements(By.XPATH, '//div[@data-review-id]')
    new_reviews = [r for r in review_elements if r.get_attribute("data-review-id") not in seen_review_ids]

    if new_reviews:
        no_new_count = 0
        print(f"üìä Iteration {scroll_iteration}: Found {len(new_reviews)} new reviews (Total: {len(seen_review_ids) + len(new_reviews)})")
    else:
        no_new_count += 1
        print(f"‚è≥ No new reviews found. Waiting... ({no_new_count}/{max_no_new})")
        if no_new_count >= max_no_new:
            print("‚úã Reached end of reviews.")
            break

    for r in new_reviews:
        review_id = r.get_attribute("data-review-id")
        
        # Skip if already processed
        if review_id in seen_review_ids:
            continue
            
        seen_review_ids.add(review_id)
        
        try:
            # Expand truncated review text ("More" button)
            try:
                more_button = r.find_element(By.CLASS_NAME, 'w8nwRe')
                driver.execute_script("arguments[0].click();", more_button)
                time.sleep(0.15)
            except (NoSuchElementException, StaleElementReferenceException):
                pass

            # Extract review data
            author_name = r.find_element(By.CLASS_NAME, 'd4r55').text
            
            # Check if this is an owner response (no rating element)
            rating_elements = r.find_elements(By.CLASS_NAME, 'kvMYJc')
            if not rating_elements:
                continue  # Skip owner responses
            
            rating_element = rating_elements[0]
            rating_text = rating_element.get_attribute('aria-label')
            star_rating = int(rating_text.split(' ')[0])
            review_text = r.find_element(By.CLASS_NAME, 'wiI7pd').text.strip()

            if not review_text:
                no_text_rating_count += 1

            try:
                date_element = r.find_element(By.CLASS_NAME, 'rsqApe')
                posting_date = date_element.text
            except NoSuchElementException:
                posting_date = "Date not found"

            all_reviews_data.append({
                "outlet": outlet_name,
                "author": author_name,
                "rating": star_rating,
                "text": review_text,
                "date_posted": posting_date
            })
        except Exception as e:
            continue

    # Scroll down smoothly - KEY FIX: Multiple small scrolls
    for _ in range(3):
        driver.execute_script(
            "arguments[0].scrollBy(0, arguments[0].scrollHeight / 3);", 
            scrollable_div
        )
        time.sleep(0.5)
    
    # Additional wait for lazy loading
    time.sleep(scroll_pause)
    
    new_height = driver.execute_script("return arguments[0].scrollHeight", scrollable_div)
    
    # If the scroll height hasn't changed AND we didn't find new reviews, increment the patience counter
    if new_height == previous_height:
        if not new_reviews:
            no_new_count += 1
        # If new_reviews *was* found but the height didn't change, we reset the patience counter
        # because the page likely rendered hidden reviews without scrolling.
    else:
        no_new_count = 0 # Reset patience if scrolling was successful
        
    previous_height = new_height

    # After your existing scroll logic, try scrolling back up occasionally
    if scroll_iteration % 10 == 0:
        driver.execute_script("arguments[0].scrollBy(0, -500);", scrollable_div)
        time.sleep(0.5)
        driver.execute_script("arguments[0].scrollBy(0, 1000);", scrollable_div)    

# ==============================
# SAVE RESULTS
# ==============================
driver.quit()

if all_reviews_data:
    os.makedirs("Reviews", exist_ok=True)
    output_filename = os.path.join("Reviews", f"{outlet_name}_reviews.csv")
    df_reviews = pd.DataFrame(all_reviews_data)
    
    # Remove duplicates based on author + text combination
    initial_count = len(df_reviews)
    df_reviews = df_reviews.drop_duplicates(subset=['author', 'text'], keep='first')
    final_count = len(df_reviews)
    
    if initial_count > final_count:
        print(f"‚ö†Ô∏è  Removed {initial_count - final_count} duplicate reviews")
    
    df_reviews.to_csv(output_filename, index=False)
    print(f"\n‚úÖ Saved {final_count} unique reviews to '{output_filename}'")
    print(f"üìÑ Found {no_text_rating_count} reviews that were ratings only (no text).")
    print(f"üìà Review breakdown by rating:")
    print(df_reviews['rating'].value_counts().sort_index(ascending=False))
else:
    print("‚ùå No reviews were scraped.")

üß≠ Testing scrape for: Anytime Fitness Buona Vista
üîó URL: https://www.google.com/maps/place/?q=place_id:ChIJS1v-9jQb2jER7XzNEUFv8uM
‚úÖ Chrome WebDriver initialized successfully.

üîÑ Starting to scroll and collect reviews...
üìä Iteration 1: Found 20 new reviews (Total: 20)
üìä Iteration 2: Found 20 new reviews (Total: 30)
üìä Iteration 3: Found 40 new reviews (Total: 60)
üìä Iteration 4: Found 60 new reviews (Total: 100)
üìä Iteration 5: Found 60 new reviews (Total: 130)
üìä Iteration 6: Found 60 new reviews (Total: 160)
üìä Iteration 7: Found 60 new reviews (Total: 190)
üìä Iteration 8: Found 60 new reviews (Total: 220)
üìä Iteration 9: Found 60 new reviews (Total: 250)
üìä Iteration 10: Found 60 new reviews (Total: 280)
üìä Iteration 11: Found 60 new reviews (Total: 310)
üìä Iteration 12: Found 60 new reviews (Total: 340)
üìä Iteration 13: Found 40 new reviews (Total: 350)
üìä Iteration 14: Found 60 new reviews (Total: 390)
üìä Iteration 15: Found 40 new review

### *Bonus: REVIEW EXTRACTION- Outlet #6, Havelock Outram, 606 Ratings, 507 Reviews

In [38]:
# ==============================
# LOAD OUTLET DATA
# ==============================
df_outlets = pd.read_csv("top_5_outlets.csv") 

# Pick 2nd top outlet
outlet_name = df_outlets.iloc[5]["name"]
outlet_url = df_outlets.iloc[5]["maps_url"]

print(f"üß≠ Testing scrape for: {outlet_name}")
print(f"üîó URL: {outlet_url}")

# ==============================
# SETUP CHROME DRIVER
# ==============================
options = ChromeOptions()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument("--lang=en-US")
options.add_argument("--window-size=1920,1080")
# options.add_argument("--headless")  # uncomment to run in background

driver = webdriver.Chrome(
    service=Service(ChromeDriverManager().install()),
    options=options
)
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
wait = WebDriverWait(driver, 15)
actions = ActionChains(driver)
print("‚úÖ Chrome WebDriver initialized successfully.")

# ==============================
# OPEN OUTLET PAGE AND CLICK REVIEWS
# ==============================
driver.get(outlet_url)
time.sleep(3)

# Click the "Reviews" button
reviews_button = wait.until(
    EC.element_to_be_clickable((By.XPATH, '//button[contains(@aria-label, "Reviews for")]'))
)
reviews_button.click()
time.sleep(2)

# ==============================
# SCROLL AND SCRAPE REVIEWS
# ==============================
# Find the scrollable container - this is the key fix
scrollable_div = wait.until(
    EC.presence_of_element_located((By.XPATH, '//div[contains(@class, "m6QErb") and contains(@class, "DxyBCb")]'))
)

all_reviews_data = []
seen_review_ids = set()
no_text_rating_count = 0

scroll_pause = 2  
no_new_count = 0
max_no_new = 5  # Increased patience
previous_height = 0

print("\nüîÑ Starting to scroll and collect reviews...")

scroll_iteration = 0
while True:
    scroll_iteration += 1
    
    # Get current scroll height
    current_height = driver.execute_script("return arguments[0].scrollHeight", scrollable_div)
    
    # Find all reviews currently loaded
    review_elements = driver.find_elements(By.XPATH, '//div[@data-review-id]')
    new_reviews = [r for r in review_elements if r.get_attribute("data-review-id") not in seen_review_ids]

    if new_reviews:
        no_new_count = 0
        print(f"üìä Iteration {scroll_iteration}: Found {len(new_reviews)} new reviews (Total: {len(seen_review_ids) + len(new_reviews)})")
    else:
        no_new_count += 1
        print(f"‚è≥ No new reviews found. Waiting... ({no_new_count}/{max_no_new})")
        if no_new_count >= max_no_new:
            print("‚úã Reached end of reviews.")
            break

    for r in new_reviews:
        review_id = r.get_attribute("data-review-id")
        
        # Skip if already processed
        if review_id in seen_review_ids:
            continue
            
        seen_review_ids.add(review_id)
        
        try:
            # Expand truncated review text ("More" button)
            try:
                more_button = r.find_element(By.CLASS_NAME, 'w8nwRe')
                driver.execute_script("arguments[0].click();", more_button)
                time.sleep(0.15)
            except (NoSuchElementException, StaleElementReferenceException):
                pass

            # Extract review data
            author_name = r.find_element(By.CLASS_NAME, 'd4r55').text
            
            # Check if this is an owner response (no rating element)
            rating_elements = r.find_elements(By.CLASS_NAME, 'kvMYJc')
            if not rating_elements:
                continue  # Skip owner responses
            
            rating_element = rating_elements[0]
            rating_text = rating_element.get_attribute('aria-label')
            star_rating = int(rating_text.split(' ')[0])
            review_text = r.find_element(By.CLASS_NAME, 'wiI7pd').text.strip()

            if not review_text:
                no_text_rating_count += 1

            try:
                date_element = r.find_element(By.CLASS_NAME, 'rsqApe')
                posting_date = date_element.text
            except NoSuchElementException:
                posting_date = "Date not found"

            all_reviews_data.append({
                "outlet": outlet_name,
                "author": author_name,
                "rating": star_rating,
                "text": review_text,
                "date_posted": posting_date
            })
        except Exception as e:
            continue

    # Scroll down smoothly - KEY FIX: Multiple small scrolls
    for _ in range(3):
        driver.execute_script(
            "arguments[0].scrollBy(0, arguments[0].scrollHeight / 3);", 
            scrollable_div
        )
        time.sleep(0.5)
    
    # Additional wait for lazy loading
    time.sleep(scroll_pause)
    
    new_height = driver.execute_script("return arguments[0].scrollHeight", scrollable_div)
    
    # If the scroll height hasn't changed AND we didn't find new reviews, increment the patience counter
    if new_height == previous_height:
        if not new_reviews:
            no_new_count += 1
        # If new_reviews *was* found but the height didn't change, we reset the patience counter
        # because the page likely rendered hidden reviews without scrolling.
    else:
        no_new_count = 0 # Reset patience if scrolling was successful
        
    previous_height = new_height

    # After your existing scroll logic, try scrolling back up occasionally
    if scroll_iteration % 10 == 0:
        driver.execute_script("arguments[0].scrollBy(0, -500);", scrollable_div)
        time.sleep(0.5)
        driver.execute_script("arguments[0].scrollBy(0, 1000);", scrollable_div)    

# ==============================
# SAVE RESULTS
# ==============================
driver.quit()

if all_reviews_data:
    os.makedirs("Reviews", exist_ok=True)
    output_filename = os.path.join("Reviews", f"{outlet_name}_reviews.csv")
    df_reviews = pd.DataFrame(all_reviews_data)
    
    # Remove duplicates based on author + text combination
    initial_count = len(df_reviews)
    df_reviews = df_reviews.drop_duplicates(subset=['author', 'text'], keep='first')
    final_count = len(df_reviews)
    
    if initial_count > final_count:
        print(f"‚ö†Ô∏è  Removed {initial_count - final_count} duplicate reviews")
    
    df_reviews.to_csv(output_filename, index=False)
    print(f"\n‚úÖ Saved {final_count} unique reviews to '{output_filename}'")
    print(f"üìÑ Found {no_text_rating_count} reviews that were ratings only (no text).")
    print(f"üìà Review breakdown by rating:")
    print(df_reviews['rating'].value_counts().sort_index(ascending=False))
else:
    print("‚ùå No reviews were scraped.")

üß≠ Testing scrape for: Anytime Fitness Havelock Outram
üîó URL: https://www.google.com/maps/place/?q=place_id:ChIJJyOThiMZ2jER9H-usPsO2g0
‚úÖ Chrome WebDriver initialized successfully.

üîÑ Starting to scroll and collect reviews...
‚è≥ No new reviews found. Waiting... (1/5)
üìä Iteration 2: Found 20 new reviews (Total: 20)
üìä Iteration 3: Found 20 new reviews (Total: 30)
üìä Iteration 4: Found 40 new reviews (Total: 60)
üìä Iteration 5: Found 60 new reviews (Total: 100)
üìä Iteration 6: Found 60 new reviews (Total: 130)
üìä Iteration 7: Found 60 new reviews (Total: 160)
üìä Iteration 8: Found 60 new reviews (Total: 190)
üìä Iteration 9: Found 60 new reviews (Total: 220)
üìä Iteration 10: Found 60 new reviews (Total: 250)
üìä Iteration 11: Found 60 new reviews (Total: 280)
üìä Iteration 12: Found 60 new reviews (Total: 310)
üìä Iteration 13: Found 60 new reviews (Total: 340)
üìä Iteration 14: Found 40 new reviews (Total: 350)
üìä Iteration 15: Found 60 new reviews (To

### REVIEW EXTRACTION: BOTTOM 5 OUTLETS
140. ---
*141. Anytime Fitness Paya Lebar (3.7/5)- 139 Reviews, 75 Reviews
*142. Anytime Fitness Upper Cross Street (3.5/5)- 142 Reviews, 89 Reviews
*143. Anytime Fitness hillV2 (3.2/5)- 112 Reviews, 80 Reviews
144. Anytime Fitness NEX (3.1/5)- 303 Reviews, 171 Reviews
145. Anytime Fitness Northpoint City (3/5)- 242 Ratings, 157 Reviews

*(<100 reviews after scraping)

In [None]:
import time
import os
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options as ChromeOptions
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException, StaleElementReferenceException
from selenium.webdriver.common.action_chains import ActionChains
from webdriver_manager.chrome import ChromeDriverManager

# ==============================
# --- 1. SETUP DRIVER (ONLY ONCE) ---
# ==============================
options = ChromeOptions()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument("--lang=en-US")
options.add_argument("--window-size=1920,1080")
options.add_argument("--headless=new")  # Recommended for stable batch processing

print("‚öôÔ∏è Setting up WebDriver...")
try:
    driver = webdriver.Chrome(
        service=Service(ChromeDriverManager().install()),
        options=options
    )
    driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
    wait = WebDriverWait(driver, 20) # Increased wait time for robustness
    actions = ActionChains(driver)
    print("‚úÖ Chrome WebDriver initialized successfully.")
except Exception as e:
    print(f"‚ùå Failed to initialize WebDriver: {e}")
    exit()

# ==============================
# --- 2. SCRAPING FUNCTION ---
# ==============================

def scrape_reviews(outlet_name, outlet_url, driver, wait):
    """Navigates to the outlet, scrapes all reviews, and saves the data."""
    print(f"\n--- üß≠ Starting scrape for: {outlet_name} ---")
    print(f"üîó URL: {outlet_url}")

    all_reviews_data = []
    seen_review_ids = set()
    no_text_rating_count = 0
    scroll_pause = 2.0
    max_no_new = 8 # Increased patience
    
    try:
        # OPEN OUTLET PAGE AND CLICK REVIEWS
        driver.get(outlet_url)
        time.sleep(3)

        # Click the "Reviews" button
        reviews_button = wait.until(
            EC.element_to_be_clickable((By.XPATH, '//button[contains(@aria-label, "Reviews for")]'))
        )
        reviews_button.click()
        time.sleep(2)

        # SCROLL AND SCRAPE REVIEWS
        scrollable_div = wait.until(
            EC.presence_of_element_located((By.XPATH, '//div[contains(@class, "m6QErb") and contains(@class, "DxyBCb")]'))
        )

        no_new_count = 0
        previous_height = 0
        scroll_iteration = 0

        print("üîÑ Starting to scroll and collect reviews...")
        
        while True:
            scroll_iteration += 1
            
            # Find all reviews currently loaded
            review_elements = driver.find_elements(By.XPATH, '//div[@data-review-id]')
            new_reviews = [r for r in review_elements if r.get_attribute("data-review-id") not in seen_review_ids]

            if new_reviews:
                no_new_count = 0
                print(f"üìä Iteration {scroll_iteration}: Found {len(new_reviews)} new reviews (Total: {len(seen_review_ids) + len(new_reviews)})")
            else:
                no_new_count += 1
                print(f"‚è≥ No new reviews found. Waiting... ({no_new_count}/{max_no_new})")
                if no_new_count >= max_no_new:
                    print("‚úã Reached end of reviews.")
                    break
                    
            # Process new reviews
            for r in new_reviews:
                review_id = r.get_attribute("data-review-id")
                
                if review_id in seen_review_ids:
                    continue
                    
                seen_review_ids.add(review_id)
                
                try:
                    # Expand truncated review text ("More" button)
                    try:
                        more_button = r.find_element(By.CLASS_NAME, 'w8nwRe')
                        driver.execute_script("arguments[0].click();", more_button)
                        time.sleep(0.15)
                    except (NoSuchElementException, StaleElementReferenceException):
                        pass

                    # Extract data and skip owner response
                    author_name = r.find_element(By.CLASS_NAME, 'd4r55').text
                    rating_elements = r.find_elements(By.CLASS_NAME, 'kvMYJc')
                    if not rating_elements:
                        continue 
                    
                    rating_element = rating_elements[0]
                    rating_text = rating_element.get_attribute('aria-label')
                    star_rating = int(rating_text.split(' ')[0])

                    # Robust extraction of review text:
                    # - use find_elements to avoid NoSuchElementException when the text node is absent
                    # - normalize text by coercing None -> '' and stripping whitespace
                    text_elems = r.find_elements(By.CLASS_NAME, 'wiI7pd')
                    if text_elems:
                        review_text = (text_elems[0].text or '').strip()
                    else:
                        review_text = ''

                    if review_text == '':
                        no_text_rating_count += 1

                    try:
                        date_element = r.find_element(By.CLASS_NAME, 'rsqApe')
                        posting_date = date_element.text
                    except NoSuchElementException:
                        posting_date = "Date not found"

                    all_reviews_data.append({
                        "outlet": outlet_name,
                        "author": author_name,
                        "rating": star_rating,
                        "text": review_text,
                        "date_posted": posting_date
                    })
                except Exception:
                    continue

            # Scroll down the large jump for loading
            driver.execute_script(
                "arguments[0].scrollBy(0, 5000);", 
                scrollable_div
            )
            time.sleep(scroll_pause)

            # Check for scroll height change
            new_height = driver.execute_script("return arguments[0].scrollHeight", scrollable_div)
            if new_height == previous_height:
                if not new_reviews:
                    no_new_count += 1
            else:
                no_new_count = 0
                
            previous_height = new_height

            # Back-scroll occasionally for stability
            if scroll_iteration % 10 == 0 and scroll_iteration > 0:
                driver.execute_script("arguments[0].scrollBy(0, -200);", scrollable_div) 
                time.sleep(0.5)
                driver.execute_script("arguments[0].scrollBy(0, 400);", scrollable_div)


    except Exception as e:
        print(f"üö® An error occurred while scraping {outlet_name}: {e}")
    
    # --- SAVE RESULTS ---
    if all_reviews_data:
        os.makedirs("Reviews/ Worst", exist_ok=True)
        # Clean the name for a safe filename
        safe_outlet_name = "".join(c for c in outlet_name if c.isalnum() or c in (' ', '_')).rstrip()
        output_filename = os.path.join("Reviews/ Worst", f"{safe_outlet_name}_reviews.csv")
        df_reviews = pd.DataFrame(all_reviews_data)
        
        # Remove duplicates
        initial_count = len(df_reviews)
        df_reviews = df_reviews.drop_duplicates(subset=['author', 'text'], keep='first')
        final_count = len(df_reviews)
        
        if initial_count > final_count:
            print(f"‚ö†Ô∏è  Removed {initial_count - final_count} duplicate reviews")
        
        df_reviews.to_csv(output_filename, index=False)
        print(f"‚úÖ Saved {final_count} unique reviews to '{output_filename}'")
        print(f"üìÑ Found {no_text_rating_count} reviews that were ratings only (no text).")
        # print(df_reviews['rating'].value_counts().sort_index(ascending=False))
    else:
        print(f"‚ùå No reviews were scraped for {outlet_name}.")

# ==============================
# --- 3. MAIN EXECUTION LOOP ---
# ==============================

df_outlets = pd.read_csv("bottom_5_outlets.csv")
print(f"\nüöÄ Found {len(df_outlets)} outlets to scrape.")

# Iterate over each row (each outlet) in the DataFrame
for index, row in df_outlets.iterrows():
    outlet_name = row["name"]
    outlet_url = row["maps_url"]
    
    # Call the scraping function for the current outlet
    scrape_reviews(outlet_name, outlet_url, driver, wait)

# Clean up and close the browser after the loop finishes
driver.quit()
print("\n--- üèÅ All scraping complete. Driver closed. ---")

‚öôÔ∏è Setting up WebDriver...
‚úÖ Chrome WebDriver initialized successfully.

üöÄ Found 5 outlets to scrape.

--- üß≠ Starting scrape for: Anytime Fitness Northpoint City ---
üîó URL: https://www.google.com/maps/place/?q=place_id:ChIJqTFUhpIV2jER8kd6GBxvZ0A
üîÑ Starting to scroll and collect reviews...
üìä Iteration 1: Found 20 new reviews (Total: 20)
‚è≥ No new reviews found. Waiting... (1/8)
üìä Iteration 3: Found 20 new reviews (Total: 30)
üìä Iteration 4: Found 20 new reviews (Total: 40)
‚è≥ No new reviews found. Waiting... (1/8)
üìä Iteration 6: Found 20 new reviews (Total: 50)
‚è≥ No new reviews found. Waiting... (1/8)
üìä Iteration 8: Found 20 new reviews (Total: 60)
üìä Iteration 9: Found 20 new reviews (Total: 70)
üìä Iteration 10: Found 20 new reviews (Total: 80)
üìä Iteration 11: Found 20 new reviews (Total: 90)
üìä Iteration 12: Found 20 new reviews (Total: 100)
üìä Iteration 13: Found 20 new reviews (Total: 110)
üìä Iteration 14: Found 20 new reviews (Total

## RE-SCRAPE BOTTOM 20 OUTLETS

In [3]:
import time
import os
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options as ChromeOptions
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException, StaleElementReferenceException
from selenium.webdriver.common.action_chains import ActionChains
from webdriver_manager.chrome import ChromeDriverManager

# ==============================
# --- 1. SETUP DRIVER (ONLY ONCE) ---
# ==============================
options = ChromeOptions()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument("--lang=en-US")
options.add_argument("--window-size=1920,1080")
options.add_argument("--headless=new")  # Recommended for stable batch processing

print("‚öôÔ∏è Setting up WebDriver...")
try:
    driver = webdriver.Chrome(
        service=Service(ChromeDriverManager().install()),
        options=options
    )
    driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
    wait = WebDriverWait(driver, 20) # Increased wait time for robustness
    actions = ActionChains(driver)
    print("‚úÖ Chrome WebDriver initialized successfully.")
except Exception as e:
    print(f"‚ùå Failed to initialize WebDriver: {e}")
    exit()

# ==============================
# --- 2. SCRAPING FUNCTION ---
# ==============================

def scrape_reviews(outlet_name, outlet_url, driver, wait):
    """Navigates to the outlet, scrapes all reviews, and saves the data."""
    print(f"\n--- üß≠ Starting scrape for: {outlet_name} ---")
    print(f"üîó URL: {outlet_url}")

    all_reviews_data = []
    seen_review_ids = set()
    no_text_rating_count = 0
    scroll_pause = 2.0
    max_no_new = 8 # Increased patience
    
    try:
        # OPEN OUTLET PAGE AND CLICK REVIEWS
        driver.get(outlet_url)
        time.sleep(3)

        # Click the "Reviews" button
        reviews_button = wait.until(
            EC.element_to_be_clickable((By.XPATH, '//button[contains(@aria-label, "Reviews for")]'))
        )
        reviews_button.click()
        time.sleep(2)

        # SCROLL AND SCRAPE REVIEWS
        scrollable_div = wait.until(
            EC.presence_of_element_located((By.XPATH, '//div[contains(@class, "m6QErb") and contains(@class, "DxyBCb")]'))
        )

        no_new_count = 0
        previous_height = 0
        scroll_iteration = 0

        print("üîÑ Starting to scroll and collect reviews...")
        
        while True:
            scroll_iteration += 1
            
            # Find all reviews currently loaded
            review_elements = driver.find_elements(By.XPATH, '//div[@data-review-id]')
            new_reviews = [r for r in review_elements if r.get_attribute("data-review-id") not in seen_review_ids]

            if new_reviews:
                no_new_count = 0
                print(f"üìä Iteration {scroll_iteration}: Found {len(new_reviews)} new reviews (Total: {len(seen_review_ids) + len(new_reviews)})")
            else:
                no_new_count += 1
                print(f"‚è≥ No new reviews found. Waiting... ({no_new_count}/{max_no_new})")
                if no_new_count >= max_no_new:
                    print("‚úã Reached end of reviews.")
                    break
                    
            # Process new reviews
            for r in new_reviews:
                review_id = r.get_attribute("data-review-id")
                
                if review_id in seen_review_ids:
                    continue
                    
                seen_review_ids.add(review_id)
                
                try:
                    # Expand truncated review text ("More" button)
                    try:
                        more_button = r.find_element(By.CLASS_NAME, 'w8nwRe')
                        driver.execute_script("arguments[0].click();", more_button)
                        time.sleep(0.15)
                    except (NoSuchElementException, StaleElementReferenceException):
                        pass

                    # Extract data and skip owner response
                    author_name = r.find_element(By.CLASS_NAME, 'd4r55').text
                    rating_elements = r.find_elements(By.CLASS_NAME, 'kvMYJc')
                    if not rating_elements:
                        continue 
                    
                    rating_element = rating_elements[0]
                    rating_text = rating_element.get_attribute('aria-label')
                    star_rating = int(rating_text.split(' ')[0])

                    # Robust extraction of review text:
                    # - use find_elements to avoid NoSuchElementException when the text node is absent
                    # - normalize text by coercing None -> '' and stripping whitespace
                    text_elems = r.find_elements(By.CLASS_NAME, 'wiI7pd')
                    if text_elems:
                        review_text = (text_elems[0].text or '').strip()
                    else:
                        review_text = ''

                    if review_text == '':
                        no_text_rating_count += 1

                    try:
                        date_element = r.find_element(By.CLASS_NAME, 'rsqApe')
                        posting_date = date_element.text
                    except NoSuchElementException:
                        posting_date = "Date not found"

                    all_reviews_data.append({
                        "outlet": outlet_name,
                        "author": author_name,
                        "rating": star_rating,
                        "text": review_text,
                        "date_posted": posting_date
                    })
                except Exception:
                    continue

            # Scroll down the large jump for loading
            driver.execute_script(
                "arguments[0].scrollBy(0, 5000);", 
                scrollable_div
            )
            time.sleep(scroll_pause)

            # Check for scroll height change
            new_height = driver.execute_script("return arguments[0].scrollHeight", scrollable_div)
            if new_height == previous_height:
                if not new_reviews:
                    no_new_count += 1
            else:
                no_new_count = 0
                
            previous_height = new_height

            # Back-scroll occasionally for stability
            if scroll_iteration % 10 == 0 and scroll_iteration > 0:
                driver.execute_script("arguments[0].scrollBy(0, -200);", scrollable_div) 
                time.sleep(0.5)
                driver.execute_script("arguments[0].scrollBy(0, 400);", scrollable_div)


    except Exception as e:
        print(f"üö® An error occurred while scraping {outlet_name}: {e}")
    
    # --- SAVE RESULTS ---
    if all_reviews_data:
        os.makedirs("Reviews/ Worst", exist_ok=True)
        # Clean the name for a safe filename
        safe_outlet_name = "".join(c for c in outlet_name if c.isalnum() or c in (' ', '_')).rstrip()
        output_filename = os.path.join("Reviews/ Worst", f"{safe_outlet_name}_reviews.csv")
        df_reviews = pd.DataFrame(all_reviews_data)
        
        # Remove duplicates
        initial_count = len(df_reviews)
        df_reviews = df_reviews.drop_duplicates(subset=['author', 'text'], keep='first')
        final_count = len(df_reviews)
        
        if initial_count > final_count:
            print(f"‚ö†Ô∏è  Removed {initial_count - final_count} duplicate reviews")
        
        df_reviews.to_csv(output_filename, index=False)
        print(f"‚úÖ Saved {final_count} unique reviews to '{output_filename}'")
        print(f"üìÑ Found {no_text_rating_count} reviews that were ratings only (no text).")
        # print(df_reviews['rating'].value_counts().sort_index(ascending=False))
    else:
        print(f"‚ùå No reviews were scraped for {outlet_name}.")

# ==============================
# --- 3. MAIN EXECUTION LOOP ---
# ==============================
# changed code
from pathlib import Path
# use notebook CWD and resolve the repo layout reliably
nb_cwd = Path.cwd().resolve()
bottom_20_file = (nb_cwd / '..' / 'Outlets' / 'bottom_20_outlets.csv').resolve()
# fallback to local Outlets if above doesn't exist
if not bottom_20_file.exists():
    bottom_20_file = (nb_cwd / 'Outlets' / 'bottom_20_outlets.csv').resolve()

print("DEBUG: loading bottom_20_outlets from:", bottom_20_file, "exists?", bottom_20_file.exists())
df_outlets = pd.read_csv(str(bottom_20_file))
# ...existing code...
print(f"\nüöÄ Found {len(df_outlets)} outlets to scrape.")

# Iterate over each row (each outlet) in the DataFrame
for index, row in df_outlets.iterrows():
    outlet_name = row["name"]
    outlet_url = row["maps_url"]
    
    # Call the scraping function for the current outlet
    scrape_reviews(outlet_name, outlet_url, driver, wait)

# Clean up and close the browser after the loop finishes
driver.quit()
print("\n--- üèÅ All scraping complete. Driver closed. ---")

‚öôÔ∏è Setting up WebDriver...
‚úÖ Chrome WebDriver initialized successfully.
DEBUG: loading bottom_20_outlets from: /Users/breann/Documents/GitHub/IS434-Anytime-Fitness/Google-Reviews/Outlets/bottom_20_outlets.csv exists? True

üöÄ Found 20 outlets to scrape.

--- üß≠ Starting scrape for: Anytime Fitness Northpoint City ---
üîó URL: https://www.google.com/maps/place/?q=place_id:ChIJqTFUhpIV2jER8kd6GBxvZ0A
üîÑ Starting to scroll and collect reviews...
üìä Iteration 1: Found 20 new reviews (Total: 20)
‚è≥ No new reviews found. Waiting... (1/8)
üìä Iteration 3: Found 20 new reviews (Total: 30)
üìä Iteration 4: Found 20 new reviews (Total: 40)
üìä Iteration 5: Found 20 new reviews (Total: 50)
‚è≥ No new reviews found. Waiting... (1/8)
üìä Iteration 7: Found 20 new reviews (Total: 60)
üìä Iteration 8: Found 20 new reviews (Total: 70)
üìä Iteration 9: Found 20 new reviews (Total: 80)
üìä Iteration 10: Found 20 new reviews (Total: 90)
üìä Iteration 11: Found 20 new reviews (Tota

In [10]:
import time
import os
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options as ChromeOptions
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException, StaleElementReferenceException
from selenium.webdriver.common.action_chains import ActionChains
from webdriver_manager.chrome import ChromeDriverManager

# ==============================
# --- 1. SETUP DRIVER (ONLY ONCE) ---
# ==============================
options = ChromeOptions()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument("--lang=en-US")
options.add_argument("--window-size=1920,1080")
options.add_argument("--headless=new")

print("‚öôÔ∏è Setting up WebDriver...")
try:
    driver = webdriver.Chrome(
        service=Service(ChromeDriverManager().install()),
        options=options
    )
    driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
    wait = WebDriverWait(driver, 20)
    actions = ActionChains(driver)
    print("‚úÖ Chrome WebDriver initialized successfully.")
except Exception as e:
    print(f"‚ùå Failed to initialize WebDriver: {e}")
    exit()

# ==============================
# --- 2. HELPER FUNCTIONS ---
# ==============================

from datetime import datetime, timedelta
import re

def extract_date_posted(review_element):
    """Extract the date posted from review element."""
    try:
        # Look for the date in common locations
        date_selectors = [
            './/span[@class="rsqaWe"]',
            './/span[contains(@class, "rsqApe")]',
            './/span[contains(@class, "ZnRjke")]',
            './/div[contains(@class, "RgQvtc")]',
        ]
        
        for selector in date_selectors:
            try:
                date_elem = review_element.find_element(By.XPATH, selector)
                date_text = date_elem.text.strip()
                if date_text:
                    return date_text
            except:
                continue
        
        return "Date not found"
    except:
        return "Date not found"

def normalize_date(date_text):
    """Convert relative dates like '6 days ago' to 'Month Year' format."""
    if date_text == "Date not found" or not date_text:
        return "Unknown"
    
    date_text = date_text.lower().strip()
    today = datetime.now()
    
    try:
        # Handle 'X days ago'
        days_match = re.search(r'(\d+)\s+days?\s+ago', date_text)
        if days_match:
            days = int(days_match.group(1))
            target_date = today - timedelta(days=days)
            return target_date.strftime('%B %Y')
        
        # Handle 'a day ago'
        if 'a day ago' in date_text:
            target_date = today - timedelta(days=1)
            return target_date.strftime('%B %Y')
        
        # Handle 'X weeks ago'
        weeks_match = re.search(r'(\d+)\s+weeks?\s+ago', date_text)
        if weeks_match:
            weeks = int(weeks_match.group(1))
            target_date = today - timedelta(weeks=weeks)
            return target_date.strftime('%B %Y')
        
        # Handle 'a week ago'
        if 'a week ago' in date_text:
            target_date = today - timedelta(weeks=1)
            return target_date.strftime('%B %Y')
        
        # Handle 'X months ago'
        months_match = re.search(r'(\d+)\s+months?\s+ago', date_text)
        if months_match:
            months = int(months_match.group(1))
            # Simple month subtraction
            month = today.month - months
            year = today.year
            while month <= 0:
                month += 12
                year -= 1
            return datetime(year, month, 1).strftime('%B %Y')
        
        # Handle 'a month ago'
        if 'a month ago' in date_text:
            month = today.month - 1
            year = today.year
            if month <= 0:
                month += 12
                year -= 1
            return datetime(year, month, 1).strftime('%B %Y')
        
        # Handle 'X years ago'
        years_match = re.search(r'(\d+)\s+years?\s+ago', date_text)
        if years_match:
            years = int(years_match.group(1))
            target_date = today - timedelta(days=365*years)
            return target_date.strftime('%B %Y')
        
        # Handle 'a year ago'
        if 'a year ago' in date_text:
            year = today.year - 1
            return datetime(year, today.month, 1).strftime('%B %Y')
        
        # If already a full date format, return as is
        return date_text
    
    except Exception as e:
        return "Unknown"

# ==============================
# --- 3. SCRAPING FUNCTION ---
# ==============================

def scrape_reviews(outlet_name, outlet_url, driver, wait):
    """Navigates to the outlet, scrapes all reviews, and saves the data."""
    print(f"\n--- üß≠ Starting scrape for: {outlet_name} ---")
    print(f"üîó URL: {outlet_url}")

    all_reviews_data = []
    seen_review_ids = set()
    no_text_rating_count = 0
    scroll_pause = 2.0
    max_no_new = 8
    
    try:
        # OPEN OUTLET PAGE AND CLICK REVIEWS
        driver.get(outlet_url)
        time.sleep(4)

        # Click the "Reviews" button
        try:
            reviews_button = wait.until(
                EC.element_to_be_clickable((By.XPATH, '//button[contains(@aria-label, "Reviews")]'))
            )
            reviews_button.click()
            time.sleep(3)
        except Exception as e:
            print(f"‚ö†Ô∏è  Could not find Reviews button: {e}")
            return

        # SCROLL AND SCRAPE REVIEWS
        try:
            scrollable_div = wait.until(
                EC.presence_of_element_located((By.XPATH, '//div[contains(@class, "m6QErb")]'))
            )
            print("‚úÖ Found scrollable container")
        except Exception as e:
            print(f"‚ùå Could not find scrollable container: {e}")
            return

        no_new_count = 0
        previous_height = 0
        scroll_iteration = 0

        print("üîÑ Starting to scroll and collect reviews...")
        
        while True:
            scroll_iteration += 1
            
            # Find all reviews currently loaded
            review_elements = driver.find_elements(By.XPATH, '//div[@data-review-id]')
            new_reviews = [r for r in review_elements if r.get_attribute("data-review-id") not in seen_review_ids]

            if new_reviews:
                no_new_count = 0
                print(f"üìä Iteration {scroll_iteration}: Found {len(new_reviews)} new reviews (Total: {len(seen_review_ids) + len(new_reviews)})")
            else:
                no_new_count += 1
                print(f"‚è≥ No new reviews found. Waiting... ({no_new_count}/{max_no_new})")
                if no_new_count >= max_no_new:
                    print("‚úã Reached end of reviews.")
                    break
                    
            # Process new reviews
            for r in new_reviews:
                review_id = r.get_attribute("data-review-id")
                
                if review_id in seen_review_ids:
                    continue
                    
                seen_review_ids.add(review_id)
                
                try:
                    # Expand truncated review text ("More" button)
                    try:
                        more_button = r.find_element(By.CLASS_NAME, 'w8nwRe')
                        driver.execute_script("arguments[0].click();", more_button)
                        time.sleep(0.15)
                    except (NoSuchElementException, StaleElementReferenceException):
                        pass

                    # Extract author name
                    try:
                        author_name = r.find_element(By.CLASS_NAME, 'd4r55').text
                    except:
                        author_name = "Unknown"

                    # Extract rating
                    try:
                        rating_elements = r.find_elements(By.CLASS_NAME, 'kvMYJc')
                        if not rating_elements:
                            continue 
                        
                        rating_element = rating_elements[0]
                        rating_text = rating_element.get_attribute('aria-label')
                        star_rating = int(rating_text.split(' ')[0])
                    except:
                        continue

                    # Extract review text
                    try:
                        text_elems = r.find_elements(By.CLASS_NAME, 'wiI7pd')
                        if text_elems:
                            review_text = (text_elems[0].text or '').strip()
                        else:
                            review_text = ''
                    except:
                        review_text = ''

                    if review_text == '':
                        no_text_rating_count += 1

                    # Extract date posted
                    posting_date = extract_date_posted(r)
                    normalized_date = normalize_date(posting_date)

                    all_reviews_data.append({
                        "outlet": outlet_name,
                        "author": author_name,
                        "rating": star_rating,
                        "text": review_text,
                        "date_posted": posting_date,
                        "date_posted_normalized": normalized_date,
                        "review_id": review_id
                    })
                    
                except Exception as e:
                    continue

            # Scroll down with a large jump to load more reviews
            driver.execute_script(
                "arguments[0].scrollBy(0, 5000);", 
                scrollable_div
            )
            time.sleep(scroll_pause)

            # Check for scroll height change
            new_height = driver.execute_script("return arguments[0].scrollHeight", scrollable_div)
            if new_height == previous_height:
                if not new_reviews:
                    no_new_count += 1
            else:
                no_new_count = 0
                
            previous_height = new_height

            # Back-scroll occasionally for stability
            if scroll_iteration % 10 == 0 and scroll_iteration > 0:
                driver.execute_script("arguments[0].scrollBy(0, -200);", scrollable_div) 
                time.sleep(0.5)
                driver.execute_script("arguments[0].scrollBy(0, 400);", scrollable_div)

    except Exception as e:
        print(f"üö® An error occurred while scraping {outlet_name}: {e}")
    
    # --- SAVE RESULTS ---
    if all_reviews_data:
        os.makedirs("Reviews/ Worst", exist_ok=True)
        safe_outlet_name = "".join(c for c in outlet_name if c.isalnum() or c in (' ', '_')).rstrip()
        output_filename = os.path.join("Reviews/ Worst", f"{safe_outlet_name}_reviews.csv")
        df_reviews = pd.DataFrame(all_reviews_data)
        
        # Remove duplicates based on review_id
        initial_count = len(df_reviews)
        df_reviews = df_reviews.drop_duplicates(subset=['review_id'], keep='first')
        final_count = len(df_reviews)
        
        if initial_count > final_count:
            print(f"‚ö†Ô∏è  Removed {initial_count - final_count} duplicate reviews")
        
        df_reviews.to_csv(output_filename, index=False)
        print(f"‚úÖ Saved {final_count} unique reviews to '{output_filename}'")
        print(f"üìÑ Found {no_text_rating_count} reviews that were ratings only (no text).")
        
        # Show sample
        if len(df_reviews) > 0:
            print(f"\nüìã Sample data:")
            print(f"   Review ID: {df_reviews['review_id'].iloc[0]}")
            print(f"   Author: {df_reviews['author'].iloc[0]}")
            print(f"   Rating: {df_reviews['rating'].iloc[0]} stars")
            print(f"   Date: {df_reviews['date_posted'].iloc[0]}")
            print(f"   Text: {df_reviews['text'].iloc[0][:100]}...")
    else:
        print(f"‚ùå No reviews were scraped for {outlet_name}.")

# ==============================
# --- 4. MAIN EXECUTION LOOP ---
# ==============================
from pathlib import Path

nb_cwd = Path.cwd().resolve()
bottom_20_file = (nb_cwd / '..' / 'Outlets' / 'bottom_20_outlets.csv').resolve()

if not bottom_20_file.exists():
    bottom_20_file = (nb_cwd / 'Outlets' / 'bottom_20_outlets.csv').resolve()

print("DEBUG: loading bottom_20_outlets from:", bottom_20_file, "exists?", bottom_20_file.exists())
df_outlets = pd.read_csv(str(bottom_20_file))

print(f"\nüöÄ Found {len(df_outlets)} outlets to scrape.")

# Test with just the first outlet for debugging
for index, row in df_outlets.iterrows():
    outlet_name = row["name"]
    outlet_url = row["maps_url"]
    scrape_reviews(outlet_name, outlet_url, driver, wait)
    
    # Break after first outlet for testing
    if index == 0:
        print("\n‚è∏Ô∏è  Stopping after first outlet for testing. Remove this break to scrape all.")
        break

driver.quit()
print("\n--- üèÅ Scraping complete. Driver closed. ---")

‚öôÔ∏è Setting up WebDriver...
‚úÖ Chrome WebDriver initialized successfully.
DEBUG: loading bottom_20_outlets from: /Users/breann/Documents/GitHub/IS434-Anytime-Fitness/Google-Reviews/Outlets/bottom_20_outlets.csv exists? True

üöÄ Found 20 outlets to scrape.

--- üß≠ Starting scrape for: Anytime Fitness Northpoint City ---
üîó URL: https://www.google.com/maps/place/?q=place_id:ChIJqTFUhpIV2jER8kd6GBxvZ0A
‚úÖ Found scrollable container
üîÑ Starting to scroll and collect reviews...
üìä Iteration 1: Found 20 new reviews (Total: 20)
‚è≥ No new reviews found. Waiting... (1/8)
‚è≥ No new reviews found. Waiting... (3/8)
‚è≥ No new reviews found. Waiting... (5/8)
‚è≥ No new reviews found. Waiting... (7/8)
‚è≥ No new reviews found. Waiting... (9/8)
‚úã Reached end of reviews.
‚úÖ Saved 10 unique reviews to 'Reviews/ Worst/Anytime Fitness Northpoint City_reviews.csv'
üìÑ Found 0 reviews that were ratings only (no text).

üìã Sample data:
   Review ID: Ci9DQUlRQUNvZENodHljRjlvT2tSS2NWQk1

In [None]:
import time
import os
import pandas as pd
from datetime import datetime, timedelta
import re
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options as ChromeOptions
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException, StaleElementReferenceException
from selenium.webdriver.common.action_chains import ActionChains
from webdriver_manager.chrome import ChromeDriverManager

# ==============================
# --- 1. SETUP DRIVER ---
# ==============================
options = ChromeOptions()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument("--lang=en-US")
options.add_argument("--window-size=1920,1080")
options.add_argument("--headless=new")

print("‚öôÔ∏è Setting up WebDriver...")
try:
    driver = webdriver.Chrome(
        service=Service(ChromeDriverManager().install()),
        options=options
    )
    driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
    wait = WebDriverWait(driver, 20)
    actions = ActionChains(driver)
    print("‚úÖ Chrome WebDriver initialized successfully.")
except Exception as e:
    print(f"‚ùå Failed to initialize WebDriver: {e}")
    exit()

# ==============================
# --- 2. HELPER FUNCTIONS ---
# ==============================

def extract_date_posted(review_element):
    """Extract the date posted from review element."""
    try:
        date_selectors = [
            './/span[@class="rsqaWe"]',
            './/span[contains(@class, "rsqApe")]',
            './/span[contains(@class, "ZnRjke")]',
            './/div[contains(@class, "RgQvtc")]',
        ]
        
        for selector in date_selectors:
            try:
                date_elem = review_element.find_element(By.XPATH, selector)
                date_text = date_elem.text.strip()
                if date_text:
                    return date_text
            except:
                continue
        
        return "Date not found"
    except:
        return "Date not found"

def normalize_date(date_text):
    """Convert relative dates like '6 days ago' to 'Month Year' format."""
    if date_text == "Date not found" or not date_text:
        return "Unknown"
    
    date_text = date_text.lower().strip()
    today = datetime.now()
    
    try:
        # Handle 'X days ago'
        days_match = re.search(r'(\d+)\s+days?\s+ago', date_text)
        if days_match:
            days = int(days_match.group(1))
            target_date = today - timedelta(days=days)
            return target_date.strftime('%B %Y')
        
        # Handle 'a day ago'
        if 'a day ago' in date_text:
            target_date = today - timedelta(days=1)
            return target_date.strftime('%B %Y')
        
        # Handle 'X weeks ago'
        weeks_match = re.search(r'(\d+)\s+weeks?\s+ago', date_text)
        if weeks_match:
            weeks = int(weeks_match.group(1))
            target_date = today - timedelta(weeks=weeks)
            return target_date.strftime('%B %Y')
        
        # Handle 'a week ago'
        if 'a week ago' in date_text:
            target_date = today - timedelta(weeks=1)
            return target_date.strftime('%B %Y')
        
        # Handle 'X months ago'
        months_match = re.search(r'(\d+)\s+months?\s+ago', date_text)
        if months_match:
            months = int(months_match.group(1))
            month = today.month - months
            year = today.year
            while month <= 0:
                month += 12
                year -= 1
            return datetime(year, month, 1).strftime('%B %Y')
        
        # Handle 'a month ago'
        if 'a month ago' in date_text:
            month = today.month - 1
            year = today.year
            if month <= 0:
                month += 12
                year -= 1
            return datetime(year, month, 1).strftime('%B %Y')
        
        # Handle 'X years ago'
        years_match = re.search(r'(\d+)\s+years?\s+ago', date_text)
        if years_match:
            years = int(years_match.group(1))
            target_date = today - timedelta(days=365*years)
            return target_date.strftime('%B %Y')
        
        # Handle 'a year ago'
        if 'a year ago' in date_text:
            year = today.year - 1
            return datetime(year, today.month, 1).strftime('%B %Y')
        
        return date_text
    
    except Exception as e:
        return "Unknown"

# ==============================
# --- 3. SCRAPING FUNCTION ---
# ==============================

def scrape_reviews(outlet_name, outlet_url, driver, wait):
    """Navigates to the outlet, scrapes all reviews, and saves the data."""
    print(f"\n--- üß≠ Starting scrape for: {outlet_name} ---")
    print(f"üîó URL: {outlet_url}")

    all_reviews_data = []
    seen_review_ids = set()
    no_text_rating_count = 0
    scroll_pause = 3.0
    max_no_new = 20
    
    try:
        # STEP 1: OPEN OUTLET PAGE
        driver.get(outlet_url)
        time.sleep(4)

        # STEP 2: CLICK REVIEWS BUTTON
        try:
            reviews_button = wait.until(
                EC.element_to_be_clickable((By.XPATH, '//button[contains(@aria-label, "Reviews")]'))
            )
            reviews_button.click()
            time.sleep(3)
            print("‚úÖ Clicked Reviews button")
        except Exception as e:
            print(f"‚ö†Ô∏è  Could not find Reviews button: {e}")
            return

        # STEP 3: FIND SCROLLABLE CONTAINER
        scrollable_div = None
        scroll_selectors = [
            '//div[@aria-label[contains(., "Reviews")]]//div[contains(@class, "m6QErb")]',
            '//div[contains(@class, "m6QErb") and contains(@class, "DxyBCb")]',
            '//div[@class="m6QErb DxyBCb"]',
            '//div[contains(@class, "DxyBCb")]',
        ]
        
        for selector in scroll_selectors:
            try:
                scrollable_div = wait.until(
                    EC.presence_of_element_located((By.XPATH, selector))
                )
                print(f"‚úÖ Found scrollable container")
                break
            except:
                continue
        
        if not scrollable_div:
            print(f"‚ùå Could not find scrollable container")
            return

        # STEP 4: START SCROLLING AND SCRAPING LOOP
        no_new_count = 0
        previous_height = 0
        scroll_iteration = 0

        print("üîÑ Starting to scroll and collect reviews...")
        
        while True:
            scroll_iteration += 1
            
            # Find all review elements currently loaded
            review_elements = driver.find_elements(By.XPATH, '//div[@data-review-id]')
            
            # Filter for new reviews only
            new_reviews = [r for r in review_elements if r.get_attribute("data-review-id") not in seen_review_ids]

            # Log findings
            if new_reviews:
                no_new_count = 0
                print(f"üìä Iteration {scroll_iteration}: Found {len(new_reviews)} new reviews (Total: {len(seen_review_ids) + len(new_reviews)})")
            else:
                no_new_count += 1
                print(f"‚è≥ No new reviews found. Waiting... ({no_new_count}/{max_no_new})")
                if no_new_count >= max_no_new:
                    print("‚úã Reached end of reviews.")
                    break

            # STEP 5: PROCESS NEW REVIEWS
            for r in new_reviews:
                review_id = r.get_attribute("data-review-id")
                
                if review_id in seen_review_ids:
                    continue
                
                seen_review_ids.add(review_id)
                
                try:
                    # Expand truncated review text
                    try:
                        more_button = r.find_element(By.CLASS_NAME, 'w8nwRe')
                        driver.execute_script("arguments[0].click();", more_button)
                        time.sleep(0.15)
                    except (NoSuchElementException, StaleElementReferenceException):
                        pass

                    # Extract author name
                    try:
                        author_name = r.find_element(By.CLASS_NAME, 'd4r55').text
                    except:
                        author_name = "Unknown"

                    # Extract rating
                    try:
                        rating_elements = r.find_elements(By.CLASS_NAME, 'kvMYJc')
                        if not rating_elements:
                            continue 
                        
                        rating_element = rating_elements[0]
                        rating_text = rating_element.get_attribute('aria-label')
                        star_rating = int(rating_text.split(' ')[0])
                    except:
                        continue

                    # Extract review text
                    try:
                        text_elems = r.find_elements(By.CLASS_NAME, 'wiI7pd')
                        if text_elems:
                            review_text = (text_elems[0].text or '').strip()
                        else:
                            review_text = ''
                    except:
                        review_text = ''

                    if review_text == '':
                        no_text_rating_count += 1

                    # Extract and normalize date
                    posting_date = extract_date_posted(r)
                    normalized_date = normalize_date(posting_date)

                    # Add to data list
                    all_reviews_data.append({
                        "outlet": outlet_name,
                        "author": author_name,
                        "rating": star_rating,
                        "text": review_text,
                        "date_posted": posting_date,
                        "date_posted_normalized": normalized_date,
                        "review_id": review_id
                    })
                    
                except Exception as e:
                    continue

            # STEP 6: SCROLL DOWN
            try:
                driver.execute_script(
                    "arguments[0].scrollBy(0, 5000);", 
                    scrollable_div
                )
                time.sleep(scroll_pause)
            except Exception as e:
                print(f"‚ö†Ô∏è  Error scrolling: {e}")

            # STEP 7: CHECK SCROLL HEIGHT
            try:
                new_height = driver.execute_script("return arguments[0].scrollHeight", scrollable_div)
                current_scroll = driver.execute_script("return arguments[0].scrollTop", scrollable_div)
                visible_height = driver.execute_script("return arguments[0].clientHeight", scrollable_div)
                
                scroll_percentage = (current_scroll + visible_height) / new_height * 100 if new_height > 0 else 0
                print(f"   Scroll: {current_scroll}/{new_height} ({scroll_percentage:.1f}%), Visible: {visible_height}")
                
                if new_height == previous_height:
                    if not new_reviews:
                        no_new_count += 1
                else:
                    print(f"   üìà Scroll height increased: {previous_height} ‚Üí {new_height}")
                    no_new_count = 0
                    
                previous_height = new_height
            except:
                pass

            # STEP 8: BACK-SCROLL OCCASIONALLY FOR STABILITY
            if scroll_iteration % 10 == 0 and scroll_iteration > 0:
                try:
                    driver.execute_script("arguments[0].scrollBy(0, -200);", scrollable_div) 
                    time.sleep(0.5)
                    driver.execute_script("arguments[0].scrollBy(0, 400);", scrollable_div)
                except:
                    pass

    except Exception as e:
        print(f"üö® An error occurred while scraping {outlet_name}: {e}")
    
    # STEP 9: SAVE RESULTS
    if all_reviews_data:
        os.makedirs("Reviews/ Worst", exist_ok=True)
        safe_outlet_name = "".join(c for c in outlet_name if c.isalnum() or c in (' ', '_')).rstrip()
        output_filename = os.path.join("Reviews/ Worst", f"{safe_outlet_name}_reviews.csv")
        
        df_reviews = pd.DataFrame(all_reviews_data)
        
        # Remove duplicates based on review_id
        initial_count = len(df_reviews)
        df_reviews = df_reviews.drop_duplicates(subset=['review_id'], keep='first')
        final_count = len(df_reviews)
        
        if initial_count > final_count:
            print(f"‚ö†Ô∏è  Removed {initial_count - final_count} duplicate reviews")
        
        df_reviews.to_csv(output_filename, index=False)
        print(f"‚úÖ Saved {final_count} unique reviews to '{output_filename}'")
        print(f"üìÑ Found {no_text_rating_count} reviews that were ratings only (no text).")
        
        # Show sample
        if len(df_reviews) > 0:
            print(f"\nüìã Sample data:")
            print(f"   Review ID: {df_reviews['review_id'].iloc[0]}")
            print(f"   Author: {df_reviews['author'].iloc[0]}")
            print(f"   Rating: {df_reviews['rating'].iloc[0]} stars")
            print(f"   Date: {df_reviews['date_posted'].iloc[0]}")
            print(f"   Normalized Date: {df_reviews['date_posted_normalized'].iloc[0]}")
    else:
        print(f"‚ùå No reviews were scraped for {outlet_name}.")

# ==============================
# --- 4. MAIN EXECUTION ---
# ==============================
from pathlib import Path

nb_cwd = Path.cwd().resolve()
bottom_20_file = (nb_cwd / '..' / 'Outlets' / 'bottom_20_outlets.csv').resolve()

if not bottom_20_file.exists():
    bottom_20_file = (nb_cwd / 'Outlets' / 'bottom_20_outlets.csv').resolve()

print("DEBUG: loading bottom_20_outlets from:", bottom_20_file, "exists?", bottom_20_file.exists())
df_outlets = pd.read_csv(str(bottom_20_file))

print(f"\nüöÄ Found {len(df_outlets)} outlets to scrape.")

# Scrape all outlets
for index, row in df_outlets.iterrows():
    outlet_name = row["name"]
    outlet_url = row["maps_url"]
    scrape_reviews(outlet_name, outlet_url, driver, wait)

driver.quit()
print("\n--- üèÅ All scraping complete. Driver closed. ---")

‚öôÔ∏è Setting up WebDriver...
‚úÖ Chrome WebDriver initialized successfully.
DEBUG: loading bottom_20_outlets from: /Users/breann/Documents/GitHub/IS434-Anytime-Fitness/Google-Reviews/Outlets/bottom_20_outlets.csv exists? True

üöÄ Found 20 outlets to scrape.

--- üß≠ Starting scrape for: Anytime Fitness Northpoint City ---
üîó URL: https://www.google.com/maps/place/?q=place_id:ChIJqTFUhpIV2jER8kd6GBxvZ0A
‚úÖ Clicked Reviews button
‚úÖ Found scrollable container
üîÑ Starting to scroll and collect reviews...
üìä Iteration 1: Found 20 new reviews (Total: 20)
   Scroll: 5000/6880 (85.4%), Visible: 873
   üìà Scroll height increased: 0 ‚Üí 6880
‚è≥ No new reviews found. Waiting... (1/20)
   Scroll: 6007/10130 (67.9%), Visible: 873
   üìà Scroll height increased: 6880 ‚Üí 10130
üìä Iteration 3: Found 20 new reviews (Total: 30)
   Scroll: 10458/14415 (78.6%), Visible: 873
   üìà Scroll height increased: 10130 ‚Üí 14415
üìä Iteration 4: Found 20 new reviews (Total: 40)
   Scroll: 1

In [17]:
import time
import os
import pandas as pd
from datetime import datetime, timedelta
import re
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options as ChromeOptions
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException, StaleElementReferenceException
from selenium.webdriver.common.action_chains import ActionChains
from webdriver_manager.chrome import ChromeDriverManager

# ==============================
# --- 1. SETUP DRIVER ---
# ==============================
options = ChromeOptions()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument("--lang=en-US")
options.add_argument("--window-size=1920,1080")
options.add_argument("--headless=new")

print("‚öôÔ∏è Setting up WebDriver...")
try:
    driver = webdriver.Chrome(
        service=Service(ChromeDriverManager().install()),
        options=options
    )
    driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
    wait = WebDriverWait(driver, 20)
    actions = ActionChains(driver)
    print("‚úÖ Chrome WebDriver initialized successfully.")
except Exception as e:
    print(f"‚ùå Failed to initialize WebDriver: {e}")
    exit()

# ==============================
# --- 2. HELPER FUNCTIONS ---
# ==============================

def extract_date_posted(review_element):
    """Extract the date posted from review element."""
    try:
        date_selectors = [
            './/span[@class="rsqaWe"]',
            './/span[contains(@class, "rsqApe")]',
            './/span[contains(@class, "ZnRjke")]',
            './/div[contains(@class, "RgQvtc")]',
        ]
        
        for selector in date_selectors:
            try:
                date_elem = review_element.find_element(By.XPATH, selector)
                date_text = date_elem.text.strip()
                if date_text:
                    return date_text
            except:
                continue
        
        return "Date not found"
    except:
        return "Date not found"

def normalize_date(date_text):
    """Convert relative dates like '6 days ago' to 'Month Year' format."""
    if date_text == "Date not found" or not date_text:
        return "Unknown"
    
    date_text = date_text.lower().strip()
    today = datetime.now()
    
    try:
        # Handle 'X days ago'
        days_match = re.search(r'(\d+)\s+days?\s+ago', date_text)
        if days_match:
            days = int(days_match.group(1))
            target_date = today - timedelta(days=days)
            return target_date.strftime('%B %Y')
        
        # Handle 'a day ago'
        if 'a day ago' in date_text:
            target_date = today - timedelta(days=1)
            return target_date.strftime('%B %Y')
        
        # Handle 'X weeks ago'
        weeks_match = re.search(r'(\d+)\s+weeks?\s+ago', date_text)
        if weeks_match:
            weeks = int(weeks_match.group(1))
            target_date = today - timedelta(weeks=weeks)
            return target_date.strftime('%B %Y')
        
        # Handle 'a week ago'
        if 'a week ago' in date_text:
            target_date = today - timedelta(weeks=1)
            return target_date.strftime('%B %Y')
        
        # Handle 'X months ago'
        months_match = re.search(r'(\d+)\s+months?\s+ago', date_text)
        if months_match:
            months = int(months_match.group(1))
            month = today.month - months
            year = today.year
            while month <= 0:
                month += 12
                year -= 1
            return datetime(year, month, 1).strftime('%B %Y')
        
        # Handle 'a month ago'
        if 'a month ago' in date_text:
            month = today.month - 1
            year = today.year
            if month <= 0:
                month += 12
                year -= 1
            return datetime(year, month, 1).strftime('%B %Y')
        
        # Handle 'X years ago'
        years_match = re.search(r'(\d+)\s+years?\s+ago', date_text)
        if years_match:
            years = int(years_match.group(1))
            target_date = today - timedelta(days=365*years)
            return target_date.strftime('%B %Y')
        
        # Handle 'a year ago'
        if 'a year ago' in date_text:
            year = today.year - 1
            return datetime(year, today.month, 1).strftime('%B %Y')
        
        return date_text
    
    except Exception as e:
        return "Unknown"

# ==============================
# --- 3. SCRAPING FUNCTION ---
# ==============================

def scrape_reviews(outlet_name, outlet_url, driver, wait):
    """Navigates to the outlet, scrapes all reviews, and saves the data."""
    print(f"\n--- üß≠ Starting scrape for: {outlet_name} ---")
    print(f"üîó URL: {outlet_url}")

    all_reviews_data = []
    seen_review_ids = set()
    no_text_rating_count = 0
    scroll_pause = 3.0
    max_no_new = 20
    
    try:
        # STEP 1: OPEN OUTLET PAGE
        driver.get(outlet_url)
        time.sleep(4)

        # STEP 2: CLICK REVIEWS BUTTON
        try:
            reviews_button = wait.until(
                EC.element_to_be_clickable((By.XPATH, '//button[contains(@aria-label, "Reviews")]'))
            )
            reviews_button.click()
            time.sleep(3)
            print("‚úÖ Clicked Reviews button")
        except Exception as e:
            print(f"‚ö†Ô∏è  Could not find Reviews button: {e}")
            return

        # STEP 3: FIND SCROLLABLE CONTAINER
        scrollable_div = None
        scroll_selectors = [
            '//div[@aria-label[contains(., "Reviews")]]//div[contains(@class, "m6QErb")]',
            '//div[contains(@class, "m6QErb") and contains(@class, "DxyBCb")]',
            '//div[@class="m6QErb DxyBCb"]',
            '//div[contains(@class, "DxyBCb")]',
        ]
        
        for selector in scroll_selectors:
            try:
                scrollable_div = wait.until(
                    EC.presence_of_element_located((By.XPATH, selector))
                )
                print(f"‚úÖ Found scrollable container")
                break
            except:
                continue
        
        if not scrollable_div:
            print(f"‚ùå Could not find scrollable container")
            return

        # STEP 4: START SCROLLING AND SCRAPING LOOP
        no_new_count = 0
        previous_height = 0
        scroll_iteration = 0

        print("üîÑ Starting to scroll and collect reviews...")
        
        while True:
            scroll_iteration += 1
            
            # Find all review elements currently loaded
            review_elements = driver.find_elements(By.XPATH, '//div[@data-review-id]')
            
            # Filter for new reviews only
            new_reviews = [r for r in review_elements if r.get_attribute("data-review-id") not in seen_review_ids]

            # Log findings
            if new_reviews:
                no_new_count = 0
                print(f"üìä Iteration {scroll_iteration}: Found {len(new_reviews)} new reviews (Total: {len(seen_review_ids) + len(new_reviews)})")
            else:
                no_new_count += 1
                print(f"‚è≥ No new reviews found. Waiting... ({no_new_count}/{max_no_new})")
                if no_new_count >= max_no_new:
                    print("‚úã Reached end of reviews.")
                    break

            # STEP 5: PROCESS NEW REVIEWS
            for r in new_reviews:
                review_id = r.get_attribute("data-review-id")
                
                if review_id in seen_review_ids:
                    continue
                
                seen_review_ids.add(review_id)
                
                try:
                    # Expand truncated review text
                    try:
                        more_button = r.find_element(By.CLASS_NAME, 'w8nwRe')
                        driver.execute_script("arguments[0].click();", more_button)
                        time.sleep(0.15)
                    except (NoSuchElementException, StaleElementReferenceException):
                        pass

                    # Extract author name
                    try:
                        author_name = r.find_element(By.CLASS_NAME, 'd4r55').text
                    except:
                        author_name = "Unknown"

                    # Extract rating
                    try:
                        rating_elements = r.find_elements(By.CLASS_NAME, 'kvMYJc')
                        if not rating_elements:
                            continue 
                        
                        rating_element = rating_elements[0]
                        rating_text = rating_element.get_attribute('aria-label')
                        star_rating = int(rating_text.split(' ')[0])
                    except:
                        continue

                    # Extract review text
                    try:
                        text_elems = r.find_elements(By.CLASS_NAME, 'wiI7pd')
                        if text_elems:
                            review_text = (text_elems[0].text or '').strip()
                        else:
                            review_text = ''
                    except:
                        review_text = ''

                    if review_text == '':
                        no_text_rating_count += 1

                    # Extract and normalize date
                    posting_date = extract_date_posted(r)
                    normalized_date = normalize_date(posting_date)

                    # Add to data list
                    all_reviews_data.append({
                        "outlet": outlet_name,
                        "author": author_name,
                        "rating": star_rating,
                        "text": review_text,
                        "date_posted": posting_date,
                        "date_posted_normalized": normalized_date,
                        "review_id": review_id
                    })
                    
                except Exception as e:
                    continue

            # STEP 6: SCROLL DOWN
            try:
                driver.execute_script(
                    "arguments[0].scrollBy(0, 5000);", 
                    scrollable_div
                )
                time.sleep(scroll_pause)
            except Exception as e:
                print(f"‚ö†Ô∏è  Error scrolling: {e}")

            # STEP 7: CHECK SCROLL HEIGHT
            try:
                new_height = driver.execute_script("return arguments[0].scrollHeight", scrollable_div)
                current_scroll = driver.execute_script("return arguments[0].scrollTop", scrollable_div)
                visible_height = driver.execute_script("return arguments[0].clientHeight", scrollable_div)
                
                scroll_percentage = (current_scroll + visible_height) / new_height * 100 if new_height > 0 else 0
                print(f"   Scroll: {current_scroll}/{new_height} ({scroll_percentage:.1f}%), Visible: {visible_height}")
                
                if new_height == previous_height:
                    if not new_reviews:
                        no_new_count += 1
                else:
                    print(f"   üìà Scroll height increased: {previous_height} ‚Üí {new_height}")
                    no_new_count = 0
                    
                previous_height = new_height
            except:
                pass

            # STEP 8: BACK-SCROLL OCCASIONALLY FOR STABILITY
            if scroll_iteration % 10 == 0 and scroll_iteration > 0:
                try:
                    driver.execute_script("arguments[0].scrollBy(0, -200);", scrollable_div) 
                    time.sleep(0.5)
                    driver.execute_script("arguments[0].scrollBy(0, 400);", scrollable_div)
                except:
                    pass

    except Exception as e:
        print(f"üö® An error occurred while scraping {outlet_name}: {e}")
    
    # STEP 9: SAVE RESULTS
    if all_reviews_data:
        os.makedirs("Reviews/ Best", exist_ok=True)
        safe_outlet_name = "".join(c for c in outlet_name if c.isalnum() or c in (' ', '_')).rstrip()
        output_filename = os.path.join("Reviews/ Best", f"{safe_outlet_name}_reviews.csv")
        
        df_reviews = pd.DataFrame(all_reviews_data)
        
        # Remove duplicates based on review_id
        initial_count = len(df_reviews)
        df_reviews = df_reviews.drop_duplicates(subset=['review_id'], keep='first')
        final_count = len(df_reviews)
        
        if initial_count > final_count:
            print(f"‚ö†Ô∏è  Removed {initial_count - final_count} duplicate reviews")
        
        df_reviews.to_csv(output_filename, index=False)
        print(f"‚úÖ Saved {final_count} unique reviews to '{output_filename}'")
        print(f"üìÑ Found {no_text_rating_count} reviews that were ratings only (no text).")
        
        # Show sample
        if len(df_reviews) > 0:
            print(f"\nüìã Sample data:")
            print(f"   Review ID: {df_reviews['review_id'].iloc[0]}")
            print(f"   Author: {df_reviews['author'].iloc[0]}")
            print(f"   Rating: {df_reviews['rating'].iloc[0]} stars")
            print(f"   Date: {df_reviews['date_posted'].iloc[0]}")
            print(f"   Normalized Date: {df_reviews['date_posted_normalized'].iloc[0]}")
    else:
        print(f"‚ùå No reviews were scraped for {outlet_name}.")

# ==============================
# --- 4. MAIN EXECUTION ---
# ==============================
from pathlib import Path

nb_cwd = Path.cwd().resolve()
top_20_file = (nb_cwd / '..' / 'Outlets' / 'top_20_outlets.csv').resolve()

if not top_20_file.exists():
    top_20_file = (nb_cwd / 'Outlets' / 'top_20_outlets.csv').resolve()

print("DEBUG: loading top_20_outlets from:",top_20_file, "exists?", top_20_file.exists())
df_outlets = pd.read_csv(str(top_20_file))

print(f"\nüöÄ Found {len(df_outlets)} outlets to scrape.")

# Scrape all outlets
for index, row in df_outlets.iterrows():
    outlet_name = row["name"]
    outlet_url = row["maps_url"]
    scrape_reviews(outlet_name, outlet_url, driver, wait)

driver.quit()
print("\n--- üèÅ All scraping complete. Driver closed. ---")

‚öôÔ∏è Setting up WebDriver...
‚úÖ Chrome WebDriver initialized successfully.
DEBUG: loading top_20_outlets from: /Users/breann/Documents/GitHub/IS434-Anytime-Fitness/Google-Reviews/Outlets/top_20_outlets.csv exists? True

üöÄ Found 20 outlets to scrape.

--- üß≠ Starting scrape for: Anytime Fitness MacPherson Mall ---
üîó URL: https://www.google.com/maps/place/?q=place_id:ChIJX41bNAAX2jER8L9rlgPDC7E
‚úÖ Clicked Reviews button
‚úÖ Found scrollable container
üîÑ Starting to scroll and collect reviews...
üìä Iteration 1: Found 20 new reviews (Total: 20)
   Scroll: 5000/7721 (76.1%), Visible: 873
   üìà Scroll height increased: 0 ‚Üí 7721
‚è≥ No new reviews found. Waiting... (1/20)
   Scroll: 6848/12021 (64.2%), Visible: 873
   üìà Scroll height increased: 7721 ‚Üí 12021
üìä Iteration 3: Found 20 new reviews (Total: 30)
   Scroll: 11848/12987 (98.0%), Visible: 873
   üìà Scroll height increased: 12021 ‚Üí 12987
‚è≥ No new reviews found. Waiting... (1/20)
   Scroll: 12114/17707 (7

KeyboardInterrupt: 