# MSU_J4O'S AMAZING SCRAPER!!!!!

### Setup Instructions
Hey! Before running this scraper, make sure you’ve completed the following:

1. **Install the correct ChromeDriver version**
   - ChromeDriver must match your currently installed version of Google Chrome.
   - To check your version: open `chrome://settings/help` in your browser.
   - Download the corresponding ChromeDriver from: [https://chromedriver.chromium.org/downloads](https://chromedriver.chromium.org/downloads)
   - Make sure the path to ChromeDriver is correctly set in `CHROMEDRIVER_PATH` inside the code.
       - THIS IS IN YOUR DOWNLOADS FOLDER. PLEASE KEEP CHROMEDRIVER IN YOUR DOWNLOADS FOLDER.

2. **Set up a Python virtual environment** (recommended)
   ```bash
   python -m venv venv
   source venv/bin/activate  # On Windows use: venv\Scripts\activate
   ```

3. **Install dependencies from `requirements.txt`**
   ```bash
   pip install -r requirements.txt
   ```

---

### IMPORTS

In [34]:
# === Standard Library ===
import os
import time
import json
import csv
import re
import random
import shutil
from datetime import datetime, timedelta
from urllib.parse import urljoin
from pathlib import Path
import platform

# === Third-Party Libraries ===
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.chrome.service import Service
from dotenv import load_dotenv
import pandas as pd
from bs4 import BeautifulSoup

### WHY DID WE MAKE OUR OWN SCRAPER?
We originally tried using two libraries:
- [facebook-scraper](https://github.com/kevinzg/facebook-scraper)
- [facebook-page-scraper](https://pypi.org/project/facebook-page-scraper/)

Unfortunately, both of these tools were unable to bypass Facebook’s security measures (like login walls, dynamic content loading, etc.).

By building our own scraper:
- We can extract the data we need.
- We bypass some protections using a real browser session and cookies.
- And honestly, it's pretty cool that we made it ourselves as a group project

---

### Setup: ChromeDriver, Cookies, and Facebook Group Page
- Loads the locally installed ChromeDriver and launches a new Chrome browser session.
- Applies previously saved Facebook session cookies to bypass the login process.
    - You can export your cookies using the [ExportThisCookie browser extension](https://exportthiscookie.com/)
- Navigates directly to the specified Facebook group page to begin scraping.

---

In [39]:
def get_chromedriver_path():
    # Get user's Downloads folder
    downloads = Path.home() / "Downloads"
    
    # Choose driver filename depending on OS
    system = platform.system()
    machine = platform.machine()

    if system == "Windows":
        filename = "chromedriver.exe"
        driver_path = downloads / filename

    elif system == "Darwin" and machine == "x86_64":  # Intel Mac
        driver_path = downloads / "chromedriver-mac-x64" / "chromedriver"

    else:
        filename = "chromedriver"
        driver_path = downloads / filename

    if driver_path.exists():
        return str(driver_path)
    else:
        raise FileNotFoundError(f"ChromeDriver not found at {driver_path}")


In [40]:
# === Path Configuration ===
COOKIES_PATH = "fb_cookies.json"
GROUP_URL = "https://www.facebook.com/groups/1996906587229548/"

# === Set Up Chrome WebDriver ===
options = webdriver.ChromeOptions()

# Only set Chrome binary path if on macOS
if platform.system() == "Darwin":
    mac_chrome_path = "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"
    if Path(mac_chrome_path).exists():
        options.binary_location = mac_chrome_path

# Optional: uncomment this to run headless
# options.add_argument("--headless")

# Automatically fetch correct driver
service = Service(get_chromedriver_path())
driver = webdriver.Chrome(service=service, options=options)

# === Load Facebook and Inject Cookies ===
driver.get("https://www.facebook.com/")
time.sleep(3)

with open(COOKIES_PATH, "r") as f:
    cookies = json.load(f)
    for cookie in cookies:
        if "sameSite" in cookie:
            cookie["sameSite"] = "Strict"
        try:
            driver.add_cookie(cookie)
        except Exception as e:
            print("cookie error:", cookie.get("name"), e)

# === Navigate to Target Group ===
driver.get(GROUP_URL)
time.sleep(5)


### What Did This Do?
This section is your method to get into Facebook:
- Without needing manual login, it uses saved cookies to impersonate a logged-in session.
- This makes scraping faster, more consistent, and less likely to trigger Facebook’s security walls.
- Once on the group page, the scraper can begin scanning for content like posts and timestamps.

---

### Actual Scraper Function

#### What It Does:
- Automatically scrolls through the Facebook group page to load posts.
- Extracts timestamp, post content, and post URL from each visible post.
- Automatically stops if no new content is found after a couple scrolls.
- Saves all scraped data into a clean, structured CSV file.


In [24]:
# j4o's ultimate facebook group scraper v3.1 (cleaned on 4-8-2025)
# features:
# - relative timestamp conversion (3d → actual date)  (3-29-2025)
# - error resilience against facebook's nonsense (3-29-2025)
# - clean CSV output (3-30-2025)
# - renamed for the group! (4-9-2025)

# --------------------------
# 1. SCROLLING MECHANISM
# --------------------------
def scroll_to_load(driver, max_scrolls=30, wait=2, min_scrolls=10, verbose=True):
    """scrolls the page with smart detection but ensures a minimum number of scrolls"""
    last_height = driver.execute_script("return document.body.scrollHeight")
    stall_counter = 0

    for i in range(max_scrolls):
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(wait)

        new_height = driver.execute_script("return document.body.scrollHeight")
        posts_found = len(driver.find_elements(By.CSS_SELECTOR, "div[role='article']"))
        if verbose:
            print(f"Scroll {i+1}/{max_scrolls} | Posts found: {posts_found}")

        if new_height == last_height:
            stall_counter += 1
            if i + 1 >= min_scrolls:
                print("No new posts detected - stopping scroll")
                break
        else:
            stall_counter = 0

        last_height = new_height

# --------------------------  
# 2. TIMESTAMP CONVERSION
# --------------------------
def get_timestamp(post):
    """convert facebook's vague time hints into actual dates"""
    time_elem = post.find("a", href=lambda x: x and ("/posts/" in x or "/permalink/" in x))
    if not time_elem:
        return "unknown"

    timestamp = time_elem.get_text(strip=True)

    if "Just now" in timestamp:
        return datetime.now().strftime("%Y-%m-%d %H:%M")

    match = re.match(r"(\d+)\s*([smhdwy])", timestamp)
    if not match:
        return timestamp

    num, unit = int(match.group(1)), match.group(2)
    now = datetime.now()

    delta_map = {
        's': timedelta(seconds=num),
        'm': timedelta(minutes=num),
        'h': timedelta(hours=num),
        'd': timedelta(days=num),
        'w': timedelta(weeks=num),
        'y': timedelta(days=num * 365),
    }

    return (now - delta_map.get(unit, timedelta())).strftime("%Y-%m-%d %H:%M")

# --------------------------
# 3. MAIN EXTRACTION LOGIC
# --------------------------
def get_content(post):
    """extract post text with 'See more' expansion"""
    content_div = post.find("div", {"dir": "auto"})
    if not content_div:
        return ""

    content = content_div.get_text("\n", strip=True)
    see_more = post.find("div", string=re.compile(r"See\\s+more", re.I))
    if see_more:
        more_text = see_more.find_next("div")
        if more_text:
            content += "\n" + more_text.get_text("\n", strip=True)

    return content

def get_post_url(post):
    """extract post URL with fallback"""
    time_elem = post.find("a", href=lambda x: x and ("/posts/" in x or "/permalink/" in x))
    if time_elem:
        return urljoin("https://www.facebook.com", time_elem['href'])
    return "unknown"

def extract_posts(driver):
    """More reliable post detection with Facebook's current layout"""
    soup = BeautifulSoup(driver.page_source, "html.parser")
    posts_data = []

    post_selectors = [
        {'role': 'article'},
        {'data-pagelet': re.compile('FeedUnit_')},
        {'class': re.compile('x1yztbdb')}
    ]

    for selector in post_selectors:
        posts = soup.find_all("div", selector)
        if posts:
            break

    for post in posts:
        try:
            post_data = {
                "timestamp": get_timestamp(post),
                "content": get_content(post),
                "post_url": get_post_url(post),
            }
            posts_data.append(post_data)
        except Exception:
            continue

    return posts_data

# --------------------------
# 4. OUTPUT FUNCTIONS
# --------------------------
def save_posts_to_csv(posts, filename="fb_scraped_data.csv"):
    """save our hard-earned data to CSV"""
    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=["timestamp", "content", "post_url"])
        writer.writeheader()
        writer.writerows(posts)

# --------------------------
# 5. USER-FRIENDLY FILENAME HANDLING
# --------------------------
def get_output_filename():
    while True:
        user_input = input("Enter output filename (or press Enter for 'fb_scraped_data.csv'): ").strip()
        if not user_input:
            return "fb_scraped_data.csv"
        if not user_input.lower().endswith('.csv'):
            user_input += '.csv'
        if len(user_input) > 100:
            print("That filename is too long. Try something shorter.")
            continue
        if not re.match(r'^[\w\-\. ]+$', user_input):
            print("Invalid characters in filename. Use only letters, numbers, spaces, hyphens, and periods.")
            continue
        return user_input

# --------------------------
# 6. MAIN EXECUTION FLOW
# --------------------------
def run_scraper(driver):
    try:
        print("\U0001f680 Starting Facebook scraper")
        scrolls_per_batch = 11

        # === This is where you adjust the running time of the program ===
        # Setting total run time to approximately 3 minutes
        total_batches = 30  # 6 batches × ~30s each = ~3 minutes
        wait = 3.0

        all_posts = {}
        stale_batches = 0

        for batch in range(total_batches):
            print(f"Batch {batch+1}/{total_batches} | Scrolling {scrolls_per_batch} times...")
            scroll_to_load(driver, max_scrolls=scrolls_per_batch, wait=wait, verbose=False)

            print("\U0001f50d Extracting posts...")
            new_posts = extract_posts(driver)

            new_count = 0
            for p in new_posts:
                if p["post_url"] not in all_posts:
                    all_posts[p["post_url"]] = p
                    new_count += 1

            total_unique = len(all_posts)
            print(f"New: {new_count} | Total unique: {total_unique}")

            if new_count == 0:
                stale_batches += 1
                print(f"\u26a0\ufe0f No new posts in this batch ({stale_batches}x)")
                if stale_batches >= 3:
                    print("No new content in 3 batches. Stopping early.")
                    break
            else:
                stale_batches = 0

        posts_list = list(all_posts.values())
        print(f"\n✅ Finished scraping: {len(posts_list)} unique posts")

        if posts_list:
            filename = get_output_filename()
            save_posts_to_csv(posts_list, filename)
            print(f"Saved {len(posts_list)} posts to '{filename}'")

            print("\nTop timestamps:")
            df = pd.DataFrame(posts_list)
            print(df['timestamp'].value_counts().head(10))

    except Exception as e:
        print(f"Error: {str(e)}")

### How Do I Make It Run Longer?

Normally, the scraper runs for approximately 3 minutes, which is enough for testing or small-scale use.

If you'd like to extend the scraping time:

- Find this section in the `run_scraper` function:

  ```python
  # === This is where you adjust the running time of the program ===
  total_batches = 6  # 6 batches × ~30s each = ~3 minutes
    ```
    
#### Increase `total_batches` like this:
- `12` → ~6 minutes  
- `20` → ~10 minutes  
- `30` → ~15 minutes+

Each batch:
- Scrolls the page 10 times (`scrolls_per_batch = 10`)
- Waits ~2.5 seconds between scrolls

> Longer runs **WILL** trigger Facebook's security measures and you may be rate-limited.

### Run The Scraper!

In [25]:
run_scraper(driver)

🚀 Starting Facebook scraper
Batch 1/30 | Scrolling 11 times...
🔍 Extracting posts...
New: 10 | Total unique: 10
Batch 2/30 | Scrolling 11 times...
🔍 Extracting posts...
New: 75 | Total unique: 85
Batch 3/30 | Scrolling 11 times...
🔍 Extracting posts...
New: 75 | Total unique: 160
Batch 4/30 | Scrolling 11 times...
🔍 Extracting posts...
New: 34 | Total unique: 194
Batch 5/30 | Scrolling 11 times...
🔍 Extracting posts...
New: 40 | Total unique: 234
Batch 6/30 | Scrolling 11 times...
🔍 Extracting posts...
New: 27 | Total unique: 261
Batch 7/30 | Scrolling 11 times...
🔍 Extracting posts...
New: 60 | Total unique: 321
Batch 8/30 | Scrolling 11 times...
🔍 Extracting posts...
New: 60 | Total unique: 381
Batch 9/30 | Scrolling 11 times...
🔍 Extracting posts...
New: 37 | Total unique: 418
Batch 10/30 | Scrolling 11 times...
🔍 Extracting posts...
New: 2 | Total unique: 420
Batch 11/30 | Scrolling 11 times...
🔍 Extracting posts...
New: 0 | Total unique: 420
⚠️ No new posts in this batch (1x)
Batc

KeyboardInterrupt: 

#### Running will save to your experiments folder, so you can just run the CSV below by inputting the filename.