# Reddit Scraper  

## Overview  
The **Reddit Scraper** is a web scraping tool designed to automate the process of logging into Reddit and extracting posts based on specific tags. Utilizing the Selenium WebDriver, this scraper navigates the Reddit website, interacts with the user interface, and collects relevant data from posts for analysis.  

## How It Works  

### Workflow Overview  

1. **Driver Initialization**:  
   - The scraper initializes a Selenium WebDriver instance configured to run in headless mode (without a graphical user interface).  
   - It sets up necessary options for Chrome, including disabling GPU acceleration and configuring the browser window size.  

2. **Login Process**:  
   - The scraper navigates to the Reddit login page and enters the provided username and password.  
   - It handles the login button, which is located within a Shadow DOM, using JavaScript to ensure successful interaction.  
   - After clicking the login button, it checks the current URL to confirm a successful login.  

3. **Scraping Posts**:  
   - The scraper iterates through a list of specified tags.  
   - For each tag, it navigates to the corresponding Reddit search results page and sets the sorting option to "New" to get the latest posts.  

4. **Data Extraction**:  
   - The scraper scrolls through the search results, extracting relevant data from each post, including:  
     - **Username**: The name of the user who made the post.  
     - **Time Posted**: The timestamp of the post in ISO format.  
     - **Content**: The text content of the post.  
   - It collects data for a specified number of scrolls per tag.  

5. **Navigating Through Posts**:  
   - After extracting data from a set of posts, the scraper scrolls down to load more posts dynamically, continuing the extraction process.  

6. **Data Storage**:  
   - Extracted data is stored in a structured format (list of dictionaries) for further analysis.  

### Example of Extracted Data  

The scraper collects the following types of data for each post:  

| Field          | Description                              |  
|----------------|------------------------------------------|  
| Post Number    | A unique number for the post            |  
| tag    | The tag used to search on Reddit       |  
| Username       | The name of the user who made the post   |  
| Time Posted    | The timestamp of when the post was submitted |  
| Date           | The date of extraction (YYYY-MM-DD)     |  
| Content        | The actual content of the post           |  

## Use Cases  

The data extracted by the Reddit Scraper can be utilized for various purposes, including:  

- **Social Media Analysis**: Analyze trends and sentiments around specific topics or communities.  
- **Market Research**: Gather insights into public opinion regarding products or services discussed on Reddit.  
- **Content Monitoring**: Track discussions and feedback related to specific topics or events.  

## Important Notes  

- **Temporary Accounts**: It is recommended to use temporary accounts for scraping to avoid potential violations of Reddit's tags of service.  
- **Dynamic Content**: The scraper relies on specific XPath and CSS selectors to extract data. Changes in Reddit's layout may require updates to these selectors.  
- **Shadow DOM Handling**: The scraper includes methods to interact with elements within the Shadow DOM, which may not be accessible through traditional methods.  

## Conclusion  

The Reddit Scraper provides an efficient way to gather and analyze social media data from the Reddit platform. By automating the login and data extraction processes, it enables users to gain valuable insights into trends and discussions in a timely manner.

In [1]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.chrome.options import Options
from datetime import datetime
import pandas as pd
import time
import boto3
import yaml

In [5]:
# Initialize WebDriver with or wihtout Headless Mode
def init_driver(chromedriver_path):
    chrome_options = Options()
    # chrome_options.add_argument("--headless")
    chrome_options.add_argument("--disable-gpu")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--log-level=3")
    chrome_options.add_argument("--disable-web-security")  # Disable web security 
    chrome_options.add_argument("--window-size=1280,1024")
    chrome_options.add_argument("--ignore-certificate-errors")  # Ignore SSL errors  
    chrome_options.add_argument("--allow-insecure-localhost")  # Allow insecure localhost connections
    driver = webdriver.Chrome(executable_path=chromedriver_path, options=chrome_options)
    wait = WebDriverWait(driver, 10)
    print("driver intiated successfuly")
    return driver, wait

In [6]:
chromedriver_path = r"D:\coding\freelancing\stockMarket\chromedriver.exe"
driver, wait = init_driver(chromedriver_path)

  driver = webdriver.Chrome(executable_path=chromedriver_path, options=chrome_options)


driver intiated successfuly


In [7]:
# Log In to Reddit
def login_to_reddit(driver, wait, username, password):
    """
    Log in to Reddit with provided credentials.
    """
    try:
        driver.get("https://www.reddit.com/login/")
        time.sleep(5)
        # Enter username
        username_input = wait.until(EC.presence_of_element_located((By.ID, "login-username")))
        username_input.send_keys(username)

        # Enter password
        password_input = wait.until(EC.presence_of_element_located((By.ID, "login-password")))
        password_input.send_keys(password)
        time.sleep(2)
        try:
        # Locate the login button inside the Shadow DOM using JavaScript
            login_button = driver.execute_script("""
                return document
                    .querySelector("body > shreddit-app > shreddit-overlay-display")
                    .shadowRoot.querySelector("shreddit-signup-drawer")
                    .shadowRoot.querySelector("shreddit-drawer > div > shreddit-async-loader > div > shreddit-slotter")
                    .shadowRoot.querySelector("#login > auth-flow-modal > div.w-100 > faceplate-tracker > button");
            """)

            if login_button:
                # Click the login button
                driver.execute_script("arguments[0].click();", login_button)
                print("Login button clicked successfully!")
        except Exception:
            # Fallback to JavaScript click if standard click fails
            login_button_js = driver.find_element(By.XPATH, '//*[@id="login"]/auth-flow-modal/div[2]/faceplate-tracker/button/span/span')
            driver.execute_script("arguments[0].click();", login_button_js)
            print("Login button clicked using JavaScript!")

            # Wait for successful login
            time.sleep(5)
            if "login" in driver.current_url.lower():
                raise Exception("Login failed. Still on login page.")
                print("Login successful!")

    except Exception as e:
        print(f"Error during Reddit login: {e}")


In [None]:
# Credentials and configuration
username = "uname"
password = "pass"
login_to_reddit(driver, wait, username, password)

Login button clicked using JavaScript!
Error during Reddit login: Login failed. Still on login page.


In [9]:
def extract_reddit_post_links(driver, wait, subreddit, scroll_count):
    """
    Extract post links from a specific subreddit.

    Args:
        driver: Selenium WebDriver instance.
        wait: WebDriverWait instance.
        subreddit: The subreddit to scrape.
        scroll_count: Number of scrolls to perform for the subreddit.

    Returns:
        List of post links.
    """
    post_links = []

    print(f"Extracting post links for subreddit: {subreddit}")
    driver.get(f"https://www.reddit.com/r/{subreddit}/")
    time.sleep(5)  # Allow page to load

    for scroll in range(scroll_count):
        print(f"Scrolling {scroll + 1}/{scroll_count} for subreddit: {subreddit}")

        # Loop through posts using index numbers
        for post_index in range(1, 3):  # Adjust range as needed
            try:
                # Locate the post using the index
                post_element = wait.until(
                    EC.presence_of_element_located(
                        (By.XPATH, f"(//shreddit-post)[{post_index}]")
                    )
                )

                # Extract the post link
                link_element = post_element.find_element(By.XPATH, ".//a[contains(@href, '/r/') and @slot='full-post-link']")
                post_link = link_element.get_attribute("href")

                # Append the link to the list
                post_links.append(post_link)

            except Exception as e:
                print(f"Error processing post index {post_index}: {e}")

        # Scroll for more posts
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(3)

    print(f"Captured {len(post_links)} links for subreddit: {subreddit}")
    return post_links

In [10]:
def extract_post_data(driver, wait, subreddit):
    """
    Extract post data from a specific subreddit.

    Args:
        driver: Selenium WebDriver instance.
        wait: WebDriverWait instance.
        subreddit: The subreddit to scrape.

    Returns:
        A list of dictionaries containing post data.
    """
    post_data = []
 # addjust scrol cound according your requirements
    subreddit_links = extract_reddit_post_links(driver, wait, subreddit, scroll_count=1)

    for index, link in enumerate(subreddit_links):
        print(f"Extracting data for post {index + 1}: {link}")
        data = {
            'Post Number': str(index + 1),
            'Label': 'reddit_data',
            'Sub-reddit': subreddit,
            'Username': 'N/A',
            'Time Posted': 'N/A',
            'Date Posted': 'N/A',
            'Content': 'N/A',
            'Comments': []
        }

        # Extract the user
        try:
            driver.get(link)
            user_element = wait.until(
                EC.presence_of_element_located(
                    (By.XPATH, "//a[contains(@class, 'author-name')]")
                )
            )
            data['Username'] = user_element.text
        except Exception as e:
            print("Error extracting username: N/A")

        # Extract the datetime value for date and time posted
        try:
            time_element = wait.until(
                EC.presence_of_element_located(
                    (By.XPATH, "//faceplate-timeago/time")
                )
            )
            datetime_value = time_element.get_attribute("datetime")
            data['Date Posted'] = datetime_value.split("T")[0]  # YYYY-MM-DD
            data['Time Posted'] = datetime_value.split("T")[1].split("+")[0]  # HH:MM:SS
        except Exception as e:
            print("Error extracting date and time: N/A")

        # Locate the post title element
        try:
            title_element = wait.until(
                EC.presence_of_element_located(
                    (By.XPATH, "//h1[contains(@class, 'font-semibold') and @slot='title']")
                )
            )
            data['Content'] = title_element.text.strip()  # Strip to remove any extra whitespace
    
        except Exception as e:
            print(f"Error extracting post title")

        # Extract comments
        try:
            comments_elements = driver.find_elements(By.XPATH, '//div[@class="md text-14 rounded-[8px] pb-2xs overflow-hidden"]//div[@id="-post-rtjson-content"]/p')
            data['Comments'] = [comment.text for comment in comments_elements]
        except Exception as e:
            print(f"An error occurred extracting comments")

        post_data.append(data)

    return post_data


In [11]:
def save_to_yaml(data, base_filename):  
    """  
    Save extracted data to a YAML file with the current date in the filename.  

    Args:  
        data: Data to be saved.  
        base_filename: Base name of the YAML file (without extension).  
    """  
    # Get the current date in the format YYYY-MM-DD  
    date_today= datetime.now().strftime("%Y-%m-%d")  
    
    # Create the full filename with date  
    filename = f"{base_filename}_{date_today}.yaml"  
    
    with open(filename, 'w') as yaml_file:  
        yaml.dump(data, yaml_file, default_flow_style=False)  
    print(f"Data saved to {filename}")

In [12]:
subreddits = ["apple", "tesla"]
all_post_data = []

for subreddit in subreddits:
    post_data = extract_post_data(driver, wait, subreddit)
    all_post_data.extend(post_data)  # Combine data from all subreddits

# Save all data to a single YAML file
save_to_yaml(all_post_data, 'reddit_data')

Extracting post links for subreddit: apple
Scrolling 1/1 for subreddit: apple
Captured 2 links for subreddit: apple
Extracting data for post 1: https://www.reddit.com/r/apple/comments/1iijkk2/ios_app_store_apps_with_screenshotreading_malware/
Extracting data for post 2: https://www.reddit.com/r/apple/comments/1idy9vf/apple_reports_record_q1_2025_earnings_with_12430/
Extracting post links for subreddit: tesla
Scrolling 1/1 for subreddit: tesla
Captured 2 links for subreddit: tesla
Extracting data for post 1: https://www.reddit.com/r/Tesla/comments/1gxe3oj/bedini_sg_energizer_motor/
Extracting data for post 2: https://www.reddit.com/r/Tesla/comments/180wt0g/sad_day_teslas_lab_on_long_island_burned_to_the/
Data saved to reddit_data_2025-02-06.yaml
