Stocktwits Scraper

## Stocktwits Scraper  

### How It Works  

The **Stocktwits Scraper** is designed to automate the process of logging into the Stocktwits platform and extracting relevant posts based on specific tags or stock tickers. It employs the **Selenium WebDriver** to perform browser automation, allowing it to navigate web pages, wait for elements to load, and extract information from dynamic content.  

#### Workflow Overview  

1. **Login Process**:   
   - The scraper begins by logging into Stocktwits using a provided username and password.   
   - It navigates to the login page, enters the user credentials, and clicks the login button.   
   - The scraper waits for a successful login by checking for the presence of specific elements on the page.  

2. **Navigating to Tags**:  
   - After logging in, the scraper iterates through a list of specified tags or stock tickers.  
   - For each tag, it constructs the appropriate URL and navigates to the corresponding Stocktwits page.  

3. **Scrolling for Posts**:  
   - The scraper performs predefined scrolls on the page to load additional posts dynamically.  
   - Each scroll action may reveal new content that needs to be captured.  

4. **Data Extraction**:  
   - For each loaded set of posts, the scraper extracts relevant data, including:  
     - **Post Number**: A sequential identifier for each post.  
     - **Tag**: The specific stock ticker or tag associated with the post.  
     - **Username**: The name of the user who authored the post.  
     - **Time Posted**: The timestamp indicating when the post was made, extracted from the relevant HTML element.  
     - **Date**: The date of extraction in YYYY-MM-DD format.  
     - **Content**: The text body of the post, which may contain comments, discussions, or market insights.  

5. **Storing Data**:  
   - Extracted data is stored in a structured format, specifically a list of dictionaries, with each dictionary representing a single post.  
   - After processing all specified tags, the scraper returns the collected data for further analysis.  

#### Example of Extracted Data  

The scraper collects the following types of data for each post:  

| Field          | Description                              |  
|----------------|------------------------------------------|  
| Post Number    | A unique number for the post            |  
| Tag            | The stock symbol or tag being discussed |  
| Username       | The name of the user who made the post  |  
| Time Posted     | The timestamp of when the post was submitted  |  
| Date           | The extraction date (YY-MM-DD)          |  
| Content        | The actual content of the post           |  

### Use Cases  

The data extracted by the Stocktwits Scraper can be utilized for various purposes, such as:  

- **Market Analysis**: Analyze sentiment around specific stocks or market trends.  
- **Investor Insights**: Gain perspectives from the community regarding specific investments.  
- **Trend Monitoring**: Track discussions around particular topics or events in real-time.  

### Conclusion  

The Stocktwits Scraper combines automation with data extraction, making it a valuable tool for anyone interested in gaining insights from social media discussions related to stocks. The ability to gather and analyze this data can provide a competitive edge in understanding market sentiment and investor behavior.

**Important Note**: A temporary account has been created for this scraping task. Please be cautious when using genuine accounts.

In [66]:
from selenium import webdriver
from selenium.common.exceptions import TimeoutException, NoSuchElementException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.chrome.options import Options
from datetime import datetime
import pandas as pd
import time
import boto3
import yaml

In [67]:
# Initialize WebDriver with or wihtout Headless Mode
def init_driver(chromedriver_path):
    chrome_options = Options()
    # chrome_options.add_argument("--headless")
    chrome_options.add_argument("--disable-gpu")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--window-size=1280,1024")
    driver = webdriver.Chrome(executable_path=chromedriver_path, options=chrome_options)
    wait = WebDriverWait(driver, 10)
    print("driver intiated successfuly")
    return driver, wait

In [None]:
chromedriver_path = r"D:\coding\freelancing\stockMarket\chromedriver.exe"
driver, wait = init_driver(chromedriver_path)

In [69]:
# Log In to Stocktwits
def login_to_stocktwits(username, password):
    """
    Log in to Stocktwits with provided credentials.
    """
    try:
        driver.get("https://stocktwits.com/signin")
        time.sleep(6)  # Allow the page to load
        try:
            accept_button = wait.until(
            EC.element_to_be_clickable((By.ID, "onetrust-accept-btn-handler")))
            # Click the "I Accept" button
            accept_button.click()
        except:
            print("I Accept button not appeared")
        # Wait for the username input field and enter the username
        username_input = wait.until(
            EC.visibility_of_element_located((By.CSS_SELECTOR, "input[data-testid='log-in-username']")))
        username_input.send_keys(username)  # Replace with your username

        password_input = wait.until(
            EC.visibility_of_element_located((By.CSS_SELECTOR, "input[data-testid='log-in-password']")))
        password_input.send_keys(password)  # Replace with your password


        login_button = wait.until(
        EC.element_to_be_clickable((By.CSS_SELECTOR, "button[data-testid='log-in-submit']")))

        login_button.click()
        time.sleep(5)

        print("Login successful!")
    except Exception as e:
        print(f"Error during Stocktwits login: {e}")

In [None]:
# Credentials and configuration
username = "your_user"
password = "pass"
login_to_stocktwits(username, password)

In [239]:
def scrape_stocktwits(driver, wait, tags, scroll_count, max_posts_per_tag, min_posts_per_tag=0):  
    """  
    Scrape Stocktwits posts and comments for a list of tags/tickers.  

    Args:  
        driver: Selenium WebDriver instance.  
        wait: WebDriverWait instance.  
        tags: List of tags or tickers to scrape.  
        scroll_count: Number of scrolls to load more posts.  
        max_posts_per_tag: Maximum number of posts to scrape per tag.  
        min_posts_per_tag: Minimum number of posts to scrape per tag.  

    Returns:  
        List of dictionaries containing post data.  
    """  
    all_data = []  
    post_number = 1  # Initialize post number globally  

    for tag in tags:  
        print(f"Scraping posts for: {tag}")  
        try:  
            # Navigate to the tag page  
            driver.get(f"https://stocktwits.com/symbol/{tag}")  
            time.sleep(5)  # Allow page to load  

            total_scraped_posts = 0  # Track total posts scraped for the tag  
            post_links = set()  # Use a set to avoid duplicates  

            # Capture post links from the first scroll (first batch of posts)  
            for scroll in range(scroll_count):  
                if total_scraped_posts >= max_posts_per_tag:  
                    break  # Stop scrolling if max posts per tag reached  

                print(f"Scrolling {scroll + 1}/{scroll_count} for: {tag}")  

                # Wait for the post links to load  
                wait.until(  
                    EC.presence_of_all_elements_located((By.XPATH, "//a[contains(@href, '/message/')]"))  
                )  

                # Capture post links  
                try:  
                    post_links_elements = driver.find_elements(By.XPATH, "//a[contains(@href, '/message/')]")  
                    for link in post_links_elements:  
                        post_links.add(link.get_attribute("href"))  # Add links to the set  
                    print(f"Captured {len(post_links)} post links.")  
                except Exception as e:  
                    print("An error occurred while capturing post links:", str(e))  

                # Scroll for more posts (if needed)  
                driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")  
                time.sleep(3)  # Allow time for the page to load more posts  

            # Scrape each post sequentially using captured links  
            for link in list(post_links)[:max_posts_per_tag]:  
                try:  
                    print(f"Scraping post from link: {link}")  # Debug statement  
                    # Navigate to the post  
                    driver.get(link)  
                    time.sleep(2)  # Allow time for the post to load  

                    # Wait for post content to load  
                    post_element = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".RichTextMessage_body__4qUeP")))  

                    # Extract username  
                    username = "N/A"  
                    try:   
                        username_element = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,   
                            "span.StreamMessage_username-default__9l3zP.font-semibold.text-lg[aria-label='Username']")))  

                        # Extract the username text  
                        username = username_element.text   
                    except Exception:  
                        username = "N/A"  

                    # Extract time  
                    date_posted, time_posted = "N/A", "N/A"  
                    try:  
                        time_element = wait.until(EC.presence_of_element_located(  
                            (By.CSS_SELECTOR, "div[data-testid='message-header'] time[aria-label='Time this message was posted']")))  
                        datetime_value = time_element.get_attribute("datetime")  
                        date_posted, time_posted = datetime_value.split('T')  
                        time_posted = time_posted[:-1]  # Remove 'Z'  
                    except Exception:  
                        date_posted, time_posted = "N/A", "N/A"  

                    # Extract message content  
                    message_content = "N/A"  
                    try:  
                        post_content_js_path = (  
                            "return document.querySelector('.RichTextMessage_body__4qUeP').innerText;"  
                        )   
                        message_content = driver.execute_script(post_content_js_path)   
                    except Exception:  
                        message_content = "N/A"  

                    # Extract comments  
                    comments = "N/A"  
                    try:  
                        comments_js_path = (  
                            "return Array.from(document.querySelectorAll('.RichTextMessage_body__4qUeP')).filter(el => "  
                            "el.innerText.includes('@')).map(el => el.innerText.replace(/@\\w+\\s/, '').trim()).join('\\n');"  
                        )  
                        comments = driver.execute_script(comments_js_path)  
                    except Exception:  
                        comments = "N/A"  

                    # Append data  
                    all_data.append({  
                        'Post Number': post_number,
                        'label': "stoctwits_data",
                        'Tag': tag,  
                        'Username': username,  
                        'Time Posted': time_posted,  
                        'Date': date_posted,  
                        'Content': message_content,  
                        'Comments': comments
                    })  

                    post_number += 1  # Increment post number  
                    total_scraped_posts += 1  # Increment total scraped posts  

                except Exception as e:  
                    print(f"Error processing post from link {link}: {e}")  
                    continue  # Skip to the next iteration  

            # Check if the minimum posts requirement is met  
            if total_scraped_posts < min_posts_per_tag:  
                print(f"Warning: Only {total_scraped_posts} posts scraped for tag '{tag}', which is less than the minimum required ({min_posts_per_tag}).")  

        except Exception as e:  
            print(f"Error processing tag {tag}: {e}")  

    return all_data

In [240]:
# def save_and_upload_data(data, bucket_name):
def save_and_upload_data(data):
    """
    Save data to Excel and YAML and upload to an S3 bucket.

    Args:
        data: The data to save and upload.
        bucket_name: The name of the S3 bucket to upload to.
    """
    try:
        # Save to YAML with UTF-8 encoding
        yaml_filename = f"stocktwits_data_{datetime.now().strftime('%Y-%m-%d')}.yaml"
        with open(yaml_filename, 'w', encoding='utf-8') as yaml_file:
            yaml.dump(data, yaml_file, allow_unicode=True)
        print(f"Data saved to YAML: {yaml_filename}")

        # Save to Excel (if needed, implement this part)
        # excel_filename = f"stocktwits_data_{datetime.now().strftime('%Y-%m-%d')}.xlsx"
        # Implement Excel saving logic here using pandas or openpyxl

        # Upload to S3
        # s3 = boto3.client('s3')
        # # s3.upload_file(excel_filename, bucket_name, excel_filename)
        # s3.upload_file(yaml_filename, bucket_name, yaml_filename)
        # print(f"Files uploaded to S3 bucket: {bucket_name}")

    except Exception as e:
        print(f"Error saving or uploading data: {e}")

In [None]:
if __name__ == "__main__":
    # chromedriver_path = r"path_to_your_chromedriver"
    # driver, wait = init_driver(chromedriver_path)
    # bucket_name = "your-s3-bucket-name"
    # tags = ["WISH.TSXV", "SNDL", "CLOV", "CRSP", "RCPT", AVAV", "ZOM", "PLBY"]
    tags = ["CRSP"]
    try:
        # Scrape Stocktwits
        scraped_data = scrape_stocktwits(driver, wait, tags, scroll_count=1, max_posts_per_tag=2)
        # save_and_upload_data(scraped_data, bucket_name)
        save_and_upload_data(scraped_data)
    except Exception as e:
        print(f"error in main: {e}")

Scraping posts for: CRSP
Scrolling 1/1 for: CRSP
Captured 23 post links.
Scraping post from link: https://stocktwits.com/audiopile/message/601267669
Scraping post from link: https://stocktwits.com/Jupilerke/message/601322936
Data saved to YAML: stocktwits_data_2025-01-25.yaml
