<a href="https://colab.research.google.com/github/davidelgas/DataSciencePortfolio/blob/main/Language_Models/LLM_RAG/RAG_Corpus_Development.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#1 Project Description






The project leverages user-generated content from a domain-specific online forum as the training corpus. This data is largely unstructured, with minimal metadata available. The following tools were considered to gather the source text for the corpus:


### Web Scraping
- **Tools:** Beautiful Soup, online SaaS products
    - **Pros:**
        - **Direct Access to Targeted Data:** Enables precise extraction of user-generated content from specific sections or threads within the forum.
        - **Efficiency in Data Collection:** Automated scripts can gather large volumes of data in a short amount of time, making it suitable for assembling significant datasets for NLP.
    - **Cons:**
        - **Potential for Incomplete Data:** May miss embedded content or dynamically loaded data, depending on the website’s structure.
        - **Ethical and Legal Considerations:** Scraping data from forums may raise concerns about user privacy and must adhere to the terms of service of the website.
        - **Very Platform Dependent:** Forum specific solutions result in forum specific data schemas that must be reverse engineered to for successful text extraction.

### Forum-specific APIs
- **Tools:** Python (`requests` library for API calls and `json` library for handling responses)
    - **Pros:**
        - **Structured and Reliable Data Retrieval:** APIs provide structured data, making it easier to process and integrate into your project.
        - **Efficient and Direct Access:** Directly accessing the forum's data through its API is efficient, bypassing the need for HTML parsing.
        - **Compliance and Ethical Data Use:** Utilizing APIs respects the forum's data use policies and ensures access is in line with user agreements.
    - **Cons:**
        - **Rate Limiting:** APIs often have limitations on the number of requests that can be made in a certain timeframe, which could slow down data collection.
        - **API Changes:** Dependence on the forum's API structure means that changes or deprecation could disrupt your data collection pipeline.
        - **Access Restrictions:** Some data or functionalities might be restricted or require authentication, posing additional challenges for comprehensive data collection.


**Conclusion: I will be using Beautiful Soup to create my corpus.**


#2 Create Enviornment

In [None]:
# Access to Google Drive
# This seems to propagate credentials better from its own cell

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#Packages and libraries

!pip install snowflake

import os
import time
import requests
import pandas as pd
import concurrent.futures
import snowflake.connector
import concurrent.futures
import json

from bs4 import BeautifulSoup
from datetime import datetime

# --- Settings ---
#credentials_path = '/content/drive/Othercomputers/My Mac/Git/credentials/snowflake_credentials.txt'




#3 Data Collection


In [None]:

# --- Configuration Parameters ---
DEFAULT_BASE_PATH = '/content/drive/Othercomputers/My Mac/Git/Language_Models/datasets/e9/'

# --- Data Fetching Functions ---

def create_urls(base_path: str, filename: str = 'e9_forum_thread_ids.csv', threads: int = 1):
    """
    Creates and records thread IDs to fetch by incrementing from the last known thread ID.

    Args:
        base_path: Directory to store data files
        filename: File to store thread IDs
        threads: Number of new thread IDs to create

    Returns:
        DataFrame with new thread IDs
    """
    file_path = os.path.join(base_path, filename)
    if os.path.exists(file_path) and os.path.getsize(file_path) > 0:
        existing_ids = pd.read_csv(file_path)
        last_thread_id = int(existing_ids['thread_id'].iloc[-1])
        print(f"Existing thread_ids found. Last thread_id: {last_thread_id}")
    else:
        last_thread_id = 0
        print(f"No existing thread_ids. Starting from {last_thread_id}")

    new_ids = [{'thread_id': tid} for tid in range(last_thread_id + 1, last_thread_id + threads + 1)]
    new_thread_ids = pd.DataFrame(new_ids)
    new_thread_ids.to_csv(file_path, mode='a', header=not os.path.exists(file_path), index=False)

    print(f"Added {threads} new thread_ids. Ending at {new_ids[-1]['thread_id']}")
    return new_thread_ids

def fetch_full_thread_data(df, base_path: str, posts_filename: str = 'e9_forum_posts.csv', decorated_filename: str = 'e9_forum_threads_decorated.csv'):
    """
    Fetches full thread data for the given thread IDs from e9coupe forum.

    Args:
        df: DataFrame with thread_id column
        base_path: Directory to store output files
        posts_filename: File to store individual posts
        decorated_filename: File to store thread metadata
    """
    # Ensure the base path exists
    os.makedirs(base_path, exist_ok=True)

    posts_file = os.path.join(base_path, posts_filename)
    decorated_file = os.path.join(base_path, decorated_filename)

    # Load existing data if available
    existing_posts = pd.read_csv(posts_file) if os.path.exists(posts_file) else pd.DataFrame(columns=['thread_id', 'post_timestamp', 'post_raw'])
    existing_decorated = pd.read_csv(decorated_file) if os.path.exists(decorated_file) else pd.DataFrame(columns=['thread_id', 'thread_title', 'thread_first_post'])

    # Find threads that we haven't processed yet
    existing_thread_ids = set(existing_posts['thread_id'].tolist()) | set(existing_decorated['thread_id'].tolist())
    new_threads = df[~df['thread_id'].isin(existing_thread_ids)]

    if new_threads.empty:
        print("No new threads to fetch.")
        return

    print(f"Fetching data for {len(new_threads)} new threads...")

    post_data = []
    decorated_data = []

    for thread_id in new_threads['thread_id']:
        thread_url = f"https://e9coupe.com/forum/threads/{thread_id}/?page=1"
        try:
            print(f"Fetching thread {thread_id}...")
            response = requests.get(thread_url)

            if response.status_code != 200:
                print(f"Error {response.status_code} fetching {thread_url}")
                continue

            soup = BeautifulSoup(response.text, 'html.parser')
            articles = soup.find_all('article', class_='message--post')

            if not articles:
                print(f"No posts found for thread {thread_id}. Skipping.")
                continue

            post_count = len(articles)
            print(f"Found {post_count} posts in thread {thread_id}")

            # Extract thread title
            title_element = soup.find('title')
            thread_title = title_element.get_text().split('|')[0].strip() if title_element else "No Title"

            # Extract first post content
            first_post_element = soup.find('article', class_='message-body')
            first_post = first_post_element.get_text(strip=True) if first_post_element else "No content"

            decorated_data.append({
                'thread_id': thread_id,
                'thread_title': thread_title,
                'thread_first_post': first_post
            })

            # Extract all posts
            for article in articles:
                timestamp_element = article.find('time')
                content_element = article.find('div', class_='bbWrapper')

                post_data.append({
                    'thread_id': thread_id,
                    'post_timestamp': timestamp_element['datetime'] if timestamp_element else "N/A",
                    'post_raw': content_element.get_text(strip=True) if content_element else "No content"
                })

            # Be respectful to the server
            time.sleep(1)

        except Exception as e:
            print(f"Error fetching thread {thread_id}: {e}")

    # Save new posts
    if post_data:
        new_posts_df = pd.DataFrame(post_data)
        combined_posts = pd.concat([existing_posts, new_posts_df], ignore_index=True)
        combined_posts.to_csv(posts_file, index=False)
        print(f"Saved {len(new_posts_df)} new posts. Total posts: {len(combined_posts)}")

    # Save new thread metadata
    if decorated_data:
        new_decorated_df = pd.DataFrame(decorated_data)
        combined_decorated = pd.concat([existing_decorated, new_decorated_df], ignore_index=True)
        combined_decorated.to_csv(decorated_file, index=False)
        print(f"Saved {len(new_decorated_df)} new decorated threads. Total threads: {len(combined_decorated)}")

def create_forum_corpus(base_path: str, posts_filename: str = 'e9_forum_posts.csv',
                       decorated_filename: str = 'e9_forum_threads_decorated.csv',
                       corpus_filename: str = 'e9_forum_corpus.csv',
                       append_to_main_corpus: bool = True):
    """
    Create or update the forum corpus file by combining posts and thread metadata.

    Args:
        base_path: Directory containing data files
        posts_filename: File containing individual posts
        decorated_filename: File containing thread metadata
        corpus_filename: Output file for the batch corpus
        append_to_main_corpus: Whether to append to the main corpus file

    Returns:
        DataFrame containing the complete corpus
    """
    posts_file = os.path.join(base_path, posts_filename)
    decorated_file = os.path.join(base_path, decorated_filename)
    corpus_file = os.path.join(base_path, corpus_filename)
    main_corpus_file = os.path.join(base_path, 'e9_forum_corpus.csv')

    # Check if the required input files exist
    if not os.path.exists(posts_file) or not os.path.exists(decorated_file):
        print(f"ERROR: Required input files not found. Cannot create corpus.")
        if not os.path.exists(posts_file):
            print(f"Missing: {posts_file}")
        if not os.path.exists(decorated_file):
            print(f"Missing: {decorated_file}")
        return pd.DataFrame()

    # Read the full posts and decorated files
    print(f"Reading posts from {posts_file}")
    posts_df = pd.read_csv(posts_file)
    print(f"Reading thread metadata from {decorated_file}")
    decorated_df = pd.read_csv(decorated_file)

    print(f"Found {len(posts_df)} posts across {posts_df['thread_id'].nunique()} threads")
    print(f"Found {len(decorated_df)} threads with metadata")

    # Aggregate posts by thread_id
    print("Aggregating posts by thread ID...")
    aggregated = posts_df.groupby('thread_id')['post_raw'].agg(
        lambda x: ' '.join(str(i) for i in x if pd.notna(i))).reset_index()
    aggregated.rename(columns={'post_raw': 'thread_all_posts'}, inplace=True)

    # Ensure data types match for joining
    decorated_df['thread_id'] = decorated_df['thread_id'].astype('int64')
    aggregated['thread_id'] = aggregated['thread_id'].astype('int64')

    # Find threads with both metadata and posts
    common_thread_ids = set(decorated_df['thread_id']) & set(aggregated['thread_id'])
    print(f"Found {len(common_thread_ids)} threads with both metadata and posts")

    # Filter to only include threads with both metadata and posts
    filtered_decorated = decorated_df[decorated_df['thread_id'].isin(common_thread_ids)]
    filtered_aggregated = aggregated[aggregated['thread_id'].isin(common_thread_ids)]

    # Create the corpus by merging
    batch_corpus = pd.merge(filtered_decorated, filtered_aggregated, on='thread_id', how='inner')
    print(f"Created corpus with {len(batch_corpus)} threads")

    # Save the batch corpus
    batch_corpus.to_csv(corpus_file, index=False)
    print(f"Saved batch corpus to {corpus_file}")

    # If append_to_main_corpus is True, update the main corpus file
    if append_to_main_corpus:
        # Load existing main corpus if available
        if os.path.exists(main_corpus_file):
            main_corpus = pd.read_csv(main_corpus_file)
            print(f"Loaded existing main corpus with {len(main_corpus)} threads")

            # Find new threads not already in the main corpus
            existing_main_thread_ids = set(main_corpus['thread_id'].tolist())
            new_threads = batch_corpus[~batch_corpus['thread_id'].isin(existing_main_thread_ids)]

            if new_threads.empty:
                print("No new threads to add to main corpus")
            else:
                # Append new threads to main corpus
                combined_corpus = pd.concat([main_corpus, new_threads], ignore_index=True)
                combined_corpus.to_csv(main_corpus_file, index=False)
                print(f"Added {len(new_threads)} new threads to main corpus. Total: {len(combined_corpus)}")
                return combined_corpus
        else:
            # If main corpus doesn't exist, create it with the batch corpus
            batch_corpus.to_csv(main_corpus_file, index=False)
            print(f"Created new main corpus with {len(batch_corpus)} threads")

    return batch_corpus

def update_local_corpus(base_path: str, threads_to_add: int = 5, corpus_filename: str = 'e9_forum_corpus_batch.csv'):
    """
    Main function to update the local forum corpus by fetching new threads.

    Args:
        base_path: Directory to store all data files
        threads_to_add: Number of new threads to fetch
        corpus_filename: Filename for the batch corpus

    Returns:
        DataFrame containing the updated corpus
    """
    print("\n=== Starting Local Forum Corpus Update ===\n")

    # Create directory if it doesn't exist
    os.makedirs(base_path, exist_ok=True)

    # Get new thread IDs to fetch
    new_thread_ids = create_urls(base_path, threads=threads_to_add)

    # Fetch data for new threads
    fetch_full_thread_data(new_thread_ids, base_path)

    # Create or update the corpus, and append to main corpus
    forum_corpus_df = create_forum_corpus(base_path, corpus_filename=corpus_filename, append_to_main_corpus=True)

    print("\n=== Local Forum Corpus Update Complete ===\n")
    return forum_corpus_df

def create_corpus_backup(base_path: str, corpus_filename: str = 'e9_forum_corpus.csv'):
    """
    Create a timestamped backup of the current corpus.

    Args:
        base_path: Directory containing the corpus file
        corpus_filename: Filename of the corpus to backup
    """
    corpus_path = os.path.join(base_path, corpus_filename)
    if not os.path.exists(corpus_path):
        print(f"Corpus file not found: {corpus_path}")
        return

    # Create backups directory
    backup_dir = os.path.join(base_path, 'backups')
    os.makedirs(backup_dir, exist_ok=True)

    # Generate timestamped filename
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    backup_filename = f"{os.path.splitext(corpus_filename)[0]}_{timestamp}.csv"
    backup_path = os.path.join(backup_dir, backup_filename)

    # Copy the file
    import shutil
    shutil.copy2(corpus_path, backup_path)

    print(f"Created backup: {backup_path}")

def save_corpus_to_json(base_path: str, corpus_filename: str = 'e9_forum_corpus.csv'):
    """
    Save the corpus to a JSON file with better formatting.

    Args:
        base_path: Directory containing the corpus file
        corpus_filename: Filename of the corpus to convert
    """
    corpus_path = os.path.join(base_path, corpus_filename)
    if not os.path.exists(corpus_path):
        print(f"Corpus file not found: {corpus_path}")
        return

    # Load the corpus
    corpus_df = pd.read_csv(corpus_path)

    if corpus_df.empty:
        print("No data to save.")
        return

    # Create JSON filename based on the CSV filename
    json_filename = f"{os.path.splitext(corpus_filename)[0]}.json"
    json_path = os.path.join(base_path, json_filename)

    # Convert to records format for better JSON structure
    records = corpus_df.to_dict(orient='records')

    with open(json_path, 'w', encoding='utf-8') as f:
        json.dump(records, f, indent=2, ensure_ascii=False)

    print(f"Saved corpus to JSON: {json_path} ({len(records)} threads)")

def fetch_forum_data_in_batches(base_path: str, num_batches: int = 2, threads_per_batch: int = 10):
    """
    Run multiple batch updates to fetch forum data and aggregate into a single corpus.

    Args:
        base_path: Directory to store all data files
        num_batches: Number of batches to run
        threads_per_batch: Number of threads to fetch in each batch

    Returns:
        DataFrame containing the complete corpus
    """
    # Ensure base path exists
    os.makedirs(base_path, exist_ok=True)

    # Create backup of existing corpus if it exists
    main_corpus_path = os.path.join(base_path, 'e9_forum_corpus.csv')
    if os.path.exists(main_corpus_path):
        create_corpus_backup(base_path)

    print(f"\n=== Starting Forum Data Fetching: {num_batches} batches, {threads_per_batch} threads per batch ===\n")

    for batch_num in range(num_batches):
        print(f"\n=== Processing Batch {batch_num + 1}/{num_batches} ===\n")

        batch_filename = f"e9_forum_corpus_batch_{batch_num + 1}.csv"
        update_local_corpus(base_path, threads_to_add=threads_per_batch, corpus_filename=batch_filename)

    # Save the final corpus to JSON format as well
    save_corpus_to_json(base_path)

    # Load and return the final corpus
    if os.path.exists(main_corpus_path):
        final_corpus = pd.read_csv(main_corpus_path)
        print(f"\n=== Forum Data Fetching Complete: {len(final_corpus)} total threads in corpus ===\n")
        return final_corpus
    else:
        print("\n=== Forum Data Fetching Complete, but no corpus was created ===\n")
        return pd.DataFrame()

# 4 Data Storage

In [None]:
# --- Configuration Parameters ---
DEFAULT_BASE_PATH = '/content/drive/Othercomputers/My Mac/Git/Language_Models/datasets/e9/'
DEFAULT_CREDENTIALS_PATH = '/content/drive/Othercomputers/My Mac/Git/credentials/snowflake_credentials.txt'

# --- Utility Functions ---

def load_credentials(path_to_credentials):
    """
    Load Snowflake credentials from a file and set them as environment variables.

    Args:
        path_to_credentials: Path to the credentials file
    """
    if not os.path.exists(path_to_credentials):
        raise FileNotFoundError(f"Credentials file not found: {path_to_credentials}")

    with open(path_to_credentials, 'r') as file:
        for line_num, line in enumerate(file, start=1):
            line = line.strip()
            if line and '=' in line:
                key, value = line.split('=', 1)
                os.environ[key] = value
            else:
                print(f"Skipping invalid line {line_num}: {line}")

    for var in ['USER', 'PASSWORD', 'ACCOUNT']:
        if not os.environ.get(var):
            raise EnvironmentError(f"Missing environment variable: {var}")

def connect_to_snowflake():
    """
    Connect to Snowflake using environment variables.

    Returns:
        Snowflake connection object
    """
    try:
        conn = snowflake.connector.connect(
            user=os.environ.get('USER'),
            password=os.environ.get('PASSWORD'),
            account=os.environ.get('ACCOUNT')
        )
        print(f"Connected to Snowflake account: {os.environ.get('ACCOUNT')}")
        return conn
    except Exception as e:
        raise ConnectionError(f"Failed to connect to Snowflake: {e}")

def create_db_schema_table(cur):
    """
    Create the database, schema, and table if they don't exist.

    Args:
        cur: Snowflake cursor
    """
    try:
        cur.execute("CREATE DATABASE IF NOT EXISTS e9_corpus")
        cur.execute("USE DATABASE e9_corpus")
        cur.execute("CREATE SCHEMA IF NOT EXISTS e9_corpus_schema")
        cur.execute("""
            CREATE TABLE IF NOT EXISTS e9_corpus.e9_corpus_schema.e9_forum_corpus (
                THREAD_ID NUMBER(38,0) PRIMARY KEY,
                THREAD_TITLE STRING,
                THREAD_FIRST_POST STRING,
                THREAD_ALL_POSTS STRING
            )
        """)
        print("Database, schema, and table checked/created.")
    except Exception as e:
        print(f"Error creating database/schema/table: {e}")

def fetch_existing_thread_ids(cur):
    """
    Fetch the list of thread IDs that already exist in Snowflake.

    Args:
        cur: Snowflake cursor

    Returns:
        Set of existing thread IDs
    """
    query = "SELECT THREAD_ID FROM e9_corpus.e9_corpus_schema.e9_forum_corpus"
    try:
        cur.execute(query)
        result = cur.fetchall()
        return set(row[0] for row in result)
    except Exception as e:
        print(f"Error fetching existing thread IDs: {e}")
        return set()

def insert_missing_data(cur, df, existing_thread_ids):
    """
    Insert only new data into Snowflake, skipping already loaded threads.

    Args:
        cur: Snowflake cursor
        df: DataFrame with the corpus data
        existing_thread_ids: Set of thread IDs already in the database
    """
    if df.empty:
        print("No data to insert.")
        return

    print(f"Original DataFrame has {len(df)} rows.")
    df.columns = [col.upper() for col in df.columns]

    # Filter only missing threads
    new_df = df[~df['THREAD_ID'].isin(existing_thread_ids)]
    print(f"{len(new_df)} new threads will be inserted into Snowflake.")

    if new_df.empty:
        print("No new threads to insert.")
        return

    # Replace NaN values with None
    new_df = new_df.where(pd.notnull(new_df), None)

    insert_query = """
    INSERT INTO e9_corpus.e9_corpus_schema.e9_forum_corpus
    (THREAD_ID, THREAD_TITLE, THREAD_FIRST_POST, THREAD_ALL_POSTS)
    VALUES (%s, %s, %s, %s)
    """

    # Create a list of tuples
    rows_to_insert = [
        (
            row['THREAD_ID'],
            row['THREAD_TITLE'],
            row['THREAD_FIRST_POST'],
            row['THREAD_ALL_POSTS']
        )
        for _, row in new_df.iterrows()
    ]

    # Batch insert all rows at once
    cur.executemany(insert_query, rows_to_insert)
    print(f"Inserted {len(rows_to_insert)} new threads into Snowflake.")

def upload_corpus_to_snowflake(base_path=DEFAULT_BASE_PATH,
                              credentials_path=DEFAULT_CREDENTIALS_PATH,
                              filename=None):
    """
    Upload the corpus file to Snowflake.

    Args:
        base_path: Directory containing the corpus file
        credentials_path: Path to the Snowflake credentials file
        filename: Name of the corpus file (if None, uses the default e9_forum_corpus.csv)
    """
    # If no specific filename is provided, use the main corpus file
    if filename is None:
        filename = 'e9_forum_corpus.csv'

    file_path = os.path.join(base_path, filename)
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"Corpus file not found: {file_path}")

    # Load the corpus
    forum_corpus_df = pd.read_csv(file_path)
    print(f"Loaded {len(forum_corpus_df)} rows from {file_path} to upload.")

    # Set up Snowflake connection
    load_credentials(credentials_path)
    conn = connect_to_snowflake()
    cur = conn.cursor()

    try:
        # Create database objects if needed
        create_db_schema_table(cur)

        # Check what threads already exist
        existing_thread_ids = fetch_existing_thread_ids(cur)
        print(f"Snowflake already has {len(existing_thread_ids)} threads.")

        # Insert only new threads
        insert_missing_data(cur, forum_corpus_df, existing_thread_ids)

        # Commit the transaction
        conn.commit()
        print(f"Data from {filename} committed successfully.")

        # Get the final count
        cur.execute("SELECT COUNT(*) FROM e9_corpus.e9_corpus_schema.e9_forum_corpus")
        final_count = cur.fetchone()[0]
        print(f"Total threads now in Snowflake: {final_count}")

    except Exception as e:
        print(f"Error during upload: {e}")
        conn.rollback()
        raise e
    finally:
        cur.close()
        conn.close()

def upload_batch_files_to_snowflake(base_path=DEFAULT_BASE_PATH,
                                   credentials_path=DEFAULT_CREDENTIALS_PATH,
                                   pattern='e9_forum_corpus_batch_*.csv'):
    """
    Upload all batch files to Snowflake.

    Args:
        base_path: Directory containing the batch files
        credentials_path: Path to the Snowflake credentials file
        pattern: Pattern to match batch files
    """
    import glob

    # Find all matching batch files
    batch_files = glob.glob(os.path.join(base_path, pattern))

    if not batch_files:
        print(f"No batch files found matching pattern: {pattern}")
        return

    print(f"Found {len(batch_files)} batch files to upload.")

    # Upload each batch file
    for batch_file in sorted(batch_files):
        filename = os.path.basename(batch_file)
        print(f"\n=== Uploading {filename} ===\n")
        try:
            upload_corpus_to_snowflake(base_path, credentials_path, filename)
        except Exception as e:
            print(f"Error uploading {filename}: {e}")

def upload_main_corpus_to_snowflake(base_path=DEFAULT_BASE_PATH,
                                   credentials_path=DEFAULT_CREDENTIALS_PATH):
    """
    Upload the main corpus file to Snowflake.

    Args:
        base_path: Directory containing the corpus file
        credentials_path: Path to the Snowflake credentials file
    """
    print("\n=== Uploading main corpus file ===\n")
    try:
        upload_corpus_to_snowflake(base_path, credentials_path, 'e9_forum_corpus.csv')
    except Exception as e:
        print(f"Error uploading main corpus: {e}")

# For Colab execution
if __name__ == "__main__":
    # In Colab, create buttons for the user to choose what to upload
    if 'google.colab' in globals():
        try:
            from google.colab import output
            import ipywidgets as widgets
            from IPython.display import display

            print("E9 Forum Snowflake Uploader")
            print("---------------------------")
            print(f"Default Base Path: {DEFAULT_BASE_PATH}")
            print(f"Default Credentials Path: {DEFAULT_CREDENTIALS_PATH}")
            print("Choose an upload option:")

            def on_upload_main_clicked(b):
                output.clear()
                upload_main_corpus_to_snowflake()

            def on_upload_batches_clicked(b):
                output.clear()
                upload_batch_files_to_snowflake()

            def on_upload_all_clicked(b):
                output.clear()
                print("Uploading main corpus...")
                upload_main_corpus_to_snowflake()
                print("\nUploading batch files...")
                upload_batch_files_to_snowflake()

            upload_main_button = widgets.Button(
                description='Upload Main Corpus',
                button_style='info',
                tooltip='Upload only the main corpus file'
            )
            upload_main_button.on_click(on_upload_main_clicked)

            upload_batches_button = widgets.Button(
                description='Upload Batch Files',
                button_style='warning',
                tooltip='Upload all batch files'
            )
            upload_batches_button.on_click(on_upload_batches_clicked)

            upload_all_button = widgets.Button(
                description='Upload All',
                button_style='success',
                tooltip='Upload both main corpus and batch files'
            )
            upload_all_button.on_click(on_upload_all_clicked)

            display(widgets.HBox([upload_main_button, upload_batches_button, upload_all_button]))
        except:
            # Fall back to regular execution if widgets not available
            print("Interactive widgets not available. Using default options.")
            print("To upload the main corpus file:")
            print("  upload_main_corpus_to_snowflake()")
            print("To upload all batch files:")
            print("  upload_batch_files_to_snowflake()")
            print("To upload a specific file:")
            print("  upload_corpus_to_snowflake(filename='your_file.csv')")
    else:
        # Not in Colab
        print("To upload the main corpus file:")
        print("  upload_main_corpus_to_snowflake()")
        print("To upload all batch files:")
        print("  upload_batch_files_to_snowflake()")
        print("To upload a specific file:")
        print("  upload_corpus_to_snowflake(filename='your_file.csv')")

To upload the main corpus file:
  upload_main_corpus_to_snowflake()
To upload all batch files:
  upload_batch_files_to_snowflake()
To upload a specific file:
  upload_corpus_to_snowflake(filename='your_file.csv')


# 5 Orchestration

In [None]:
# Configuration
BASE_PATH = '/content/drive/Othercomputers/My Mac/Git/Language_Models/datasets/e9/'
CREDENTIALS_PATH = '/content/drive/Othercomputers/My Mac/Git/credentials/snowflake_credentials.txt'
NUM_BATCHES = 2
THREADS_PER_BATCH = 25
MAX_WORKERS = 3

# Create executor for concurrent uploads
executor = concurrent.futures.ThreadPoolExecutor(max_workers=MAX_WORKERS)
futures = []

# Process each batch
for batch_num in range(NUM_BATCHES):
    print(f"\n=== Starting batch {batch_num + 1} ===\n")

    # Generate batch filename
    batch_filename = f"e9_forum_corpus_batch_{batch_num + 1}.csv"

    # Fetch data
    forum_corpus_df = update_local_corpus(BASE_PATH, threads_to_add=THREADS_PER_BATCH, corpus_filename=batch_filename)

    # Upload to Snowflake in the background
    future = executor.submit(upload_corpus_to_snowflake, BASE_PATH, CREDENTIALS_PATH, batch_filename)

    # Add callback for result handling
    def create_callback(filename):
        def handle_upload_result(fut):
            try:
                fut.result()
                print(f"Upload completed for {filename}")
            except Exception as e:
                print(f"UPLOAD FAILED for {filename}: {e}")
        return handle_upload_result

    future.add_done_callback(create_callback(batch_filename))
    futures.append(future)

# Wait for all uploads to complete
executor.shutdown(wait=True)
print("\n=== All scraping and uploads complete ===\n")


=== Starting batch 1 ===


=== Starting Local Forum Corpus Update ===

Existing thread_ids found. Last thread_id: 15260
Added 25 new thread_ids. Ending at 15285
Fetching data for 25 new threads...
Fetching thread 15261...
Found 1 posts in thread 15261
Fetching thread 15262...
Found 3 posts in thread 15262
Fetching thread 15263...
Found 1 posts in thread 15263
Fetching thread 15264...
Found 7 posts in thread 15264
Fetching thread 15265...
Found 1 posts in thread 15265
Fetching thread 15266...
Found 9 posts in thread 15266
Fetching thread 15267...
Found 3 posts in thread 15267
Fetching thread 15268...
Found 20 posts in thread 15268
Fetching thread 15269...
Found 2 posts in thread 15269
Fetching thread 15270...
Found 4 posts in thread 15270
Fetching thread 15271...
Found 2 posts in thread 15271
Fetching thread 15272...
Found 20 posts in thread 15272
Fetching thread 15273...
Found 3 posts in thread 15273
Fetching thread 15274...
Found 1 posts in thread 15274
Fetching thread 15275...
Found