<a href="https://colab.research.google.com/github/davidelgas/DataSciencePortfolio/blob/main/Language_Models/NLP_Corpus_Development.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#1 Project Description






The project leverages user-generated content from a domain-specific online forum as the training corpus. This data is largely unstructured, with minimal metadata available. The following tools were considered to gather the source text for the corpus:


### Web Scraping
- **Tools:** Beautiful Soup, online SaaS products
    - **Pros:**
        - **Direct Access to Targeted Data:** Enables precise extraction of user-generated content from specific sections or threads within the forum.
        - **Efficiency in Data Collection:** Automated scripts can gather large volumes of data in a short amount of time, making it suitable for assembling significant datasets for NLP.
    - **Cons:**
        - **Potential for Incomplete Data:** May miss embedded content or dynamically loaded data, depending on the website’s structure.
        - **Ethical and Legal Considerations:** Scraping data from forums may raise concerns about user privacy and must adhere to the terms of service of the website.
        - **Very Platform Dependent:** Forum specific solutions result in forum specific data schemas that must be reverse engineered to for successful text extraction.

### Forum-specific APIs
- **Tools:** Python (`requests` library for API calls and `json` library for handling responses)
    - **Pros:**
        - **Structured and Reliable Data Retrieval:** APIs provide structured data, making it easier to process and integrate into your project.
        - **Efficient and Direct Access:** Directly accessing the forum's data through its API is efficient, bypassing the need for HTML parsing.
        - **Compliance and Ethical Data Use:** Utilizing APIs respects the forum's data use policies and ensures access is in line with user agreements.
    - **Cons:**
        - **Rate Limiting:** APIs often have limitations on the number of requests that can be made in a certain timeframe, which could slow down data collection.
        - **API Changes:** Dependence on the forum's API structure means that changes or deprecation could disrupt your data collection pipeline.
        - **Access Restrictions:** Some data or functionalities might be restricted or require authentication, posing additional challenges for comprehensive data collection.


**Conclusion: I will be using Beautiful Soup to create my corpus.**


#2 Create Enviornment

In [1]:
# Access to Google Drive
# This seems to propagate credentials better from its own cell

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
#Packages and libraries

!pip install snowflake

import os
import time
import requests
import pandas as pd
import concurrent.futures
import snowflake.connector
import concurrent.futures

from bs4 import BeautifulSoup

# --- Settings ---
base_path = '/content/drive/Othercomputers/My Mac/Git/Language_Models/datasets/e9'
credentials_path = '/content/drive/Othercomputers/My Mac/Git/credentials/snowflake_credentials.txt'


Collecting snowflake
  Downloading snowflake-1.4.0-py3-none-any.whl.metadata (2.0 kB)
Collecting snowflake-core==1.4.0 (from snowflake)
  Downloading snowflake_core-1.4.0-py3-none-any.whl.metadata (2.0 kB)
Collecting snowflake-legacy (from snowflake)
  Downloading snowflake_legacy-1.0.0-py3-none-any.whl.metadata (2.5 kB)
Collecting snowflake-connector-python (from snowflake-core==1.4.0->snowflake)
  Downloading snowflake_connector_python-3.15.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (70 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.8/70.8 kB[0m [31m596.6 kB/s[0m eta [36m0:00:00[0m
Collecting asn1crypto<2.0.0,>0.24.0 (from snowflake-connector-python->snowflake-core==1.4.0->snowflake)
  Downloading asn1crypto-1.5.1-py2.py3-none-any.whl.metadata (13 kB)
Collecting boto3>=1.24 (from snowflake-connector-python->snowflake-core==1.4.0->snowflake)
  Downloading boto3-1.38.5-py3-none-any.whl.metadata (6.6 kB)
Collecting botocore>=1.24 (f

#3 Data Collection


In [3]:
# --- Utility Functions ---

def load_credentials(path_to_credentials):
    if not os.path.exists(path_to_credentials):
        raise FileNotFoundError(f"Credentials file not found: {path_to_credentials}")
    with open(path_to_credentials, 'r') as file:
        for line_num, line in enumerate(file, start=1):
            line = line.strip()
            if line and '=' in line:
                key, value = line.split('=', 1)
                os.environ[key] = value
            else:
                print(f"Skipping invalid line {line_num}: {line}")
    for var in ['USER', 'PASSWORD', 'ACCOUNT']:
        if not os.environ.get(var):
            raise EnvironmentError(f"Missing environment variable: {var}")

def connect_to_snowflake():
    try:
        conn = snowflake.connector.connect(
            user=os.environ.get('USER'),
            password=os.environ.get('PASSWORD'),
            account=os.environ.get('ACCOUNT')
        )
        print(f"Connected to Snowflake account: {os.environ.get('ACCOUNT')}")
        return conn
    except Exception as e:
        raise ConnectionError(f"Failed to connect to Snowflake: {e}")

def create_db_schema_table(cur):
    try:
        cur.execute("CREATE DATABASE IF NOT EXISTS e9_corpus")
        cur.execute("USE DATABASE e9_corpus")
        cur.execute("CREATE SCHEMA IF NOT EXISTS e9_corpus_schema")
        cur.execute("""
            CREATE TABLE IF NOT EXISTS e9_corpus.e9_corpus_schema.e9_forum_corpus (
                THREAD_ID NUMBER(38,0) PRIMARY KEY,
                THREAD_TITLE STRING,
                THREAD_FIRST_POST STRING,
                THREAD_ALL_POSTS STRING
            )
        """)
        print("Database, schema, and table checked/created.")
    except Exception as e:
        print(f"Error creating database/schema/table: {e}")

def fetch_existing_thread_ids(cur):
    query = "SELECT THREAD_ID FROM e9_corpus.e9_corpus_schema.e9_forum_corpus"
    try:
        cur.execute(query)
        result = cur.fetchall()
        return set(row[0] for row in result)
    except Exception as e:
        print(f"Error fetching existing thread IDs: {e}")
        return set()

def insert_missing_data(cur, df, existing_thread_ids):
    """Insert only new data into Snowflake, skipping already loaded threads."""
    if df.empty:
        print("No data to insert.")
        return

    print(f"Original DataFrame has {len(df)} rows.")
    df.columns = [col.upper() for col in df.columns]

    # Filter only missing threads
    new_df = df[~df['THREAD_ID'].isin(existing_thread_ids)]
    print(f"{len(new_df)} new threads will be inserted into Snowflake.")

    if new_df.empty:
        print("No new threads to insert.")
        return

    # Replace NaN values with None
    new_df = new_df.where(pd.notnull(new_df), None)

    insert_query = """
    INSERT INTO e9_corpus.e9_corpus_schema.e9_forum_corpus
    (THREAD_ID, THREAD_TITLE, THREAD_FIRST_POST, THREAD_ALL_POSTS)
    VALUES (%s, %s, %s, %s)
    """

    # Create a list of tuples
    rows_to_insert = [
        (
            row['THREAD_ID'],
            row['THREAD_TITLE'],
            row['THREAD_FIRST_POST'],
            row['THREAD_ALL_POSTS']
        )
        for _, row in new_df.iterrows()
    ]

    # Batch insert all rows at once
    cur.executemany(insert_query, rows_to_insert)

def upload_corpus_to_snowflake(base_path: str, credentials_path: str, filename: str):
    file_path = os.path.join(base_path, filename)
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"Corpus file not found: {file_path}")

    forum_corpus_df = pd.read_csv(file_path)
    print(f"Loaded {len(forum_corpus_df)} rows from {file_path} to upload.")

    load_credentials(credentials_path)
    conn = connect_to_snowflake()
    cur = conn.cursor()

    try:
        create_db_schema_table(cur)
        existing_thread_ids = fetch_existing_thread_ids(cur)
        print(f"Snowflake already has {len(existing_thread_ids)} threads.")
        insert_missing_data(cur, forum_corpus_df, existing_thread_ids)
        conn.commit()
        print(f"Data from {filename} committed successfully.")

        cur.execute("SELECT COUNT(*) FROM e9_corpus.e9_corpus_schema.e9_forum_corpus")
        final_count = cur.fetchone()[0]
        print(f"Total threads now in Snowflake: {final_count}")

    except Exception as e:
        print(f"Error during upload: {e}")
        conn.rollback()
        raise e
    finally:
        cur.close()
        conn.close()


In [4]:
df_e9_corpus = pd.read_csv('/content/drive/Othercomputers/My Mac/Git/Language_Models/datasets/e9/e9_forum_threads_decorated.csv')
df_e9_corpus.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3407 entries, 0 to 3406
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   thread_id          3407 non-null   int64 
 1   thread_title       3407 non-null   object
 2   thread_first_post  3399 non-null   object
dtypes: int64(1), object(2)
memory usage: 80.0+ KB


# 4 Data Storage

In [5]:
def create_urls(base_path: str, filename: str = 'e9_forum_thread_ids.csv', threads: int = 1):
    file_path = os.path.join(base_path, filename)
    if os.path.exists(file_path) and os.path.getsize(file_path) > 0:
        existing_ids = pd.read_csv(file_path)
        last_thread_id = int(existing_ids['thread_id'].iloc[-1])
        print(f"Existing thread_ids found. Last thread_id: {last_thread_id}")
    else:
        last_thread_id = 0
        print(f"No existing thread_ids. Starting from {last_thread_id}")

    new_ids = [{'thread_id': tid} for tid in range(last_thread_id + 1, last_thread_id + threads + 1)]
    new_thread_ids = pd.DataFrame(new_ids)
    new_thread_ids.to_csv(file_path, mode='a', header=not os.path.exists(file_path), index=False)

    print(f"Added {threads} new thread_ids. Ending at {new_ids[-1]['thread_id']}")
    return new_thread_ids

def fetch_full_thread_data(df, base_path: str, posts_filename: str = 'e9_forum_posts.csv', decorated_filename: str = 'e9_forum_threads_decorated.csv'):
    posts_file = os.path.join(base_path, posts_filename)
    decorated_file = os.path.join(base_path, decorated_filename)

    existing_posts = pd.read_csv(posts_file) if os.path.exists(posts_file) else pd.DataFrame(columns=['thread_id', 'post_timestamp', 'post_raw'])
    existing_decorated = pd.read_csv(decorated_file) if os.path.exists(decorated_file) else pd.DataFrame(columns=['thread_id', 'thread_title', 'thread_first_post'])

    existing_thread_ids = set(existing_posts['thread_id'].tolist()) | set(existing_decorated['thread_id'].tolist())
    new_threads = df[~df['thread_id'].isin(existing_thread_ids)]

    post_data = []
    decorated_data = []

    for thread_id in new_threads['thread_id']:
        thread_url = f"https://e9coupe.com/forum/threads/{thread_id}/?page=1"
        try:
            print(f"Fetching thread {thread_id}...")
            response = requests.get(thread_url)
            if response.status_code != 200:
                print(f"Error {response.status_code} fetching {thread_url}")
                continue

            soup = BeautifulSoup(response.text, 'html.parser')
            articles = soup.find_all('article', class_='message--post')
            if not articles:
                print(f"No posts found for thread {thread_id}. Skipping.")
                continue

            title_element = soup.find('title')
            thread_title = title_element.get_text().split('|')[0].strip() if title_element else "No Title"

            first_post_element = soup.find('article', class_='message-body')
            first_post = first_post_element.get_text(strip=True) if first_post_element else "No content"

            decorated_data.append({
                'thread_id': thread_id,
                'thread_title': thread_title,
                'thread_first_post': first_post
            })

            for article in articles:
                timestamp_element = article.find('time')
                content_element = article.find('div', class_='bbWrapper')
                post_data.append({
                    'thread_id': thread_id,
                    'post_timestamp': timestamp_element['datetime'] if timestamp_element else "N/A",
                    'post_raw': content_element.get_text(strip=True) if content_element else "No content"
                })

            time.sleep(1)

        except Exception as e:
            print(f"Error fetching thread {thread_id}: {e}")

    if post_data:
        new_posts_df = pd.DataFrame(post_data)
        combined_posts = pd.concat([existing_posts, new_posts_df], ignore_index=True)
        combined_posts.to_csv(posts_file, index=False)
        print(f"Saved {len(new_posts_df)} new posts. Total posts: {len(combined_posts)}")

    if decorated_data:
        new_decorated_df = pd.DataFrame(decorated_data)
        combined_decorated = pd.concat([existing_decorated, new_decorated_df], ignore_index=True)
        combined_decorated.to_csv(decorated_file, index=False)
        print(f"Saved {len(new_decorated_df)} new decorated threads. Total threads: {len(combined_decorated)}")

def create_forum_corpus(base_path: str, posts_filename: str = 'e9_forum_posts.csv', decorated_filename: str = 'e9_forum_threads_decorated.csv', corpus_filename: str = 'e9_forum_corpus.csv'):
    posts_file = os.path.join(base_path, posts_filename)
    decorated_file = os.path.join(base_path, decorated_filename)
    corpus_file = os.path.join(base_path, corpus_filename)

    posts_df = pd.read_csv(posts_file)
    decorated_df = pd.read_csv(decorated_file)

    aggregated = posts_df.groupby('thread_id')['post_raw'].agg(lambda x: ' '.join(str(i) for i in x)).reset_index()
    aggregated.rename(columns={'post_raw': 'thread_all_posts'}, inplace=True)

    decorated_df['thread_id'] = decorated_df['thread_id'].astype('int64')
    aggregated['thread_id'] = aggregated['thread_id'].astype('int64')

    decorated_df = decorated_df[decorated_df['thread_id'].isin(aggregated['thread_id'])]

    forum_corpus = pd.merge(decorated_df, aggregated, on='thread_id', how='inner')
    forum_corpus.to_csv(corpus_file, index=False)
    print(f"Saved corpus with {len(forum_corpus)} threads to {corpus_file}")

    return forum_corpus

def update_local_corpus(base_path: str, threads_to_add: int = 5, corpus_filename: str = 'e9_forum_corpus.csv'):
    print("\n=== Starting Local Forum Corpus Update ===\n")
    new_thread_ids = create_urls(base_path, threads=threads_to_add)
    fetch_full_thread_data(new_thread_ids, base_path)
    forum_corpus_df = create_forum_corpus(base_path, corpus_filename=corpus_filename)
    print("\n=== Local Forum Corpus Update Complete ===\n")
    return forum_corpus_df


# Orchestration

In [11]:
# ====== START BATCH SCRAPE + BACKGROUND UPLOAD ======

num_batches = 10
threads_per_batch = 10
max_workers_upload = 3

executor = concurrent.futures.ThreadPoolExecutor(max_workers=max_workers_upload)

for batch_num in range(num_batches):
    print(f"\n=== Starting batch {batch_num + 1} ===\n")

    batch_filename = f"e9_forum_corpus_batch_{batch_num + 1}.csv"
    forum_corpus_df = update_local_corpus(base_path, threads_to_add=threads_per_batch, corpus_filename=batch_filename)

    future = executor.submit(upload_corpus_to_snowflake, base_path, credentials_path, batch_filename)

    def handle_upload_result(fut):
        try:
            fut.result()
        except Exception as e:
            print(f"UPLOAD FAILED for {batch_filename}: {e}")

    future.add_done_callback(handle_upload_result)

executor.shutdown(wait=True)
print("\n=== All scraping and uploads complete ===\n")



=== Starting batch 1 ===


=== Starting Local Forum Corpus Update ===

Existing thread_ids found. Last thread_id: 4310
Added 10 new thread_ids. Ending at 4320
Fetching thread 4311...
Fetching thread 4312...
Fetching thread 4313...
Fetching thread 4314...
Fetching thread 4315...
Fetching thread 4316...
Fetching thread 4317...
Fetching thread 4318...
Fetching thread 4319...
Fetching thread 4320...
Saved 56 new posts. Total posts: 23331
Saved 10 new decorated threads. Total threads: 4317
Saved corpus with 4317 threads to /content/drive/Othercomputers/My Mac/Git/Language_Models/datasets/e9/e9_forum_corpus_batch_1.csv

=== Local Forum Corpus Update Complete ===


=== Starting batch 2 ===


=== Starting Local Forum Corpus Update ===

Existing thread_ids found. Last thread_id: 4320
Added 10 new thread_ids. Ending at 4330
Loaded 4317 rows from /content/drive/Othercomputers/My Mac/Git/Language_Models/datasets/e9/e9_forum_corpus_batch_1.csv to upload.
Fetching thread 4321...
Connected to Snowfl

# OLDER CODE

In [None]:
import os

# Set base paths
base_path = '/content/drive/Othercomputers/My Mac/Git/Language_Models/datasets/e9'
credentials_path = '/content/drive/Othercomputers/My Mac/Git/credentials/snowflake_credentials.txt'

# Make sure base path exists
if not os.path.exists(base_path):
    raise FileNotFoundError(f"Base path does not exist: {base_path}")

if not os.path.exists(credentials_path):
    raise FileNotFoundError(f"Credentials file does not exist: {credentials_path}")

print(f"Base path set to: {base_path}")
print(f"Credentials path set to: {credentials_path}")


Base path set to: /content/drive/Othercomputers/My Mac/Git/Language_Models/datasets/e9
Credentials path set to: /content/drive/Othercomputers/My Mac/Git/credentials/snowflake_credentials.txt


In [None]:
import os
import pandas as pd

def create_urls(base_path: str, filename: str = 'e9_forum_thread_ids.csv', threads: int = 1):
    """
    Create thread_id entries for URLs and append them to a CSV file.

    Args:
        base_path (str): Directory where the CSV file is located.
        filename (str): Name of the CSV file. Defaults to 'e9_forum_thread_ids.csv'.
        threads (int): Number of new thread_ids to add. Defaults to 1.

    Returns:
        last_thread_id (int): The last existing thread_id before adding new ones.
        last_thread_id_processed (int): The last thread_id after adding new ones.
        new_thread_ids (DataFrame): DataFrame containing the newly added thread_ids.
    """
    file_path = os.path.join(base_path, filename)

    # Check for existing file and get the last thread_id
    if os.path.exists(file_path) and os.path.getsize(file_path) > 0:
        existing_ids = pd.read_csv(file_path)
        last_thread_id = int(existing_ids['thread_id'].iloc[-1])
        print(f"Existing file found. Last thread_id: {last_thread_id}")
    else:
        last_thread_id = 0
        print(f"No existing file found. Starting from thread_id: {last_thread_id}")

    # Generate new thread_ids
    new_ids = [{'thread_id': tid} for tid in range(last_thread_id + 1, last_thread_id + threads + 1)]

    new_thread_ids = pd.DataFrame(new_ids)

    # Append new thread_ids to the CSV
    new_thread_ids.to_csv(file_path, mode='a', header=not os.path.exists(file_path), index=False)

    # Info messages
    print(f"Adding {threads} new threads.")
    print(f"Ending at thread_id: {new_ids[-1]['thread_id']}")

    return last_thread_id, new_ids[-1]['thread_id'], new_thread_ids


last_id, last_processed_id, new_threads_df = create_urls(base_path, threads=5)

Existing file found. Last thread_id: 5
Adding 5 new threads.
Ending at thread_id: 10


In [None]:
import os
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup

def fetch_thread_data(df, base_path: str, filename: str = 'e9_forum_threads.csv'):
    """
    Ingest a DataFrame of thread_ids and fetch thread titles and URLs.
    Saves results to a CSV file, ensuring only unique thread_ids are added.

    Args:
        df (DataFrame): DataFrame containing thread_ids to process.
        base_path (str): Directory where the output CSV is located.
        filename (str): Name of the output CSV file. Defaults to 'e9_forum_threads.csv'.

    Returns:
        DataFrame: Updated DataFrame with all thread data (old + new).
    """
    file_path = os.path.join(base_path, filename)

    # Load existing data if it exists
    existing_thread_ids = set()
    last_thread_id = 0

    if os.path.exists(file_path) and os.path.getsize(file_path) > 0:
        try:
            existing_df = pd.read_csv(file_path)
            existing_thread_ids = set(existing_df['thread_id'].tolist())
            if not existing_df.empty:
                last_thread_id = max(existing_df['thread_id'])
            print(f"Loaded {len(existing_thread_ids)} existing thread IDs from {file_path}")
            print(f"Last thread ID in existing file: {last_thread_id}")
        except Exception as e:
            print(f"Warning: Could not load existing data: {e}")
    else:
        existing_df = pd.DataFrame(columns=['thread_id', 'thread_title', 'thread_url'])

    # Identify new thread IDs to process
    new_thread_ids = [thread_id for thread_id in df['thread_id'] if thread_id not in existing_thread_ids]

    print(f"Found {len(new_thread_ids)} new thread IDs to process.")
    if new_thread_ids:
        print(f"New thread IDs: {new_thread_ids}")

    # Process new threads
    new_data = []
    for thread_id in new_thread_ids:
        thread_url = f"https://e9coupe.com/forum/threads/{thread_id}"
        page_url = f"{thread_url}/?page=1"  # Only processing page 1

        try:
            print(f"Fetching data for thread {thread_id}...")
            response = requests.get(page_url)

            if response.status_code != 200:
                print(f"Error: Got status code {response.status_code} for {page_url}")
                continue

            soup = BeautifulSoup(response.text, 'html.parser')
            title_element = soup.find('title')

            if title_element:
                title = title_element.get_text()
                thread_title = title.split('|')[0].strip()

                new_data.append({
                    'thread_id': thread_id,
                    'thread_title': thread_title,
                    'thread_url': page_url
                })

                print(f"Found thread {thread_id}: '{thread_title}'")
            else:
                print(f"Warning: No title found for thread {thread_id}")

            # Be nice to server
            time.sleep(1)

        except Exception as e:
            print(f"Error processing thread {thread_id}: {e}")

    # Save new data
    if new_data:
        new_df = pd.DataFrame(new_data)

        # Append or create the file
        if os.path.exists(file_path):
            new_df.to_csv(file_path, mode='a', header=False, index=False)
        else:
            new_df.to_csv(file_path, index=False)

        print(f"Added {len(new_data)} new threads to {file_path}")
        new_ids = [item['thread_id'] for item in new_data]
        print(f"New thread IDs added: {new_ids}")
    else:
        print("No new threads to add.")

    # Return updated dataset
    if os.path.exists(file_path):
        return pd.read_csv(file_path)
    else:
        return pd.DataFrame(columns=['thread_id', 'thread_title', 'thread_url'])

all_threads_df = fetch_thread_data(new_threads_df, base_path)

Loaded 10 existing thread IDs from /content/drive/Othercomputers/My Mac/Git/Language_Models/datasets/e9/e9_forum_threads.csv
Last thread ID in existing file: 10
Found 0 new thread IDs to process.
No new threads to add.


In [None]:
import os
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup

def fetch_first_post_content(df, base_path: str, filename: str = 'e9_forum_threads_decorated.csv'):
    """
    Fetch the first post content for each thread and save to a CSV,
    ensuring no duplicate processing of already existing threads.

    Args:
        df (DataFrame): DataFrame containing thread_id, thread_url, and thread_title.
        base_path (str): Directory where the output CSV will be saved.
        filename (str): Name of the output CSV file. Defaults to 'e9_forum_threads_decorated.csv'.

    Returns:
        DataFrame: Updated DataFrame with thread_id, thread_title, and first post content.
    """
    file_path = os.path.join(base_path, filename)

    # Step 1: Load existing data if available
    existing_thread_ids = set()
    if os.path.exists(file_path) and os.path.getsize(file_path) > 0:
        existing_df = pd.read_csv(file_path)
        existing_thread_ids = set(existing_df['thread_id'].tolist())
        print(f"Loaded {len(existing_thread_ids)} existing thread IDs from {file_path}")
    else:
        existing_df = pd.DataFrame(columns=['thread_id', 'thread_title', 'thread_first_post'])

    # Step 2: Identify new threads to process
    new_threads = df[~df['thread_id'].isin(existing_thread_ids)]

    print(f"Found {len(new_threads)} new threads to fetch first posts.")

    if new_threads.empty:
        print("No new threads to process.")
        return existing_df

    # Step 3: Fetch first posts
    data = []

    for thread_id, thread_url, thread_title in zip(new_threads['thread_id'], new_threads['thread_url'], new_threads['thread_title']):
        try:
            print(f"Fetching first post for thread {thread_id}...")
            response = requests.get(thread_url)

            if response.status_code != 200:
                print(f"Error: Got status code {response.status_code} for {thread_url}")
                post_content = "Failed to fetch content"
            else:
                soup = BeautifulSoup(response.text, 'html.parser')
                first_post = soup.find('article', class_='message-body')

                if first_post:
                    post_content = first_post.get_text(strip=True)
                else:
                    post_content = "No content found"

        except Exception as e:
            print(f"Error fetching thread {thread_id}: {e}")
            post_content = "Error fetching content"

        data.append({
            'thread_id': thread_id,
            'thread_title': thread_title,
            'thread_first_post': post_content
        })

        # Be kind to the server
        time.sleep(1)

    # Step 4: Save combined results
    if data:
        new_df = pd.DataFrame(data)
        combined_df = pd.concat([existing_df, new_df], ignore_index=True)
        combined_df.to_csv(file_path, index=False)
        print(f"Saved updated decorated thread data to {file_path}")
    else:
        print("No new data fetched.")
        combined_df = existing_df

    return combined_df

decorated_threads_df = fetch_first_post_content(all_threads_df, base_path)

Loaded 10 existing thread IDs from /content/drive/Othercomputers/My Mac/Git/Language_Models/datasets/e9/e9_forum_threads_decorated.csv
Found 0 new threads to fetch first posts.
No new threads to process.


In [None]:
import os
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup

def fetch_and_parse_thread(df, base_path: str, filename: str = 'e9_forum_posts.csv'):
    """
    Fetch all posts from threads that haven't been processed yet and save to a CSV.

    Args:
        df (DataFrame): DataFrame containing thread_id and thread_url.
        base_path (str): Directory where the output CSV will be saved.
        filename (str): Name of the output CSV file. Defaults to 'e9_forum_posts.csv'.

    Returns:
        DataFrame: Updated DataFrame with thread_id, post_timestamp, and post_raw content.
    """
    file_path = os.path.join(base_path, filename)

    # Step 1: Load existing data if available
    existing_thread_ids = set()
    if os.path.exists(file_path) and os.path.getsize(file_path) > 0:
        existing_posts = pd.read_csv(file_path)
        existing_thread_ids = set(existing_posts['thread_id'].tolist())
        print(f"Loaded {len(existing_thread_ids)} existing thread IDs from {file_path}")
    else:
        existing_posts = pd.DataFrame(columns=['thread_id', 'post_timestamp', 'post_raw'])

    # Step 2: Identify new threads to process
    new_threads = df[~df['thread_id'].isin(existing_thread_ids)]

    print(f"Found {len(new_threads)} new threads to fetch.")

    if new_threads.empty:
        print("No new threads to process.")
        return existing_posts  # Just return what already exists

    # Step 3: Fetch new thread posts
    post_data = []

    for index, row in new_threads.iterrows():
        thread_id = row['thread_id']
        thread_url = row['thread_url']

        try:
            print(f"Fetching posts for thread {thread_id}...")
            response = requests.get(thread_url)

            if response.status_code != 200:
                print(f"Error: Status code {response.status_code} for {thread_url}")
                continue

            soup = BeautifulSoup(response.text, 'html.parser')
            articles = soup.find_all('article', class_='message--post')

            for article in articles:
                timestamp_element = article.find('time')
                post_timestamp = timestamp_element['datetime'] if timestamp_element else 'N/A'

                content_element = article.find('div', class_='bbWrapper')
                post_content = content_element.get_text(strip=True) if content_element else 'No content found'

                post_data.append({
                    'thread_id': thread_id,
                    'post_timestamp': post_timestamp,
                    'post_raw': post_content
                })

            # Be kind to the server
            time.sleep(1)

        except Exception as e:
            print(f"Error processing thread {thread_id}: {e}")

    # Step 4: Save new posts
    if post_data:
        new_posts_df = pd.DataFrame(post_data)
        new_posts_df['post_raw'] = new_posts_df['post_raw'].astype(str)

        # Append to existing posts
        combined_posts = pd.concat([existing_posts, new_posts_df], ignore_index=True)
        combined_posts.to_csv(file_path, index=False)

        print(f"Added {len(new_posts_df)} new posts. Total posts now: {len(combined_posts)}")
    else:
        print("No new posts fetched.")
        combined_posts = existing_posts

    return combined_posts

all_posts_df = fetch_and_parse_thread(all_threads_df, base_path)

Loaded 10 existing thread IDs from /content/drive/Othercomputers/My Mac/Git/Language_Models/datasets/e9/e9_forum_posts.csv
Found 0 new threads to fetch.
No new threads to process.


In [None]:
import os
import pandas as pd

def create_forum_corpus(e9_forum_posts, e9_forum_threads_decorated, base_path: str, filename: str = 'e9_forum_corpus.csv'):
    """
    Create a final forum corpus combining thread metadata and all posts.

    Args:
        e9_forum_posts (DataFrame): DataFrame with all posts (thread_id, post_timestamp, post_raw).
        e9_forum_threads_decorated (DataFrame): DataFrame with thread_id, thread_title, and first_post.
        base_path (str): Directory where the output CSV will be saved.
        filename (str): Name of the output CSV file. Defaults to 'e9_forum_corpus.csv'.

    Returns:
        DataFrame: Final corpus DataFrame with thread_id, thread_title, first post, and all posts.
    """
    output_path = os.path.join(base_path, filename)

    # Group by thread_id and concatenate all posts
    aggregated_data = e9_forum_posts.groupby('thread_id')['post_raw'].agg(lambda x: ' '.join(x)).reset_index()

    # Rename column
    aggregated_data.rename(columns={'post_raw': 'thread_all_posts'}, inplace=True)

    # Ensure correct data types
    e9_forum_threads_decorated['thread_id'] = e9_forum_threads_decorated['thread_id'].astype('int64')
    aggregated_data['thread_id'] = aggregated_data['thread_id'].astype('int64')

    # Merge decorated thread info with all posts
    e9_forum_corpus = pd.merge(e9_forum_threads_decorated, aggregated_data, on='thread_id', how='left')

    # Save to CSV
    e9_forum_corpus.to_csv(output_path, index=False)

    # PRINTS
    print(f"Saved forum corpus to {output_path}")
    print(f"Total threads in corpus: {len(e9_forum_corpus)}")

    return e9_forum_corpus

forum_corpus_df = create_forum_corpus(all_posts_df, decorated_threads_df, base_path)

Saved forum corpus to /content/drive/Othercomputers/My Mac/Git/Language_Models/datasets/e9/e9_forum_corpus.csv
Total threads in corpus: 10


In [None]:
def update_local_corpus_and_upload(base_path: str, credentials_path: str, threads_to_add: int = 5):
    """
    Master function to update the local forum corpus and upload to Snowflake.

    Args:
        base_path (str): Path to the local data folder.
        credentials_path (str): Path to Snowflake credentials file.
        threads_to_add (int): Number of new thread IDs to add.
    """
    print("\n=== Starting corpus update process ===\n")

    # Step 1: Create new thread IDs
    print("Step 1: Creating new thread IDs...")
    last_id, last_processed_id, new_threads_df = create_urls(base_path, threads=threads_to_add)

    # Step 2: Fetch basic thread data (titles, URLs)
    print("\nStep 2: Fetching thread titles and URLs...")
    all_threads_df = fetch_thread_data(new_threads_df, base_path)

    # Step 3: Fetch first post content
    print("\nStep 3: Fetching first post content...")
    decorated_threads_df = fetch_first_post_content(all_threads_df, base_path)

    # Step 4: Fetch all posts in threads
    print("\nStep 4: Fetching all posts...")
    all_posts_df = fetch_and_parse_thread(all_threads_df, base_path)

    # Step 5: Build final corpus
    print("\nStep 5: Building the final forum corpus...")
    forum_corpus_df = create_forum_corpus(all_posts_df, decorated_threads_df, base_path)

    # Step 6: Upload to Snowflake
    print("\nStep 6: Uploading forum corpus to Snowflake...")
    upload_corpus_to_snowflake(base_path, credentials_path)

    print("\n=== Corpus update and upload complete! ===\n")


# 4 Data Storage

In [None]:

# Load the e9_forum_corpus DataFrame from the CSV file
e9_forum_corpus = pd.read_csv(BASE_PATH + 'e9_forum_corpus_dirty.csv')

def load_credentials(path_to_credentials):
    with open(path_to_credentials, 'r') as file:
        for line_num, line in enumerate(file, start=1):
            line = line.strip()
            if line and '=' in line:
                key, value = line.split('=')
                os.environ[key] = value
            else:
                print(f"Issue with line {line_num} in {path_to_credentials}: '{line}'")
                # Optionally raise an error or handle the issue as needed

def connect_to_snowflake():
    return snowflake.connector.connect(
        user=os.environ.get('USER'),
        password=os.environ.get('PASSWORD'),
        account=os.environ.get('ACCOUNT')
    )

def create_db_and_schema(cur):
    """Create the database and schema in Snowflake."""
    try:
        cur.execute("CREATE DATABASE IF NOT EXISTS e9_corpus")
        cur.execute("USE DATABASE e9_corpus")
        cur.execute("CREATE SCHEMA IF NOT EXISTS e9_corpus_schema")
        print("Database and schema created successfully.")
    except Exception as e:
        print(f"Error creating database and schema: {e}")

def create_table_if_not_exists(cur):
    """Create the e9_forum_corpus table if it does not exist."""
    try:
        cur.execute("""
        CREATE TABLE IF NOT EXISTS e9_corpus.e9_corpus_schema.e9_forum_corpus (
            THREAD_ID NUMBER(38,0),
            THREAD_TITLE VARCHAR(16777216),
            THREAD_FIRST_POST VARCHAR(16777216),
            THREAD_ALL_POSTS VARCHAR(16777216)
        )
        """)
        print("e9_forum_corpus table created successfully.")
    except Exception as e:
        print(f"Error creating table: {e}")

def insert_data_into_table(cur, df):
    """Insert data from the DataFrame into the e9_forum_corpus table."""
    for index, row in df.iterrows():
        row = row.where(pd.notnull(row), None)
        insert_command = f"""
        INSERT INTO e9_corpus.e9_corpus_schema.e9_forum_corpus
        (THREAD_ID, THREAD_TITLE, THREAD_FIRST_POST, THREAD_ALL_POSTS)
        VALUES (%s, %s, %s, %s)
        """
        try:
            cur.execute(insert_command, (
                row['THREAD_ID'], row['THREAD_TITLE'],
                row['THREAD_FIRST_POST'], row['THREAD_ALL_POSTS']
            ))
        except Exception as e:
            print(f"Error inserting data: {e}")

def fetch_data_from_table(cur):
    """Fetch all data from the e9_forum_corpus table."""
    query = "SELECT * FROM e9_corpus.e9_corpus_schema.e9_forum_corpus"
    cur.execute(query)
    return cur.fetch_pandas_all()

def main():
    # Load Snowflake credentials
    load_credentials(CREDENTIALS_PATH)

    # Connect to Snowflake
    conn = connect_to_snowflake()
    cur = conn.cursor()

    # Create the database, schema, and table if they don't exist
    create_db_and_schema(cur)
    create_table_if_not_exists(cur)

    # Insert data into the table
    insert_data_into_table(cur, e9_forum_corpus)
    conn.commit()
    print("Data inserted into e9_forum_corpus table.")

    # Fetch data from the table
    e9_forum_corpus_df = fetch_data_from_table(cur)
    e9_forum_corpus_df.info()

    # Close cursor and connection
    cur.close()
    conn.close()

if __name__ == "__main__":
    main()


Database and schema created successfully.
e9_forum_corpus table created successfully.


KeyboardInterrupt: 