<a href="https://colab.research.google.com/github/davidelgas/DataSciencePortfolio/blob/main/NLP_corpus_and_LDA/corpus/noteboooks/NLP_Corpus.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Overview

This focus of this project is on the creation of a corpus that will be utilized in several Natural Language Processing (NLP) effforts, including LDA, GRU, LSTM and Transformer.

## Corpus Creation

The corpus developed here was assembled by scraping a pubic forum specific to the BMW E9 automobile (www.e9coupe.com). This active forum has been in exsitence since 2003.
The code was written in Python using Google Colab Notebooks and leveraging Beautiful Soup. Raw text was compiled and stored in a Snowflake database to support multiple NLP projects. Furture ideas include supplementing the forum corpus with an existing users guide specific to this car make and model.

### Create Enviornment

In [None]:
# This seems to propagate credentials better from its own cell

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip3 install requests
import requests

!pip3 install beautifulsoup4
from bs4 import BeautifulSoup

!pip3 install pandas
import pandas as pd

!pip3 install numpy
import numpy as np

!pip install snowflake-connector-python
import snowflake.connector

import re

import os

import logging

from transformers import BertTokenizer, BertModel, pipeline

import torch


Collecting snowflake-connector-python
  Downloading snowflake_connector_python-3.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting asn1crypto<2.0.0,>0.24.0 (from snowflake-connector-python)
  Downloading asn1crypto-1.5.1-py2.py3-none-any.whl (105 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
Collecting platformdirs<4.0.0,>=2.6.0 (from snowflake-connector-python)
  Downloading platformdirs-3.11.0-py3-none-any.whl (17 kB)
Collecting tomlkit (from snowflake-connector-python)
  Downloading tomlkit-0.12.4-py3-none-any.whl (37 kB)
Installing collected packages: asn1crypto, tomlkit, platformdirs, snowflake-connector-python
  Attempting uninstall: platformdirs
    Found existing installation: platformdirs 4.2.0
    Uninstalling platformdirs-4.2.0:
      Succe

## Create corpus

Ill be scraping posts from my classic car forum for the corpus. Ill be limiting the data retreival while I build the model so I dont impact the site for users. Ill be using Beautiful Soup where possible to parse the content into a dataframe structure. The admin of the forum has been notified that I am experimenting with ways to improve the online community.

In [None]:
# Generate the list of thread_ids to scrape and parse
# There are currently approximately 15k threads

# Set the file path to save files
file_path = '/content/drive/MyDrive/Data_sets/e9/e9_forum_thread_ids.csv'

# Set the number of incremental thread_ids to process
threads = 500

# Check if the file exists and has content. If it does, update last_thread_id
if os.path.exists(file_path) and os.path.getsize(file_path) > 0:
    e9_forum_thread_ids = pd.read_csv(file_path)
    last_thread_id = e9_forum_thread_ids['thread_id'].iloc[-1]
    last_thread_id = int(last_thread_id)  # Convert to integer

else:
    last_thread_id = 0

# Function to create URLs from the thread_ids
def create_urls(threads, last_thread_id):
    urls = []
    for thread_id in range(last_thread_id + 1, last_thread_id + threads + 1):
        urls.append({'thread_id': thread_id})
    return urls

urls = create_urls(threads, last_thread_id)

last_thread_id_processed = urls[-1]['thread_id']

# Convert the list of dictionaries into a DataFrame
e9_forum_thread_ids = pd.DataFrame(urls)

# Save DataFrame to CSV file
e9_forum_thread_ids.to_csv(file_path, mode='a', header=['thread_id'], index=False)

print("Starting with thread_id " + str(last_thread_id))
print("Processing additional " + str(threads) + " threads")
print("Ending with thread_id " + str(last_thread_id_processed))

Starting with thread_id 15000
Processing additional 500 threads
Ending with thread_id 15500


In [None]:
# Generate the URL and title for each thread

pages = 1

def fetch_thread_data(df, pages=1):
    for index, row in df.iterrows():
        thread_id = row['thread_id']
        thread_url = f"https://e9coupe.com/forum/threads/{thread_id}"
        for i in range(1, pages + 1):
            page_url = f"{thread_url}/?page={i}"  # Construct the page URL
            response = requests.get(page_url)
            soup = BeautifulSoup(response.text, 'html.parser')
            title = soup.find('title').get_text()
            thread_title = title.split('|')[0].strip()
            df.at[index, 'thread_title'] = thread_title
            df.at[index, 'thread_url'] = page_url

    return df

# Fetch thread URLs and title
e9_forum_threads = fetch_thread_data(e9_forum_thread_ids)

# Export and save result
file_path = '/content/drive/MyDrive/Data_sets/e9/e9_forum_threads.csv'

header = ['thread_id', 'thread_title', 'thread_url']

# Export and save result
e9_forum_threads.to_csv(file_path, mode='a', header=header, index=False)

In [None]:
e9_forum_threads.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   thread_id     500 non-null    int64 
 1   thread_title  500 non-null    object
 2   thread_url    500 non-null    object
dtypes: int64(1), object(2)
memory usage: 11.8+ KB


In [None]:
# Find the first post in the thread creation

import requests
from bs4 import BeautifulSoup
import pandas as pd

def fetch_first_post_content(df):
    data = []

    for thread_id, thread_url, thread_title in zip(df['thread_id'], df['thread_url'], df['thread_title']):
        response = requests.get(thread_url)
        soup = BeautifulSoup(response.text, 'html.parser')

        first_post = soup.find('article', class_='message-body')
        if first_post:
            post_content = first_post.get_text(strip=True)
        else:
            post_content = "No content found"  # Handle case where no post content is found

        data.append({'thread_id': thread_id, 'thread_title': thread_title, 'thread_first_post': post_content})

    return data

# Fetch first post content
data = fetch_first_post_content(e9_forum_threads)

# Convert to DataFrame
e9_forum_threads_decorated = pd.DataFrame(data)

# Export and save result
file_path = '/content/drive/MyDrive/Data_sets/e9/e9_forum_threads_decorated.csv'

header = not os.path.exists(file_path)

# Export and save result
e9_forum_threads_decorated.to_csv(file_path, mode='a', header=header, index=False)

In [None]:
# Find all posts associated with each thread

def fetch_and_parse_thread(df):
    post_data = []
    processed_posts = set()
    for index, row in df.iterrows():
        response = requests.get(row['thread_url'])
        soup = BeautifulSoup(response.text, 'html.parser')
        articles = soup.find_all('article', class_='message--post')  # Correct class name as example
        for article in articles:
            # Extracting post timestamp instead of post ID
            post_timestamp = article.find('time')['datetime'] if article.find('time') else 'N/A'
            content = article.find('div', class_='bbWrapper').get_text(strip=True)

            post_data.append({
                'thread_id': row['thread_id'],
                'post_timestamp': post_timestamp,
                'post_raw': content
            })

    return post_data

# Fetch all thread post content
post_data = fetch_and_parse_thread(e9_forum_threads)

# Convert to DataFrame
e9_forum_posts = pd.DataFrame(post_data)

# Export and save result
file_path = ('/content/drive/MyDrive/Data_sets/e9/e9_forum_posts.csv')

header = ['thread_id', 'post_timestamp','post_raw']

# Export and save result
e9_forum_posts.to_csv(file_path, mode='a', header=header, index=False)

In [None]:
e9_forum_posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3682 entries, 0 to 3681
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   thread_id       3682 non-null   int64 
 1   post_timestamp  3682 non-null   object
 2   post_raw        3682 non-null   object
dtypes: int64(1), object(2)
memory usage: 86.4+ KB


In [None]:
# Create the corpus by aggregating all posts into one column
# and merging with the threads df

# Group by THREAD_ID and concatenate the POST_RAW values
aggregated_data = e9_forum_posts.groupby('thread_id')['post_raw'].agg(lambda x: ' '.join(x)).reset_index()

# Rename the column to indicate that it contains concatenated post content
aggregated_data.rename(columns={'post_raw': 'thread_all_posts'}, inplace=True)

# Cast 'thread_id' column to int64 in both DataFrames
e9_forum_threads['thread_id'] = e9_forum_threads['thread_id'].astype('int64')
aggregated_data['thread_id'] = aggregated_data['thread_id'].astype('int64')

# Merge the two DataFrames
e9_forum_corpus = pd.merge(e9_forum_threads_decorated, aggregated_data, on='thread_id', how='left')

# Export and save result
e9_forum_corpus.to_csv('/content/drive/MyDrive/Data_sets/e9/e9_forum_corpus.csv', index=False)

In [None]:
e9_forum_corpus.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 500 entries, 0 to 499
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   thread_id          500 non-null    int64 
 1   thread_title       500 non-null    object
 2   thread_first_post  500 non-null    object
 3   thread_all_posts   491 non-null    object
dtypes: int64(1), object(3)
memory usage: 19.5+ KB


## Load tables into Snowflake

In [None]:
# Create the db and schema

# Set the snowflake account and login information
path_to_credentials = '/content/drive/MyDrive/credentials/snowflake_credentials'


# Load the credentials
with open(path_to_credentials, 'r') as file:
    for line in file:
        key, value = line.strip().split('=')
        os.environ[key] = value

conn = snowflake.connector.connect(
    user=os.environ.get('USER'),
    password=os.environ.get('PASSWORD'),
    account=os.environ.get('ACCOUNT'),
)

# Create a cursor object
cur = conn.cursor()

# Create a database for the corpus and load the tables
try:
    # Create a new database
    cur.execute("CREATE DATABASE IF NOT EXISTS e9_corpus")

    # Use the new database
    cur.execute("USE DATABASE e9_corpus")

    # Create a new schema
    cur.execute("CREATE SCHEMA IF NOT EXISTS e9_corpus_schema")

    print("Database and schema created successfully.")
except Exception as e:
    print(e)

cur.close()

conn.close()

Database and schema created successfully.


In [None]:
# Clean the file

def clean_text(text):
    # Remove special characters and symbols using regex
    cleaned_text = re.sub(r'[^\w\s]', '', str(text))
    return cleaned_text

e9_forum_threads = e9_forum_threads.applymap(clean_text)

# Set the snowflake account and login information
path_to_credentials = '/content/drive/MyDrive/credentials/snowflake_credentials'

# Load the credentials
with open(path_to_credentials, 'r') as file:
    for line in file:
        key, value = line.strip().split('=')
        os.environ[key] = value

conn = snowflake.connector.connect(
    user=os.environ.get('USER'),
    password=os.environ.get('PASSWORD'),
    account=os.environ.get('ACCOUNT'),
)

# Create a cursor object
cur = conn.cursor()

# Check if the table exists
try:
    cur.execute("SELECT 1 FROM e9_corpus.e9_corpus_schema.e9_forum_corpus LIMIT 1")
    table_exists = True
except snowflake.connector.errors.ProgrammingError:
    table_exists = False

# If the table does not exist, create it
if not table_exists:
    try:
        cur.execute("""
        CREATE TABLE e9_corpus.e9_corpus_schema.e9_forum_corpus (
            thread_id NUMBER(38,0),
            thread_title VARCHAR(16777216),
            thread_first_post VARCHAR(16777216),
            thread_all_posts VARCHAR(16777216)
        )
        """)
        print("e9_forum_corpus table created successfully.")
    except Exception as e:
        print(e)

# Insert data into e9_forum_corpus table
for index, row in e9_forum_corpus.iterrows():

    row = row.where(pd.notnull(row), None)

    # Prepare the INSERT command with placeholders for the values
    insert_command = """
    INSERT INTO e9_corpus.e9_corpus_schema.e9_forum_corpus
    (thread_id, thread_title, thread_first_post, thread_all_posts)
    VALUES
    (%s, %s, %s, %s)
    """

    # Use the row values as parameters to safely insert the data
    cur.execute(insert_command, (row['thread_id'], row['thread_title'], row['thread_first_post'], row['thread_all_posts']))
    conn.commit()

print("Data inserted into e9_forum_corpus table.")

cur.close()
conn.close()


Data inserted into e9_forum_corpus table.


In [None]:
# Confirm dataset in Snowflake

# Set the snowflake account and login information
path_to_credentials = '/content/drive/MyDrive/credentials/snowflake_credentials'

# Load the credentials
with open(path_to_credentials, 'r') as file:
    for line in file:
        key, value = line.strip().split('=')
        os.environ[key] = value

conn = snowflake.connector.connect(
    user=os.environ.get('USER'),
    password=os.environ.get('PASSWORD'),
    account=os.environ.get('ACCOUNT'),
)

# Create a cursor object
cur = conn.cursor()

# Select source data
query = """
SELECT * FROM "E9_CORPUS"."E9_CORPUS_SCHEMA"."E9_FORUM_CORPUS";
"""
cur.execute(query)

# Load data into a df.
e9_forum_corpus = cur.fetch_pandas_all()
e9_forum_corpus.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15437 entries, 0 to 15436
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   THREAD_ID          15437 non-null  int16 
 1   THREAD_TITLE       15437 non-null  object
 2   THREAD_FIRST_POST  15437 non-null  object
 3   THREAD_ALL_POSTS   15153 non-null  object
dtypes: int16(1), object(3)
memory usage: 392.1+ KB


In [None]:
# This code cell will stop execution of subsequent cells

class StopExecution(Exception):
    def _render_traceback_(self):
        pass  # This will prevent the traceback from being shown

raise StopExecution("Execution stopped by user")

StopExecution: Execution stopped by user

In [None]:
# Clean threads df and add to snowflake

# Clean the file
def clean_text(text):

    cleaned_text = re.sub(r'[^\w\s]', '', str(text))
    return cleaned_text

# Apply the clean_text function to all columns in the DataFrame
e9_forum_threads = e9_forum_threads.applymap(clean_text)

# Set the snowflake account and login information
path_to_credentials = '/content/drive/MyDrive/credentials/snowflake_credentials'

# Load the credentials
with open(path_to_credentials, 'r') as file:
    for line in file:
        key, value = line.strip().split('=')
        os.environ[key] = value

conn = snowflake.connector.connect(
    user=os.environ.get('USER'),
    password=os.environ.get('PASSWORD'),
    account=os.environ.get('ACCOUNT'),
)

# Create a cursor object
cur = conn.cursor()

# Create the e9_forum_threads table
try:
  cur.execute("""
  CREATE OR REPLACE TABLE E9_CORPUS.e9_corpus_schema.e9_forum_threads (
    thread_id NUMBER(38,0),
    thread_title VARCHAR(16777216),
    thread_first_post VARCHAR(16777216)
  )
  """)

  # Insert data into e9_forum_threads table
  for index, row in e9_forum_threads.iterrows():
      cur.execute(f"""
      INSERT INTO E9_CORPUS.e9_corpus_schema.e9_forum_threads
      (thread_id, thread_title, thread_first_post)
      VALUES
      ({row['thread_id']}, '{row['thread_title']}', '{row['thread_first_post']}')
      """)
      conn.commit()

  print("e9_forum_threads created successfully.")
except Exception as e:
    print(e)

cur.close()
conn.close()

In [None]:
# Clean posts df and add to snowflake

# Clean the file
def clean_text(text):
    # Remove special characters and symbols using regex
    cleaned_text = re.sub(r'[^\w\s]', '', str(text))
    return cleaned_text

# Apply the clean_text function to all columns in the DataFrame
e9_forum_posts = e9_forum_posts.applymap(clean_text)


# Set the snowflake account and login information
path_to_credentials = '/content/drive/MyDrive/credentials/snowflake_credentials'

# Load the credentials
with open(path_to_credentials, 'r') as file:
    for line in file:
        key, value = line.strip().split('=')
        os.environ[key] = value

conn = snowflake.connector.connect(
    user=os.environ.get('USER'),
    password=os.environ.get('PASSWORD'),
    account=os.environ.get('ACCOUNT'),
)

# Create a cursor object
cur = conn.cursor()

# Create the e9_forum_posts table
try:
  cur.execute("""
  CREATE OR REPLACE TABLE E9_CORPUS.e9_corpus_schema.e9_forum_posts (
    thread_id NUMBER(38,0),
    post_timestamp VARCHAR(16777216),
    post_raw VARCHAR(16777216)
  )
  """)

  # Insert data into e9_forum_posts table
  for index, row in e9_forum_posts.iterrows():
      cur.execute(f"""
      INSERT INTO E9_CORPUS.e9_corpus_schema.e9_forum_posts
      (thread_id, post_timestamp, post_raw)
      VALUES
      ({row['thread_id']}, '{row['post_timestamp']}', '{row['post_raw']}')
      """)
      conn.commit()

  print("e9_forum_posts created successfully.")
except Exception as e:
    print(e)

cur.close()
conn.close()

In [None]:
e9_forum_posts.info()


## Parking Lot

## Create another corpus from user manual.

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from torch.utils.data import Dataset, DataLoader
from transformers import AdamW
import torch
from sklearn.model_selection import train_test_split


# Read in the manual
file = open("/content/drive/MyDrive/Data_sets/e9_manual.txt", "r")
manual_raw = file.read()
file.close()

# Initialize tokenizer and model
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = GPT2LMHeadModel.from_pretrained(model_name)

# Load and prepare dataset
file_path = '/content/drive/MyDrive/e9/nlp/df_table_3.csv'
#file_path = '/content/drive/MyDrive/Data_sets/e9_manual.txt'
with open(file_path, 'r', encoding='utf-8') as file:
    lines = file.readlines()  # Assume each line is a separate data entry

# Tokenize each line separately to treat each as a sample
input_ids = []
attention_masks = []
for line in lines:
    tokens = tokenizer(line, max_length=1024, truncation=True, padding="max_length", return_tensors="pt")
    input_ids.append(tokens['input_ids'])
    attention_masks.append(tokens['attention_mask'])

# Convert lists to tensors
input_ids = torch.cat(input_ids)
attention_masks = torch.cat(attention_masks)

# Split the data into train and validation sets
train_inputs, val_inputs, train_masks, val_masks = train_test_split(input_ids, attention_masks, test_size=0.1, random_state=42)

class TextDataset(Dataset):
    def __init__(self, input_ids, masks):
        self.input_ids = input_ids
        self.masks = masks

    def __len__(self):
        return self.input_ids.size(0)

    def __getitem__(self, idx):
        return {
            'input_ids': self.input_ids[idx],
            'attention_mask': self.masks[idx]
        }

# Create DataLoader for train and validation datasets
train_dataset = TextDataset(train_inputs, train_masks)
val_dataset = TextDataset(val_inputs, val_masks)

train_loader = DataLoader(train_dataset, batch_size=1, shuffle=True)  # Reduced batch size
val_loader = DataLoader(val_dataset, batch_size=1)

# Prepare for training
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)
optimizer = AdamW(model.parameters(), lr=5e-5)

# Define the training and evaluation loop
epochs = 4

for epoch in range(epochs):
    model.train()
    for batch in train_loader:
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = input_ids.clone()
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

    # Validation phase
    model.eval()
    total_val_loss = 0
    with torch.no_grad():
        for batch in val_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = input_ids.clone()
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            total_val_loss += outputs.loss.item()

    avg_val_loss = total_val_loss / len(val_loader)
    print(f'Epoch {epoch}, Validation Loss: {avg_val_loss}')

# Save the fine-tuned model and tokenizer
model_path = '/content/drive/MyDrive/e9/pytorch'
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)


# Below is code from an older effort

In [None]:
# Original effort
# Create list of URLs to scrape and parse

threads = 100
urls = []

def create_urls(threads, page_number=1):
    base_url = 'https://e9coupe.com/forum/threads/'
    # Iterate over thread IDs to generate URLs
    for thread_id in range(1, threads + 1):
        thread_url = f"{base_url}{thread_id}"
        urls.append({'thread_id': thread_id, 'thread_url': thread_url})

create_urls(threads)  # Using the 'threads' variable

# Convert the list of dictionaries into a DataFrame
e9_forum_urls = pd.DataFrame(urls)

# Display the resulting DataFrame
e9_forum_urls.head()

# Export and save result
e9_forum_urls.to_csv('/content/drive/MyDrive/Data_sets/e9/e9_forum_urls.csv', index=False)

In [None]:
# Original effort

# Process URLs into a dataframe of thread ids and thread titles. This will be
# the core table while I decorate the dataframe with additional metadata.

# Each root URL can contain multiple pages. To limit any potential
# impact the this production data source, the data here is limited to the
# first page of each URL which contains as many as 20 individual member posts.

pages = 1

data = []

df_threads = pd.DataFrame()

def fetch_thread_data(df):
    for index, row in df.iterrows():
        thread_id = row['thread_id']
        url = row['thread_url']
        for i in range(1, pages + 1):
            page_url = f"{url}/?page={i}"  # Construct the page URL
            response = requests.get(page_url)
            if response.status_code == 200:
                soup = BeautifulSoup(response.text, 'html.parser')
                title = soup.find('title').get_text()
                thread_title = title.split('|')[0].strip()  # Extract the thread title

                data.append({'thread_id': thread_id, 'thread_title': thread_title, 'thread_url': page_url})

    return data

# Fetch thread data
data = fetch_thread_data(e9_forum_urls)

# Convert the list of dictionaries into a DataFrame
e9_forum_threads = pd.DataFrame(data)

e9_forum_threads['thread_id'] = e9_forum_threads['thread_id'].astype(int)

# Display the resulting DataFrame
e9_forum_threads.head()


In [None]:
# Extract the inital content when the thread was created.
# This text will be used to create a short description.

def fetch_first_post_content(df):
    data = []

    for thread_id, thread_url in zip(df['thread_id'], df['thread_url']): # ensures pairing
        response = requests.get(thread_url)
        soup = BeautifulSoup(response.text, 'html.parser')

        first_post = soup.find('article', class_='message-body')
        post_content = first_post.get_text(strip=True)

        data.append({'thread_id': thread_id, 'thread_first_post': post_content})

    return data

# Fetch first post content and convert to DataFrame
data = fetch_first_post_content(e9_forum_threads)
first_post_df = pd.DataFrame(data)

# Casting the values I want to join on
e9_forum_threads['thread_id'] = e9_forum_threads['thread_id'].astype(int)
first_post_df['thread_id'] = first_post_df['thread_id'].astype(int)

# Update the df
e9_forum_threads = pd.merge(e9_forum_threads, first_post_df, on='thread_id', how='left')

# Display the resulting DataFrame
e9_forum_threads.head()

# Export and save result
e9_forum_threads.to_csv('/content/drive/MyDrive/Data_sets/e9/e9_forum_threads.csv', index=False)


In [None]:
# Fetch all post data from each thread

# As written this will fetch all the posts on the first page, which is 20
# This might need to be updated to iterate through all page values (1 through n)

def fetch_and_parse_thread(df):
    post_data = []
    processed_posts = set()
    for index, row in df.iterrows():
        response = requests.get(row['thread_url'])
        soup = BeautifulSoup(response.text, 'html.parser')
        articles = soup.find_all('article', class_='message--post')  # Correct class name as example
        for article in articles:
            post_id = article.get('id', 'N/A')
            numeric_post_id = re.findall(r'\d+', post_id)[0] if re.findall(r'\d+', post_id) else 'N/A'

            if numeric_post_id not in processed_posts:
                processed_posts.add(numeric_post_id)
                content = article.find('div', class_='bbWrapper').get_text(strip=True)
                #timestamp = article.find('time', class_='u-dt').get_text(strip=True) if article.find('time', class_='u-dt') else 'N/A'
                #post_number_element = article.find('ul', class_='message-attribution-opposite').find('li').find_next_sibling('li')
                #post_number = post_number_element.get_text(strip=True) if post_number_element else 'N/A'
                #post_number = post_number.lstrip('#') if post_number != 'N/A' else post_number

                post_data.append({
                    'thread_id': row['thread_id'],  # Corrected to use row's data
                    'post_id': numeric_post_id,
                    'post_raw': content
                })

    return pd.DataFrame(post_data, columns=['thread_id', 'post_id','post_raw'])

# Fetch thread URLs and titles, and store in a DataFrame
e9_forum_posts = fetch_and_parse_thread(e9_forum_threads)

e9_forum_posts['thread_id'] = e9_forum_posts['thread_id'].astype(int)
e9_forum_posts['post_id'] = e9_forum_posts['post_id'].astype(int)

# Display the resulting DataFrame
e9_forum_posts.head()

# Export and save result
e9_forum_posts.to_csv('/content/drive/MyDrive/Data_sets/e9/e9_forum_posts.csv', index=False)


In [None]:
# Define a function to clean text
# The problem is that in removes the // from URLs in e9_forum_threads


# Remove URL from e9_forum_threads

e9_forum_threads.drop(columns=['thread_url'], inplace=True)


def clean_text(text):
    # Remove special characters and symbols using regex
    cleaned_text = re.sub(r'[^\w\s]', '', str(text))
    return cleaned_text

# Apply the clean_text function to all columns in the DataFrame
e9_forum_threads = e9_forum_threads.applymap(clean_text)

# Apply the clean_text function to all columns in the DataFrame
#e9_forum_posts = e9_forum_posts.applymap(clean_text)

In [None]:
# Aggregate threads and posts into one df

# Group by THREAD_ID and concatenate the POST_RAW values
aggregated_data = e9_forum_posts.groupby('thread_id')['post_raw'].agg(lambda x: ' '.join(x)).reset_index()

# Rename the column to indicate that it contains concatenated post content
aggregated_data.rename(columns={'post_raw': 'thread_all_posts'}, inplace=True)


# Convert 'thread_id' column to int64 in both DataFrames
e9_forum_threads['thread_id'] = e9_forum_threads['thread_id'].astype('int64')
aggregated_data['thread_id'] = aggregated_data['thread_id'].astype('int64')

# Merge the two DataFrames
e9_forum_corpus = pd.merge(e9_forum_threads, aggregated_data, on='thread_id', how='left')


# Export and save result
e9_forum_corpus.to_csv('/content/drive/MyDrive/Data_sets/e9/e9_forum_corpus.csv', index=False)


In [None]:
# Define a function to clean text
# The problem is that in removes the // from URLs in e9_forum_threads

def clean_text(text):
    # Remove special characters and symbols using regex
    cleaned_text = re.sub(r'[^\w\s]', '', str(text))
    return cleaned_text

# Apply the clean_text function to all columns in the DataFrame
e9_forum_corpus = e9_forum_corpus.applymap(clean_text)

# Apply the clean_text function to all columns in the DataFrame
#e9_forum_posts = e9_forum_posts.applymap(clean_text)

In [None]:
# This is a summarization of posts
# This includes tokenization of the text

from transformers import T5ForConditionalGeneration, T5Tokenizer

# Initialize the T5 tokenizer and model
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')

def summarize_text(df):
    sum_text = []  # Initialize the list to hold summaries
    for text in df['post_concat']:
        # Ensure the text is a string and not empty
        #if not isinstance(text, str) or not text.strip():
        #    sum_text.append("")  # Append an empty string for non-valid entries
        #   continue

        # Prefixing the input text with "summarize: " as T5 expects
        inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", truncation=True, max_length=512)
        summary_ids = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)

        # Decode the generated ids to get the summary text
        summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

        sum_text.append(summary)

    return sum_text

df_threads['post_summary'] = summarize_text(df_threads)

# Display the resulting DataFrame
df_threads.head()

In [None]:
# Find the key words of all posts for a given thread

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to extract keywords using BERT
def bert_extract_keywords(text, tokenizer, model, top_n=5):
    # Tokenize and encode the text
    inputs = tokenizer.encode_plus(text, add_special_tokens=True, return_tensors="pt", truncation=True, max_length=512)
    input_ids = inputs['input_ids'][0]

    # Get the embeddings from the last hidden layer
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.squeeze(0)

    # Compute word importance by summing up the embeddings
    word_importance = torch.sum(embeddings, dim=1)

    # Get the indices of the top n important words
    top_n_indices = word_importance.argsort(descending=True)[:top_n]

    # Filter out indices that are out of range of input_ids
    top_n_indices = [idx for idx in top_n_indices if idx < len(input_ids)]

    # Decode the top n words
    keywords = [tokenizer.decode([input_ids[idx]]) for idx in top_n_indices]

    return keywords

df_threads['post_keywords'] = df_threads['post_summary'].apply(lambda x: bert_extract_keywords(x, tokenizer, model))

# Display the resulting DataFrame
df_threads.head()

In [None]:
# Export and save result
df_threads.to_csv('/content/drive/MyDrive/e9/nlp/df_threads.csv', index=False)

## Create Tables

####Table 1: Issues

*   ID (Unique ID)
*   Issue
*   Short Description
*   Keywords




In [None]:
#Table 1 Issues

#*   Key Should be from pandas
#*   Issue ID (Foreign Key) Should be taken from the thread_id
#*   Issue Should be the thread title
#*   Short Description Using the thread title for now. Should be the post of the originating thread
#*   Keywords: Should be from the thread post

# Table 1
df_table_1 = df_threads[['thread_id','thread_title','thread_first_post_summary','thread_first_post_keywords']]

# Export and save result
df_table_1.to_csv('/content/drive/MyDrive/Data_sets/df_table_1.csv', index=False)


####Table 2: Solutions

*   ID (Unique Note ID)
*   Issue
*   Issue ID (Foreign Key linking to ID in Issues table)
*   Solution


In [None]:
#Table 2
#*   Key Should be from pandas
#*   Issue Should be the thread title
#*   Issue ID (Foreign Key) Should be taken from the thread_key
#*   Detailed Solution Should be the concatinated post_raw per thread_id
#*   Keywords Should be from be the concatinated post_raw per thread_id

df_table_2 = df_threads[['thread_title','thread_id','post_concat','post_keywords']]

# Export and save result
df_table_2.to_csv('/content/drive/MyDrive/Data_sets/df_table_2.csv', index=False)

####Table 3: Notes

Key Should be from pandas
Issue ID (Foreign Key) Should be taken from the thread_key
Unstructured Note Content Should be from the thread posts


*   ID (Unique Note ID)
*   Issue ID (Foreign Key linking to ID in Issues table)
*   Unstructured Content

In [None]:
# Table 3: Notes

#*   Key Should be from pandas
#*   Issue ID (Foreign Key) Should be taken from the thread_key
#*   Unstructured Content: Should be from the raw thread posts

df_table_3 = df_threads[['thread_id','post_concat']]


# Export and save result
df_table_3.to_csv('/content/drive/MyDrive/Data_sets/df_table_3.csv',index=False)

In [None]:
df_table_3.info()

In [None]:
# Old code on determining post vs thread



        # Parse the HTML content of the page
#        soup = BeautifulSoup(page.content, "html.parser")
#        results = soup.find_all('div', class_='contentRow-main')

#        for result in results:
#            title_element = result.find('h3', class_='contentRow-title')
#            if title_element and title_element.find('a'):
#                title = title_element.get_text(strip=True)
#                url = title_element.find('a')['href']

                # Check if it's a thread or a post
#                post_info = result.find('div', class_='contentRow-minor').get_text(strip=True)
#                if "Thread" in post_info:
#                    type = "Thread"

 #                   print(f"Title: {title}, URL: {url}, Type: {type}")
 #                   print('--------------------------------------------------')
                # Append the information to a list
#                    url = 'https://e9coupe.com'+url
#                    thread_urls.append({'url': url})
#                    #thread_urls.append({'title': title, 'url': url, 'type': type})


# Create a DataFrame for the URLs
#df_threads = pd.DataFrame(thread_urls)