<a href="https://colab.research.google.com/github/davidelgas/DataSciencePortfolio/blob/main/NLP_corpus_creation%20/noteboooks/NLP_Corpus_Development.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 Data Collection and Preprocessing





In [None]:
# Access to Google Drive
# This seems to propagate credentials better from its own cell

from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Install libraries

from IPython.display import Image

# Data Collection
import os

!pip3 install pandas
import pandas as pd

!pip3 install requests
import requests

!pip3 install beautifulsoup4
from bs4 import BeautifulSoup

!pip install snowflake-connector-python
import snowflake.connector

# Data Preprocessing
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

from gensim.parsing.preprocessing import STOPWORDS

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

import re

!pip install langdetect
from langdetect import detect

from transformers import BertTokenizer, BertModel, pipeline

import torch

!pip3 install numpy
import numpy as np

!pip install faiss-cpu
import faiss

!pip install langdetect
from langdetect import detect


Data Collection



The project leverages user-generated content from a domain-specific online forum as the training corpus. This data is largely unstructured, with minimal metadata available. The following tools were considered to gather the source text for the corpus:


### Web Scraping
- **Tools:** Beautiful Soup, online SaaS products
    - **Pros:**
        - **Direct Access to Targeted Data:** Enables precise extraction of user-generated content from specific sections or threads within the forum.
        - **Efficiency in Data Collection:** Automated scripts can gather large volumes of data in a short amount of time, making it suitable for assembling significant datasets for NLP.
    - **Cons:**
        - **Potential for Incomplete Data:** May miss embedded content or dynamically loaded data, depending on the website’s structure.
        - **Ethical and Legal Considerations:** Scraping data from forums may raise concerns about user privacy and must adhere to the terms of service of the website.
        - **Very Platform Dependent:** Forum specific solutions result in forum specific data schemas that must be reverse engineered to for successful text extraction.

### Forum-specific APIs
- **Tools:** Python (`requests` library for API calls and `json` library for handling responses)
    - **Pros:**
        - **Structured and Reliable Data Retrieval:** APIs provide structured data, making it easier to process and integrate into your project.
        - **Efficient and Direct Access:** Directly accessing the forum's data through its API is efficient, bypassing the need for HTML parsing.
        - **Compliance and Ethical Data Use:** Utilizing APIs respects the forum's data use policies and ensures access is in line with user agreements.
    - **Cons:**
        - **Rate Limiting:** APIs often have limitations on the number of requests that can be made in a certain timeframe, which could slow down data collection.
        - **API Changes:** Dependence on the forum's API structure means that changes or deprecation could disrupt your data collection pipeline.
        - **Access Restrictions:** Some data or functionalities might be restricted or require authentication, posing additional challenges for comprehensive data collection.


**Conclusion: I will be using Beautiful Soup to create my corpus.**


In [None]:
# Generate the list of thread_ids to scrape and parse
# There are currently approximately 15k threads

# Set the file path to save files
file_path = '/content/drive/MyDrive/Colab Notebooks/Data_sets/e9/e9_forum_thread_ids.csv'

# Set the number of incremental thread_ids to process
threads = 500

# Check if the file exists and has content. If it does, update last_thread_id
if os.path.exists(file_path) and os.path.getsize(file_path) > 0:
    e9_forum_thread_ids = pd.read_csv(file_path)
    last_thread_id = e9_forum_thread_ids['thread_id'].iloc[-1]
    last_thread_id = int(last_thread_id)  # Convert to integer

else:
    last_thread_id = 0

# Function to create URLs from the thread_ids
def create_urls(threads, last_thread_id):
    urls = []
    for thread_id in range(last_thread_id + 1, last_thread_id + threads + 1):
        urls.append({'thread_id': thread_id})
    return urls

urls = create_urls(threads, last_thread_id)

last_thread_id_processed = urls[-1]['thread_id']

# Convert the list of dictionaries into a DataFrame
e9_forum_thread_ids = pd.DataFrame(urls)

# Save DataFrame to CSV file
e9_forum_thread_ids.to_csv(file_path, mode='a', header=['thread_id'], index=False)

print("Starting with thread_id " + str(last_thread_id))
print("Processing additional " + str(threads) + " threads")
print("Ending with thread_id " + str(last_thread_id_processed))

In [None]:
# Generate the URL and title for each thread

pages = 1

def fetch_thread_data(df, pages=1):
    for index, row in df.iterrows():
        thread_id = row['thread_id']
        thread_url = f"https://e9coupe.com/forum/threads/{thread_id}"
        for i in range(1, pages + 1):
            page_url = f"{thread_url}/?page={i}"  # Construct the page URL
            response = requests.get(page_url)
            soup = BeautifulSoup(response.text, 'html.parser')
            title = soup.find('title').get_text()
            thread_title = title.split('|')[0].strip()
            df.at[index, 'thread_title'] = thread_title
            df.at[index, 'thread_url'] = page_url

    return df

# Fetch thread URLs and title
e9_forum_threads = fetch_thread_data(e9_forum_thread_ids)

# Export and save result
file_path = '/content/drive/MyDrive/Colab Notebooks/Data_sets/e9/e9_forum_threads.csv'

header = ['thread_id', 'thread_title', 'thread_url']

# Export and save result
e9_forum_threads.to_csv(file_path, mode='a', header=header, index=False)

In [None]:
# Find the first post in the thread creation
# I may use this as part of the question portion of the RAG

def fetch_first_post_content(df):
    data = []

    for thread_id, thread_url, thread_title in zip(df['thread_id'], df['thread_url'], df['thread_title']):
        response = requests.get(thread_url)
        soup = BeautifulSoup(response.text, 'html.parser')

        first_post = soup.find('article', class_='message-body')
        if first_post:
            post_content = first_post.get_text(strip=True)
        else:
            post_content = "No content found"  # Handle case where no post content is found

        data.append({'thread_id': thread_id, 'thread_title': thread_title, 'thread_first_post': post_content})

    return data

# Fetch first post content
data = fetch_first_post_content(e9_forum_threads)

# Convert to DataFrame
e9_forum_threads_decorated = pd.DataFrame(data)

# Export and save result
file_path = '/content/drive/MyDrive/Colab Notebooks/Data_sets/e9/e9_forum_threads_decorated.csv'

header = not os.path.exists(file_path)

# Export and save result
e9_forum_threads_decorated.to_csv(file_path, mode='a', header=header, index=False)

In [None]:
# Find all posts associated with each thread

def fetch_and_parse_thread(df):
    post_data = []
    processed_posts = set()
    for index, row in df.iterrows():
        response = requests.get(row['thread_url'])
        soup = BeautifulSoup(response.text, 'html.parser')
        articles = soup.find_all('article', class_='message--post')
        for article in articles:
            # Extracting post timestamp instead of post ID
            post_timestamp = article.find('time')['datetime'] if article.find('time') else 'N/A'
            content = article.find('div', class_='bbWrapper').get_text(strip=True)

            post_data.append({
                'thread_id': row['thread_id'],
                'post_timestamp': post_timestamp,
                'post_raw': content
            })

    return post_data

# Fetch all thread post content
post_data = fetch_and_parse_thread(e9_forum_threads)

# Convert to DataFrame
e9_forum_posts = pd.DataFrame(post_data)

# Export and save result
file_path = ('/content/drive/MyDrive/Colab Notebooks/Data_sets/e9/e9_forum_posts.csv')

header = ['thread_id', 'post_timestamp','post_raw']

# Export and save result
e9_forum_posts.to_csv(file_path, mode='a', header=header, index=False)

In [None]:
# Create the corpus by aggregating all posts into one column
# and merging with the threads df

# Group by THREAD_ID and concatenate the POST_RAW values
aggregated_data = e9_forum_posts.groupby('thread_id')['post_raw'].agg(lambda x: ' '.join(x)).reset_index()

# Rename the column to indicate that it contains concatenated post content
aggregated_data.rename(columns={'post_raw': 'thread_all_posts'}, inplace=True)

# Cast 'thread_id' column to int64 in both DataFrames
e9_forum_threads['thread_id'] = e9_forum_threads['thread_id'].astype('int64')
aggregated_data['thread_id'] = aggregated_data['thread_id'].astype('int64')

# Merge the two DataFrames
e9_forum_corpus = pd.merge(e9_forum_threads_decorated, aggregated_data, on='thread_id', how='left')

# Export and save result
e9_forum_corpus.to_csv('/content/drive/MyDrive/Colab Notebooks/Data_sets/e9/e9_forum_corpus.csv', index=False)

In [None]:
# Create the db and schema in Snowfake

# Set the snowflake account and login information
path_to_credentials = '/content/drive/MyDrive/credentials/snowflake_credentials'

# Load the credentials
with open(path_to_credentials, 'r') as file:
    for line in file:
        key, value = line.strip().split('=')
        os.environ[key] = value

conn = snowflake.connector.connect(
    user=os.environ.get('USER'),
    password=os.environ.get('PASSWORD'),
    account=os.environ.get('ACCOUNT'),
)

# Create a cursor object
cur = conn.cursor()

# Create a database for the corpus and load the tables
try:
    # Create a new database
    cur.execute("CREATE DATABASE IF NOT EXISTS e9_corpus")

    # Use the new database
    cur.execute("USE DATABASE e9_corpus")

    # Create a new schema
    cur.execute("CREATE SCHEMA IF NOT EXISTS e9_corpus_schema")

    print("Database and schema created successfully.")
except Exception as e:
    print(e)

cur.close()

conn.close()

In [None]:
# Save the data to Snowflake

# Set the snowflake account and login information
path_to_credentials = '/content/drive/MyDrive/credentials/snowflake_credentials'

# Load the credentials
with open(path_to_credentials, 'r') as file:
    for line in file:
        key, value = line.strip().split('=')
        os.environ[key] = value

conn = snowflake.connector.connect(
    user=os.environ.get('USER'),
    password=os.environ.get('PASSWORD'),
    account=os.environ.get('ACCOUNT'),
)

# Create a cursor object
cur = conn.cursor()

# Check if the table exists
try:
    cur.execute("SELECT 1 FROM e9_corpus.e9_corpus_schema.e9_forum_corpus LIMIT 1")
    table_exists = True
except snowflake.connector.errors.ProgrammingError:
    table_exists = False

# If the table does not exist, create it
if not table_exists:
    try:
        cur.execute("""
        CREATE TABLE e9_corpus.e9_corpus_schema.e9_forum_corpus (
            thread_id NUMBER(38,0),
            thread_title VARCHAR(16777216),
            thread_first_post VARCHAR(16777216),
            thread_all_posts VARCHAR(16777216)
        )
        """)
        print("e9_forum_corpus table created successfully.")
    except Exception as e:
        print(e)

# Insert data into e9_forum_corpus table
for index, row in e9_forum_corpus.iterrows():

    row = row.where(pd.notnull(row), None)

    # Prepare the INSERT command with placeholders for the values
    insert_command = """
    INSERT INTO e9_corpus.e9_corpus_schema.e9_forum_corpus
    (thread_id, thread_title, thread_first_post, thread_all_posts)
    VALUES
    (%s, %s, %s, %s)
    """

    # Use the row values as parameters to safely insert the data
    cur.execute(insert_command, (row['thread_id'], row['thread_title'], row['thread_first_post'], row['thread_all_posts']))
    conn.commit()

print("Data inserted into e9_forum_corpus table.")

cur.close()
conn.close()


In [None]:
# Confirm dataset in Snowflake

# Set the snowflake account and login information
path_to_credentials = '/content/drive/MyDrive/credentials/snowflake_credentials'

# Load the credentials
with open(path_to_credentials, 'r') as file:
    for line in file:
        key, value = line.strip().split('=')
        os.environ[key] = value

conn = snowflake.connector.connect(
    user=os.environ.get('USER'),
    password=os.environ.get('PASSWORD'),
    account=os.environ.get('ACCOUNT'),
)

# Create a cursor object
cur = conn.cursor()

# Select source data
query = """
SELECT * FROM "E9_CORPUS"."E9_CORPUS_SCHEMA"."E9_FORUM_CORPUS";
"""
cur.execute(query)

# Load data into a df.
e9_forum_corpus = cur.fetch_pandas_all()
e9_forum_corpus.info()