# Data Cleaning and Initial Preprocessing
---

## <ins>Objective</ins>
The purpose of this notebook is to clean and preprocess the raw data extracted from guide articles about Baldur's Gate 3. This includes handling missing or duplicate values, cleaning the raw text data, and standardizing the meta tag columns. These steps will prepare the data for further analysis and advanced preprocessing.

---

## <ins>Workflow</ins>
1. Load the raw scraped data containing:
   - Body content
   - Meta tags: `Meta_Title`, `Meta_Description`, and `Meta_Keywords`.
2. Check for missing or duplicate values in all columns.
3. Clean the raw content:
   - Remove HTML artifacts, excess whitespace, and irrelevant characters.
4. Standardize text in `Content` and meta tags:
   - Convert to lowercase.
   - Remove punctuation and special characters.
5. Remove stopwords from textual columns.
6. Tokenize the cleaned content and meta tags for further analysis.
7. Save the preprocessed data for exploratory analysis and advanced preprocessing.

---

## <ins>Imports</ins>

In [1]:
import pandas as pd
import numpy as np
import re
from bs4 import BeautifulSoup 
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from string import punctuation
import nltk

# Uncomment rows below and execute if nltk stopwords and nltk punctuation not downloaded
# nltk.download('stopwords')
# nltk.download('punkt')

## <ins>Load Data</ins>

In [2]:
data_path = "data/scraped_data_with_meta.pkl"
raw_dataframe = pd.read_pickle(data_path)

## Step 2: Checking for Missing and Duplicate Data

This step ensures the data integrity by identifying and handling:
- Missing values (e.g., null or empty fields).
- Remove any columns with more than 70% of the date have a null or empty field
- Duplicate entries to avoid redundancy in analysis.

In [3]:
def check_missing_and_empty(dataframe):
    summary = {}
    for column in dataframe.columns:
        total = len(dataframe)
        missing = dataframe[column].isnull().sum()
        empty = (dataframe[column] == "").sum()
        summary[column] = {
            "Total Rows": total,
            "Missing Values": missing,
            "Empty Strings": empty,
            "Total Missing/Empty": missing + empty,
            "Percentage Missing/Empty": ((missing + empty) / total) * 100
        }
    return pd.DataFrame(summary).T  # Convert summary to a DataFrame for readability

# Check missing and empty values for all columns
missing_summary = check_missing_and_empty(raw_dataframe)

print("Missing and Empty Value Summary:")
print(missing_summary)

Missing and Empty Value Summary:
                  Total Rows  Missing Values  Empty Strings  \
URL                     36.0             0.0            0.0   
Content                 36.0             0.0            0.0   
Meta_Title              36.0             0.0            0.0   
Meta_Description        36.0             0.0            3.0   
Meta_Keywords           36.0             0.0           34.0   

                  Total Missing/Empty  Percentage Missing/Empty  
URL                               0.0                  0.000000  
Content                           0.0                  0.000000  
Meta_Title                        0.0                  0.000000  
Meta_Description                  3.0                  8.333333  
Meta_Keywords                    34.0                 94.444444  


In [4]:
# Drop columns with 70%+ missing/empty values
threshold = 70  
columns_to_drop = missing_summary[
    missing_summary["Percentage Missing/Empty"] > threshold
].index.tolist()

if columns_to_drop:
    print(f"Dropping columns with more than {threshold}% missing/empty values: {columns_to_drop}")
    raw_dataframe.drop(columns=columns_to_drop, axis=1, inplace=True)
else:
    print("No columns to drop based on missing/empty values threshold.")

Dropping columns with more than 70% missing/empty values: ['Meta_Keywords']


In [5]:
# Remove duplicates based on the URL column
raw_dataframe.drop_duplicates(subset="URL", inplace=True)

## Step 3: Cleaning Raw Content

The raw scraped content often contains unwanted elements such as HTML tags, scripts, styles, and excess whitespace. This step focuses on:
- Removing all HTML tags to retain plain text.
- Eliminating any lingering JavaScript or CSS code.
- Cleaning up newline characters, tabs, and extra spaces for a cleaner dataset.


In [6]:
# Define a function to clean text, including removing JS and CSS
def clean_text_with_js_css_removal(text):
    if isinstance(text, str):
        # Parse the text using BeautifulSoup
        soup = BeautifulSoup(text, "html.parser")
        
        # Remove <script> and <style> tags along with their content
        for script_or_style in soup(["script", "style"]):
            script_or_style.decompose()
        
        # Get plain text from the HTML
        text = soup.get_text()
        
        # Remove special characters, punctuation, and extra spaces
        text = re.sub(r"[^\w\s]", " ", text)  # Remove special characters and punctuation
        text = re.sub(r"\s+", " ", text).strip()  # Replace multiple spaces with a single space
        return text.lower()
    return text  # Return as is if not a string

# Apply cleaning to relevant columns
columns_to_clean = ["Content", "Meta_Title", "Meta_Description"]

for column in columns_to_clean:
    raw_dataframe[column] = raw_dataframe[column].apply(clean_text_with_js_css_removal)

print("Cleaned text in 'Content', 'Meta_Title', and 'Meta_Description' columns, including JS and CSS removal.")

Cleaned text in 'Content', 'Meta_Title', and 'Meta_Description' columns, including JS and CSS removal.


## Step 4: Standardizing Text

Standardizing text ensures uniformity in formatting, which is essential for further processing like tokenization and analysis. This step includes:
- Converting all text to lowercase for consistency.
- Stripping any leading or trailing whitespace.
- Removing unnecessary newline characters or tabs.

In [7]:
# Define a function to standardize text
def standardize_text(text):
    if isinstance(text, str):
        return text.lower().strip()  # Convert to lowercase and remove leading/trailing whitespace
    return text  # Return as is if not a string

# Apply standardization to relevant columns
columns_to_standardize = ["Content", "Meta_Title", "Meta_Description"]

for column in columns_to_standardize:
    raw_dataframe[column] = raw_dataframe[column].apply(standardize_text)

print("Standardized text in 'Content', 'Meta_Title', and 'Meta_Description' columns.")

Standardized text in 'Content', 'Meta_Title', and 'Meta_Description' columns.


## Step 5: Removing General Stopwords
In this step, we remove standard English stopwords to simplify the text and reduce noise. We will not apply any custom or domain-specific stopword removal at this stage, as this will be addressed during exploratory data analysis (EDA).

In [8]:
# Define a function to remove stopwords
stop_words = set(stopwords.words("english"))
def remove_general_stopwords(text):
    if isinstance(text, str):
        tokens = word_tokenize(text)
        filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
        return " ".join(filtered_tokens)
    return text

# Apply general stopword removal to the relevant columns
raw_dataframe["Content"] = raw_dataframe["Content"].apply(remove_general_stopwords)
raw_dataframe["Meta_Title"] = raw_dataframe["Meta_Title"].apply(remove_general_stopwords)
raw_dataframe["Meta_Description"] = raw_dataframe["Meta_Description"].apply(remove_general_stopwords)

print("Removed general stopwords from 'Content', 'Meta_Title', and 'Meta_Description' columns.")

Removed general stopwords from 'Content', 'Meta_Title', and 'Meta_Description' columns.


## Step 6: Tokenization

Tokenization is the process of splitting text into individual components, such as words or sentences. For our analysis, we will tokenize the content into words. This step prepares the data for further preprocessing and text analysis tasks.

In [9]:
# Define a function for tokenization
def tokenize_text(text):
    if isinstance(text, str):
        tokens = word_tokenize(text)
        return tokens
    return text

# Apply tokenization to the relevant columns
raw_dataframe["Tokenized_Content"] = raw_dataframe["Content"].apply(tokenize_text)
raw_dataframe["Tokenized_Meta_Title"] = raw_dataframe["Meta_Title"].apply(tokenize_text)
raw_dataframe["Tokenized_Meta_Description"] = raw_dataframe["Meta_Description"].apply(tokenize_text)

print("Tokenized 'Content', 'Meta_Title', and 'Meta_Description' columns into words.")

Tokenized 'Content', 'Meta_Title', and 'Meta_Description' columns into words.


## Step 7: Save Initial Preprocessed Data

At this stage, the data has been cleaned and tokenized, making it ready for exploratory data analysis (EDA) and advanced preprocessing. The processed data will be saved as a `.pkl` file for seamless loading in the next notebook.

In [10]:
# Save the updated preprocessed data
raw_dataframe.to_pickle("data/initial_preprocessed_data.pkl")
raw_dataframe.to_csv("data/initial_preprocessed_data.csv", index=False)

print("Saved the preprocessed data for further analysis.")

Saved the preprocessed data for further analysis.
