# Data Cleaning and Initial Preprocessing

## Objective
- The purpose of this notebook is to clean and preprocess the raw data extracted from guide articles about Baldur's Gate 3. This includes handling missing or duplicate values, cleaning the raw text data, and standardizing the meta tag columns. These steps will prepare the data for further analysis and advanced preprocessing.

## Workflow
1. Load the raw scraped data containing
2. Check for missing or duplicate values in all columns
3. Clean the raw content
4. Standardize text
5. Remove stopwords from columns
6. Tokenize the cleaned columns
7. Save the preprocessed data for EDA and advanced preprocessing

## Imports

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import re
from string import punctuation
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
from nltk.stem import WordNetLemmatizer

# Uncomment rows below and execute if nltk stopwords and nltk punctuation not downloaded
# nltk.download('stopwords')
# nltk.download('punkt')

## Load Data

In [2]:
data_path = "data/scraped_data_with_meta.pkl"
raw_dataframe = pd.read_pickle(data_path)

In [3]:
raw_dataframe.head()

Unnamed: 0,Google_Question,URL,Content,Meta_Title,Meta_Description,Meta_Keywords
0,How to multiclass?,https://www.ign.com/wikis/baldurs-gate-3/How_t...,Join\nGuide\nChecklists\nMaps\nTask Search\nHo...,How to Multiclass - Baldur's Gate III Guide - IGN,Multiclassing in Baldur&#x2019;s Gate 3 lets y...,
1,How to multiclass?,https://www.thegamer.com/baldurs-gate-3-bg3-mu...,Menu\nSign In Now\nTrending\nTheGamer GOTY 202...,How To Multiclass In Baldur's Gate 3,Multiclassing allows you to use some interesti...,
2,How to multiclass?,https://www.polygon.com/baldurs-gate-3-guides/...,The homepage\nGuides\nGames\nEntertainment\nGU...,How to multiclass and best multiclass builds i...,Multiclassing is one of the best ways to incre...,
3,How to multiclass?,https://www.rockpapershotgun.com/baldurs-gate-...,All 107 Black Myth Wukong bosses\nInfinity Nik...,How to multiclass in Baldur's Gate 3 | Rock Pa...,"Baldur's Gate 3 lets you multiclass, allowing ...",
4,How to install mods?,https://bg3.wiki/wiki/Modding:Installing_mods,Modding:Installing mods\nJump to navigation\nJ...,Modding Resources - bg3.wiki,This page is a hub for everything related to M...,"Installing mods, Baldur's Gate 3 Wiki, BG3 Wik..."


## Checking Data Types of all Columns
Checking data types of all values to confirm we are working with only `str` variables.

In [4]:
type_summary = {}
for column in raw_dataframe.columns:
    types = raw_dataframe[column].map(type).value_counts()
    type_summary[column] = {str(t): count for t, count in types.items()}

for column, type_counts in type_summary.items():
    type_info = ", ".join([f"{count} {t.split('.')[-1]}" for t, count in type_counts.items()])
    print(f"{column}: {type_info}")

Google_Question: 36 <class 'str'>
URL: 36 <class 'str'>
Content: 36 <class 'str'>
Meta_Title: 36 <class 'str'>
Meta_Description: 33 <class 'str'>, 3 <class 'NoneType'>
Meta_Keywords: 33 <class 'NoneType'>, 3 <class 'str'>


All variables are `str` or `NoneType`.

## Checking for Missing and Duplicate Data
This step ensures the data integrity by identifying and handling:
- Missing values (e.g., null or empty fields).
- Remove any columns with more than 70% of the date have a null or empty field
- Duplicate entries to avoid redundancy in analysis.

### Checking Missing Data

In [5]:
def check_missing_and_empty(dataframe):
    summary = {}
    for column in dataframe.columns:
        total = len(dataframe)
        missing = dataframe[column].isnull().sum()
        empty = (dataframe[column] == "").sum()
        summary[column] = {
            "Total Rows": total,
            "Missing Values": missing,
            "Empty Strings": empty,
            "Total Missing/Empty": missing + empty,
            "Percentage Missing/Empty": ((missing + empty) / total) * 100
        }
    return pd.DataFrame(summary).T 

missing_summary = check_missing_and_empty(raw_dataframe)
print("Missing and Empty Value Summary:")
print(missing_summary)

Missing and Empty Value Summary:
                  Total Rows  Missing Values  Empty Strings  \
Google_Question         36.0             0.0            0.0   
URL                     36.0             0.0            0.0   
Content                 36.0             0.0            0.0   
Meta_Title              36.0             0.0            0.0   
Meta_Description        36.0             3.0            0.0   
Meta_Keywords           36.0            33.0            0.0   

                  Total Missing/Empty  Percentage Missing/Empty  
Google_Question                   0.0                  0.000000  
URL                               0.0                  0.000000  
Content                           0.0                  0.000000  
Meta_Title                        0.0                  0.000000  
Meta_Description                  3.0                  8.333333  
Meta_Keywords                    33.0                 91.666667  


### Removing Redundant Columns

In [6]:
threshold = 70  
columns_to_drop = missing_summary[
    missing_summary["Percentage Missing/Empty"] > threshold
].index.tolist()

if columns_to_drop:
    print(f"Dropping columns with more than {threshold}% missing/empty values: {columns_to_drop}")
    raw_dataframe.drop(columns=columns_to_drop, axis=1, inplace=True)
else:
    print("No columns to drop based on missing/empty values threshold.")

Dropping columns with more than 70% missing/empty values: ['Meta_Keywords']


### Fill in `NoneType` Values with Empty Strings

In [7]:
raw_dataframe["Meta_Description"].fillna("", inplace=True)

### Removing Duplicate Rows

In [8]:
raw_dataframe.drop_duplicates(subset="URL", inplace=True)

In [9]:
raw_dataframe.head()

Unnamed: 0,Google_Question,URL,Content,Meta_Title,Meta_Description
0,How to multiclass?,https://www.ign.com/wikis/baldurs-gate-3/How_t...,Join\nGuide\nChecklists\nMaps\nTask Search\nHo...,How to Multiclass - Baldur's Gate III Guide - IGN,Multiclassing in Baldur&#x2019;s Gate 3 lets y...
1,How to multiclass?,https://www.thegamer.com/baldurs-gate-3-bg3-mu...,Menu\nSign In Now\nTrending\nTheGamer GOTY 202...,How To Multiclass In Baldur's Gate 3,Multiclassing allows you to use some interesti...
2,How to multiclass?,https://www.polygon.com/baldurs-gate-3-guides/...,The homepage\nGuides\nGames\nEntertainment\nGU...,How to multiclass and best multiclass builds i...,Multiclassing is one of the best ways to incre...
3,How to multiclass?,https://www.rockpapershotgun.com/baldurs-gate-...,All 107 Black Myth Wukong bosses\nInfinity Nik...,How to multiclass in Baldur's Gate 3 | Rock Pa...,"Baldur's Gate 3 lets you multiclass, allowing ..."
4,How to install mods?,https://bg3.wiki/wiki/Modding:Installing_mods,Modding:Installing mods\nJump to navigation\nJ...,Modding Resources - bg3.wiki,This page is a hub for everything related to M...


## Cleaning Raw Content
The raw scraped content often contains unwanted elements such as HTML tags, scripts, styles, and excess whitespace. This step focuses on:

In [10]:
def clean_text_with_js_css_removal(text):
    if isinstance(text, str):
        soup = BeautifulSoup(text, "html.parser")
        
        for script_or_style in soup(["script", "style"]):
            script_or_style.decompose()
        
        text = soup.get_text()
        text = re.sub(r"[^\w\s]", " ", text)
        text = re.sub(r"\s+", " ", text).strip()
        return text.lower()
    return text

columns_to_clean = ["Content", "Meta_Title", "Meta_Description"]

for column in columns_to_clean:
    raw_dataframe[column] = raw_dataframe[column].apply(clean_text_with_js_css_removal)

## Standardizing Text
Standardizing text ensures uniformity in formatting, which is essential for further processing like tokenization and analysis.

In [11]:
def standardize_text(text):
    if isinstance(text, str):
        return text.lower().strip()
    return text
columns_to_standardize = ["Content", "Meta_Title", "Meta_Description"]

for column in columns_to_standardize:
    raw_dataframe[column] = raw_dataframe[column].apply(standardize_text)

## Removing General Stopwords
In this step, we remove standard English stopwords to simplify the text and reduce noise. I will not apply any custom or domain-specific stopword removal at this stage, as this will be addressed during exploratory data analysis (EDA).

In [12]:
stop_words = set(stopwords.words("english"))
def remove_general_stopwords(text):
    if isinstance(text, str):
        tokens = word_tokenize(text)
        filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
        return " ".join(filtered_tokens)
    return text

raw_dataframe["Content"] = raw_dataframe["Content"].apply(remove_general_stopwords)
raw_dataframe["Meta_Title"] = raw_dataframe["Meta_Title"].apply(remove_general_stopwords)
raw_dataframe["Meta_Description"] = raw_dataframe["Meta_Description"].apply(remove_general_stopwords)

## Tokenization, Lemmatization, and N-Gram Generation
This function preprocesses the text data by tokenizing it into unigrams, generating n-grams (bigrams and trigrams), and applying lemmatization to unigrams to standardize word forms, enabling more meaningful text analysis.

In [13]:
lemmatizer = WordNetLemmatizer()
def process_text_column(text):
    if isinstance(text, str):
        
        # Tokenization
        tokens = word_tokenize(text)
        
        # Lemmatization for unigrams (using lemmatized tokens)
        lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]
        
        # Generate bigrams and trigrams (using lemmatized tokens)
        bigrams = [" ".join(gram) for gram in ngrams(lemmatized_tokens, 2)]
        trigrams = [" ".join(gram) for gram in ngrams(lemmatized_tokens, 3)]
        
        return lemmatized_tokens, bigrams, trigrams
    return [], [], []

for column in ["Content", "Meta_Title", "Meta_Description"]:
    raw_dataframe[f"Unigrams_{column}"], \
    raw_dataframe[f"Bigrams_{column}"], \
    raw_dataframe[f"Trigrams_{column}"] = zip(*raw_dataframe[column].apply(process_text_column))

In [14]:
processed_columns = [
    "Unigrams_Content", "Bigrams_Content", "Trigrams_Content",
    "Unigrams_Meta_Title", "Bigrams_Meta_Title", "Trigrams_Meta_Title",
    "Unigrams_Meta_Description", "Bigrams_Meta_Description", "Trigrams_Meta_Description"
]
raw_dataframe[processed_columns].head()

Unnamed: 0,Unigrams_Content,Bigrams_Content,Trigrams_Content,Unigrams_Meta_Title,Bigrams_Meta_Title,Trigrams_Meta_Title,Unigrams_Meta_Description,Bigrams_Meta_Description,Trigrams_Meta_Description
0,"[join, guide, checklist, map, task, search, mu...","[join guide, guide checklist, checklist map, m...","[join guide checklist, guide checklist map, ch...","[multiclass, baldur, gate, iii, guide, ign]","[multiclass baldur, baldur gate, gate iii, iii...","[multiclass baldur gate, baldur gate iii, gate...","[multiclassing, baldur, gate, 3, let, train, c...","[multiclassing baldur, baldur gate, gate 3, 3 ...","[multiclassing baldur gate, baldur gate 3, gat..."
1,"[menu, sign, trending, thegamer, goty, 2024, i...","[menu sign, sign trending, trending thegamer, ...","[menu sign trending, sign trending thegamer, t...","[multiclass, baldur, gate, 3]","[multiclass baldur, baldur gate, gate 3]","[multiclass baldur gate, baldur gate 3]","[multiclassing, allows, use, interesting, vers...","[multiclassing allows, allows use, use interes...","[multiclassing allows use, allows use interest..."
2,"[homepage, guide, game, entertainment, guide, ...","[homepage guide, guide game, game entertainmen...","[homepage guide game, guide game entertainment...","[multiclass, best, multiclass, build, bg3, pol...","[multiclass best, best multiclass, multiclass ...","[multiclass best multiclass, best multiclass b...","[multiclassing, one, best, way, increase, char...","[multiclassing one, one best, best way, way in...","[multiclassing one best, one best way, best wa..."
3,"[107, black, myth, wukong, boss, infinity, nik...","[107 black, black myth, myth wukong, wukong bo...","[107 black myth, black myth wukong, myth wukon...","[multiclass, baldur, gate, 3, rock, paper, sho...","[multiclass baldur, baldur gate, gate 3, 3 roc...","[multiclass baldur gate, baldur gate 3, gate 3...","[baldur, gate, 3, let, multiclass, allowing, m...","[baldur gate, gate 3, 3 let, let multiclass, m...","[baldur gate 3, gate 3 let, 3 let multiclass, ..."
4,"[modding, installing, mod, jump, navigation, j...","[modding installing, installing mod, mod jump,...","[modding installing mod, installing mod jump, ...","[modding, resource, bg3, wiki]","[modding resource, resource bg3, bg3 wiki]","[modding resource bg3, resource bg3 wiki]","[page, hub, everything, related, modding, bald...","[page hub, hub everything, everything related,...","[page hub everything, hub everything related, ..."


## Verifying Data Types and Word Lengths in Tokenized and N-Gram Columns
Ensure that the new tokenized and n-gram columns are of type `list`. Additionally, verify that all values within these lists are of type `str` and adhere to the correct word lengths: unigrams should contain single-word values, bigrams should contain two-word values, and trigrams should contain three-word values.

### Checking Each Column

In [15]:
type_summary = {}

for column in raw_dataframe.columns:
    # Get the type of each value in the column
    types = raw_dataframe[column].map(type).value_counts()
    # Convert the types to a readable summary
    type_summary[column] = {str(t): count for t, count in types.items()}

# Display the summary
for column, type_counts in type_summary.items():
    type_info = ", ".join([f"{count} {t.split('.')[-1]}" for t, count in type_counts.items()])
    print(f"{column}: {type_info}")

Google_Question: 36 <class 'str'>
URL: 36 <class 'str'>
Content: 36 <class 'str'>
Meta_Title: 36 <class 'str'>
Meta_Description: 36 <class 'str'>
Unigrams_Content: 36 <class 'list'>
Bigrams_Content: 36 <class 'list'>
Trigrams_Content: 36 <class 'list'>
Unigrams_Meta_Title: 36 <class 'list'>
Bigrams_Meta_Title: 36 <class 'list'>
Trigrams_Meta_Title: 36 <class 'list'>
Unigrams_Meta_Description: 36 <class 'list'>
Bigrams_Meta_Description: 36 <class 'list'>
Trigrams_Meta_Description: 36 <class 'list'>


### Checking Tokenized Columns

In [16]:
column_word_lengths = {
    'Unigrams_Content': 1,
    'Bigrams_Content': 2,
    'Trigrams_Content': 3,
    'Unigrams_Meta_Title': 1,
    'Bigrams_Meta_Title': 2,
    'Trigrams_Meta_Title': 3,
    'Unigrams_Meta_Description': 1,
    'Bigrams_Meta_Description': 2,
    'Trigrams_Meta_Description': 3
}

for column, word_length in column_word_lengths.items():
    print(f"Validating column: {column}")
    invalid_rows = []
    
    for index, value in raw_dataframe[column].items():
        if not isinstance(value, list):
            invalid_rows.append((index, "Not a list"))
            continue    
        for item in value:
            if not isinstance(item, str):
                invalid_rows.append((index, f"Non-string item: {item}"))
                break
            if len(item.split()) != word_length:
                invalid_rows.append((index, f"Invalid word length: {item}"))
                break
    if invalid_rows:
        print(f"  Found {len(invalid_rows)} invalid rows:")
        for row_index, issue in invalid_rows:
            print(f"    Row {row_index}: {issue}")
    else:
        print("  All rows are valid.")
    print("\n")

Validating column: Unigrams_Content
  All rows are valid.


Validating column: Bigrams_Content
  All rows are valid.


Validating column: Trigrams_Content
  All rows are valid.


Validating column: Unigrams_Meta_Title
  All rows are valid.


Validating column: Bigrams_Meta_Title
  All rows are valid.


Validating column: Trigrams_Meta_Title
  All rows are valid.


Validating column: Unigrams_Meta_Description
  All rows are valid.


Validating column: Bigrams_Meta_Description
  All rows are valid.


Validating column: Trigrams_Meta_Description
  All rows are valid.




## Save Initial Preprocessed Data

At this stage, the data has been cleaned and tokenized, making it ready for exploratory data analysis (EDA) and advanced preprocessing. The processed data will be saved as a `.pkl` file for seamless loading in the next notebook.

In [17]:
raw_dataframe.to_pickle("data/initial_preprocessed_data.pkl")
raw_dataframe.to_csv("data/initial_preprocessed_data.csv", index=False)