# Data Cleaning and Preprocessing
---

## <ins>Objective</ins>
- This step involves preparing the raw scraped data for analysis by addressing potential inconsistencies and ensuring uniformity. The key processes include:
    - Removing HTML Tags: Extract meaningful textual content from the HTML structure of the guides.
    - Cleaning Content: Remove extra whitespace, newline characters, and irrelevant artifacts for cleaner data.
    - Tokenization and Preprocessing: Split text into individual tokens (words) and standardize by converting to lowercase, removing stopwords, and eliminating punctuation.
    - Checking for Null and Duplicate Values: Identify and address any missing or duplicate entries in the dataset.
    - The resulting dataset ensures consistency and quality, forming the basis for meaningful analysis in subsequent steps.
---

## <ins>Imports</ins>

In [1]:
import pandas as pd
import re
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

# Uncomment rows below and execute if nltk stopwords and nltk punctuation not downloaded
# nltk.download('stopwords')
# nltk.download('punkt')

## <ins>Load Data</ins>

In [2]:
scraped_dataframe = pd.read_pickle('data/scraped_data.pkl')

---
## <ins>Cleaning</ins>
- Creating a function to remove HTML tags and unwanted elements from `Content` column.

In [3]:
def clean_html(raw_html):
    """
    Remove HTML tags and unwanted elements from the text.
    """
    # Remove HTML tags
    soup = BeautifulSoup(raw_html, "html.parser")
    text = soup.get_text()
    
    # Remove extra whitespace and newlines
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

# Apply cleaning function to the 'content' column
scraped_dataframe['Cleaned_Content'] = scraped_dataframe['Content'].apply(clean_html)

# Drop the original content column if no longer needed
scraped_dataframe.drop(columns=['Content'], inplace=True)

In [4]:
scraped_dataframe

Unnamed: 0,URL,Title,Meta Description,Headings,Cleaned_Content
0,https://www.ign.com/wikis/baldurs-gate-3/Where...,Where to Find and Recruit Karlach - Baldur's G...,"Baldur&apos;s Gate 3 is a massive game, filled...","[Baldur's Gate III Guide, Find in guide, Inter...","Baldur's Gate 3 is a massive game, filled with..."
1,https://www.ign.com/wikis/baldurs-gate-3/Compa...,Companions and Party Members - Baldur's Gate I...,Companions in Baldur&apos;s Gate 3 are unique ...,"[Baldur's Gate III Guide, Find in guide, Inter...",Companions in Baldur's Gate 3 are unique chara...
2,https://www.ign.com/wikis/baldurs-gate-3/All_S...,All Sex and Romance Options - Baldur's Gate II...,If you&apos;re in the mood and trying to creat...,"[Baldur's Gate III Guide, Find in guide, Inter...",If you're in the mood and trying to create a r...
3,https://www.polygon.com/24035856/karlach-bg3-r...,Baldur’s Gate 3: How to get Karlach as fast as...,One easy trick to get a head start on Karlach’...,[All my friends are strategically jumping off ...,One stupid trick to get a head start on Karlac...
4,https://www.polygon.com/baldurs-gate-3-guides/...,How to choose the best class for you in BG3 | ...,Baldur’s Gate 3 makes it hard to find the best...,[How to choose the best class for you in Baldu...,"BG3 features a dozen different, equally awesom..."
5,https://www.polygon.com/baldurs-gate-3-guide-w...,Explore the Overgrown Ruins walkthrough — Bald...,Our Baldur’s Gate 3 guide will help you comple...,[Baldur’s Gate 3 guide: Overgrown Ruins walkth...,How to complete the “Investigate the ruins” qu...
6,https://screenrant.com/baldurs-gate-3-where-to...,Where To Find (& Recruit) Karlach In Baldur’s ...,One of the best additions to any Baldur's Gate...,"[Screen Rant, Where To Find (& Recruit) Karlac...",Your changes have been saved Email is sent Ema...
7,https://screenrant.com/baldurs-gate-3-beginner...,"10 Baldur’s Gate 3 Questions For Beginners, An...",With Baldur's Gate 3 adaption of Dungeons & Dr...,"[Screen Rant, 10 Baldur’s Gate 3 Questions For...",Your changes have been saved Email is sent Ema...
8,https://screenrant.com/baldurs-gate-3-missed-q...,10 Best Baldur's Gate 3 Quests You Probably Mi...,Baldur’s Gate 3 is absolutely packed with ques...,"[Screen Rant, 10 Best Baldur's Gate 3 Quests Y...",Your changes have been saved Email is sent Ema...
9,https://www.gamesradar.com/baldurs-gate-3-karl...,How to find Karlach in Baldur's Gate 3 | Games...,Karlach is a potential party member and compan...,"[How to find Karlach in Baldur's Gate 3, Karla...",How to find Karlach Demonsbane in BG3 and get ...


---
## <ins>Checking for Null and Duplicate Values</ins>
- Although it is clear there are no `NaN` values or duplicate entries, I will implement code to check for these issues. This ensures scalability and maintains quality in case more guides are added in the future.

### Steps:
1. **Identify and Remove Duplicates**:
   - Check for duplicate entries in the dataset using the `URL` and `Cleaned_Content` columns.
   - Drop duplicate rows to ensure each guide entry is unique.

2. **Identify Missing Values**:
   - Check for `NaN` values in the dataset.
   - Remove rows where the `Cleaned_Content` column is empty or missing.

In [5]:
# Remove duplicates
initial_row_count = len(scraped_dataframe)
scraped_dataframe = scraped_dataframe.drop_duplicates(subset=['URL', 'Cleaned_Content'])
print(f"Removed {initial_row_count - len(scraped_dataframe)} duplicate rows.")

# Identify missing values
missing_values = scraped_dataframe.isnull().sum()
print("Missing values before handling:")
print(missing_values)

# Remove rows where 'Cleaned_Content' is missing or empty
scraped_dataframe = scraped_dataframe[
    scraped_dataframe['Cleaned_Content'].notna() & (scraped_dataframe['Cleaned_Content'] != "")
]
print(f"Rows after handling missing values: {len(scraped_dataframe)}")

Removed 0 duplicate rows.
Missing values before handling:
URL                 0
Title               0
Meta Description    0
Headings            0
Cleaned_Content     0
dtype: int64
Rows after handling missing values: 12


---
## <ins>Tokenization and Basic Preprocessing</ins>
- Tokenizing `Cleaned_Content` into individual words for further processing.
- Removing common stopwords, punctuation, and converting all text to lowercase to standardize content.
- Preprocessed text will be stored in a new column called `Processed_Content`.

### Steps:
1. **Tokenization**:
   - Split text into individual tokens (words).
2. **Stopwords and Punctuation Removal**:
   - Remove common stopwords (e.g., "the," "and") and punctuation.
3. **Standardization**:
   - Convert text to lowercase for consistency.

In [6]:
# Define stopwords and punctuation
stop_words = set(stopwords.words('english'))
punctuation = set(string.punctuation)

# Tokenization and Preprocessing Function
def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens = [word.lower() for word in tokens if word.lower() not in stop_words and word not in punctuation]
    return " ".join(tokens)

# Apply preprocessing
scraped_dataframe['Processed_Content'] = scraped_dataframe['Cleaned_Content'].apply(preprocess_text)

In [7]:
scraped_dataframe

Unnamed: 0,URL,Title,Meta Description,Headings,Cleaned_Content,Processed_Content
0,https://www.ign.com/wikis/baldurs-gate-3/Where...,Where to Find and Recruit Karlach - Baldur's G...,"Baldur&apos;s Gate 3 is a massive game, filled...","[Baldur's Gate III Guide, Find in guide, Inter...","Baldur's Gate 3 is a massive game, filled with...",baldur 's gate 3 massive game filled branching...
1,https://www.ign.com/wikis/baldurs-gate-3/Compa...,Companions and Party Members - Baldur's Gate I...,Companions in Baldur&apos;s Gate 3 are unique ...,"[Baldur's Gate III Guide, Find in guide, Inter...",Companions in Baldur's Gate 3 are unique chara...,companions baldur 's gate 3 unique characters ...
2,https://www.ign.com/wikis/baldurs-gate-3/All_S...,All Sex and Romance Options - Baldur's Gate II...,If you&apos;re in the mood and trying to creat...,"[Baldur's Gate III Guide, Find in guide, Inter...",If you're in the mood and trying to create a r...,'re mood trying create relationship companions...
3,https://www.polygon.com/24035856/karlach-bg3-r...,Baldur’s Gate 3: How to get Karlach as fast as...,One easy trick to get a head start on Karlach’...,[All my friends are strategically jumping off ...,One stupid trick to get a head start on Karlac...,one stupid trick get head start karlach ’ stor...
4,https://www.polygon.com/baldurs-gate-3-guides/...,How to choose the best class for you in BG3 | ...,Baldur’s Gate 3 makes it hard to find the best...,[How to choose the best class for you in Baldu...,"BG3 features a dozen different, equally awesom...",bg3 features dozen different equally awesome c...
5,https://www.polygon.com/baldurs-gate-3-guide-w...,Explore the Overgrown Ruins walkthrough — Bald...,Our Baldur’s Gate 3 guide will help you comple...,[Baldur’s Gate 3 guide: Overgrown Ruins walkth...,How to complete the “Investigate the ruins” qu...,complete “ investigate ruins ” quest jeffrey p...
6,https://screenrant.com/baldurs-gate-3-where-to...,Where To Find (& Recruit) Karlach In Baldur’s ...,One of the best additions to any Baldur's Gate...,"[Screen Rant, Where To Find (& Recruit) Karlac...",Your changes have been saved Email is sent Ema...,changes saved email sent email already sent ’ ...
7,https://screenrant.com/baldurs-gate-3-beginner...,"10 Baldur’s Gate 3 Questions For Beginners, An...",With Baldur's Gate 3 adaption of Dungeons & Dr...,"[Screen Rant, 10 Baldur’s Gate 3 Questions For...",Your changes have been saved Email is sent Ema...,changes saved email sent email already sent ’ ...
8,https://screenrant.com/baldurs-gate-3-missed-q...,10 Best Baldur's Gate 3 Quests You Probably Mi...,Baldur’s Gate 3 is absolutely packed with ques...,"[Screen Rant, 10 Best Baldur's Gate 3 Quests Y...",Your changes have been saved Email is sent Ema...,changes saved email sent email already sent ’ ...
9,https://www.gamesradar.com/baldurs-gate-3-karl...,How to find Karlach in Baldur's Gate 3 | Games...,Karlach is a potential party member and compan...,"[How to find Karlach in Baldur's Gate 3, Karla...",How to find Karlach Demonsbane in BG3 and get ...,find karlach demonsbane bg3 get companion purc...


---
## <ins>Saving Processed Data</ins>
- Saving the preprocessed data as a `.pkl` file and as a `.csv` file in the `/data` directory.

In [8]:
scraped_dataframe.to_pickle('data/preprocessed_data.pkl')
scraped_dataframe.to_csv('data/preprocessed_data.csv', index=False)