# Webscraping Data
---

## <ins>Objective</ins>
- The goal of this step is to collect targeted guide content from IGN, Polygon, Screen Rant, and GamesRadar+ for Baldur's Gate 3. The scraped data will be used for analyzing SEO strategies. Key steps include:
    - Defining Target URLs: Identify and list specific guides to scrape for consistent comparison across sites.
    - Scraping Guide Content: Use Python libraries such as requests and BeautifulSoup to extract HTML content from each URL.
    - Validating Scraped Data: Ensure all selected URLs are accessible and comply with the website's robots.txt policies.
    - Storing Data: Save the scraped data in a structured format (Pandas DataFrame) for further cleaning and preprocessing.
---

## <ins>Imports</ins>

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

---
## <ins>URLs to Scrape</ins>
- List of all the URLs to scrape data from.

In [2]:
urls = [
    # IGN
    "https://www.ign.com/wikis/baldurs-gate-3/Where_to_Find_and_Recruit_Karlach",
    "https://www.ign.com/wikis/baldurs-gate-3/Companions_and_Party_Members",
    "https://www.ign.com/wikis/baldurs-gate-3/All_Sex_and_Romance_Options",
    
    # Polygon
    "https://www.polygon.com/24035856/karlach-bg3-romance-guide-baldurs-gate-3",
    "https://www.polygon.com/baldurs-gate-3-guides/23817654/best-class-choose-classes",
    "https://www.polygon.com/baldurs-gate-3-guide-walkthrough/21514686/overgrown-ruins-walkthrough-explore-investigate-bedchamber-dank-crypt-hooded-figure-sarcophagus",
    
    # Screen Rant
    "https://screenrant.com/baldurs-gate-3-where-to-find-recruit-karlach/",
    "https://screenrant.com/baldurs-gate-3-beginners-questions-answered/",
    "https://screenrant.com/baldurs-gate-3-missed-quests-bg3/",
    
    # GamesRadar+
    "https://www.gamesradar.com/baldurs-gate-3-karlach/",
    "https://www.gamesradar.com/baldurs-gate-3-weapons/",
    "https://www.gamesradar.com/baldurs-gate-3-tips-and-tricks/"
]

---
## <ins>Webscraping</ins>
- Using headers to simulate a browser so requests are not blocked due to bot-like behaviour.

In [3]:
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36"
}

- Fetching HTML content and parsing it.

In [4]:
scraped_data = []

for url in urls:
    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')

        # Extract title
        title = soup.find('title').text if soup.find('title') else 'N/A'

        # Extract meta description
        meta_description = (
            soup.find('meta', {'name': 'description'})['content']
            if soup.find('meta', {'name': 'description'})
            else 'N/A'
        )

        # Extract headings (h1, h2, h3)
        headings = [h.text.strip() for h in soup.find_all(['h1', 'h2', 'h3'])]

        # Extract main content paragraphs
        content = " ".join([p.text.strip() for p in soup.find_all('p')])

        scraped_data.append({
            "URL": url,
            "Title": title,
            "Meta Description": meta_description,
            "Headings": headings,
            "Content": content
        })

    except Exception as e:
        print(f"Error scraping {url}: {e}")

---
## <ins>Checking Scraped Data</ins>

In [5]:
scraped_dataframe = pd.DataFrame(scraped_data)
scraped_dataframe

Unnamed: 0,URL,Title,Meta Description,Headings,Content
0,https://www.ign.com/wikis/baldurs-gate-3/Where...,Where to Find and Recruit Karlach - Baldur's G...,"Baldur&apos;s Gate 3 is a massive game, filled...","[Baldur's Gate III Guide, Find in guide, Inter...","Baldur's Gate 3 is a massive game, filled with..."
1,https://www.ign.com/wikis/baldurs-gate-3/Compa...,Companions and Party Members - Baldur's Gate I...,Companions in Baldur&apos;s Gate 3 are unique ...,"[Baldur's Gate III Guide, Find in guide, Inter...",Companions in Baldur's Gate 3 are unique chara...
2,https://www.ign.com/wikis/baldurs-gate-3/All_S...,All Sex and Romance Options - Baldur's Gate II...,If you&apos;re in the mood and trying to creat...,"[Baldur's Gate III Guide, Find in guide, Inter...",If you're in the mood and trying to create a r...
3,https://www.polygon.com/24035856/karlach-bg3-r...,Baldur’s Gate 3: How to get Karlach as fast as...,One easy trick to get a head start on Karlach’...,[All my friends are strategically jumping off ...,One stupid trick to get a head start on Karlac...
4,https://www.polygon.com/baldurs-gate-3-guides/...,How to choose the best class for you in BG3 | ...,Baldur’s Gate 3 makes it hard to find the best...,[How to choose the best class for you in Baldu...,"BG3 features a dozen different, equally awesom..."
5,https://www.polygon.com/baldurs-gate-3-guide-w...,Explore the Overgrown Ruins walkthrough — Bald...,Our Baldur’s Gate 3 guide will help you comple...,[Baldur’s Gate 3 guide: Overgrown Ruins walkth...,How to complete the “Investigate the ruins” qu...
6,https://screenrant.com/baldurs-gate-3-where-to...,Where To Find (& Recruit) Karlach In Baldur’s ...,One of the best additions to any Baldur's Gate...,"[Screen Rant, Where To Find (& Recruit) Karlac...",Your changes have been saved Email is sent Em...
7,https://screenrant.com/baldurs-gate-3-beginner...,"10 Baldur’s Gate 3 Questions For Beginners, An...",With Baldur's Gate 3 adaption of Dungeons & Dr...,"[Screen Rant, 10 Baldur’s Gate 3 Questions For...",Your changes have been saved Email is sent Em...
8,https://screenrant.com/baldurs-gate-3-missed-q...,10 Best Baldur's Gate 3 Quests You Probably Mi...,Baldur’s Gate 3 is absolutely packed with ques...,"[Screen Rant, 10 Best Baldur's Gate 3 Quests Y...",Your changes have been saved Email is sent Em...
9,https://www.gamesradar.com/baldurs-gate-3-karl...,How to find Karlach in Baldur's Gate 3 | Games...,Karlach is a potential party member and compan...,"[How to find Karlach in Baldur's Gate 3, Karla...",How to find Karlach Demonsbane in BG3 and get ...


---
## <ins>Saving Data</ins>
- Saving the data as a `.pkl` file and as a `.csv` file in the `/data` directory.

In [6]:
scraped_dataframe.to_pickle('data/scraped_data.pkl')
scraped_dataframe.to_csv('data/scraped_data.csv', index=False)