# Scrape Your Own DataSet - Virtual Protocol

Before proceeding with this step, ensure that you have created an account with Virtual Protocol. If you haven't already registered, please do so by visiting the following link:

[Create Your Account on Virtual Protocol](https://app.virtuals.io/)

Additionally, for real-time assistance and to join our community of data enthusiasts, please join our Discord server:

[Join Our Discord Community](https://discord.gg/X3PtZpjC)

Once you've set up your account and joined Discord, you can continue with the data scraping tutorial in this notebook.


# VIRTUAL Protocol

"An AI x Metaverse Protocol that is building the future of virtual interactions."

**VIRTUAL Protocol aligns incentives for the decentralized creation and monetization of AI agents for every virtual interaction (gaming, metaverses, online interactions, or beyond).**

![Virtual Protocol Image](https://whitepaper.virtuals.io/~gitbook/image?url=https%3A%2F%2F3396718532-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FazqzRwq4QZNL0p1M51oQ%252Fuploads%252FlunssyB9aApcmlzfmgOM%252FGroup%2520127.png%3Falt%3Dmedia%26token%3D898525bf-9b85-4c76-acc8-9b59688c7d5c&width=768&dpr=4&quality=100&sign=c977a2a48bf06bcca5620502fcc1e916ae68d9eea37b68f53509ce9ee5595a6c)

[Visit the Whitepaper for more details](https://whitepaper.virtuals.io/the-virtuals-ecosystem-explained)


# Useful Links

**Website:** [https://tao.virtuals.io/](https://tao.virtuals.io/)

**Github:** [https://github.com/Virtual-Protocol/tao-vpsubnet/](https://github.com/Virtual-Protocol/tao-vpsubnet/)

**Twitter/X:** [https://twitter.com/virtuals_io](https://twitter.com/virtuals_io)



## Import Libraries

We will be using beautifulsoup for the scraping part.

[**Beautiful Soup**](https://beautiful-soup-4.readthedocs.io/en/latest/) is a Python library for pulling data out of HTML and XML files.

It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.

It commonly saves programmers hours or days of work.


[**Pandas**](https://www.w3schools.com/python/pandas/pandas_intro.asp) is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.



In [30]:
!pip install beautifulsoup4



In [31]:
!pip install pandas



In [32]:
!pip install requests



## Get Request & Scrape using beautiful soup

This script uses requests and BeautifulSoup to scrape title from URL https://onepunchman.fandom.com/wiki/Saitama using Pandas.


In [58]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

# URLs to be scraped
urls = [
    'https://onepunchman.fandom.com/wiki/Saitama',
    # This is a placeholder; use the second link here
]

In [59]:
for url in urls:
    response = requests.get(url)
    if response.status_code == 200:
        soup = bs(response.text, 'html.parser')
        # Example: Find and print the title of the page
        title = soup.find('title').text
        print(f"Title of {url}: {title}")
    else:
        print(f"Failed to retrieve data from {url}. Status code: {response.status_code}")

Title of https://onepunchman.fandom.com/wiki/Saitama: Saitama | One-Punch Man Wiki | Fandom


## Access the URL

This function scrapes data from the provided URL and returns them in a Pandas DataFrame, handling any request errors gracefully.


## Collect the facts about the character on the websites

This code collects data from multiple sections of the website https://onepunchman.fandom.com/wiki/Saitama like Appearance, Personality, Abilities and Powers, Quotes, and Trivia, then categorizes them, and organizes them into a Pandas DataFrame with labels for each information.



In [60]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

# URLs to be scraped
urls = [
    'https://onepunchman.fandom.com/wiki/Saitama'

]

def extract_text_from_url(url):
    # Send a GET request to the page
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        # Create BeautifulSoup object; parse with 'html.parser'
        soup = bs(response.text, 'html.parser')

        # Define a function to extract section text
        def extract_section_text(section_id):
            try:
                # Find the section header and its following elements until the next section header
                section = soup.find('span', {'id': section_id}).find_parent('h2')
                content = []
                for sibling in section.find_next_siblings():
                    if sibling.name == 'h2':  # Stop if another main section heading is reached
                        break
                    if sibling.name in ['p', 'ul', 'ol']:  # Capture paragraphs and list items
                        content.append(sibling.get_text(strip=True))
                return ' '.join(content)
            except AttributeError:
                return 'Section not found.'

        # Extracting information from different sections
        appearance_text = extract_section_text('Appearance')
        personality_text = extract_section_text('Personality')
        power_text = extract_section_text('Abilities_and_Powers')
        quotes_text = extract_section_text('Quotes')
        trivia_text = extract_section_text('Trivia')


        # Return extracted text
        return {
            'URL': url,
            'Appearance': appearance_text,
            'Personality': personality_text,
            'Abilities and Powers': power_text,
            'Quotes': quotes_text,
            'Trivia': trivia_text


        }
    else:
        print(f"Failed to retrieve data from {url}. Status code: {response.status_code}")
        return None

# Extract information from both URLs
data = [extract_text_from_url(url) for url in urls if extract_text_from_url(url) is not None]

# Create DataFrame for structured information
df_info_01 = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df_info_01.to_csv('saitama_info.csv', index=False)

# Print the DataFrame
print(df_info_01)


                                           URL  \
0  https://onepunchman.fandom.com/wiki/Saitama   

                                          Appearance  \
0  Saitama is an ordinary-looking bald man with b...   

                                         Personality  \
0  Saitama's character on the upper surface can b...   

                                Abilities and Powers  \
0  Saitama is the titularOne-Punch Manand the str...   

                                              Quotes  \
0  (ToMarugori)"Having overwhelming power is... b...   

                                              Trivia  
0  Saitama's hero profile number is 03402.[189]Sa...  


In [61]:
df_info_01

Unnamed: 0,URL,Appearance,Personality,Abilities and Powers,Quotes,Trivia
0,https://onepunchman.fandom.com/wiki/Saitama,Saitama is an ordinary-looking bald man with b...,Saitama's character on the upper surface can b...,Saitama is the titularOne-Punch Manand the str...,"(ToMarugori)""Having overwhelming power is... b...",Saitama's hero profile number is 03402.[189]Sa...


# Data Cleaning

In [62]:
# Load the combined DataFrame
df_combined = pd.read_csv('saitama_info.csv')

# Check for null values in the DataFrame
null_summary = df_combined.isnull().sum()

# Print the summary of null values
print("Summary of null values in each column:")
print(null_summary)

Summary of null values in each column:
URL                     0
Appearance              0
Personality             0
Abilities and Powers    0
Quotes                  0
Trivia                  0
dtype: int64


In [67]:
# Save the cleaned DataFrame to a new CSV file
df_combined.to_csv('saitama_cleaned_info.csv', index=False)
df_combined.to_csv('saitama_cleaned_info.txt', sep='\t', index=False)

### Remove column header and URL.

In [68]:
import csv
import re

# Specify the input CSV file and output TXT file
input_csv = 'saitama_cleaned_info.csv'  # replace with your CSV file
output_txt = 'saitama_cleaned_info.txt'  # replace with your desired TXT file name

# Regular expression pattern to match URLs
url_pattern = re.compile(r'http[s]?://\S+')

# Open the CSV file for reading and the TXT file for writing
with open(input_csv, 'r') as csv_file, open(output_txt, 'w') as txt_file:
    # Create a CSV reader object
    csv_reader = csv.reader(csv_file)

    # Skip the header row
    next(csv_reader)

    # Iterate over each row in the CSV file
    for row in csv_reader:
        # Remove URLs from each column in the row
        cleaned_row = [url_pattern.sub('', col) for col in row]

        # Convert the cleaned row to a string, join columns with a space, and write to the TXT file
        txt_file.write(' '.join(cleaned_row).strip() + '\n')


### Index Numbers Removed: The script no longer includes an item counter or adds numbers to the sentences.

### Clean and Split Sentences: URLs and text in brackets are removed, and each row is split into sentences.

### Write Sentences: Each non-empty sentence is written on a new line in the output file without any numbering.

In [70]:
import csv
import re
import nltk

# Ensure you have the required NLTK data
nltk.download('punkt')

# Specify the input CSV file and output TXT file
input_csv = 'saitama_cleaned_info.csv'  # replace with your CSV file
output_txt = 'saitama_activity_output.txt'  # replace with your desired TXT file name

# Regular expression patterns to match URLs and text in brackets
url_pattern = re.compile(r'http[s]?://\S+')
bracket_pattern = re.compile(r'\[.*?\]|\(.*?\)')
quotation_pattern = re.compile(r'\"(.*?)\"|\'(.*?)\'')

# Open the CSV file for reading and the TXT file for writing
with open(input_csv, 'r') as csv_file, open(output_txt, 'w') as txt_file:
    # Create a CSV reader object
    csv_reader = csv.reader(csv_file)

    # Skip the header row
    next(csv_reader)

    # Iterate over each row in the CSV file
    for row in csv_reader:
        # Remove URLs and text in brackets from each column in the row
        cleaned_row = [url_pattern.sub('', col) for col in row]
        cleaned_row = [bracket_pattern.sub('', col) for col in cleaned_row]

        # Remove quotation marks but keep the text inside
        cleaned_row = [quotation_pattern.sub(r'\1\2', col) for col in cleaned_row]

        # Combine all columns into a single string
        combined_text = ' '.join(cleaned_row).strip()

        # Split the combined text into sentences
        sentences = nltk.sent_tokenize(combined_text)

        # Write each non-empty sentence as a new line
        for sentence in sentences:
            cleaned_sentence = sentence.strip()
            if cleaned_sentence:  # Check if the sentence is not empty
                txt_file.write(f'{cleaned_sentence}\n')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Activity 1: Scrape Facts from Websites




Visit this link: https://docs.google.com/document/d/1O_0UKeGdoFE3jhLASUQ2LJ1WCkLOfl2rHtpg2_z3nPo/edit?usp=sharing