<a href="https://colab.research.google.com/github/alammobaDar/Data_Scraping/blob/main/%5BStudents%5D_Data_scraping_with_Virtual_Protocol.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scrape Your Own DataSet - Virtual Protocol

Before proceeding with this step, ensure that you have created an account with Virtual Protocol. If you haven't already registered, please do so by visiting the following link:

[Create Your Account on Virtual Protocol](https://app.virtuals.io/)

Additionally, for real-time assistance and to join our community of data enthusiasts, please join our Discord server:

[Join Our Discord Community](https://discord.gg/X3PtZpjC)

Once you've set up your account and joined Discord, you can continue with the data scraping tutorial in this notebook.


# VIRTUAL Protocol

"An AI x Metaverse Protocol that is building the future of virtual interactions."

**VIRTUAL Protocol aligns incentives for the decentralized creation and monetization of AI agents for every virtual interaction (gaming, metaverses, online interactions, or beyond).**

![Virtual Protocol Image](https://whitepaper.virtuals.io/~gitbook/image?url=https%3A%2F%2F3396718532-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FazqzRwq4QZNL0p1M51oQ%252Fuploads%252FlunssyB9aApcmlzfmgOM%252FGroup%2520127.png%3Falt%3Dmedia%26token%3D898525bf-9b85-4c76-acc8-9b59688c7d5c&width=768&dpr=4&quality=100&sign=c977a2a48bf06bcca5620502fcc1e916ae68d9eea37b68f53509ce9ee5595a6c)

[Visit the Whitepaper for more details](https://whitepaper.virtuals.io/the-virtuals-ecosystem-explained)


# Useful Links

**Website:** [https://tao.virtuals.io/](https://tao.virtuals.io/)

**Github:** [https://github.com/Virtual-Protocol/tao-vpsubnet/](https://github.com/Virtual-Protocol/tao-vpsubnet/)

**Twitter/X:** [https://twitter.com/virtuals_io](https://twitter.com/virtuals_io)



## Import Libraries

We will be using beautifulsoup for the scraping part.

[**Beautiful Soup**](https://beautiful-soup-4.readthedocs.io/en/latest/) is a Python library for pulling data out of HTML and XML files.

It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.

It commonly saves programmers hours or days of work.


[**Pandas**](https://www.w3schools.com/python/pandas/pandas_intro.asp) is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.



In [None]:
!pip install beautifulsoup4



In [None]:
!pip install pandas as pd

[31mERROR: Could not find a version that satisfies the requirement as (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for as[0m[31m
[0m

In [None]:
!pip install requests



## Get Request & Scrape using beautiful soup

This script uses requests and BeautifulSoup to scrape quotes from Taylor Swift's Fandom page and process the data using Pandas.


In [None]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

# URLs to be scraped
urls = [
    'https://naruto.fandom.com/wiki/Hinata_Hy%C5%ABga',
    'https://hero.fandom.com/wiki/Hinata_Hy%C5%ABga'  # This is a placeholder; use the second link here
]

                                                URL  \
0  https://naruto.fandom.com/wiki/Hinata_Hy%C5%ABga   
1    https://hero.fandom.com/wiki/Hinata_Hy%C5%ABga   

                                          Background  \
0  Because Hiashi was the head of the Hyūga clan,...   
1  After Hinata gained a brutal beating from spar...   

                                         Personality  \
0  Hinata's outlook sees a shift after her appoin...   
1  Back in her childhood, as a result of her clan...   

                                          Appearance  
0  In Part I, Hinata usually wears a cream-colour...  
1  In Part I, her hair was in a short, leveled hi...  


## Access the URL

This function scrapes quotes from the provided URL and returns them in a Pandas DataFrame, handling any request errors gracefully.


## Collect the facts about the character on the websites

This code collects quotes from multiple sections, categorizes them, and organizes them into a Pandas DataFrame with labels for each quote.



In [None]:

# Extract information from both URLs
data = [extract_text_from_url(url) for url in urls if extract_text_from_url(url) is not None]

# Create DataFrame for structured information
df_info = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df_info.to_csv('hinata_info.csv', index=False)

# Print the DataFrame
print(df_info)


                                                URL  \
0  https://naruto.fandom.com/wiki/Hinata_Hy%C5%ABga   
1    https://hero.fandom.com/wiki/Hinata_Hy%C5%ABga   

                                          Background  \
0  Because Hiashi was the head of the Hyūga clan,...   
1  After Hinata gained a brutal beating from spar...   

                                         Personality  \
0  Hinata's outlook sees a shift after her appoin...   
1  Back in her childhood, as a result of her clan...   

                                          Appearance  
0  In Part I, Hinata usually wears a cream-colour...  
1  In Part I, her hair was in a short, leveled hi...  


In [None]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

# URLs to be scraped
urls = [
    'https://naruto.fandom.com/wiki/Hinata_Hy%C5%ABga',
    'https://hero.fandom.com/wiki/Hinata_Hy%C5%ABga'
]

def extract_text_from_url(url):
    # Send a GET request to the page
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        # Create BeautifulSoup object; parse with 'html.parser'
        soup = bs(response.text, 'html.parser')

        # Define a function to extract section text
        def extract_section_text(section_id):
            try:
                # Find the section header and its following paragraphs
                section = soup.find('span', {'id': section_id}).find_parent('h2').find_next_sibling('p')
                paragraphs = []
                for p in section.find_all_next('p'):
                    if p.find('span', {'class': 'mw-headline'}):  # Stop if another heading is reached
                        break
                    paragraphs.append(p.get_text(strip=True))
                return ' '.join(paragraphs)
            except AttributeError:
                return 'Section not found.'

        # Extracting information from different sections
        background_text = extract_section_text('Background')
        personality_text = extract_section_text('Personality')
        appearance_text = extract_section_text('Appearance')

        # Return extracted text
        return {
            'URL': url,
            'Background': background_text,
            'Personality': personality_text,
            'Appearance': appearance_text
        }
    else:
        print(f"Failed to retrieve data from {url}. Status code: {response.status_code}")
        return None

# Extract information from both URLs
data = [extract_text_from_url(url) for url in urls if extract_text_from_url(url) is not None]

# Create DataFrame for structured information
df_info_01 = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df_info_01.to_csv('hinata_info.csv', index=False)

# Print the DataFrame
print(df_info_01)


                                                URL  \
0  https://naruto.fandom.com/wiki/Hinata_Hy%C5%ABga   
1    https://hero.fandom.com/wiki/Hinata_Hy%C5%ABga   

                                          Background  \
0  Because Hiashi was the head of the Hyūga clan,...   
1  After Hinata gained a brutal beating from spar...   

                                         Personality  \
0  Hinata's outlook sees a shift after her appoin...   
1  Back in her childhood, as a result of her clan...   

                                          Appearance  
0  In Part I, Hinata usually wears a cream-colour...  
1  In Part I, her hair was in a short, leveled hi...  


In [None]:
df_info_01

Unnamed: 0,URL,Background,Personality,Appearance
0,https://naruto.fandom.com/wiki/Hinata_Hy%C5%ABga,"Because Hiashi was the head of the Hyūga clan,...",Hinata's outlook sees a shift after her appoin...,"In Part I, Hinata usually wears a cream-colour..."
1,https://hero.fandom.com/wiki/Hinata_Hy%C5%ABga,After Hinata gained a brutal beating from spar...,"Back in her childhood, as a result of her clan...","In Part I, her hair was in a short, leveled hi..."


## Try Another Website to Scrape

In [None]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

# URL to be scraped
url = 'https://greatcharacters.miraheze.org/wiki/Hinata_Hyuga'

def extract_text_from_url(url):
    # Send a GET request to the page
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        # Create BeautifulSoup object; parse with 'html.parser'
        soup = bs(response.text, 'html.parser')

        # Define a function to extract section text
        def extract_section_text(section_title):
            try:
                # Find the section header
                section = soup.find('span', {'id': section_title})
                if not section:
                    return 'Section not found.'
                # Find the following content until the next heading
                section_content = section.find_parent('h2').find_next_sibling()
                paragraphs = []
                while section_content and section_content.name in ['p', 'ul', 'ol']:
                    if section_content.name == 'p':
                        paragraphs.append(section_content.get_text(strip=True))
                    elif section_content.name in ['ul', 'ol']:
                        list_items = section_content.find_all('li')
                        for item in list_items:
                            paragraphs.append(item.get_text(strip=True))
                    section_content = section_content.find_next_sibling()
                return ' '.join(paragraphs)
            except AttributeError:
                return 'Section not found.'

        # Extracting information from different sections
        sections = {
            'Why_She_Rocks': 'Why_She_Rocks',
            'Bad_Qualities': 'Bad_Qualities',
            'Trivia': 'Trivia',
            'Comments': 'Comments'
        }

        extracted_data = {section: extract_section_text(title) for section, title in sections.items()}

        # Rename the columns
        renamed_columns = {
            'Why_She_Rocks': 'Background',
            'Bad_Qualities': 'Personality',
            'Trivia': 'Appearance',
            'Comments': 'Comments'
        }

        # Apply the renaming
        extracted_data_renamed = {renamed_columns[section]: text for section, text in extracted_data.items()}
        extracted_data_renamed['URL'] = url

        # Return extracted text
        return extracted_data_renamed
    else:
        print(f"Failed to retrieve data from {url}. Status code: {response.status_code}")
        return None

# Extract information from the URL
data = extract_text_from_url(url)

# Create DataFrame for structured information
if data:
    # Create DataFrame with URL as the first column
    df_info = pd.DataFrame([data])
    df_info = df_info[['URL', 'Background', 'Personality', 'Appearance', 'Comments']]

    # Save the DataFrame to a CSV file
    df_info.to_csv('hinata_greatcharacters_info.csv', index=False)
    # Print the DataFrame
    print(df_info)
else:
    print("No data to save.")


                                                 URL  \
0  https://greatcharacters.miraheze.org/wiki/Hina...   

                                          Background  \
0  Unlike most of the other characters at the sta...   

                                         Personality  \
0  Her act of fainting when she sees Naruto can b...   

                                          Appearance Comments  
0  She shares a birthday (December 27) withOchaco...           


In [None]:
df_info

Unnamed: 0,URL,Background,Personality,Appearance,Comments
0,https://greatcharacters.miraheze.org/wiki/Hina...,Unlike most of the other characters at the sta...,Her act of fainting when she sees Naruto can b...,She shares a birthday (December 27) withOchaco...,


In [None]:
import pandas as pd

# Load the two CSV files into separate DataFrames
df_info_01 = pd.read_csv('hinata_info.csv')
df_info_02 = pd.read_csv('hinata_greatcharacters_info.csv')

# Concatenate the DataFrames
df_combined = pd.concat([df_info_01, df_info_02], ignore_index=True)

# Save the combined DataFrame to a new CSV file
df_combined.to_csv('hinata_combined_info.csv', index=False)

# Print the combined DataFrame
print(df_combined)


                                                 URL  \
0   https://naruto.fandom.com/wiki/Hinata_Hy%C5%ABga   
1     https://hero.fandom.com/wiki/Hinata_Hy%C5%ABga   
2  https://greatcharacters.miraheze.org/wiki/Hina...   

                                          Background  \
0  Because Hiashi was the head of the Hyūga clan,...   
1  After Hinata gained a brutal beating from spar...   
2  Unlike most of the other characters at the sta...   

                                         Personality  \
0  Hinata's outlook sees a shift after her appoin...   
1  Back in her childhood, as a result of her clan...   
2  Her act of fainting when she sees Naruto can b...   

                                          Appearance  Comments  
0  In Part I, Hinata usually wears a cream-colour...       NaN  
1  In Part I, her hair was in a short, leveled hi...       NaN  
2  She shares a birthday (December 27) withOchaco...       NaN  


In [None]:
df_combined

Unnamed: 0,URL,Background,Personality,Appearance,Comments
0,https://naruto.fandom.com/wiki/Hinata_Hy%C5%ABga,"Because Hiashi was the head of the Hyūga clan,...",Hinata's outlook sees a shift after her appoin...,"In Part I, Hinata usually wears a cream-colour...",
1,https://hero.fandom.com/wiki/Hinata_Hy%C5%ABga,After Hinata gained a brutal beating from spar...,"Back in her childhood, as a result of her clan...","In Part I, her hair was in a short, leveled hi...",
2,https://greatcharacters.miraheze.org/wiki/Hina...,Unlike most of the other characters at the sta...,Her act of fainting when she sees Naruto can b...,She shares a birthday (December 27) withOchaco...,


# Data Cleaning

In [None]:
# Load the combined DataFrame
df_combined = pd.read_csv('hinata_combined_info.csv')

# Check for null values in the DataFrame
null_summary = df_combined.isnull().sum()

# Print the summary of null values
print("Summary of null values in each column:")
print(null_summary)

Summary of null values in each column:
URL            0
Background     0
Personality    0
Appearance     0
Comments       3
dtype: int64


In [None]:
# Drop the 'Comments' column if it exists
if 'Comments' in df_combined.columns:
    df_combined = df_combined.drop(columns=['Comments'])
    print("'Comments' column dropped.")
else:
    print("'Comments' column does not exist.")

# Save the cleaned DataFrame to a new CSV file
#df_combined.to_csv('hinata_cleaned_info.csv', index=False)

# Print the cleaned DataFrame
print("Cleaned DataFrame:")
print(df_combined)
df_combined



'Comments' column does not exist.
Cleaned DataFrame:
                                                 URL  \
0   https://naruto.fandom.com/wiki/Hinata_Hy%C5%ABga   
1     https://hero.fandom.com/wiki/Hinata_Hy%C5%ABga   
2  https://greatcharacters.miraheze.org/wiki/Hina...   

                                          Background  \
0  Because Hiashi was the head of the Hyūga clan,...   
1  After Hinata gained a brutal beating from spar...   
2  Unlike most of the other characters at the sta...   

                                         Personality  \
0  Hinata's outlook sees a shift after her appoin...   
1  Back in her childhood, as a result of her clan...   
2  Her act of fainting when she sees Naruto can b...   

                                          Appearance  
0  In Part I, Hinata usually wears a cream-colour...  
1  In Part I, her hair was in a short, leveled hi...  
2  She shares a birthday (December 27) withOchaco...  


Unnamed: 0,URL,Background,Personality,Appearance
0,https://naruto.fandom.com/wiki/Hinata_Hy%C5%ABga,"Because Hiashi was the head of the Hyūga clan,...",Hinata's outlook sees a shift after her appoin...,"In Part I, Hinata usually wears a cream-colour..."
1,https://hero.fandom.com/wiki/Hinata_Hy%C5%ABga,After Hinata gained a brutal beating from spar...,"Back in her childhood, as a result of her clan...","In Part I, her hair was in a short, leveled hi..."
2,https://greatcharacters.miraheze.org/wiki/Hina...,Unlike most of the other characters at the sta...,Her act of fainting when she sees Naruto can b...,She shares a birthday (December 27) withOchaco...


In [None]:
# Save the cleaned DataFrame to a new CSV file
df_combined.to_csv('hinata_cleaned_info.csv', index=False)

# Activity 1: Scrape Facts from Websites




Visit this link: https://docs.google.com/document/d/1O_0UKeGdoFE3jhLASUQ2LJ1WCkLOfl2rHtpg2_z3nPo/edit?usp=sharing