<a href="https://colab.research.google.com/github/edelmode/Bootcamp-VirtualProtocol_WebScraping/blob/main/Konan_Info.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scrape Your Own DataSet - Virtual Protocol

Before proceeding with this step, ensure that you have created an account with Virtual Protocol. If you haven't already registered, please do so by visiting the following link:

[Create Your Account on Virtual Protocol](https://app.virtuals.io/)

Additionally, for real-time assistance and to join our community of data enthusiasts, please join our Discord server:

[Join Our Discord Community](https://discord.gg/X3PtZpjC)

Once you've set up your account and joined Discord, you can continue with the data scraping tutorial in this notebook.


# VIRTUAL Protocol

"An AI x Metaverse Protocol that is building the future of virtual interactions."

**VIRTUAL Protocol aligns incentives for the decentralized creation and monetization of AI agents for every virtual interaction (gaming, metaverses, online interactions, or beyond).**

![Virtual Protocol Image](https://whitepaper.virtuals.io/~gitbook/image?url=https%3A%2F%2F3396718532-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FazqzRwq4QZNL0p1M51oQ%252Fuploads%252FlunssyB9aApcmlzfmgOM%252FGroup%2520127.png%3Falt%3Dmedia%26token%3D898525bf-9b85-4c76-acc8-9b59688c7d5c&width=768&dpr=4&quality=100&sign=c977a2a48bf06bcca5620502fcc1e916ae68d9eea37b68f53509ce9ee5595a6c)

[Visit the Whitepaper for more details](https://whitepaper.virtuals.io/the-virtuals-ecosystem-explained)


# Useful Links

**Website:** [https://tao.virtuals.io/](https://tao.virtuals.io/)

**Github:** [https://github.com/Virtual-Protocol/tao-vpsubnet/](https://github.com/Virtual-Protocol/tao-vpsubnet/)

**Twitter/X:** [https://twitter.com/virtuals_io](https://twitter.com/virtuals_io)



## Import Libraries

We will be using beautifulsoup for the scraping part.

[**Beautiful Soup**](https://beautiful-soup-4.readthedocs.io/en/latest/) is a Python library for pulling data out of HTML and XML files.

It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.

It commonly saves programmers hours or days of work.


[**Pandas**](https://www.w3schools.com/python/pandas/pandas_intro.asp) is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.



In [None]:
!pip install beautifulsoup4



In [None]:
!pip install pandas



In [None]:
!pip install requests



## Get Request & Scrape using beautiful soup

This script uses requests and BeautifulSoup to scrape quotes from Taylor Swift's Fandom page and process the data using Pandas.


In [None]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

# URLs to be scraped
urls = [
    'https://naruto.fandom.com/wiki/Hinata_Hy%C5%ABga',
    'https://akat.fandom.com/wiki/Konan'  # This is a placeholder; use the second link here
]

## Access the URL

This function scrapes quotes from the provided URL and returns them in a Pandas DataFrame, handling any request errors gracefully.


## Collect the facts about the character on the websites

This code collects quotes from multiple sections, categorizes them, and organizes them into a Pandas DataFrame with labels for each quote.



In [None]:

# Extract information from both URLs
data = [extract_text_from_url(url) for url in urls if extract_text_from_url(url) is not None]

# Create DataFrame for structured information
df_info = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df_info.to_csv('konan_info.csv', index=False)

# Print the DataFrame
print(df_info)


           Background         Personality          Appearance  \
0  Section not found.  Section not found.                       
1  Section not found.  Section not found.  Section not found.   

             Comments                                               URL  
0  Section not found.  https://naruto.fandom.com/wiki/Hinata_Hy%C5%ABga  
1  Section not found.                https://akat.fandom.com/wiki/Konan  


In [None]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

# URLs to be scraped
urls = [
    'https://akat.fandom.com/wiki/Konan'
]

def extract_text_from_url(url):
    # Send a GET request to the page
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        # Create BeautifulSoup object; parse with 'html.parser'
        soup = bs(response.text, 'html.parser')

        # Define a function to extract section text based on header names
        def extract_section_text(header_text):
            paragraphs = []
            found = False
            for tag in soup.find_all(['h2', 'h3', 'h4']):
                if header_text.lower() in tag.get_text().lower():
                    found = True
                    next_sibling = tag.find_next_sibling()
                    while next_sibling and next_sibling.name != 'h2':
                        if next_sibling.name == 'p':
                            paragraphs.append(next_sibling.get_text(strip=True))
                        next_sibling = next_sibling.find_next_sibling()
            return ' '.join(paragraphs) if found else 'Section not found.'

        # Extracting information from different sections
        background_text = extract_section_text('Background')
        personality_text = extract_section_text('Personality')
        appearance_text = extract_section_text('Appearance')
        abilities_text = extract_section_text('Abilities')

        # Return extracted text
        return {
            'URL': url,
            'Background': background_text,
            'Personality': personality_text,
            'Appearance': appearance_text,
            'Abilities': abilities_text
        }
    else:
        print(f"Failed to retrieve data from {url}. Status code: {response.status_code}")
        return None

# Extract information from the URL
data = [extract_text_from_url(url) for url in urls if extract_text_from_url(url) is not None]

# Create DataFrame for structured information
df_info_01 = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df_info_01.to_csv('Konan_info.csv', index=False)

# Print the DataFrame
print(df_info_01)


                                  URL  \
0  https://akat.fandom.com/wiki/Konan   

                                          Background  \
0  [1][2]Konan as a child.When she was young, Kon...   

                                         Personality  \
0  Konan is stoic, calm, and level-headed (much l...   

                                          Appearance  \
0  Konan is a woman who has short blue hair, gray...   

                                           Abilities  
0  [3][4]Konan's first shown origami. [5][6]Konan...  


In [None]:
df_info_01

Unnamed: 0,URL,Background,Personality,Appearance,Abilities
0,https://akat.fandom.com/wiki/Konan,"[1][2]Konan as a child.When she was young, Kon...","Konan is stoic, calm, and level-headed (much l...","Konan is a woman who has short blue hair, gray...",[3][4]Konan's first shown origami. [5][6]Konan...


## Try Another Website to Scrape

In [None]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

# URLs to be scraped
urls = [
    'https://naruto.fandom.com/wiki/Konan'
]

def extract_text_from_url(url):
    # Send a GET request to the page
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        # Create BeautifulSoup object; parse with 'html.parser'
        soup = bs(response.text, 'html.parser')

        # Define a function to extract section text based on header names
        def extract_section_text(header_text):
            paragraphs = []
            found = False
            # Look for headers (h2, h3, h4) that contain the header text
            for tag in soup.find_all(['h2', 'h3', 'h4']):
                if header_text.lower() in tag.get_text().lower():
                    found = True
                    next_sibling = tag.find_next_sibling()
                    # Collect text until the next header of the same or higher level
                    while next_sibling and next_sibling.name not in ['h2', 'h3', 'h4']:
                        if next_sibling.name == 'p':
                            paragraphs.append(next_sibling.get_text(strip=True))
                        next_sibling = next_sibling.find_next_sibling()
            return ' '.join(paragraphs) if found else 'Section not found.'

        # Extracting information from different sections
        background_text = extract_section_text('Background')
        personality_text = extract_section_text('Personality')
        appearance_text = extract_section_text('Appearance')
        abilities_text = extract_section_text('Abilities')

        # Return extracted text
        return {
            'URL': url,
            'Background': background_text,
            'Personality': personality_text,
            'Appearance': appearance_text,
            'Abilities': abilities_text
        }
    else:
        print(f"Failed to retrieve data from {url}. Status code: {response.status_code}")
        return None

# Extract information from the URL
data = [extract_text_from_url(url) for url in urls if extract_text_from_url(url) is not None]

# Create DataFrame for structured information
df_info_02 = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df_info_02.to_csv('Konan2.csv', index=False)

# Print the DataFrame
print(df_info_02)


                                    URL  \
0  https://naruto.fandom.com/wiki/Konan   

                                          Background  \
0  When she was young, Konan's family was killed ...   

                                         Personality  \
0  Konan was smart, stoic, calm, and level-headed...   

                                          Appearance  \
0  Konan had short, straight blue hair with a bun...   

                                           Abilities  
0  As an S-rank criminal who grew up in war-torn ...  


In [None]:
df_info_02

Unnamed: 0,URL,Background,Personality,Appearance,Abilities
0,https://naruto.fandom.com/wiki/Konan,"When she was young, Konan's family was killed ...","Konan was smart, stoic, calm, and level-headed...","Konan had short, straight blue hair with a bun...",As an S-rank criminal who grew up in war-torn ...


In [None]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

# URLs to be scraped
urls = [
    'https://hero.fandom.com/wiki/Konan'
]

def extract_text_from_url(url):
    try:
        # Send a GET request to the page
        response = requests.get(url)
        response.raise_for_status()  # Check for request errors

        # Create BeautifulSoup object; parse with 'html.parser'
        soup = bs(response.text, 'html.parser')

        # Define a function to extract section text based on header names
        def extract_section_text(header_text):
            paragraphs = []
            found = False
            # Look for headers (h2, h3, h4) that contain the header text
            for tag in soup.find_all(['h2', 'h3', 'h4']):
                if header_text.lower() in tag.get_text().lower():
                    found = True
                    next_sibling = tag.find_next_sibling()
                    # Collect text until the next header of the same or higher level
                    while next_sibling and next_sibling.name not in ['h2', 'h3', 'h4']:
                        if next_sibling.name == 'p':
                            paragraphs.append(next_sibling.get_text(strip=True))
                        next_sibling = next_sibling.find_next_sibling()
            return ' '.join(paragraphs) if found else 'Section not found.'

        # Extracting information from different sections
        background_text = extract_section_text('Background')
        personality_text = extract_section_text('Personality')
        appearance_text = extract_section_text('Appearance')
        abilities_text = extract_section_text('Abilities')

        # Return extracted text
        return {
            'URL': url,
            'Background': background_text,
            'Personality': personality_text,
            'Appearance': appearance_text,
            'Abilities': abilities_text
        }
    except requests.RequestException as e:
        print(f"Failed to retrieve data from {url}. Error: {e}")
        return None

# Extract information from the URL
data = [extract_text_from_url(url) for url in urls if extract_text_from_url(url) is not None]

# Create DataFrame for structured information
df_info_03 = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df_info_03.to_csv('Konan3.csv', index=False)

# Print the DataFrame
print(df_info_03)


                                  URL  \
0  https://hero.fandom.com/wiki/Konan   

                                          Background  \
0  When she was only a child, Konan's parents was...   

                                         Personality  \
0  Konan was a stoic, calm, and level-headed woma...   

                                          Appearance  \
0  Konan has blue hair, gray eyes (amber in the a...   

                                           Abilities  
0  Konan is an S-rank kunoichi and her skills wer...  


In [None]:
df_info_03

Unnamed: 0,URL,Background,Personality,Appearance,Abilities
0,https://hero.fandom.com/wiki/Konan,"When she was only a child, Konan's parents was...","Konan was a stoic, calm, and level-headed woma...","Konan has blue hair, gray eyes (amber in the a...",Konan is an S-rank kunoichi and her skills wer...


In [None]:
!pip install selenium

Collecting selenium
  Downloading selenium-4.23.1-py3-none-any.whl.metadata (7.1 kB)
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.26.0-py3-none-any.whl.metadata (8.8 kB)
Collecting trio-websocket~=0.9 (from selenium)
  Downloading trio_websocket-0.11.1-py3-none-any.whl.metadata (4.7 kB)
Collecting outcome (from trio~=0.17->selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl.metadata (2.6 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.9->selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl.metadata (5.6 kB)
Collecting h11<1,>=0.9.0 (from wsproto>=0.14->trio-websocket~=0.9->selenium)
  Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Downloading selenium-4.23.1-py3-none-any.whl (9.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.4/9.4 MB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading trio-0.26.0-py3-none-any.whl (475 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m475.7/475.7 kB[0m [31m21.

In [None]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

# URLs to be scraped
urls = [
    'https://wiki.sportskeeda.com/naruto/konan'
]

def extract_text_from_url(url):
    try:
        # Send a GET request to the page
        response = requests.get(url)
        response.raise_for_status()  # Check for request errors

        # Create BeautifulSoup object; parse with 'html.parser'
        soup = bs(response.text, 'html.parser')

        # Define a function to extract section text based on header IDs
        def extract_section_text(header_id, class_name):
            paragraphs = []
            header = soup.find(id=header_id)
            if header:
                next_sibling = header.find_next_sibling()
                # Collect text from <p> elements with the specified class name
                while next_sibling and next_sibling.name not in ['h2', 'h3', 'h4']:
                    if next_sibling.name == 'p' and class_name in next_sibling.get('class', []):
                        paragraphs.append(next_sibling.get_text(strip=True))
                    next_sibling = next_sibling.find_next_sibling()
            return ' '.join(paragraphs) if paragraphs else 'Section not found.'

        # Define class name for the paragraphs
        class_name = 'your-paragraph-class-name'  # Replace with the actual class name of <p> elements

        # Extracting information from different sections
        background_text = extract_section_text('konan-1', class_name)
        personality_text = extract_section_text('konan-2', class_name)
        appearance_text = extract_section_text('konan-3', class_name)
        abilities_text = extract_section_text('konan-4', class_name)

        # Return extracted text
        return {
            'URL': url,
            'Background': background_text,
            'Personality': personality_text,
            'Appearance': appearance_text,
            'Abilities': abilities_text
        }
    except requests.RequestException as e:
        print(f"Failed to retrieve data from {url}. Error: {e}")
        return None

# Extract information from the URL
data = [extract_text_from_url(url) for url in urls if extract_text_from_url(url) is not None]

# Create DataFrame for structured information
df_info_03 = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df_info_03.to_csv('Konan3.csv', index=False)

# Print the DataFrame
print(df_info_03)


                                         URL          Background  \
0  https://wiki.sportskeeda.com/naruto/konan  Section not found.   

          Personality          Appearance           Abilities  
0  Section not found.  Section not found.  Section not found.  


In [None]:
df_info_04

In [None]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

# URLs to be scraped
urls = [
    'https://gakkubi.tumblr.com/post/648446905223970816/ame-trio-analysis-konan'
]

def extract_text_from_url(url):
    try:
        # Send a GET request to the page
        response = requests.get(url)
        response.raise_for_status()  # Check for request errors

        # Create BeautifulSoup object; parse with 'html.parser'
        soup = bs(response.text, 'html.parser')

        # Define a function to extract section text based on header names
        def extract_section_text(header_text):
            paragraphs = []
            found = False
            # Look for headers (h2, h3, h4) that contain the header text
            for tag in soup.find_all(['h2', 'h3', 'h4']):
                if header_text.lower() in tag.get_text().lower():
                    found = True
                    next_sibling = tag.find_next_sibling()
                    # Collect text until the next header of the same or higher level
                    while next_sibling and next_sibling.name not in ['h2', 'h3', 'h4']:
                        if next_sibling.name == 'p':
                            paragraphs.append(next_sibling.get_text(strip=True))
                        next_sibling = next_sibling.find_next_sibling()
            return ' '.join(paragraphs) if found else 'Section not found.'

        # Extracting information from different sections
        background_text = extract_section_text('Background')
        personality_text = extract_section_text('Personality')
        appearance_text = extract_section_text('Appearance')
        abilities_text = extract_section_text('Abilities')

        # Return extracted text
        return {
            'URL': url,
            'Background': background_text,
            'Personality': personality_text,
            'Appearance': appearance_text,
            'Abilities': abilities_text
        }
    except requests.RequestException as e:
        print(f"Failed to retrieve data from {url}. Error: {e}")
        return None

# Extract information from the URL
data = [extract_text_from_url(url) for url in urls if extract_text_from_url(url) is not None]

# Create DataFrame for structured information
df_info_05 = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df_info_05.to_csv('Konan4.csv', index=False)

# Print the DataFrame
print(df_info_05)


                                                 URL          Background  \
0  https://gakkubi.tumblr.com/post/64844690522397...  Section not found.   

          Personality          Appearance           Abilities  
0  Section not found.  Section not found.  Section not found.  


In [None]:
df_info_05

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,11,12,13,14,15,16,17,18,19,20
0,"[https://hero.fandom.com/wiki/Konan, Naruto]","[https://hero.fandom.com/wiki/Konan, ""Yahiko a...","[https://hero.fandom.com/wiki/Konan, Naruto]","[https://hero.fandom.com/wiki/Konan, White]","[https://hero.fandom.com/wiki/Konan, Naruto]","[https://hero.fandom.com/wiki/Konan, Naruto]","[https://hero.fandom.com/wiki/Konan, Hidden Le...","[https://hero.fandom.com/wiki/Konan, New Gener...","[https://hero.fandom.com/wiki/Konan, Mentors]","[https://hero.fandom.com/wiki/Konan, Hokages]",...,"[https://hero.fandom.com/wiki/Konan, Leaders o...","[https://hero.fandom.com/wiki/Konan, Hidden Sa...","[https://hero.fandom.com/wiki/Konan, Hidden Cl...","[https://hero.fandom.com/wiki/Konan, Hidden Mi...","[https://hero.fandom.com/wiki/Konan, Hidden St...","[https://hero.fandom.com/wiki/Konan, Tailed Be...","[https://hero.fandom.com/wiki/Konan, Others]","[https://hero.fandom.com/wiki/Konan, Boruto: N...","[https://hero.fandom.com/wiki/Konan, New Gener...","[https://hero.fandom.com/wiki/Konan, Non-canon]"


In [None]:
import pandas as pd

# Load the five CSV files into separate DataFrames
df_info_01 = pd.read_csv('Konan_info.csv')
df_info_02 = pd.read_csv('Konan2.csv')
df_info_03 = pd.read_csv('Konan3.csv')
df_info_04 = pd.read_csv('Konan4.csv')
df_info_05 = pd.read_csv('Konan5.csv')

# Concatenate the DataFrames
df_combined = pd.concat([df_info_01, df_info_02, df_info_03, df_info_04, df_info_05], ignore_index=True)

# Save the combined DataFrame to a new CSV file
df_combined.to_csv('konan_combined_info.csv', index=False)

# Print the combined DataFrame
print(df_combined)

                                  URL  \
0  https://akat.fandom.com/wiki/Konan   
1  https://hero.fandom.com/wiki/Konan   
2  https://hero.fandom.com/wiki/Konan   
3  https://hero.fandom.com/wiki/Konan   
4                                 NaN   

                                          Background  \
0  [1][2]Konan as a child.When she was young, Kon...   
1  When she was only a child, Konan's parents was...   
2  When she was only a child, Konan's parents was...   
3  When she was only a child, Konan's parents was...   
4                                                NaN   

                                         Personality  \
0  Konan is stoic, calm, and level-headed (much l...   
1  Konan was a stoic, calm, and level-headed woma...   
2  Konan was a stoic, calm, and level-headed woma...   
3  Konan was a stoic, calm, and level-headed woma...   
4                                                NaN   

                                          Appearance  \
0  Konan is a woman who

In [None]:
df_combined

Unnamed: 0,URL,Background,Personality,Appearance,Abilities,0,1,2,3,4,...,11,12,13,14,15,16,17,18,19,20
0,https://akat.fandom.com/wiki/Konan,"[1][2]Konan as a child.When she was young, Kon...","Konan is stoic, calm, and level-headed (much l...","Konan is a woman who has short blue hair, gray...",[3][4]Konan's first shown origami. [5][6]Konan...,,,,,,...,,,,,,,,,,
1,https://hero.fandom.com/wiki/Konan,"When she was only a child, Konan's parents was...","Konan was a stoic, calm, and level-headed woma...","Konan has blue hair, gray eyes (amber in the a...",Konan is an S-rank kunoichi and her skills wer...,,,,,,...,,,,,,,,,,
2,https://hero.fandom.com/wiki/Konan,"When she was only a child, Konan's parents was...","Konan was a stoic, calm, and level-headed woma...","Konan has blue hair, gray eyes (amber in the a...",Konan is an S-rank kunoichi and her skills wer...,,,,,,...,,,,,,,,,,
3,https://hero.fandom.com/wiki/Konan,"When she was only a child, Konan's parents was...","Konan was a stoic, calm, and level-headed woma...","Konan has blue hair, gray eyes (amber in the a...",Konan is an S-rank kunoichi and her skills wer...,,,,,,...,,,,,,,,,,
4,,,,,,"['https://hero.fandom.com/wiki/Konan', 'Naruto']","['https://hero.fandom.com/wiki/Konan', '""Yahik...","['https://hero.fandom.com/wiki/Konan', 'Naruto']","['https://hero.fandom.com/wiki/Konan', 'White']","['https://hero.fandom.com/wiki/Konan', 'Naruto']",...,"['https://hero.fandom.com/wiki/Konan', 'Leader...","['https://hero.fandom.com/wiki/Konan', 'Hidden...","['https://hero.fandom.com/wiki/Konan', 'Hidden...","['https://hero.fandom.com/wiki/Konan', 'Hidden...","['https://hero.fandom.com/wiki/Konan', 'Hidden...","['https://hero.fandom.com/wiki/Konan', 'Tailed...","['https://hero.fandom.com/wiki/Konan', 'Others']","['https://hero.fandom.com/wiki/Konan', 'Boruto...","['https://hero.fandom.com/wiki/Konan', 'New Ge...","['https://hero.fandom.com/wiki/Konan', 'Non-ca..."


# Data Cleaning

In [None]:
# Load the combined DataFrame
df_combined = pd.read_csv('konan_combined_info.csv')

# Check for null values in the DataFrame
null_summary = df_combined.isnull().sum()

# Print the summary of null values
print("Summary of null values in each column:")
print(null_summary)

Summary of null values in each column:
URL            1
Background     1
Personality    1
Appearance     1
Abilities      1
0              4
1              4
2              4
3              4
4              4
5              4
6              4
7              4
8              4
9              4
10             4
11             4
12             4
13             4
14             4
15             4
16             4
17             4
18             4
19             4
20             4
dtype: int64


In [None]:
# Drop columns with null values
df_cleaned = df_combined.dropna(axis=1, how='any')
# Drop rows with any null values
df_cleaned = df_combined.dropna(axis=0, how='any')
# Save the cleaned DataFrame to a new CSV file
df_cleaned.to_csv('konan_cleaned_info.csv', index=False)

# Print the summary of missing values to verify
print(df_cleaned.isnull().sum())

# Save the cleaned DataFrame to a new CSV file
#df_combined.to_csv('hinata_cleaned_info.csv', index=False)

# Print the cleaned DataFrame
print("Cleaned DataFrame:")
print(df_combined)
df_combined



URL            0.0
Background     0.0
Personality    0.0
Appearance     0.0
Abilities      0.0
0              0.0
1              0.0
2              0.0
3              0.0
4              0.0
5              0.0
6              0.0
7              0.0
8              0.0
9              0.0
10             0.0
11             0.0
12             0.0
13             0.0
14             0.0
15             0.0
16             0.0
17             0.0
18             0.0
19             0.0
20             0.0
dtype: float64
Cleaned DataFrame:
                                  URL  \
0  https://akat.fandom.com/wiki/Konan   
1  https://hero.fandom.com/wiki/Konan   
2  https://hero.fandom.com/wiki/Konan   
3  https://hero.fandom.com/wiki/Konan   
4                                 NaN   

                                          Background  \
0  [1][2]Konan as a child.When she was young, Kon...   
1  When she was only a child, Konan's parents was...   
2  When she was only a child, Konan's parents was...   
3

Unnamed: 0,URL,Background,Personality,Appearance,Abilities,0,1,2,3,4,...,11,12,13,14,15,16,17,18,19,20
0,https://akat.fandom.com/wiki/Konan,"[1][2]Konan as a child.When she was young, Kon...","Konan is stoic, calm, and level-headed (much l...","Konan is a woman who has short blue hair, gray...",[3][4]Konan's first shown origami. [5][6]Konan...,,,,,,...,,,,,,,,,,
1,https://hero.fandom.com/wiki/Konan,"When she was only a child, Konan's parents was...","Konan was a stoic, calm, and level-headed woma...","Konan has blue hair, gray eyes (amber in the a...",Konan is an S-rank kunoichi and her skills wer...,,,,,,...,,,,,,,,,,
2,https://hero.fandom.com/wiki/Konan,"When she was only a child, Konan's parents was...","Konan was a stoic, calm, and level-headed woma...","Konan has blue hair, gray eyes (amber in the a...",Konan is an S-rank kunoichi and her skills wer...,,,,,,...,,,,,,,,,,
3,https://hero.fandom.com/wiki/Konan,"When she was only a child, Konan's parents was...","Konan was a stoic, calm, and level-headed woma...","Konan has blue hair, gray eyes (amber in the a...",Konan is an S-rank kunoichi and her skills wer...,,,,,,...,,,,,,,,,,
4,,,,,,"['https://hero.fandom.com/wiki/Konan', 'Naruto']","['https://hero.fandom.com/wiki/Konan', '""Yahik...","['https://hero.fandom.com/wiki/Konan', 'Naruto']","['https://hero.fandom.com/wiki/Konan', 'White']","['https://hero.fandom.com/wiki/Konan', 'Naruto']",...,"['https://hero.fandom.com/wiki/Konan', 'Leader...","['https://hero.fandom.com/wiki/Konan', 'Hidden...","['https://hero.fandom.com/wiki/Konan', 'Hidden...","['https://hero.fandom.com/wiki/Konan', 'Hidden...","['https://hero.fandom.com/wiki/Konan', 'Hidden...","['https://hero.fandom.com/wiki/Konan', 'Tailed...","['https://hero.fandom.com/wiki/Konan', 'Others']","['https://hero.fandom.com/wiki/Konan', 'Boruto...","['https://hero.fandom.com/wiki/Konan', 'New Ge...","['https://hero.fandom.com/wiki/Konan', 'Non-ca..."


In [None]:
from google.colab import files

# Save the cleaned DataFrame to a new CSV file
df_combined.to_csv('hinata_cleaned_info.csv', index=False)

#Save the combined DataFrame to a text file with tab delimiters
df_combined.to_csv('hinata_cleaned_info.txt', sep='\t', index=False)

# Download the text file
files.download('hinata_cleaned_info.txt')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Activity 1: Scrape Facts from Websites




Visit this link: https://docs.google.com/document/d/1O_0UKeGdoFE3jhLASUQ2LJ1WCkLOfl2rHtpg2_z3nPo/edit?usp=sharing