In summary, this code fetches the content of the "page1.html" webpage located at 'http://pythonscraping.com/pages/page1.html' and prints out the raw HTML content of that page. This is a basic example of web scraping, which is the process of programmatically extracting data from websites. Keep in mind that web scraping should be done responsibly and in accordance with the website's terms of use.

In [None]:
from urllib.request import urlopen
html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read())


b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


This Python code demonstrates web scraping using the `urllib` library to fetch and print the content of multiple webpages. It starts by importing the `urlopen` function from `urllib.request`. Next, a list of URLs is defined in the `urls` variable, containing three different webpage URLs. The code then enters a loop that iterates through each URL in the list. Inside the loop, the `urlopen` function is used to open each URL and retrieve its content. The content, obtained through `html.read()`, is then printed to the console along with the URL it came from. This process repeats for each URL in the list, effectively fetching and displaying the raw HTML content of the specified webpages. It's important to note that responsible web scraping practices, including adherence to website terms of use, should be followed when performing such operations.

In [None]:
from urllib.request import urlopen

# List of URLs
urls = [
    'http://pythonscraping.com/pages/page1.html',
    'http://pythonscraping.com/pages/page2.html',
    'http://pythonscraping.com/pages/page3.html'
]

# Loop through the URLs and fetch their content
for url in urls:
    html = urlopen(url)
    print(f"Content from {url}:\n{html.read()}\n")


Content from http://pythonscraping.com/pages/page1.html:
b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'

Content from http://pythonscraping.com/pages/page2.html:
b'\n<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div class="body" id="fakeLatin">\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitati

This code snippet utilizes the `urllib` library to open a webpage at 'http://www.pythonscraping.com/pages/page1.html'. It then imports the `BeautifulSoup` class from the `bs4` (Beautiful Soup 4) library, which is commonly used for parsing and navigating HTML content. The `urlopen` function fetches the webpage's content, and the `BeautifulSoup` constructor is used to create a Beautiful Soup object called `bs`, which represents the parsed HTML content. Finally, the code prints out the content enclosed within the first `<h1>` (heading level 1) tag of the webpage. In essence, this code fetches the content of the specified webpage, parses it, and extracts and prints the content of the first main heading on the page.

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html.parser')
print(bs.h1)


<h1>An Interesting Title</h1>


This code segment imports the `BeautifulSoup` class from the `bs4` library and the `urlopen` function from `urllib.request`. It then uses the `urlopen` function to retrieve the HTML content of the webpage located at 'http://pythonscraping.com/pages/page1.html'. However, the variable `bs` is not defined before attempting to print it, which would result in an error. To clarify, the `BeautifulSoup` object should be created from the fetched HTML content using the line `bs = BeautifulSoup(html.read(), 'html.parser')` before attempting to print it. The `BeautifulSoup` object represents the parsed HTML content and provides methods to navigate and extract information from the page's structure.

In [None]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
html = urlopen('http://pythonscraping.com/pages/page1.html')
print(bs)



<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>



This code defines a Python function named `getTitle` that takes a URL as its argument. It utilizes the `urlopen` function from the `urllib.request` module to fetch the HTML content from the provided URL. If an HTTP error occurs during the request, it returns `None`. Then, it uses BeautifulSoup to parse the HTML and searches for the first `<h1>` tag within the `<body>` of the HTML content. If the tag is not found, it returns `None`. The function returns the found `<h1>` tag. The code then calls this function with a specific URL and assigns the returned result to the variable `title`. If the `title` is `None`, it prints 'Not found topic'. Otherwise, it prints the content of the `title` variable. This code demonstrates how to handle HTTP errors, parse HTML content, and extract information using BeautifulSoup in a web scraping context.

In [None]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None

    try:
        bs = BeautifulSoup(html.read(), 'html.parser')
        title = bs.body.h1
    except AttributeError as e:
        return None

    return title

title = getTitle('http://www.pythonscraping.com/pages/page1.html')

if title == None:
    print('Not found topic')
else:
    print(title)


<h1>An Interesting Title</h1>


This code defines a function named `getTitle` that attempts to fetch a webpage's HTML content using the `urlopen` function from the `urllib.request` module. If an HTTP error occurs during the request, the function returns `None`. It then uses BeautifulSoup to parse the HTML and searches for the first `<h1>` tag within the `<body>` section. If the tag is not found, the function returns `None`. The code then calls the `getTitle` function with a specific URL and stores the result in the variable `title`. If `title` is `None`, it prints 'Not found'. Otherwise, it prints the content of the `title` variable. This code demonstrates error handling and HTML parsing with BeautifulSoup for extracting specific information from a web page.

In [None]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:

        return None

    try:
        bs = BeautifulSoup(html.read(), 'html.parser')
        title = bs.body.h1
    except AttributeError as e:
        return None

    return title

title = getTitle('http://www.pythonscraping.com/pages/page1.html')

if title == None:
    print('Not found)
else:
    print(title)


<h1>An Interesting Title</h1>


This code imports necessary modules and uses `urlopen` to fetch the HTML content from a webpage. It then utilizes BeautifulSoup to parse the HTML using the specified parser. The code loops through the children of a table with the `id` attribute 'giftList'. For each child, it prints the HTML content, effectively extracting and displaying the information within the specified table. This code demonstrates how to access and manipulate the structure of HTML elements within a specific section of a webpage using BeautifulSoup.

In [None]:

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')
for child in bs.find('table',{'id':'giftList'}).children:
    print(child)




<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>


This code uses the BeautifulSoup library in Python to illustrate how to navigate and extract information from HTML elements. It begins by defining an HTML string representing a webpage with nested `<div>` elements. The code then creates a BeautifulSoup object to parse this HTML. It finds the parent element with the ID 'parent' and subsequently locates all child elements within it that have the class 'child'. The content of these child elements is displayed using a loop. It also demonstrates finding the parent of one of the child elements and displays its content. Lastly, the code finds the siblings of the first child element and displays their content as well. Overall, this example demonstrates how BeautifulSoup can be used to traverse and manipulate HTML elements efficiently.

In [None]:
from bs4 import BeautifulSoup

#Suppose html_str is the HTML text of the web page you want
html_str = """
<html>
  <body>
    <div id="parent">
      <div class="child">First Child</div>
      <div class="child">Second Child</div>
      <div class="child">Third Child</div>
    </div>
  </body>
</html>
"""

# Create a Beautiful Soup object using HTML text
soup = BeautifulSoup(html_str, 'html.parser')

# Find the parent element
parent_element = soup.find('div', id='parent')

# Find all child elements
child_elements = parent_element.find_all('div', class_='child')

# Display the content of child elements
for child in child_elements:
    print(child.text)

# Find parent element based on child elements
parent_of_children = child_elements[0].find_parent()

# Display the content of the parent element
print(parent_of_children.text)

# Find siblings of an element
sibling_elements = child_elements[0].find_next_siblings()

# Show content of sisters/brothers
for sibling in sibling_elements:
    print(sibling.text)


First Child
Second Child
Third Child

First Child
Second Child
Third Child

Second Child
Third Child


This code snippet demonstrates how to utilize the BeautifulSoup library in Python to parse HTML content and extract specific elements. It starts by defining an HTML content string and then uses BeautifulSoup to parse it. The code employs the `find_all` method to locate all anchor ('a') tags within the parsed HTML and subsequently prints the value of their href attributes. Additionally, it employs the `find` method to extract the text content of the first paragraph ('p') tag and prints it. This example showcases how BeautifulSoup simplifies the process of navigating and extracting data from HTML documents.

In [None]:
from bs4 import BeautifulSoup

# HTML content to parse
html_content = """
<html>
   <body>
       <h1>Welcome to my website</h1>
       <p>This is a paragraph.</p>
       <a href="https://www.example.com">Visit Example</a>
   </body>
</html>
"""

# Parse the HTML content
soup = BeautifulSoup(html_content, 'html.parser')

# Using find_all to extract all 'a' tags
all_links = soup.find_all('a')
for link in all_links:
   print(link['href'])  # Print the href attribute of each link

# Using find to extract the first 'p' tag
first_paragraph = soup.find('p').text
print(first_paragraph)


https://www.example.com
This is a paragraph.


This code snippet illustrates how to use the BeautifulSoup library in Python to parse HTML content and extract specific elements. It demonstrates two scenarios: first, it finds all anchor ('a') tags with a specific href attribute value ('https://www.example.com') and prints their text content. Second, it searches for the first anchor tag with a specific CSS class ('special-link') and prints its href attribute value. This showcases the library's ability to locate and extract elements based on attributes and classes within HTML content.

In [None]:
from bs4 import BeautifulSoup

# Your HTML content here
html_content = """
<html>
   <body>
       <h1>Welcome to my website</h1>
       <p>This is a paragraph.</p>
       <a href="https://www.example.com">Visit Example</a>
   </body>
</html>
"""

soup = BeautifulSoup(html_content, 'html.parser')

# Using find_all to search for tags with specific attributes
specific_links = soup.find_all('a', href='https://www.example.com')
for link in specific_links:
    print(link.text)

# Using find to search for the first 'a' tag with a specific class
specific_link = soup.find('a', class_='special-link')
if specific_link is not None:
    print(specific_link['href'])
else:
    print("No link with class 'special-link' found.")



Visit Example
No link with class 'special-link' found.


This code snippet showcases the navigation functionalities of the BeautifulSoup library in Python for HTML parsing. It begins by using `soup.find('p')` to locate a 'p' tag (paragraph) within the parsed HTML content. The code then prints the text content of the paragraph using `paragraph.text`. Next, it retrieves the name of the parent tag of the paragraph using `paragraph.parent.name` and prints it. The code utilizes a loop to iterate through the children of the paragraph, displaying each child's content using `print(child)`. Additionally, it demonstrates the ability to navigate through siblings of the paragraph. It employs `paragraph.find_next_sibling('p')` to find the next 'p' tag and `paragraph.find_previous_sibling('p')` to locate the previous 'p' tag. This code exemplifies how to traverse and manipulate the hierarchical structure of HTML elements using BeautifulSoup's navigation methods.

In [None]:
# Navigating through parent, children, and siblings
paragraph = soup.find('p')
print(paragraph.text)
print(paragraph.parent.name)  # Print the parent tag's name

for child in paragraph.children:
   print(child)

# Navigating through siblings
next_sibling = paragraph.find_next_sibling('p')
previous_sibling = paragraph.find_previous_sibling('p')


This is a paragraph.
body
This is a paragraph.


This code snippet demonstrates how to utilize the BeautifulSoup library in Python to parse HTML content. It begins by importing the library and providing HTML content as a string. The code then initializes a BeautifulSoup object to parse the HTML. It showcases the concept of navigating through the ancestor elements of a specific 'a' tag (hyperlink) and displays their tag names. Additionally, it illustrates how to traverse through the descendants of the 'body' tag, printing the names of non-empty descendant tags. Finally, the code finds all the text content within the 'a' tag and its descendants, providing a useful technique for text extraction from HTML elements. This example highlights the capabilities of BeautifulSoup for HTML parsing and navigation tasks.

In [None]:
from bs4 import BeautifulSoup

# Your HTML content here
html_content = """
<html>
   <body>
       <h1>Welcome to my website</h1>
       <p>This is a paragraph.</p>
       <a href="https://www.example.com">Visit Example</a>
   </body>
</html>
"""

soup = BeautifulSoup(html_content, 'html.parser')

# Navigating through ancestors
link = soup.find('a')
ancestors = link.find_parents()
for ancestor in ancestors:
    print(ancestor.name)

# Navigating through descendants
descendants = soup.find('body').descendants
for descendant in descendants:
    if descendant.name is not None:
        print(descendant.name)

# Searching for all text within a tag and its descendants
all_text = link.find_all(text=True)
print(all_text)



body
html
[document]
h1
p
a
['Visit Example']


  all_text = link.find_all(text=True)


This code snippet utilizes the BeautifulSoup library to parse HTML content. It first locates an 'a' tag (hyperlink) and finds its ancestor tags, demonstrating the upward navigation. It also shows how to traverse through descendants of the 'body' tag, highlighting the hierarchical structure of HTML elements. Additionally, the code extracts all text content within a specified tag and its descendants. This example underscores how BeautifulSoup simplifies HTML navigation and extraction tasks in Python.

In [None]:
from bs4 import BeautifulSoup

# Your HTML content here
html_content = """
<html>
   <body>
       <h1>Welcome to my website</h1>
       <p>This is a paragraph.</p>
       <a href="https://www.example.com">Visit Example</a>
   </body>
</html>
"""

soup = BeautifulSoup(html_content, 'html.parser')

# Navigating through ancestors
link = soup.find('a')
ancestors = link.find_parents()
for ancestor in ancestors:
    print(ancestor.name)

# Navigating through descendants
descendants = soup.find('body').descendants
for descendant in descendants:
    if descendant.name is not None:
        print(descendant.name)

# Searching for all text within a tag and its descendants
all_text = link.find_all(text=True)
print(all_text)


body
html
[document]
h1
p
a
['Visit Example']


  all_text = link.find_all(text=True)
