In [3]:
# Import BeautifulSoup from the bs4 library for parsing HTML and XML documents
from bs4 import BeautifulSoup

# Import the urllib library for fetching data across the web
import urllib
import urllib.request

# Import the re module for regular expression operations
import re

In [4]:
# Open the URL using urllib's urlopen function
# This URL points to a raw HTML file hosted on GitHub
with urllib.request.urlopen('https://raw.githubusercontent.com/BigDataGal/Data-Mania-Demos/master/IoT-2018.html') as response:
    # Read the content of the response, which is the HTML data from the URL
    html = response.read()

In [20]:
# Parse the HTML content using BeautifulSoup and specify the parser as 'html.parser'
# This creates a BeautifulSoup object that represents the document as a nested data structure
soup = BeautifulSoup(html, 'html.parser')

# Check the type of the soup object to confirm it's a BeautifulSoup object
type(soup)

bs4.BeautifulSoup

## Parsing Data

In [21]:
# Use the prettify method to format the BeautifulSoup object into a nicely indented string
# Print the formatted HTML to make it more readable
print(soup.prettify())

<html>
 <head>
  <title>
   IoT Articles
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    2018 Trends: Best New IoT Device Ideas for Data Scientists and Engineers
   </b>
  </p>
  <p class="description">
   It’s almost 2018 and IoT is on the cusp of an explosive expansion. In this article, I offer you a listing of new IoT device ideas that you can use...
   <br/>
   <br/>
   It’s almost 2018 and IoT is on the cusp of an explosive expansion. In this article, I offer you a listing of new IoT device ideas that you can use to get practice in designing your first IoT applications.
   <h1>
    Looking Back at My Coolest IoT Find in 2017
   </h1>
   Before going into detail about best new IoT device ideas, here’s the backstory.
   <span style="text-decoration: underline;">
    <strong>
     <a class="preview" href="http://bit.ly/LPlNDJj" id="link 1">
      Last month Ericsson Digital invited me
     </a>
    </strong>
   </span>
   to tour the Ericsson Studio in Kista, Sweden. Up un

## Getting data from a parse tree

In [22]:
# Extract all the text from the HTML document using the get_text method
# This removes all the HTML tags and returns only the text content
text_only = soup.get_text()

# Print the extracted text to display the content of the HTML document without any tags
print(text_only)

IoT Articles

2018 Trends: Best New IoT Device Ideas for Data Scientists and Engineers
It’s almost 2018 and IoT is on the cusp of an explosive expansion. In this article, I offer you a listing of new IoT device ideas that you can use...


It’s almost 2018 and IoT is on the cusp of an explosive expansion. In this article, I offer you a listing of new IoT device ideas that you can use to get practice in designing your first IoT applications.
Looking Back at My Coolest IoT Find in 2017
Before going into detail about best new IoT device ideas, here’s the backstory. Last month Ericsson Digital invited me to tour the Ericsson Studio in Kista, Sweden. Up until that visit, IoT had been largely theoretical to me. Of course, I know the usual mumbo-jumbo about wearables and IoT-connected fitness trackers. That stuff is all well and good, but it’s somewhat old hat – plus I am not sure we are really benefiting so much from those, so I’m not that impressed.

It wasn’t until I got to the Ericsson Stu

In [23]:
# Find all occurrences of the <li> tag in the HTML document
# This returns a list of all <li> elements, which are typically used for list items
soup.find_all('li')

[<li><strong>Big Data</strong> &amp; Data Engineering: Sensors that are embedded within IoT devices spin off machine-generated data like it’s going out of style. For IoT to function, the platform must be solidly engineered to handle big data. Be assured, that requires some serious data engineering.</li>,
 <li><strong>Machine Learning</strong> Data Science: While a lot of IoT devices are still operated according to rules-based decision criteria, the age of artificial intelligence is upon us. IoT will increasingly depend on machine learning algorithms to control device operations so that devices are able to autonomously respond to a complex set of overlapping stimuli.</li>,
 <li><strong>Blockchain</strong>-Enabled Security: Above all else, IoT networks must be secure. Blockchain technology is primed to meet the security demands that come along with building and expanding the IoT.</li>,
 <li>Enable built-in sensing to build a weather station that measures ambient temperature and humidity<

In [24]:
# Search for all elements in the HTML document with the id attribute equal to 'link 7'
# This method can find elements based on their id, which is a unique identifier in HTML
soup.find_all(id='link 7')

[<a class="preview" href="http://www.skyfilabs.com/iot-online-courses" id="link 7">SkyFi</a>]

In [25]:
# Find all occurrences of the <ol> tag in the HTML document
# The <ol> tag is used to define ordered lists in HTML
soup.find_all('ol')

[<ol>
 <li><strong>Big Data</strong> &amp; Data Engineering: Sensors that are embedded within IoT devices spin off machine-generated data like it’s going out of style. For IoT to function, the platform must be solidly engineered to handle big data. Be assured, that requires some serious data engineering.</li>
 <li><strong>Machine Learning</strong> Data Science: While a lot of IoT devices are still operated according to rules-based decision criteria, the age of artificial intelligence is upon us. IoT will increasingly depend on machine learning algorithms to control device operations so that devices are able to autonomously respond to a complex set of overlapping stimuli.</li>
 <li><strong>Blockchain</strong>-Enabled Security: Above all else, IoT networks must be secure. Blockchain technology is primed to meet the security demands that come along with building and expanding the IoT.</li>
 </ol>,
 <ol>
 <li>Enable built-in sensing to build a weather station that measures ambient temperat

In [26]:
# Find all occurrences of both <ol> and <b> tags in the HTML document
# This method allows searching for multiple tags at once by passing a list of tag names
# <ol> tags define ordered lists, and <b> tags are used to bold text in HTML
soup.find_all(['ol', 'b'])

[<b>2018 Trends: Best New IoT Device Ideas for Data Scientists and Engineers</b>,
 <ol>
 <li><strong>Big Data</strong> &amp; Data Engineering: Sensors that are embedded within IoT devices spin off machine-generated data like it’s going out of style. For IoT to function, the platform must be solidly engineered to handle big data. Be assured, that requires some serious data engineering.</li>
 <li><strong>Machine Learning</strong> Data Science: While a lot of IoT devices are still operated according to rules-based decision criteria, the age of artificial intelligence is upon us. IoT will increasingly depend on machine learning algorithms to control device operations so that devices are able to autonomously respond to a complex set of overlapping stimuli.</li>
 <li><strong>Blockchain</strong>-Enabled Security: Above all else, IoT networks must be secure. Blockchain technology is primed to meet the security demands that come along with building and expanding the IoT.</li>
 </ol>,
 <ol>
 <li

In [27]:
# Compile a regular expression pattern to match the letter 't'
# This pattern will be used to find tags with names containing 't'
t = re.compile('t')

# Iterate over all tags in the HTML document whose names match the regular expression pattern
for tag in soup.find_all(t):
    # Print the name of each tag that matches the pattern
    print(tag.name)

html
title
strong
strong
strong
strong
strong
strong


In [28]:
# Iterate over all tags in the HTML document
# Passing True to find_all() retrieves all tags, regardless of their name
for tag in soup.find_all(True):
    # Print the name of each tag
    print(tag.name)

html
head
title
body
p
b
p
br
br
h1
span
strong
a
a
a
img
a
span
strong
a
h1
ol
li
strong
li
strong
li
strong
h1
a
a
a
h2
ol
li
li
li
li
li
li
h2
ol
li
li
li
li
li
a
img
h2
ol
li
li
li
li
li
h2
ol
li
li
li
li
span
strong
a
em
p


In [29]:
# Iterate over all <a> tags in the HTML document
# <a> tags define hyperlinks in HTML
for link in soup.find_all('a'):
    # Print the value of the 'href' attribute for each <a> tag
    # The 'href' attribute contains the URL that the hyperlink points to
    print(link.get('href'))

http://bit.ly/LPlNDJj
http://www.data-mania.com/blog/m2m-vs-iot/
bit.ly/LPlNDJj
http://mat.se/
http://bit.ly/LPlNDJj
https://click.linksynergy.com/deeplink?id=*JDLXjeE*wk&mid=39197&murl=https%3A%2F%2Fwww.udemy.com%2Ftopic%2Finternet-of-things%2F%3Fsort%3Dhighest-rated
http://www.skyfilabs.com/iot-online-courses
https://www.coursera.org/specializations/iot
bit.ly/LPlNDJj
http://bit.ly/LPlNDJj


In [30]:
# Find all strings in the HTML document that contain the substring 'data'
# This uses a regular expression to search for the specified pattern within text nodes
soup.find_all(string=re.compile('data'))

[' & Data Engineering: Sensors that are embedded within IoT devices spin off machine-generated data like it’s going out of style. For IoT to function, the platform must be solidly engineered to handle big data. Be assured, that requires some serious data engineering.']