The cells below install the required libraries

In [5]:
!pip install mamba
!mamba install bs4==4.10.0 



0 examples ran in 0.0000 seconds


In [4]:
from bs4 import BeautifulSoup # Helps in web scrapping.
import requests  # Helps us to download a web page

# Beautiful Soup Objects

Beautiful Soup is a Python library for pulling data out of HTML and XML files, we will focus on HTML files. This is accomplished by representing the HTML as a set of objects with methods used to parse the HTML. We can navigate the HTML as a tree and/or filter out what we are looking for.

Consider the following HTML:

In [17]:
%%html
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h3><b id='boldest'>Abhishek</b></h3>
<p> Salary: $ 94,000,000 </p>
<h3> Akash</h3>
<p> Salary: $89,000,000 </p>
<h3> Mohini </h3>
<p> Salary: $73,200,000</p>
</body>
</html>

We can store it as a string in the variable HTML:


In [20]:
html="<!DOCTYPE html><html><head><title>Page Title</title></head><body><h3><b id='boldest'>Abhishek</b></h3><p> Salary: $ 94,000,000 </p><h3>Akash</h3><p> Salary: $89,000, 000 </p><h3> Mohini </h3><p> Salary: $73,200, 000</p></body></html>"

To parse a document, pass it into the <code>BeautifulSoup</code> constructor, the <code>BeautifulSoup</code> object, which represents the document as a nested data structure:


In [21]:
soup=BeautifulSoup(html,"html.parser")

First, the document is converted to Unicode, (similar to ASCII),  and HTML entities are converted to Unicode characters. Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. The <code>BeautifulSoup</code> object can create other types of objects.

We can use the method <code>prettify()</code> to display the HTML in the nested structure:


In [22]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Page Title
  </title>
 </head>
 <body>
  <h3>
   <b id="boldest">
    Abhishek
   </b>
  </h3>
  <p>
   Salary: $ 94,000,000
  </p>
  <h3>
   Akash
  </h3>
  <p>
   Salary: $89,000, 000
  </p>
  <h3>
   Mohini
  </h3>
  <p>
   Salary: $73,200, 000
  </p>
 </body>
</html>



# Tags

Let's say we want the  title of the page and the name of the top paid player we can use the <code>Tag</code>. The <code>Tag</code> object corresponds to an HTML tag in the original document, for example, the tag title.


In [23]:
tag_object=soup.title
print("tag_object:",tag_object)

tag_object: <title>Page Title</title>


we can see the tag type <code>bs4.element.Tag</code>


In [24]:
print("tag object type:",type(tag_object))

tag object type: <class 'bs4.element.Tag'>


If there is more than one <code>Tag</code>  with the same name, the first element with that <code>Tag</code> name is called, this corresponds to the most paid player:


In [25]:
tag_object=soup.h3
tag_object

<h3><b id="boldest">Abhishek</b></h3>

Enclosed in the bold attribute <code>b</code>, it helps to use the tree representation. We can navigate down the tree using the child attribute to get the name.


# Children, Parents, and Siblings

In [26]:
tag_child=tag_object.b
tag_child

<b id="boldest">Abhishek</b>

In [28]:
tag_parent=tag_child.parent
tag_parent

<h3><b id="boldest">Abhishek</b></h3>

In [29]:
sibling_1=tag_object.next_sibling
sibling_1

<p> Salary: $ 94,000,000 </p>

# HTML Attributes

If the tag has attributes, the tag <code>id="boldest"</code> has an attribute <code>id</code> whose value is <code>boldest</code>. You can access a tag’s attributes by treating the tag like a dictionary:


In [30]:
tag_child['id']

'boldest'

In [31]:
tag_child.attrs #The dictionary can be accessed directly as attrs:

{'id': 'boldest'}

In [32]:
 tag_child.get('id')

'boldest'

A string corresponds to a bit of text or content within a tag. Beautiful Soup uses the <code>NavigableString</code> class to contain this text. In our HTML we can obtain the name of the first player by extracting the string of the <code>Tag</code> object <code>tag_child</code> as follows:


In [33]:
tag_string=tag_child.string
tag_string

'Abhishek'

In [34]:
type(tag_string)

bs4.element.NavigableString

A NavigableString is just like a Python string or Unicode string, to be more precise. The main difference is that it also supports some  <code>BeautifulSoup</code> features. We can covert it to sting object in Python:


In [35]:
unicode_string=str(tag_string)
unicode_string

'Abhishek'

# Filter

Filters allow you to find complex patterns, the simplest filter is a string. In this section we will pass a string to a different filter method and Beautiful Soup will perform a match against that exact string.  Consider the following HTML of rocket launchs:


In [36]:
%%html
<table>
  <tr>
    <td id='flight' >Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
   </tr>
  <tr> 
    <td>1</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a> </td>
    <td>80 kg</td>
  </tr>
</table>

0,1,2
Flight No,Launch site,Payload mass
1,Florida,300 kg
2,Texas,94 kg
3,Florida,80 kg


In [39]:
table="<table><tr><td id='flight' >Flight No</td><td>Launch site</td><td>Payload mass</td></tr><tr><td>1</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td><td>300 kg</td></tr><tr><td>2</td><td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a> </td><td>80 kg</td></tr></table>"

In [40]:
table_soup=BeautifulSoup(table,"html.parser")

## find All

The Method signature for <code>find_all(name, attrs, recursive, string, limit, **kwargs)<c/ode>

When we set the <code>name</code> parameter to a tag name, the method will extract all the tags with that name and its children.


In [41]:
table_rows=table_soup.find_all('tr')
table_rows

[<tr><td id="flight">Flight No</td><td>Launch site</td><td>Payload mass</td></tr>,
 <tr><td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td><td>300 kg</td></tr>,
 <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>,
 <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td><td>80 kg</td></tr>]

The result is a Python Iterable just like a list, each element is a <code>tag</code> object:


In [43]:
first_row=table_rows[0]
first_row

<tr><td id="flight">Flight No</td><td>Launch site</td><td>Payload mass</td></tr>

In [44]:
print(type(first_row))

<class 'bs4.element.Tag'>


In [45]:
first_row.td

<td id="flight">Flight No</td>

If we iterate through the list, each element corresponds to a row in the table:


In [46]:
for i,row in enumerate(table_rows):
    print("row",i,"is",row)
    

row 0 is <tr><td id="flight">Flight No</td><td>Launch site</td><td>Payload mass</td></tr>
row 1 is <tr><td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td><td>300 kg</td></tr>
row 2 is <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>
row 3 is <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td><td>80 kg</td></tr>


In [48]:
for i, row in enumerate(table_rows):
    print("row",i)
    cells=row.find_all('td')
    for j,cell in enumerate(cells):
        print('column',j,"cell",cell)

row 0
column 0 cell <td id="flight">Flight No</td>
column 1 cell <td>Launch site</td>
column 2 cell <td>Payload mass</td>
row 1
column 0 cell <td>1</td>
column 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>
column 2 cell <td>300 kg</td>
row 2
column 0 cell <td>2</td>
column 1 cell <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>
column 2 cell <td>94 kg</td>
row 3
column 0 cell <td>3</td>
column 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td>
column 2 cell <td>80 kg</td>


If we use a list we can match against any item in that list.


In [49]:
list_input=table_soup.find_all(name=['tr','td'])
list_input

[<tr><td id="flight">Flight No</td><td>Launch site</td><td>Payload mass</td></tr>,
 <td id="flight">Flight No</td>,
 <td>Launch site</td>,
 <td>Payload mass</td>,
 <tr><td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td><td>300 kg</td></tr>,
 <td>1</td>,
 <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>,
 <td>300 kg</td>,
 <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>,
 <td>2</td>,
 <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>,
 <td>94 kg</td>,
 <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td><td>80 kg</td></tr>,
 <td>3</td>,
 <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td>,
 <td>80 kg</td>]

# string
With string you can search for strings instead of tags, where we find all the elments with Florida:


In [50]:
table_soup.find_all(string="Florida")

['Florida', 'Florida']

# Downloading and Scraping The Contents of a Web Page

In [61]:
url="https://www.ibm.com"

In [62]:
data=requests.get(url).text

In [63]:
soup=BeautifulSoup(data,"html.parser")

In [64]:
# Scrape all links
for link in soup.find_all('a',href=True):
    print(link.get('href'))

https://www.ibm.com/cloud?lnk=intro


In [65]:
#Scrape all images
for link in soup.find_all('img'):
    print(link)
    print(link.get('src'))

<img alt="Portraits of IBM consultants" class="bx--image__img" id="image--235469917" loading="lazy" src="/content/dam/adobe-cms/default-images/home-consultants.component.crop-16by9-xl.ts=1695214867398.jpg/content/adobe-cms/in/en/homepage/_jcr_content/root/table_of_contents/simple_image"/>
/content/dam/adobe-cms/default-images/home-consultants.component.crop-16by9-xl.ts=1695214867398.jpg/content/adobe-cms/in/en/homepage/_jcr_content/root/table_of_contents/simple_image


# Scrape data from HTML tables

In [66]:
#The below url contains an html table with data about colors and color codes.
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"

In [67]:
data=requests.get(url).text

In [68]:
soup=BeautifulSoup(data,"html.parser")

In [69]:
table=soup.find('table')

In [70]:
for row in table.find_all('tr'):
    cols=row.find_all('td')
    color_name=cols[2].string
    color_code=cols[3].string
    print("{}--->{}".format(color_name,color_code))

Color Name--->None
lightsalmon--->#FFA07A
salmon--->#FA8072
darksalmon--->#E9967A
lightcoral--->#F08080
coral--->#FF7F50
tomato--->#FF6347
orangered--->#FF4500
gold--->#FFD700
orange--->#FFA500
darkorange--->#FF8C00
lightyellow--->#FFFFE0
lemonchiffon--->#FFFACD
papayawhip--->#FFEFD5
moccasin--->#FFE4B5
peachpuff--->#FFDAB9
palegoldenrod--->#EEE8AA
khaki--->#F0E68C
darkkhaki--->#BDB76B
yellow--->#FFFF00
lawngreen--->#7CFC00
chartreuse--->#7FFF00
limegreen--->#32CD32
lime--->#00FF00
forestgreen--->#228B22
green--->#008000
powderblue--->#B0E0E6
lightblue--->#ADD8E6
lightskyblue--->#87CEFA
skyblue--->#87CEEB
deepskyblue--->#00BFFF
lightsteelblue--->#B0C4DE
dodgerblue--->#1E90FF


# Scrape data from HTML tables into a DataFrame using BeautifulSoup and Pandas

In [71]:
import pandas as pd

In [72]:
#The below url contains html tables with data about world population.
url = "https://en.wikipedia.org/wiki/World_population"

In [73]:
data=requests.get(url).text

In [75]:
soup=BeautifulSoup(data,"html.parser")

In [76]:
tables=soup.find_all('table')

In [77]:
len(tables)

30

Assume that we are looking for the `10 most densly populated countries` table, we can look through the tables list and find the right one we are look for based on the data in each table or we can search for the table name if it is in the table but this option might not always work.


In [79]:
for index,table in enumerate(tables):
    if ("10 most densely populated countries" in str(table)):
        table_index = index
print(table_index)

7


See if you can locate the table name of the table, `10 most densly populated countries`, below.


In [80]:
print(tables[table_index].prettify())

<table class="wikitable sortable" style="text-align:right">
 <caption>
  10 most densely populated countries
  <small>
   (with population above 5 million)
  </small>
  <sup class="reference" id="cite_ref-:10_106-0">
   <a href="#cite_note-:10-106">
    [101]
   </a>
  </sup>
 </caption>
 <tbody>
  <tr>
   <th scope="col">
    Rank
   </th>
   <th scope="col">
    Country
   </th>
   <th scope="col">
    Population
   </th>
   <th scope="col">
    Area
    <br/>
    <small>
     (km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
   <th scope="col">
    Density
    <br/>
    <small>
     (pop/km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
  </tr>
  <tr>
   <td>
    1
   </td>
   <td align="left">
    <span class="flagicon">
     <span class="mw-image-border" typeof="mw:File">
      <span>
       <img alt="" class="mw-file-element" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/4/4

In [84]:
population_data = pd.DataFrame(columns=["Rank", "Country", "Population", "Area", "Density"])
data=[]
for row in tables[table_index].tbody.find_all("tr"):
    col = row.find_all("td")
    if (col != []):
        rank = col[0].text
        country = col[1].text
        population = col[2].text.strip()
        area = col[3].text.strip()
        density = col[4].text.strip()
        data.append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density})
        population_data = pd.DataFrame(data)
population_data

Unnamed: 0,Rank,Country,Population,Area,Density
0,1,Singapore,5921231,719,8235
1,2,Bangladesh,165650475,148460,1116
2,3,\n Palestine[note 3][102]\n\n,5223000,6025,867
3,4,Taiwan[note 4],23580712,35980,655
4,5,South Korea,51844834,99720,520
5,6,Lebanon,5296814,10400,509
6,7,Rwanda,13173730,26338,500
7,8,Burundi,12696478,27830,456
8,9,India,1389637446,3287263,423
9,10,Netherlands,17400824,41543,419
