# Webscraping

First of all, install some Python libraries to use and study the Webscrapping theme:

In [26]:
!pip install bs4
!pip install lxml
!pip install html5lib
# !pip install requests==2.26.0




Import modules and functions

In [27]:
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page

---


## Beautiful Soup Objects

BS is a Python lib that pulls data out of HTML and XML files. It represents HTML as a set of objects with methods, used to parse the HTML.

We navigate the HTML as a tree and filter what we're looking for.

### The following webpage has an HTML code, that can be stored as a string

In [31]:
%%html
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h3><b id='boldest'>Lebron James</b></h3>
<p> Salary: $ 92,000,000</p>
<h3> Stephen Curry</h3>
<p> Salary: $85,000,000 </p>
<h3> Kevin Durant </h3>
<p> Salary: $73,200,000 <p>
</body>
</html>

In [29]:
html = "<!DOCTYPE html><html><head><title>Page Title</title></head><body><h3><b id='boldest'>Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body></html>"

To parse a document, pass it into the BS construtor, which represents the document as a nested data structure:

In [30]:
soup = BeautifulSoup(html,"html.parser")

BS transforms a complex HTML doc into a complex tree of phyton objects. The method prettify() can display the HTML in the nested structure:

In [32]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Page Title
  </title>
 </head>
 <body>
  <h3>
   <b id="boldest">
    Lebron James
   </b>
  </h3>
  <p>
   Salary: $ 92,000,000
  </p>
  <h3>
   Stephen Curry
  </h3>
  <p>
   Salary: $85,000, 000
  </p>
  <h3>
   Kevin Durant
  </h3>
  <p>
   Salary: $73,200, 000
  </p>
 </body>
</html>


## Tags

Tag object correspons to an HTML <code>tag</code> in the original doc, as tag title:

In [34]:
tag_object = soup.title
print("tag object:", tag_object)

tag object: <title>Page Title</title>


We can find the tag type too:

In [35]:
print("tag object type:",type(tag_object))

tag object type: <class 'bs4.element.Tag'>


If there's more than one tag with the same name, the first element will be called:

In [36]:
tag_object = soup.h3
tag_object

<h3><b id="boldest">Lebron James</b></h3>

If the tag object is a tree of objects, so we can access the others "branchs", the Children, Parents, and Siblings:

In [45]:
# Access the child using the function b
tag_child = tag_object.b
print("Tag Child:", tag_child)

# Access the parent using the function parent
parent_tag = tag_child.parent
print("Parent Tag:", parent_tag)

## The tag_child parent is identical to the tag_object

# Access the sibling, the paragraph element, using the function next_sibling
sibling_1 = tag_object.next_sibling
print("Tag Sibling 1:", sibling_1)

# We can find the next_sibling of the sibling_1 also using the same function
sibling_2 = sibling_1.next_sibling
print("Tag Sibling 2:", sibling_2,)

# To find the Stephen Curry's salary, we use the same method:
curry_salary = sibling_2.next_sibling
print("Curry's Salary:", curry_salary,)

Tag Child: <b id="boldest">Lebron James</b>
Parent Tag: <h3><b id="boldest">Lebron James</b></h3>
Tag Sibling 1: <p> Salary: $ 92,000,000 </p>
Tag Sibling 2: <h3> Stephen Curry</h3>
Curry's Salary: <p> Salary: $85,000, 000 </p>


BS uses <code>NavigableString</code> class to contain a string, so we can find the corresponding string using the function <code>string</code>:

In [53]:
tag_string = tag_child.string
print("String:",tag_string)

print("Type:",type(tag_string))

String: Lebron James
Type: <class 'bs4.element.NavigableString'>


---


## Filter

Consider the following table HTML of rocket launchs:

In [55]:
%%html
<table>
  <tr>
    <td id='flight' >Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
   </tr>
  <tr> 
    <td>1</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td>
    <td>80 kg</td>
  </tr>
</table>

0,1,2
Flight No,Launch site,Payload mass
1,Florida,300 kg
2,Texas,94 kg
3,Florida,80 kg


In [58]:
table = "<table><tr><td id='flight'>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a></td><td>300 kg</td></tr><tr><td>2</td><td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td><td>80 kg</td></tr></table>"

table_bs = BeautifulSoup(table, "html.parser")

### Find All Method

The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters.

The Method signature for find_all(name, attrs, recursive, string, limit, **kwargs)

In [63]:
# Setting the name parameter to a tag name, we extract all tags and its children
table_rows = table_bs.find_all(name = 'tr')
table_rows

[<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>,
 <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td><td>300 kg</td></tr>,
 <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>,
 <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td><td>80 kg</td></tr>]

In [64]:
# And each element is a tag object:
first_row = table_rows[0]
first_row.td

<td id="flight">Flight No</td>

In [65]:
# We can obtain the child
first_row.td

<td id="flight">Flight No</td>

If we iterate through the list, each element corresponds to a row in the table:

In [66]:
for i,row in enumerate(table_rows):
    print("row",i,"is",row)

row 0 is <tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>
row 1 is <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td><td>300 kg</td></tr>
row 2 is <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>
row 3 is <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td><td>80 kg</td></tr>


As row is a cell object, we can apply the method <code>find_all</code> to it and extract table cells in the object cells using the tag <code>td</code>, this is all the children with the name <code>td</code>. The result is a list, each element corresponds to a cell and is a Tag object, we can iterate through this list as well. We can extract the content using the string attribute.

In [67]:
for i,row in enumerate (table_row):
     print("row", i)
     cells = row.find_all ("td")
     for j, cell in enumerate (cells):
          print("column", j, "cell", cell)

row 0
column 0 cell <td id="flight">Flight No</td>
column 1 cell <td>Launch site</td>
column 2 cell <td>Payload mass</td>
row 1
column 0 cell <td>1</td>
column 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td>
column 2 cell <td>300 kg</td>
row 2
column 0 cell <td>2</td>
column 1 cell <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>
column 2 cell <td>94 kg</td>
row 3
column 0 cell <td>3</td>
column 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td>
column 2 cell <td>80 kg</td>


In [68]:
# we can match against any item in that list using a list
list_input=table_bs .find_all(name=["tr", "td"])
list_input

[<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>,
 <td id="flight">Flight No</td>,
 <td>Launch site</td>,
 <td>Payload mass</td>,
 <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td><td>300 kg</td></tr>,
 <td>1</td>,
 <td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td>,
 <td>300 kg</td>,
 <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>,
 <td>2</td>,
 <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>,
 <td>94 kg</td>,
 <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td><td>80 kg</td></tr>,
 <td>3</td>,
 <td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td>,
 <td>80 kg</td>]

In [12]:
import html5lib
from bs4 import BeautifulSoup
html = "<table><tr><td>Pizza Place</td><td>Orders</td><td>Slices</td></tr><tr><td>Domino’s Pizza</td><td>10</td><td>100 </td></tr><tr><td>Little Caesars</td><td>12</td><td>144</td></table>"
table = BeautifulSoup(html,"html.parser")
table

<table><tr><td>Pizza Place</td><td>Orders</td><td>Slices</td></tr><tr><td>Domino’s Pizza</td><td>10</td><td>100 </td></tr><tr><td>Little Caesars</td><td>12</td><td>144</td></tr></table>

[<tr><td>Pizza Place</td><td>Orders</td><td>Slices</td></tr>,
 <tr><td>Domino’s Pizza</td><td>10</td><td>100 </td></tr>,
 <tr><td>Little Caesars</td><td>12</td><td>144</td></tr>]

<td>Pizza Place</td>

row 0
column 0 cell <td>Pizza Place</td>
column 1 cell <td>Orders</td>
column 2 cell <td>Slices</td>
row 1
column 0 cell <td>Domino’s Pizza</td>
column 1 cell <td>10</td>
column 2 cell <td>100 </td>
row 2
column 0 cell <td>Little Caesars</td>
column 1 cell <td>12</td>
column 2 cell <td>144</td>


### Attributes

If the argument is not recognized it will be turned into a filter on the tag’s attributes. You can find more methods for dealing attributes on the following [link](https://www.crummy.com/software/BeautifulSoup/bs4/doc/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0220ENSkillsNetwork23455606-2021-01-01#css-selectors).

In [69]:
# We can use the id attr to filter
table_bs.find_all(id = "flight")

[<td id="flight">Flight No</td>]

In [70]:
# We can use href to filter all the elements that have links to a specific page
list_input = table_bs.find_all(href = "https://en.wikipedia.org/wiki/Florida")
list_input

[<a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a>,
 <a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a>]

In [71]:
# We can set a "boolean" value on an attribute
table_bs.find_all(href=True)

[<a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a>,
 <a href="https://en.wikipedia.org/wiki/Texas">Texas</a>,
 <a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a>]

In [72]:
table_bs.find_all(href=False)

[<table><tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td><td>300 kg</td></tr><tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td><td>80 kg</td></tr></table>,
 <tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>,
 <td id="flight">Flight No</td>,
 <td>Launch site</td>,
 <td>Payload mass</td>,
 <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td><td>300 kg</td></tr>,
 <td>1</td>,
 <td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td>,
 <a></a>,
 <td>300 kg</td>,
 <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>,
 <td>2</td>,
 <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>,
 <td>94 k

In [74]:
soup.find_all(id ='boldest')''

[<b id="boldest">Lebron James</b>]

### String

With string you can search for strings instead of tags, where we find all the elments with Florida:

In [75]:
table_bs.find_all(string="Florida")

['Florida', 'Florida']

## Find

If you are looking for one element you can use the <code>find()</code> method to find the first element in the document. Consider the following two table:

In [76]:
%%html
<h3>Rocket Launch </h3>

<p>
<table class='rocket'>
  <tr>
    <td>Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
  </tr>
  <tr>
    <td>1</td>
    <td>Florida</td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td>Texas</td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td>Florida </td>
    <td>80 kg</td>
  </tr>
</table>
</p>
<p>

<h3>Pizza Party  </h3>
  
    
<table class='pizza'>
  <tr>
    <td>Pizza Place</td>
    <td>Orders</td> 
    <td>Slices </td>
   </tr>
  <tr>
    <td>Domino's Pizza</td>
    <td>10</td>
    <td>100</td>
  </tr>
  <tr>
    <td>Little Caesars</td>
    <td>12</td>
    <td >144 </td>
  </tr>
  <tr>
    <td>Papa John's </td>
    <td>15 </td>
    <td>165</td>
  </tr>

0,1,2
Flight No,Launch site,Payload mass
1,Florida,300 kg
2,Texas,94 kg
3,Florida,80 kg

0,1,2
Pizza Place,Orders,Slices
Domino's Pizza,10,100
Little Caesars,12,144
Papa John's,15,165


In [77]:
# Store HTML as a str and create a BS object
two_tables = "<h3>Rocket Launch </h3><p><table class='rocket'><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table></p><p><h3>Pizza Party  </h3><table class='pizza'><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td >144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr>"
two_tables_bs= BeautifulSoup(two_tables, 'html.parser')

# We can find the first table using the tag name table
two_tables_bs.find("table")

<table class="rocket"><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table>

In [78]:
# We can filter on the class attribute to find the second table, we add an underscore as a keyword.
two_tables_bs.find("table",class_='pizza')

<table class="pizza"><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td>144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr></table>

---


## Downloading and Scraping the Contents of a Web Page

We use <code>get</code> to download the contents of the webpage in text format and store in a variable called <code>data</code>:

In [83]:
url = "http://www.ibm.com"
data1 = requests.get(url).text
soup2 = BeautifulSoup(data1,"html.parser")  # create a soup object using the variable 'data'

In [89]:
# Scrape all links
for link in soup2.find_all('a', href = True):  # in html anchor/link is represented by the tag <a>
    
    print(link.get('href'))

https://www.ibm.com/thought-leadership/institute-business-value/en-us/c-suite-study/ceo?lnk=hpesls1
#3171776
#3171802
#3171808
#3171812
#3171828
#3171834
https://newsroom.ibm.com/2023-01-04-IBM-Launches-New-Way-to-Partner-through-IBM-Partner-Plus
/br-pt/about
#tab_3171780
#tab_3171784
#tab_3171788
#tab_3171792
#tab_3171796
#tab_3171800
https://www.ibm.com/consulting/br-pt/?lnk=hpptco1
https://www.ibm.com/br-pt/consulting/strategy//?lnk=hpptco2
https://www.ibm.com/br-pt/consulting/strategy//?lnk=hpptco3
https://www.ibm.com/consulting/technology/?lnk=hpptco4
/br-pt/services/operations-consulting?lnk=hpptco5
/strategic-partnerships?lnk=lnk%3Dhpptco6
/br-pt/case-studies/iberdrola?lnk=hpptcs1
https://www.ibm.com/br-pt/employment//?lnk=hpptii1
https://www.ibm.com/br-pt/employment//?lnk=hpptii1
https://research.ibm.com/
https://research.ibm.com/
https://www.ibm.com/impact?lnk=hpptii3
https://www.ibm.com/impact?lnk=hpptii3
/br-pt/about?lnk=hpptai1
#


---

## Scrape all Images Tags

In [92]:
for link in soup2.find_all('img'): # in html image is represented by the tag <img>
    print(link)
    print(link.get('src'))

<img alt="Pessoa em pé com braços cruzados" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2022-11-16/XLG%20-%20IBM%20Consulting%20-%20Overview.jpg"/>
//1.cms.s81c.com/sites/default/files/2022-11-16/XLG%20-%20IBM%20Consulting%20-%20Overview.jpg
<img alt="Membros da equipe no trabalho em uma sala de reuniões" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2022-11-16/XLG%20-%20IBM%20Consulting%20-%20Strategy.jpg"/>
//1.cms.s81c.com/sites/default/files/2022-11-16/XLG%20-%20IBM%20Consulting%20-%20Strategy.jpg
<img alt="Colegas de trabalho olhando para notebooks" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2022-11-16/XLG%20-%20IBM%20Consulting%20-%20Experience.jpg"/>
//1.cms.s81c.com/sites/default/files/2022-11-16/XLG%20-%20IBM%20Consulting%20-%20Experience.jpg
<img alt="Desenvolvedor de cloud de blusa vermelha escrevendo códigos sentado à mesa" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2022-11-16/XLG%20-%20

---

## Scrape Data from HTML Tables

In [93]:
#The below url contains an html table with data about colors and color codes.
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"

Before proceeding to scrape a web site, you need to examine the contents, and the way data is organized on the website. Open the above url in your browser and check how many rows and columns are there in the color table.

In [94]:
# get the contents of the webpage in text format and store in a variable called data
data  = requests.get(url).text
soup = BeautifulSoup(data,"html.parser")

#find a html table in the web page
table = soup.find('table') # in html table is represented by the tag <table>

#Get all rows from the table
for row in table.find_all('tr'): # in html table row is represented by the tag <tr>
    # Get all columns in each row.
    cols = row.find_all('td') # in html a column is represented by the tag <td>
    color_name = cols[2].string # store the value in column 3 as color_name
    color_code = cols[3].string # store the value in column 4 as color_code
    print("{}--->{}".format(color_name,color_code))

Color Name--->None
lightsalmon--->#FFA07A
salmon--->#FA8072
darksalmon--->#E9967A
lightcoral--->#F08080
coral--->#FF7F50
tomato--->#FF6347
orangered--->#FF4500
gold--->#FFD700
orange--->#FFA500
darkorange--->#FF8C00
lightyellow--->#FFFFE0
lemonchiffon--->#FFFACD
papayawhip--->#FFEFD5
moccasin--->#FFE4B5
peachpuff--->#FFDAB9
palegoldenrod--->#EEE8AA
khaki--->#F0E68C
darkkhaki--->#BDB76B
yellow--->#FFFF00
lawngreen--->#7CFC00
chartreuse--->#7FFF00
limegreen--->#32CD32
lime--->#00FF00
forestgreen--->#228B22
green--->#008000
powderblue--->#B0E0E6
lightblue--->#ADD8E6
lightskyblue--->#87CEFA
skyblue--->#87CEEB
deepskyblue--->#00BFFF
lightsteelblue--->#B0C4DE
dodgerblue--->#1E90FF


---

## Scrape Data from HTML Tables into a DataFrame using BeautifulSoup and Pandas

In [96]:
import pandas as pd

# URL of html tables with data about world population.
url = "https://en.wikipedia.org/wiki/World_population"

# Get the contents of the webpage in text format and store in a variable called data
data  = requests.get(url).text
soup = BeautifulSoup(data,"html.parser")

#find all html tables in the web page
tables = soup.find_all('table') # in html table is represented by the tag <table>

# we can see how many tables were found by checking the length of the tables list
len(tables)

24

If we are looking for the 10 most densly populated countries table, we can look through the tables list and find the right one we are look for based on the data in each table or we can search for the table name, if it is in the table, but this option might not always work.

In [97]:
for index,table in enumerate(tables):
    if ("10 most densely populated countries" in str(table)):
        table_index = index
print(table_index)

4


In [98]:
print(tables[table_index].prettify())

<table class="wikitable sortable" style="text-align:right">
 <caption>
  10 most densely populated countries
  <small>
   (with population above 5 million)
  </small>
  <sup class="reference" id="cite_ref-:10_106-0">
   <a href="#cite_note-:10-106">
    [101]
   </a>
  </sup>
 </caption>
 <tbody>
  <tr>
   <th scope="col">
    Rank
   </th>
   <th scope="col">
    Country
   </th>
   <th scope="col">
    Population
   </th>
   <th scope="col">
    Area
    <br/>
    <small>
     (km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
   <th scope="col">
    Density
    <br/>
    <small>
     (pop/km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
  </tr>
  <tr>
   <td>
    1
   </td>
   <td align="left">
    <span class="flagicon">
     <img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/4/48/Flag_of_Singapore.svg/23px-Flag_of_Singapore.svg.png" srcset="//upload

In [121]:
population_data = pd.DataFrame(columns=["Rank", "Country", "Population", "Area", "Density"])

for row in tables[table_index].tbody.find_all("tr"):
    col = row.find_all("td")
    if (col != []):
        rank = col[0].text
        country = col[1].text
        population = col[2].text.strip()
        area = col[3].text.strip()
        density = col[4].text.strip()
        new_row = pd.DataFrame({"Rank":[rank], "Country":[country], "Population":[population], "Area":[area], "Density":[density]})
        population_data = pd.concat([population_data, new_row], ignore_index=True)

display(population_data)

Unnamed: 0,Rank,Country,Population,Area,Density
0,1,Singapore,5921231,719,8235
1,2,Bangladesh,165650475,148460,1116
2,3,\n Palestine[102]\n\n,5223000,6025,867
3,4,Taiwan,23580712,35980,655
4,5,South Korea,51844834,99720,520
5,6,Lebanon,5296814,10400,509
6,7,Rwanda,13173730,26338,500
7,8,Burundi,12696478,27830,456
8,9,India,1389637446,3287263,423
9,10,Netherlands,17400824,41543,419


---


## Scrape data from HTML tables into a DataFrame using BeautifulSoup and read_html

Using the same url, data, soup, and tables object as in the last section we can use the read_html function to create a DataFrame.

Remember the table we need is located in tables[table_index]

We can now use the pandas function read_html and give it the string version of the table as well as the flavor which is the parsing engine bs4.

In [144]:
pd.read_html(str(tables[5]))

[   Rank         Country  Population  Area(km2)  Density(pop/km2)  \
 0     1           India  1389637446    3287263               423   
 1     2        Pakistan   242923845     796095               305   
 2     3      Bangladesh   165650475     148460              1116   
 3     4           Japan   124214766     377915               329   
 4     5     Philippines   114597229     300000               382   
 5     6         Vietnam   103808319     331210               313   
 6     7  United Kingdom    67791400     243610               278   
 7     8     South Korea    51844834      99720               520   
 8     9          Taiwan    23580712      35980               655   
 9    10       Sri Lanka    23187516      65610               353   
 
   Population trend[citation needed]  
 0                           Growing  
 1                   Rapidly growing  
 2                           Growing  
 3                    Declining[103]  
 4                           Growing  
 5   

The function read_html always returns a list of DataFrames so we must pick the one we want out of the list.

In [143]:
population_data_read_html = pd.read_html(str(tables[5]))[0]

population_data_read_html

Unnamed: 0,Rank,Country,Population,Area(km2),Density(pop/km2),Population trend[citation needed]
0,1,India,1389637446,3287263,423,Growing
1,2,Pakistan,242923845,796095,305,Rapidly growing
2,3,Bangladesh,165650475,148460,1116,Growing
3,4,Japan,124214766,377915,329,Declining[103]
4,5,Philippines,114597229,300000,382,Growing
5,6,Vietnam,103808319,331210,313,Growing
6,7,United Kingdom,67791400,243610,278,Growing
7,8,South Korea,51844834,99720,520,Steady
8,9,Taiwan,23580712,35980,655,Steady
9,10,Sri Lanka,23187516,65610,353,Growing


---

## Scrape data from HTML tables into a DataFrame using read_html

We can also use the read_html function to directly get DataFrames from a url.

In [147]:
dataframe_list = pd.read_html(url, flavor='lxml')

# You can find the DataFrame length of soup object
len(dataframe_list)

24

In [148]:
dataframe_list[5]

Unnamed: 0,Rank,Country,Population,Area(km2),Density(pop/km2),Population trend[citation needed]
0,1,India,1389637446,3287263,423,Growing
1,2,Pakistan,242923845,796095,305,Rapidly growing
2,3,Bangladesh,165650475,148460,1116,Growing
3,4,Japan,124214766,377915,329,Declining[103]
4,5,Philippines,114597229,300000,382,Growing
5,6,Vietnam,103808319,331210,313,Growing
6,7,United Kingdom,67791400,243610,278,Growing
7,8,South Korea,51844834,99720,520,Steady
8,9,Taiwan,23580712,35980,655,Steady
9,10,Sri Lanka,23187516,65610,353,Growing


We can also use the match parameter to select the specific table we want. If the table contains a string matching the text it will be read.

In [150]:
pd.read_html(url, match="10 most densely populated countries", flavor='lxml')[0]

Unnamed: 0,Rank,Country,Population,Area(km2),Density(pop/km2)
0,1,Singapore,5921231,719,8235
1,2,Bangladesh,165650475,148460,1116
2,3,Palestine[102],5223000,6025,867
3,4,Taiwan,23580712,35980,655
4,5,South Korea,51844834,99720,520
5,6,Lebanon,5296814,10400,509
6,7,Rwanda,13173730,26338,500
7,8,Burundi,12696478,27830,456
8,9,India,1389637446,3287263,423
9,10,Netherlands,17400824,41543,419


# End :)