In [2]:
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page

## Beautiful Soup Objects

Beautiful Soup is a Python library for pulling data out of HTML and XML files, we will focus on HTML files. This is accomplished by representing the HTML as a set of objects with methods used to parse the HTML. We can navigate the HTML as a tree and/or filter out what we are looking for.

In [3]:
%%html
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h3><b id='boldest'>Lebron James</b></h3>
<p> Salary: $ 92,000,000 </p>
<h3> Stephen Curry</h3>
<p> Salary: $85,000, 000 </p>
<h3> Kevin Durant </h3>
<p> Salary: $73,200, 000</p>
</body>
</html>


**Note** : Make sure you load into one line itself

In [5]:
# Storing the above webpage as a string
html="<!DOCTYPE html><html><head><title>Page Title</title></head><body><h3><b id='boldest'>Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body></html>"

First, the document is converted to Unicode, (similar to ASCII), and HTML entities are converted to Unicode characters. Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. The BeautifulSoup object can create other types of objects. In this lab, we will cover BeautifulSoup and Tag objects that for the purposes of this lab are identical, and NavigableString objects.

In [6]:
soup=BeautifulSoup(html,"html.parser")

In [7]:
print(soup)

<!DOCTYPE html>
<html><head><title>Page Title</title></head><body><h3><b id="boldest">Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body></html>


We can use the method prettify() to display the HTML in the nested structure:

In [8]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Page Title
  </title>
 </head>
 <body>
  <h3>
   <b id="boldest">
    Lebron James
   </b>
  </h3>
  <p>
   Salary: $ 92,000,000
  </p>
  <h3>
   Stephen Curry
  </h3>
  <p>
   Salary: $85,000, 000
  </p>
  <h3>
   Kevin Durant
  </h3>
  <p>
   Salary: $73,200, 000
  </p>
 </body>
</html>



## Tags
Let's say we want the title of the page and the name of the top paid player we can use the Tag. The Tag object corresponds to an HTML tag in the original document, for example, the tag title.

In [9]:
tag_object=soup.title
tag_object

<title>Page Title</title>

we can see the tag type bs4.element.Tag

In [10]:
type(tag_object)

bs4.element.Tag

If there is more than one Tag with the same name, the first element with that Tag name is called, this corresponds to the most paid player:

In [11]:
tag_object_1=soup.h3
tag_object_1

<h3><b id="boldest">Lebron James</b></h3>

## Children, Parents, and Siblings
As stated above the Tag object is a tree of objects we can access the child of the tag or navigate down the branch as follows:

In [16]:
tag_child =tag_object.b
tag_child

In [17]:
parent_tag=tag_child.parent
parent_tag

AttributeError: 'NoneType' object has no attribute 'parent'

In [18]:
tag_object.parent

<head><title>Page Title</title></head>

In [20]:
print(tag_object.next_sibling)

None


## HTML Attributes
If the tag has attributes, the tag id="boldest" has an attribute id whose value is boldest. You can access a tag’s attributes by treating the tag like a dictionary:

In [None]:
tag_child.attrs

## Navigable String
A string corresponds to a bit of text or content within a tag. Beautiful Soup uses the NavigableString class to contain this text. In our HTML we can obtain the name of the first player by extracting the sting of the Tag object tag_child as follows:

In [None]:
tag_string=tag_child.string

In [None]:
type(tag_string)

In [None]:
unicode_string = str(tag_string)
unicode_string

## Filter
Filters allow you to find complex patterns, the simplest filter is a string. In this section we will pass a string to a different filter method and Beautiful Soup will perform a match against that exact string. Consider the following HTML of rocket launchs:

In [25]:
%%html
<table>
  <tr>
    <td id='flight' >Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
   </tr>
  <tr> 
    <td>1</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td>
    <td>80 kg</td>
  </tr>
</table>

0,1,2
Flight No,Launch site,Payload mass
1,Florida,300 kg
2,Texas,94 kg
3,Florida,80 kg


In [26]:
table="<table><tr><td id='flight'>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a></td><td>300 kg</td></tr><tr><td>2</td><td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td><td>80 kg</td></tr></table>"

In [27]:
table_bs=BeautifulSoup(table,"html.parser")

## find All
The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters.

The Method signature for find_all(name, attrs, recursive, string, limit, **kwargs)

## Name¶
When we set the name parameter to a tag name, the method will extract all the tags with that name and its children.

In [28]:
table_rows=table_bs.find_all('tr')
table_rows

[<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>,
 <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td><td>300 kg</td></tr>,
 <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>,
 <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td><td>80 kg</td></tr>]

The result is a Python Iterable just like a list, each element is a tag object:

In [29]:
first_row =table_rows[0]
first_row

<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>

In [30]:
print(type(first_row))

<class 'bs4.element.Tag'>


In [31]:
first_row.td

<td id="flight">Flight No</td>

If we iterate through the list, each element corresponds to a row in the table:

In [32]:
for i,row in enumerate(table_rows):
    print("row",i,"is",row)

row 0 is <tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>
row 1 is <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td><td>300 kg</td></tr>
row 2 is <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>
row 3 is <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td><td>80 kg</td></tr>


As row is a cell object, we can apply the method find_all to it and extract table cells in the object cells using the tag td, this is all the children with the name td. The result is a list, each element corresponds to a cell and is a Tag object, we can iterate through this list as well. We can extract the content using the string attribute.

In [34]:
for i,row in enumerate(table_rows):
    print("row",i,"is",row)
    cells=row.find_all('td')
    for j,cells in enumerate(cells):
        print("column",j,"is",cells)

row 0 is <tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>
column 0 is <td id="flight">Flight No</td>
column 1 is <td>Launch site</td>
column 2 is <td>Payload mass</td>
row 1 is <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td><td>300 kg</td></tr>
column 0 is <td>1</td>
column 1 is <td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td>
column 2 is <td>300 kg</td>
row 2 is <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>
column 0 is <td>2</td>
column 1 is <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>
column 2 is <td>94 kg</td>
row 3 is <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td><td>80 kg</td></tr>
column 0 is <td>3</td>
column 1 is <td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td>
column 2 is <td>80 kg</td>


If we use a list we can match against any item in that list.


In [35]:
list_input=table_bs .find_all(name=["tr", "td"])
list_input

[<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>,
 <td id="flight">Flight No</td>,
 <td>Launch site</td>,
 <td>Payload mass</td>,
 <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td><td>300 kg</td></tr>,
 <td>1</td>,
 <td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td>,
 <td>300 kg</td>,
 <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>,
 <td>2</td>,
 <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>,
 <td>94 kg</td>,
 <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td><td>80 kg</td></tr>,
 <td>3</td>,
 <td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td>,
 <td>80 kg</td>]

# Downloading And Scraping The Contents Of A Web Page

In [36]:
url = "http://www.ibm.com"

We use *get* to download the contents of the webpage in text format and store in a variable called *data*:

In [37]:
data=requests.get(url).text

We create a BeautifulSoup object using the BeautifulSoup constructor

In [38]:
soup=BeautifulSoup(data,"html.parser")

In [41]:
# Scrape all links
for link in soup.find_all('a',href=True): # in html anchor/link is represented by the tag <a>
    print(link.get('href'))

https://www.ibm.com/resources/the-data-differentiator/scale-ai?lnk=hpenls1
https://www.ibm.com/watsonx?lnk=hpenls2
https://www.ibm.com/events/reg/flow/ibm/mknmeomb/landing/page/landing?lnk=hpenca1
//mediacenter.ibm.com/id/1_t4tolges
//mediacenter.ibm.com/id/1_t4tolges
//mediacenter.ibm.com/id/1_t4tolges
https://www.ibm.com/in-en/data-fabric?lnk=hpencm1
//mediacenter.ibm.com/id/1_ohfv4i6v
//mediacenter.ibm.com/id/1_ohfv4i6v
//mediacenter.ibm.com/id/1_ohfv4i6v
https://www.ibm.com/in-en/sustainability?lnk=hpencm2
//mediacenter.ibm.com/id/1_4f1czavh
//mediacenter.ibm.com/id/1_4f1czavh
//mediacenter.ibm.com/id/1_4f1czavh
https://www.ibm.com/consulting/?lnk=hpencm3
#tab_3171780
#tab_3171784
#tab_3171788
#tab_3171792
#tab_3171796
#tab_3171800
https://www.ibm.com/consulting/?lnk=hpenco1
https://www.ibm.com/consulting/strategy/?lnk=hpenco2
https://www.ibm.com/consulting/ibmix?lnk=flathl
https://www.ibm.com/consulting/technology/?lnk=hpenco4
https://www.ibm.com/services/operations-consulting?lnk

## Scraping all image tags

In [42]:
for link in soup.find_all('img'):
    print(link.get('src'))

//1.cms.s81c.com/sites/default/files/2023-04-05/CS_The_Modernize_Banking_Question_V1_30_mobile-frame-04-30_drupal.jpg
//1.cms.s81c.com/sites/default/files/2023-04-05/CS_The_Sustainability_Building_Question_30_mobile-07-00_drupal.jpg
//1.cms.s81c.com/sites/default/files/2023-04-05/CS_the_transform_masters_question_30_mobile_frame00_drupal_0.jpg
//1.cms.s81c.com/sites/default/files/2023-05-07/pi-1906698.xl_.jpg
//1.cms.s81c.com/sites/default/files/2023-05-07/ebsf02532.xl_.jpg
//1.cms.s81c.com/sites/default/files/2023-05-07/_l7a6081.xl_.jpg
//1.cms.s81c.com/sites/default/files/2022-11-16/XLG%20-%20IBM%20Consulting%20-%20Technology.jpg
//1.cms.s81c.com/sites/default/files/2022-11-16/XLG%20-%20IBM%20Consulting%20-%20Operations.jpg
//1.cms.s81c.com/sites/default/files/2022-11-16/XLG%20-%20IBM%20Consulting%20-%20Partners.jpg
//1.cms.s81c.com/sites/default/files/2022-11-16/XLG%20-%20Inside%20IBM%20-%20IBM%20Careers.jpg
//1.cms.s81c.com/sites/default/files/2022-11-16/XLG%20-%20Inside%20IBM%20-%

## Scraping data from tBLES

In [43]:
#The below url contains an html table with data about colors and color codes.
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"

Before proceeding to scrape a web site, you need to examine the contents, and the way data is organized on the website. Open the above url in your browser and check how many rows and columns are there in the color table.

In [44]:
# get the contents of the webpage in text format and store in a variable called data
data  = requests.get(url).text

In [45]:
soup=BeautifulSoup(data,"html.parser")

In [49]:
table=soup.find('table')

In [50]:
#Get all rows from the table
for row in table.find_all('tr'): # in html table row is represented by the tag <tr>
    # Get all columns in each row.
    cols = row.find_all('td') # in html a column is represented by the tag <td>
    color_name = cols[2].string # store the value in column 3 as color_name
    color_code = cols[3].string # store the value in column 4 as color_code
    print("{}--->{}".format(color_name,color_code))

Color Name--->None
lightsalmon--->#FFA07A
salmon--->#FA8072
darksalmon--->#E9967A
lightcoral--->#F08080
coral--->#FF7F50
tomato--->#FF6347
orangered--->#FF4500
gold--->#FFD700
orange--->#FFA500
darkorange--->#FF8C00
lightyellow--->#FFFFE0
lemonchiffon--->#FFFACD
papayawhip--->#FFEFD5
moccasin--->#FFE4B5
peachpuff--->#FFDAB9
palegoldenrod--->#EEE8AA
khaki--->#F0E68C
darkkhaki--->#BDB76B
yellow--->#FFFF00
lawngreen--->#7CFC00
chartreuse--->#7FFF00
limegreen--->#32CD32
lime--->#00FF00
forestgreen--->#228B22
green--->#008000
powderblue--->#B0E0E6
lightblue--->#ADD8E6
lightskyblue--->#87CEFA
skyblue--->#87CEEB
deepskyblue--->#00BFFF
lightsteelblue--->#B0C4DE
dodgerblue--->#1E90FF


## Scrape data from HTML tables into a DataFrame using BeautifulSoup and Pandas

In [51]:
import pandas as pd

In [52]:
#The below url contains html tables with data about world population.
url = "https://en.wikipedia.org/wiki/World_population"

Before proceeding to scrape a web site, you need to examine the contents, and the way data is organized on the website. Open the above url in your browser and check the tables on the webpage

In [53]:
data=requests.get(url).text

In [54]:
soup=BeautifulSoup(data,"html.parser")

In [56]:
tables=soup.find_all('table')

In [57]:
# we can see how many tables were found by checking the length of the tables list
len(tables)

29

Assume that we are looking for the 10 most densly populated countries table, we can look through the tables list and find the right one we are look for based on the data in each table or we can search for the table name if it is in the table but this option might not always work.

In [63]:
for index,table in enumerate(tables):
    if("region" in str(table)):
        table_index=index
    #if("richest countries" in str(table)):
       # table_index_1=index   
print(table_index)        

23


In [69]:
print(tables[23].prettify())

<table class="nowraplinks hlist mw-collapsible autocollapse navbox-inner" style="border-spacing:0;background:transparent;color:inherit">
 <tbody>
  <tr>
   <th class="navbox-title" colspan="2" scope="col">
    <link href="mw-data:TemplateStyles:r1129693374" rel="mw-deduplicated-inline-style"/>
    <link href="mw-data:TemplateStyles:r1063604349" rel="mw-deduplicated-inline-style"/>
    <div class="navbar plainlinks hlist navbar-mini">
     <ul>
      <li class="nv-view">
       <a href="/wiki/Template:Lists_of_countries_by_population_statistics" title="Template:Lists of countries by population statistics">
        <abbr style=";;background:none transparent;border:none;box-shadow:none;padding:0;" title="View this template">
         v
        </abbr>
       </a>
      </li>
      <li class="nv-talk">
       <a href="/wiki/Template_talk:Lists_of_countries_by_population_statistics" title="Template talk:Lists of countries by population statistics">
        <abbr style=";;background:none tra

### Converting the above code to a df

In [83]:
population_data = pd.DataFrame(columns=["Rank", "Country", "Population", "Area", "Density"])

for row in tables[7].tbody.find_all("tr"):
    col = row.find_all("td")
    if (col != []):
        rank = col[0].text
        country = col[1].text
        population = col[2].text.strip()
        area = col[3].text.strip()
        density = col[4].text.strip()
        population_data = population_data.append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density}, ignore_index=True)

population_data

  population_data = population_data.append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density}, ignore_index=True)
  population_data = population_data.append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density}, ignore_index=True)
  population_data = population_data.append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density}, ignore_index=True)
  population_data = population_data.append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density}, ignore_index=True)
  population_data = population_data.append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density}, ignore_index=True)
  population_data = population_data.append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density}, ignore_index=True)
  population_data = population_data.append({"Rank":rank, "Country":country, "Popul

Unnamed: 0,Rank,Country,Population,Area,Density
0,1,Singapore,5921231,719,8235
1,2,Bangladesh,165650475,148460,1116
2,3,\n Palestine[103]\n\n,5223000,6025,867
3,4,Taiwan,23580712,35980,655
4,5,South Korea,51844834,99720,520
5,6,Lebanon,5296814,10400,509
6,7,Rwanda,13173730,26338,500
7,8,Burundi,12696478,27830,456
8,9,India,1389637446,3287263,423
9,10,Netherlands,17400824,41543,419


In [84]:
for row in tables[3].tbody.find_all("tr"):
    col = row.find_all("td")
    if len(col) >= 5:
        region = col[0].text
        density = col[1].text
        Population = col[2].text.strip()
        Most_Populous_Country = col[3].text.strip()
        Most_Prosperous_Country = col[4].text.strip()
        region_popln_data = region_popln_data.append({"Region":region, "Density":density, "Population":Population, "Popln[Most]":Most_Populous_Country, "Prosperity[most]":Most_Prosperous_Country}, ignore_index=True)
region_popln_data

  region_popln_data = region_popln_data.append({"Region":region, "Density":density, "Population":Population, "Popln[Most]":Most_Populous_Country, "Prosperity[most]":Most_Prosperous_Country}, ignore_index=True)
  region_popln_data = region_popln_data.append({"Region":region, "Density":density, "Population":Population, "Popln[Most]":Most_Populous_Country, "Prosperity[most]":Most_Prosperous_Country}, ignore_index=True)
  region_popln_data = region_popln_data.append({"Region":region, "Density":density, "Population":Population, "Popln[Most]":Most_Populous_Country, "Prosperity[most]":Most_Prosperous_Country}, ignore_index=True)
  region_popln_data = region_popln_data.append({"Region":region, "Density":density, "Population":Population, "Popln[Most]":Most_Populous_Country, "Prosperity[most]":Most_Prosperous_Country}, ignore_index=True)
  region_popln_data = region_popln_data.append({"Region":region, "Density":density, "Population":Population, "Popln[Most]":Most_Populous_Country, "Prosperity[mo

Unnamed: 0,Rank,Country,Population,Area,Density,Region,Popln[Most],Prosperity[most]
0,,,4641,,104.1\n,Asia\n,"1,418,459,382 – India","13,515,000 – Tokyo Metropolis(37,400,000 – G..."
1,,,1340,,44.4\n,Africa\n,"0,211,401,000 – Nigeria","09,500,000 – Cairo(20,076,000 – Greater Cairo)"
2,,,747,,73.4\n,Europe\n,"0,146,171,000 – Russia, approx. 110 million i...","13,200,000 – Moscow(20,004,000 – Moscow metr..."
3,,,653,,24.1\n,Latin America\n,"0,214,103,000 – Brazil","12,252,000 – São Paulo City(21,650,000 – São..."
4,,,368,,14.9\n,Northern America[note 1]\n,"0,332,909,000 – United States","08,804,000 – New York City(23,582,649 – New ..."
5,,,42,,5\n,Oceania\n,"0,025,917,000 – Australia","05,367,000 – Sydney"
6,,,0.004[89],,~0\n,Antarctica\n,N/A[note 2],"00,001,258 – McMurdo Station"


## Scrape data from HTML tables into a DataFrame using BeautifulSoup and read_html


Using the same `url`, `data`, `soup`, and `tables` object as in the last section we can use the `read_html` function to create a DataFrame.

Remember the table we need is located in `tables[table_index]`

We can now use the `pandas` function `read_html` and give it the string version of the table as well as the `flavor` which is the parsing engine `bs4`.


In [91]:
pd.read_html(str(tables[3]), flavor='bs4')

[                     Region Density (inhabitants/km2) Population (millions)  \
 0                      Asia                     104.1                  4641   
 1                    Africa                      44.4                  1340   
 2                    Europe                      73.4                   747   
 3             Latin America                      24.1                   653   
 4  Northern America[note 1]                      14.9                   368   
 5                   Oceania                         5                    42   
 6                Antarctica                        ~0             0.004[89]   
 
                                Most populous country  \
 0                              1,418,459,382 – India   
 1                            0,211,401,000 – Nigeria   
 2  0,146,171,000 – Russia, approx. 110 million in...   
 3                             0,214,103,000 – Brazil   
 4                      0,332,909,000 – United States   
 5              

In [93]:
population_data_read_html = pd.read_html(str(tables[5]), flavor='bs4')[0]

population_data_read_html.head()

Unnamed: 0_level_0,#,Most populous countries,2000,2015,2030[A],Graphs are temporarily unavailable due to technical issues.
Unnamed: 0_level_1,Graphs are temporarily unavailable due to technical issues.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Unnamed: 0_level_2,Graphs are temporarily unavailable due to technical issues.,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Unnamed: 0_level_3,Graphs are temporarily unavailable due to technical issues.,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3
Unnamed: 0_level_4,Graphs are temporarily unavailable due to technical issues.,Unnamed: 1_level_4,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4
Unnamed: 0_level_5,Graphs are temporarily unavailable due to technical issues.,Unnamed: 1_level_5,Unnamed: 2_level_5,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5
Unnamed: 0_level_6,Graphs are temporarily unavailable due to technical issues.,Unnamed: 1_level_6,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6,Unnamed: 5_level_6
Unnamed: 0_level_7,Graphs are temporarily unavailable due to technical issues.,Unnamed: 1_level_7,Unnamed: 2_level_7,Unnamed: 3_level_7,Unnamed: 4_level_7,Unnamed: 5_level_7
Unnamed: 0_level_8,Graphs are temporarily unavailable due to technical issues.,Unnamed: 1_level_8,Unnamed: 2_level_8,Unnamed: 3_level_8,Unnamed: 4_level_8,Unnamed: 5_level_8
Unnamed: 0_level_9,Graphs are temporarily unavailable due to technical issues.,Unnamed: 1_level_9,Unnamed: 2_level_9,Unnamed: 3_level_9,Unnamed: 4_level_9,Unnamed: 5_level_9
Unnamed: 0_level_10,Graphs are temporarily unavailable due to technical issues.,Unnamed: 1_level_10,Unnamed: 2_level_10,Unnamed: 3_level_10,Unnamed: 4_level_10,Unnamed: 5_level_10
Unnamed: 0_level_11,Graphs are temporarily unavailable due to technical issues.,Unnamed: 1_level_11,Unnamed: 2_level_11,Unnamed: 3_level_11,Unnamed: 4_level_11,Unnamed: 5_level_11
0,,Graphs are temporarily unavailable due to tech...,,,,
1,1.0,China[B],1270.0,1376.0,1416.0,
2,2.0,India,1053.0,1311.0,1528.0,
3,3.0,United States,283.0,322.0,356.0,
4,4.0,Indonesia,212.0,258.0,295.0,


## Scrape data from HTML tables into a DataFrame using read_html
We can also use the read_html function to directly get DataFrames from a url.

In [94]:
df=pd.read_html(url,flavor='bs4')

In [96]:
len(df)

26

In [98]:
df[3]

Unnamed: 0,Region,Density (inhabitants/km2),Population (millions),Most populous country,Most populous city (metropolitan area)
0,Asia,104.1,4641,"1,418,459,382 – India","13,515,000 – Tokyo Metropolis (37,400,000 – Gr..."
1,Africa,44.4,1340,"0,211,401,000 – Nigeria","09,500,000 – Cairo (20,076,000 – Greater Cairo)"
2,Europe,73.4,747,"0,146,171,000 – Russia, approx. 110 million in...","13,200,000 – Moscow (20,004,000 – Moscow metro..."
3,Latin America,24.1,653,"0,214,103,000 – Brazil","12,252,000 – São Paulo City (21,650,000 – São ..."
4,Northern America[note 1],14.9,368,"0,332,909,000 – United States","08,804,000 – New York City (23,582,649 – New Y..."
5,Oceania,5,42,"0,025,917,000 – Australia","05,367,000 – Sydney"
6,Antarctica,~0,0.004[89],N/A[note 2],"00,001,258 – McMurdo Station"


We can also use the match parameter to select the specific table we want. If the table contains a string matching the text it will be read.

In [99]:
pd.read_html(url,match='population growth',flavor='bs4')[0]

Unnamed: 0_level_0,Year,Population,Yearly growth,Yearly growth,Density (pop/km2),Urban population,Urban population
Unnamed: 0_level_1,Year,Population,%,Number,Density (pop/km2),Number,%
0,1951,2584034261,1.88%,47603112,17,775067697,30%
1,1952,2630861562,1.81%,46827301,18,799282533,30%
2,1953,2677608960,1.78%,46747398,18,824289989,31%
3,1954,2724846741,1.76%,47237781,18,850179106,31%
4,1955,2773019936,1.77%,48173195,19,877008842,32%
...,...,...,...,...,...,...,...
65,2016,7464022000,1.14%,84225000,50,4060653000,54%
66,2017,7547859000,1.12%,83837000,51,4140189000,55%
67,2018,7631091000,1.10%,83232000,51,4219817000,55%
68,2019,7713468000,1.08%,82377000,52,4299439000,56%
