<font color='green'>**Install libraries**</font>

In [4]:
!pip install lxml==4.6.4
!pip install beautifulsoup4
!pip install html5lib==1.1



<font color='green'>**Import required modules and functions**</font>

In [5]:
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page

***
<h2>Beautiful Soup Objects</h2>

Beautiful Soup is a Python library for pulling data out of HTML and XML files, we will focus on HTML files. This is accomplished by representing the HTML as a set of objects with methods used to parse the HTML.  We can navigate the HTML as a tree and/or filter out what we are looking for.

__[BeautifulSoup Documetation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0220ENSkillsNetwork23455606-2021-01-01)__

In [6]:
%%html
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h3><b id='boldest'>Lebron James</b></h3>
<p> Salary: $ 92,000,000 </p>
<h3> Stephen Curry</h3>
<p> Salary: $85,000, 000 </p>
<h3> Kevin Durant </h3>
<p> Salary: $73,200, 000</p>
</body>
</html>

***
<font color='green'> **Store HTML as a string** </font>

In [7]:
html = "<!DOCTYPE html><html><head><title>Page Title</title></head><body><h3><b id='boldest'>Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body></html>"

To parse a document, pass it into the <code>BeautifulSoup</code> constructor, the <code>BeautifulSoup</code> object, which represents the document as a nested data structure:

In [8]:
soup = BeautifulSoup(html,"html.parser")

First, the document is converted to Unicode, (similar to ASCII), and HTML entities are converted to Unicode characters. Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. The BeautifulSoup object can create other types of objects. 

We can use the method prettify() to display the HTML in the nested structure:

In [9]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Page Title
  </title>
 </head>
 <body>
  <h3>
   <b id="boldest">
    Lebron James
   </b>
  </h3>
  <p>
   Salary: $ 92,000,000
  </p>
  <h3>
   Stephen Curry
  </h3>
  <p>
   Salary: $85,000, 000
  </p>
  <h3>
   Kevin Durant
  </h3>
  <p>
   Salary: $73,200, 000
  </p>
 </body>
</html>


### Tags 

Let's say we want the title of the page and the name of the top paid player we can use the Tag. The Tag object corresponds to an HTML tag in the original document, for example, the tag title.

In [10]:
# display title tag
tag_title = soup.title
print("tag title:", tag_title)

tag title: <title>Page Title</title>


In [11]:
# display tag type
print("tag object type:", type(tag_title))

tag object type: <class 'bs4.element.Tag'>


If there is more than one Tag with the same name, the first element with that Tag name is called, this corresponds to the most paid player:

In [12]:
# h3 tag
tag_h3 = soup.h3
print("tag h3:", tag_h3)


tag h3: <h3><b id="boldest">Lebron James</b></h3>


### Children, Parents, and Siblings

In [13]:
# access child tag
tag_child = tag_h3.b
print("tag_child:", tag_child)

tag_child: <b id="boldest">Lebron James</b>


In [14]:
# access parent tag
tag_parent = tag_child.parent
print("tag_parent:", tag_parent)

tag_parent: <h3><b id="boldest">Lebron James</b></h3>


In [15]:
# access sibling 1
tag_sibling1 = tag_h3.next_sibling
print("tag_sibling1:", tag_sibling1)

tag_sibling1: <p> Salary: $ 92,000,000 </p>


In [16]:
# access sibling 2
tag_sibling2 = tag_sibling1.next_sibling
print("tag_sibling2:", tag_sibling2)

tag_sibling2: <h3> Stephen Curry</h3>


### HTML Attributes

In [17]:
# get value of tag attribute
tag_child["id"]

'boldest'

In [18]:
# get tag attributes as dictionary
tag_child.attrs

{'id': 'boldest'}

In [19]:
# get value of tag attribute using python get() method
tag_child.get('id')

'boldest'

## Navigable String

A string corresponds to a bit of text or content within a tag. Beautiful Soup uses the NavigableString class to contain this text. In our HTML we can obtain the name of the first player by extracting the sting of the Tag object tag_child as follows:

In [20]:
# get string value of tag
tag_child_string = tag_child.string
tag_child_string

'Lebron James'

In [21]:
# type Navigable string
type(tag_child_string)

bs4.element.NavigableString

### Filter

Filters allow you to find complex patterns, the simplest filter is a string. In this section we will pass a string to a different filter method and Beautiful Soup will perform a match against that exact string. Consider the following HTML of rocket launchs:

In [22]:
%%html
<table>
  <tr>
    <td id='flight' >Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
   </tr>
  <tr> 
    <td>1</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td>
    <td>80 kg</td>
  </tr>
</table>

0,1,2
Flight No,Launch site,Payload mass
1,Florida,300 kg
2,Texas,94 kg
3,Florida,80 kg


In [23]:
table="<table><tr><td id='flight'>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a></td><td>300 kg</td></tr><tr><td>2</td><td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td><td>80 kg</td></tr></table>"

In [24]:
table_bs = BeautifulSoup(table, "html.parser")

### Find All

The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters.

The Method signature for find_all(name, attrs, recursive, string, limit, **kwargs)

### Name

When we set the name parameter to a tag name, the method will extract all the tags with that name and its children.

In [25]:
# get all 'tr' tags from table
table_rows = table_bs.find_all('tr')
table_rows

[<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>,
 <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td><td>300 kg</td></tr>,
 <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>,
 <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td><td>80 kg</td></tr>]

In [26]:
first_row = table_rows[0]
first_row

<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>

In [27]:
# type is tag
type(first_row)

bs4.element.Tag

In [28]:
# get child tag
first_row_child = first_row.td
first_row_child

<td id="flight">Flight No</td>

In [29]:
# iterate to get all rows of table
for i, row in enumerate(table_rows):
    print("row", i, "is", row)

row 0 is <tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>
row 1 is <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td><td>300 kg</td></tr>
row 2 is <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>
row 3 is <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td><td>80 kg</td></tr>


As row is a cell object, we can apply the method find_all to it and extract table cells in the object cells using the tag td, this is all the children with the name td. The result is a list, each element corresponds to a cell and is a Tag object, we can iterate through this list as well. We can extract the content using the string attribute.

In [30]:
for i, row in enumerate (table_rows):
    print("row", i)
    cells = row.find_all('td')
    for j, cell in enumerate (cells):
        print("column", j, "cell", cell)
    

row 0
column 0 cell <td id="flight">Flight No</td>
column 1 cell <td>Launch site</td>
column 2 cell <td>Payload mass</td>
row 1
column 0 cell <td>1</td>
column 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td>
column 2 cell <td>300 kg</td>
row 2
column 0 cell <td>2</td>
column 1 cell <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>
column 2 cell <td>94 kg</td>
row 3
column 0 cell <td>3</td>
column 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td>
column 2 cell <td>80 kg</td>


If we use a list we can match against any item in that list.

In [31]:
list_input=table_bs .find_all(name=["tr", "td"])
list_input

[<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>,
 <td id="flight">Flight No</td>,
 <td>Launch site</td>,
 <td>Payload mass</td>,
 <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td><td>300 kg</td></tr>,
 <td>1</td>,
 <td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td>,
 <td>300 kg</td>,
 <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>,
 <td>2</td>,
 <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>,
 <td>94 kg</td>,
 <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td><td>80 kg</td></tr>,
 <td>3</td>,
 <td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td>,
 <td>80 kg</td>]

**Filter on Attributes**

If the argument is not recognized it will be turned into a filter on the tag’s attributes. For example the id argument, Beautiful Soup will filter against each tag’s id attribute. For example, the first td elements have a value of id of flight, therefore we can filter based on that id value.

In [32]:
table_bs.find_all(id='flight')

[<td id="flight">Flight No</td>]

In [34]:
# find all the elements that have links to the Florida Wikipedia page

links_Florida = table_bs.find_all(href = "https://en.wikipedia.org/wiki/Florida")
links_Florida

[<a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a>,
 <a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a>]

In [35]:
# find all all tags with href value (href = True)

values_href = table_bs.find_all(href = True)
values_href

[<a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a>,
 <a href="https://en.wikipedia.org/wiki/Texas">Texas</a>,
 <a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a>]

**Search Strings**

With string you can search for strings instead of tags, where we find all the elments with Florida:

In [36]:
table_bs.find_all(string = 'Florida')

['Florida', 'Florida']

**Find ()**

The find_all() method scans the entire document looking for results. If you are looking for one element you can use the find() method to find the first element in the document.

In [37]:
%%html
<h3>Rocket Launch </h3>

<p>
<table class='rocket'>
  <tr>
    <td>Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
  </tr>
  <tr>
    <td>1</td>
    <td>Florida</td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td>Texas</td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td>Florida </td>
    <td>80 kg</td>
  </tr>
</table>
</p>
<p>

<h3>Pizza Party  </h3>
  
    
<table class='pizza'>
  <tr>
    <td>Pizza Place</td>
    <td>Orders</td> 
    <td>Slices </td>
   </tr>
  <tr>
    <td>Domino's Pizza</td>
    <td>10</td>
    <td>100</td>
  </tr>
  <tr>
    <td>Little Caesars</td>
    <td>12</td>
    <td >144 </td>
  </tr>
  <tr>
    <td>Papa John's </td>
    <td>15 </td>
    <td>165</td>
  </tr>

0,1,2
Flight No,Launch site,Payload mass
1,Florida,300 kg
2,Texas,94 kg
3,Florida,80 kg

0,1,2
Pizza Place,Orders,Slices
Domino's Pizza,10,100
Little Caesars,12,144
Papa John's,15,165


In [38]:
two_tables = "<h3>Rocket Launch </h3><p><table class='rocket'><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table></p><p><h3>Pizza Party  </h3><table class='pizza'><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td >144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr>"

In [39]:
two_tables_bs = BeautifulSoup(two_tables, 'html.parser')

In [40]:
# find first table

two_tables_bs.find("table")

<table class="rocket"><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table>

In [42]:
# find second table

two_tables_bs.find("table", class_ = 'pizza')

<table class="pizza"><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td>144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr></table>

### <font color='green'> Downloading And Scraping The Contents Of A Web Page </font>

In [43]:
# user get() to download content of a webpage

url = "http://www.ibm.com"
data = requests.get(url).text
data

'<!DOCTYPE html>\n<html lang="en-ph" dir="ltr">\n  <head>\n    <meta charset="utf-8" />\n<script>digitalData = {\n    "page": {\n        "category": [],\n        "pageInfo": {\n            "language": "en-PH",\n            "country": "PH",\n            "publisher": "IBM Corporation",\n            "version": "v19",\n            "ibm": {\n                "contentDelivery": "Drupal 9",\n                "contentProducer": "2598578 - Page Builder",\n                "owner": "Peter Barros/White Plains/IBM",\n                "siteID": "DRUPAL",\n                "contactModuleConfiguration": {\n                    "contactInformationBundleKey": {\n                        "focusArea": "Miscellaneous - About IBM",\n                        "languageCode": "en",\n                        "regionCode": "PH"\n                    },\n                    "contactModuleTranslationKey": {\n                        "languageCode": "en",\n                        "regionCode": "PH"\n                    }\n  

In [55]:
# create soup object using variable 'data'

soup = BeautifulSoup(data, "html.parser")

In [56]:
# scrape all links

for link in soup.find_all('a', href = True):
    print (link.get('href'))

https://www.ibm.com/ph/en
https://www.ibm.com/sitemap/ph/en
https://www.ibm.com/lets-create/new-creators/sg-en/?lnk=phhpv18l1
https://www.ibm.com/it-infrastructure/us-en/resources/hybrid-multicloud-infrastructure-strategy/?lnk=hpv18f1
https://www.ibm.com/sg-en/analytics/data-fabric/?lnk=phhpv18f2
https://www.ibm.com/sg-en/it-infrastructure/power/os/ibm-i?lnk=phhpv18f3
https://www.ibm.com/sg-en/cloud/aiops/?lnk=phhpv18f4
https://www.ibm.com/consulting/sg-en/?lnk=phhpv18f5
https://www.ibm.com/sg-en/cloud/campaign/cloud-simplicity?lnk=phhpv18f6
/ph-en/products/offers-and-discounts?lnk=hpv18t5
/ph-en/cloud/free?lnk=STW_MY_HP_T1_BLK&psrc=NONE&pexp=DEF&lnk2=trial_Cloud
/ph-en/products/cloud-pak-for-data?lnk=STW_MY_HP_T2_BLK&psrc=NONE&pexp=DEF&lnk2=trial_CloudPakData
/ph-en/cloud/watson-assistant?lnk=STW_MY_HP_T3_BLK&psrc=NONE&pexp=DEF&lnk2=trial_WatAssist
/security/identity-access-management/cloud-identity?lnk=STW_MY_HP_T4_BLK&psrc=NONE&pexp=DEF&lnk2=trial_IdentAccMgmtSvc
/products/digital-l

### <font color='green'> Scrape all images Tags </font>

In [59]:
for link in soup.find_all('img'): # in html image is represented by the tag <img>
    print (link)
    print(link.get('src'))

<img alt="2022 Forrester Consulting study" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2022-04-04/Original-20220316-26479-Forrester-Modernize-444x320_3.jpg"/>
//1.cms.s81c.com/sites/default/files/2022-04-04/Original-20220316-26479-Forrester-Modernize-444x320_3.jpg
<img alt="Let’s create security that protects your data anywhere" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2022-04-26/2022031-ls-data-driven-mobile-720x360%20%283%29.png"/>
//1.cms.s81c.com/sites/default/files/2022-04-26/2022031-ls-data-driven-mobile-720x360%20%283%29.png
<img alt="IBM i 7.5 for Power" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2022-05-06/20220503-f-ibm-i-444x320-26606_2.jpg"/>
//1.cms.s81c.com/sites/default/files/2022-05-06/20220503-f-ibm-i-444x320-26606_2.jpg
<img alt="automation and AIOPS with IBM" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2022-02-16/automate-five-levers-444x254_4.jpg"/>
//1.cms.s81c.com/sites/def

### <font color='green'>Scrape data from HTML tables </font>

In [91]:
#The below url contains an html table with data about colors and color codes.
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"

Before proceeding to scrape a web site, you need to examine the contents, and the way data is organized on the website. Open the above url in your browser and check how many rows and columns are there in the color table.

In [63]:
# get the contents of the webpage in text format and store in a variable called data

data = requests.get(url).text

In [62]:
soup = BeautifulSoup (data, "html.parser")

In [64]:
#find html table in the web page

table = soup.find('table') # in html, table is represented by the tag <table>

In [67]:
#Get all rows from the table
for row in table.find_all('tr'): # in html table row is represented by the tag <tr>
    # Get all columns in each row.
    cols = row.find_all('td') # in html a column is represented by the tag <td>
    color_name = cols[2].string # store the value in column 3 as color_name
    color_code = cols[3].string # store the value in column 4 as color_code
    print("{}--->{}".format(color_name,color_code))

Color Name--->None
lightsalmon--->#FFA07A
salmon--->#FA8072
darksalmon--->#E9967A
lightcoral--->#F08080
coral--->#FF7F50
tomato--->#FF6347
orangered--->#FF4500
gold--->#FFD700
orange--->#FFA500
darkorange--->#FF8C00
lightyellow--->#FFFFE0
lemonchiffon--->#FFFACD
papayawhip--->#FFEFD5
moccasin--->#FFE4B5
peachpuff--->#FFDAB9
palegoldenrod--->#EEE8AA
khaki--->#F0E68C
darkkhaki--->#BDB76B
yellow--->#FFFF00
lawngreen--->#7CFC00
chartreuse--->#7FFF00
limegreen--->#32CD32
lime--->#00FF00
forestgreen--->#228B22
green--->#008000
powderblue--->#B0E0E6
lightblue--->#ADD8E6
lightskyblue--->#87CEFA
skyblue--->#87CEEB
deepskyblue--->#00BFFF
lightsteelblue--->#B0C4DE
dodgerblue--->#1E90FF


### <font color='green'>Scrape data from HTML tables into a DataFrame using BeautifulSoup and Pandas </font>

In [68]:
import pandas as pd

In [95]:
#The below url contains html tables with data about world population.
url = "https://en.wikipedia.org/wiki/World_population"

Before proceeding to scrape a web site, you need to examine the contents, and the way data is organized on the website. Open the above url in your browser and check the tables on the webpage.

In [70]:
# get the contents of the webpage in text format and store in a variable called data
data  = requests.get(url).text

In [71]:
soup = BeautifulSoup(data,"html.parser")

In [82]:
#find all html tables in the web page
tables = soup.find_all('table') # in html table is represented by the tag <table>

In [73]:
# we can see how many tables were found by checking the length of the tables list
len(tables)

26

Assume that we are looking for the 10 most densly populated countries table, we can look through the tables list and find the right one we are look for based on the data in each table or we can search for the table name if it is in the table but this option might not always work.

In [74]:
for index,table in enumerate(tables):
    if ("10 most densely populated countries" in str(table)):
        table_index = index
print(table_index)

5


See if you can locate the table name of the table, 10 most densly populated countries, below.

In [83]:
print(tables[table_index].prettify())

<table class="wikitable sortable" style="text-align:right">
 <caption>
  10 most densely populated countries
  <small>
   (with population above 5 million)
  </small>
 </caption>
 <tbody>
  <tr>
   <th>
    Rank
   </th>
   <th>
    Country
   </th>
   <th>
    Population
   </th>
   <th>
    Area
    <br/>
    <small>
     (km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
   <th>
    Density
    <br/>
    <small>
     (pop/km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
  </tr>
  <tr>
   <td>
    1
   </td>
   <td align="left">
    <span class="flagicon">
     <img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/4/48/Flag_of_Singapore.svg/23px-Flag_of_Singapore.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/4/48/Flag_of_Singapore.svg/35px-Flag_of_Singapore.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/4/48/Flag_of_Singapo

In [86]:
population_data = pd.DataFrame(columns=["Rank", "Country", "Population", "Area", "Density"])

for row in tables[table_index].tbody.find_all("tr"):
    col = row.find_all("td")
    if (col != []):
        rank = col[0].text
        country = col[1].text
        population = col[2].text.strip()
        area = col[3].text.strip()
        density = col[4].text.strip()
        population_data = population_data.append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density}, ignore_index=True)

population_data

  population_data = population_data.append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density}, ignore_index=True)
  population_data = population_data.append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density}, ignore_index=True)
  population_data = population_data.append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density}, ignore_index=True)
  population_data = population_data.append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density}, ignore_index=True)
  population_data = population_data.append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density}, ignore_index=True)
  population_data = population_data.append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density}, ignore_index=True)
  population_data = population_data.append({"Rank":rank, "Country":country, "Popul

Unnamed: 0,Rank,Country,Population,Area,Density
0,1,Singapore,5704000,710,8033
1,2,Bangladesh,172880000,143998,1201
2,3,\n Palestine\n\n,5266785,6020,847
3,4,Lebanon,6856000,10452,656
4,5,Taiwan,23604000,36193,652
5,6,South Korea,51781000,99538,520
6,7,Rwanda,12374000,26338,470
7,8,Haiti,11578000,27065,428
8,9,Netherlands,17730000,41526,427
9,10,Israel,9530000,22072,432


### <font color='green'>Scrape data from HTML tables into a DataFrame using BeautifulSoup and read_html </font>

Using the same url, data, soup, and tables object as in the last section we can use the read_html function to create a DataFrame.

Remember the table we need is located in tables[table_index]

We can now use the pandas function read_html and give it the string version of the table as well as the flavor which is the parsing engine bs4

In [87]:
pd.read_html(str(tables[5]), flavor='bs4')

[   Rank      Country  Population  Area(km2)  Density(pop/km2)
 0     1    Singapore     5704000        710              8033
 1     2   Bangladesh   172880000     143998              1201
 2     3    Palestine     5266785       6020               847
 3     4      Lebanon     6856000      10452               656
 4     5       Taiwan    23604000      36193               652
 5     6  South Korea    51781000      99538               520
 6     7       Rwanda    12374000      26338               470
 7     8        Haiti    11578000      27065               428
 8     9  Netherlands    17730000      41526               427
 9    10       Israel     9530000      22072               432]

The function read_html always returns a list of DataFrames so we must pick the one we want out of the list.

In [90]:
population_data_read_html = pd.read_html(str(tables[5]), flavor='bs4')[0]

population_data_read_html

Unnamed: 0,Rank,Country,Population,Area(km2),Density(pop/km2)
0,1,Singapore,5704000,710,8033
1,2,Bangladesh,172880000,143998,1201
2,3,Palestine,5266785,6020,847
3,4,Lebanon,6856000,10452,656
4,5,Taiwan,23604000,36193,652
5,6,South Korea,51781000,99538,520
6,7,Rwanda,12374000,26338,470
7,8,Haiti,11578000,27065,428
8,9,Netherlands,17730000,41526,427
9,10,Israel,9530000,22072,432


### <font color='green'>Scrape data from HTML tables into a DataFrame using read_html </font>

We can also use the read_html function to directly get DataFrames from a url.

In [96]:
dataframe_list = pd.read_html(url, flavor='bs4')

We can see there are 26 DataFrames just like when we used find_all on the soup object.

In [97]:
len(dataframe_list)

26

Finally we can pick the DataFrame we need out of the list.

In [98]:
dataframe_list[5]

Unnamed: 0,Rank,Country,Population,Area(km2),Density(pop/km2)
0,1,Singapore,5704000,710,8033
1,2,Bangladesh,172880000,143998,1201
2,3,Palestine,5266785,6020,847
3,4,Lebanon,6856000,10452,656
4,5,Taiwan,23604000,36193,652
5,6,South Korea,51781000,99538,520
6,7,Rwanda,12374000,26338,470
7,8,Haiti,11578000,27065,428
8,9,Netherlands,17730000,41526,427
9,10,Israel,9530000,22072,432


We can also use the match parameter to select the specific table we want. If the table contains a string matching the text it will be read.

In [100]:
pd.read_html(url, match="10 most densely populated countries", flavor='bs4')[0]

Unnamed: 0,Rank,Country,Population,Area(km2),Density(pop/km2)
0,1,Singapore,5704000,710,8033
1,2,Bangladesh,172880000,143998,1201
2,3,Palestine,5266785,6020,847
3,4,Lebanon,6856000,10452,656
4,5,Taiwan,23604000,36193,652
5,6,South Korea,51781000,99538,520
6,7,Rwanda,12374000,26338,470
7,8,Haiti,11578000,27065,428
8,9,Netherlands,17730000,41526,427
9,10,Israel,9530000,22072,432
