# **Web Scraping**

Created on Fri Jul 29 08:37:28 2022

@author: David K. Jeremiah

<h2><b>Table of Contents</b></h2>
<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ul>
        <li>Overview of Webscraping</li>
        <li>Beautiful Soup Objects</li>
        <ul>
            <li>Tag</li>
            <li>Parents, Children, and Siblings</li>
            <li>HTML Attributes</li>
            <li>Navigable String</li>
        </ul>
    </ul>
    <ul>
        <li>Filter</li>
        <ul>
            <li>find All</li>
            <li>find</li>
            <li>HTML Attributes</li>
            <li>Navigable String</li>
        </ul>
    </ul>
    <ul>
        <li>Downloading and Scraping a Web Page Content</li>
    </ul>

## **Overview of Webscraping**
Let's say you want some information from a website. For instance, you would like to know more about your favourite cuisine! What do you do? Well, you can decide to copy and paste the information from Wikipedia to your own file. But what if, as you read through Wikipedia about your favourite meal, you realize there are large amounts of information from a website that you want, and you want it as quickly as possible? In such a situation, copying and pasting will not work and, quite frankly, will be tedious work! This is where Web Scraping comes in handy!

Web Scraping is a process that can be used to automatically extract information from a website, and can easily be accomplished within a matter of minutes and not hours. Most of the data on a website are unstructured data in an HTML format which, when web-scraped properly, is then converted into structured data in a spreadsheet or a database so that it can be used in various applications. 

To get started we just need a little Python code and the help of two modules named `Requests` and `Beautiful Soup`. 

First, we import the required modules and functions...

In [1]:
# Import the required modules and functions
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests # this module helps us to download a web page

## **Beautiful Soup Objects**

Beautiful Soup is a Python package for extracting data from HTML and XML files. 

Here, we'll concentrate on HTML files. This is accomplished by representing the HTML as a collection of objects that contain methods for parsing the HTML. We can navigate the HTML as a tree and/or filter for what we want.

Consider the following HTML:

In [3]:
%%html
<!DOCTYPE html>
<html>
    <head>
        <title>Page Title</title>
    </head>
    <body>
        <h3><b id='boldest'>Lebron James</b></h3>
        <p> Salary: $ 92,000,000 </p>
        <h3> Stephen Curry</h3>
        <p> Salary: $85,000, 000 </p>
        <h3> Kevin Durant </h3>
        <p> Salary: $73,200, 000</p>
    </body>
</html>

We can store it as a string in the variable. Let's call the variable 'html'...

In [4]:
html = "<!DOCTYPE html><html><head><title>Page Title</title></head><body><h3><b id='boldest'>Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body></html>"

To parse or extract parts of this string, we pass it into the BeautifulSoup constructor, the BeautifulSoup object, which represents the document as a nested data structure

In [7]:
soup = BeautifulSoup(html, 'html.parser')

# View output
print(soup)

<!DOCTYPE html>
<html><head><title>Page Title</title></head><body><h3><b id="boldest">Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body></html>


Beautiful Soup transforms a complex HTML document into a complex tree of Python objects.

Next, we can use the method <code>prettify()</code> to display the HTML in the nested structure:

In [8]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Page Title
  </title>
 </head>
 <body>
  <h3>
   <b id="boldest">
    Lebron James
   </b>
  </h3>
  <p>
   Salary: $ 92,000,000
  </p>
  <h3>
   Stephen Curry
  </h3>
  <p>
   Salary: $85,000, 000
  </p>
  <h3>
   Kevin Durant
  </h3>
  <p>
   Salary: $73,200, 000
  </p>
 </body>
</html>


### Tags
Let's say we want the  title of the page and the name of the top paid player we can use the <code>Tag</code>. The <code>Tag</code> object corresponds to an HTML tag in the original document, for example, the tag <code>title</code>.

In [9]:
# extracting the 'title' tag
tag_object = soup.title

# print result
print(tag_object)

<title>Page Title</title>


We can confirm that what we have stored in the tag_object variable is truly a tag with the `type` function:

In [11]:
print('tag object type: ', type(soup.title))

tag object type:  <class 'bs4.element.Tag'>


Let's try extracting another tag. For example, let's get the most paid player, 'Lebron James', which we know is in the first 'h3' tag:

In [12]:
tag_object1 = soup.h3
print(tag_object1)

<h3><b id="boldest">Lebron James</b></h3>


From the above result, we see so two tags enclosing the name 'Lebron James'. What if I just want to extract the name without the tags? We can do this with the concept of, `Parents`, `Children` and `Siblings`, which helps us navigate easily through html code.

### Parents, Children, and Siblings
Each HTML document can actually be referred to as a document tree. 

Tags may contain strings as well as other tags. These elements are the tag’s children. We can represent this as a family tree. Each nested tag is a level in the tree. 

The tag `HTML tag` contains the head and body tag. In other words, the `Head` and `body tag` are the descendants of the html tag. In particular, they are the children of the HTML tag. HTML tag is their parent. The head and body tag are siblings as they are on the same level. 

`Title tag` is the child of the head tag and thus, its parent is the head tag. The title tag is a descendant of the HTML tag but not its child. 

The `heading and paragraph tags` are the children of the body tag; and as they are all children of the body tag they are siblings of each other. The `bold tag` is a child of the heading tag

In [17]:
tag_child = tag_object1.b
print(tag_child)

<b id="boldest">Lebron James</b>


You can access back the parent this way:

In [18]:
parent_tag = tag_child.parent
print(parent_tag)

<h3><b id="boldest">Lebron James</b></h3>


the above result is identical to `tag_object1`

In [19]:
print(tag_object1)

<h3><b id="boldest">Lebron James</b></h3>


the tag_object1 parent is the body element.

In [20]:
print(tag_object1.parent)

<body><h3><b id="boldest">Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body>


tag_object1 sibling is the paragraph element

In [22]:
# First sibling in the body element
sibling_1 = tag_object1.next_sibling
print(sibling_1)

<p> Salary: $ 92,000,000 </p>


sibling_2 is the header element which is also a sibling of both sibling_1 and tag_object

In [23]:
# Second sibling
sibling_2 = sibling_1.next_sibling
print(sibling_2)

<h3> Stephen Curry</h3>


Using the object sibling_2 and the property `.next_sibling` and `.string` we find the salary of Stephen Curry:

In [29]:
# Finding the salary of sibling_2
sibling_3 = sibling_2.next_sibling.string
print('Stephen Curry\'s', sibling_3.strip())

Stephen Curry's Salary: $85,000, 000


### HTML Attributes

All HTML elements have attributes, which are basically special words that give additional information about the HTML elements as well as serve as a modifier of an HTML element type.

Attributes are always specified in the start or opening tag, and usually come in name/value pairs like: name="value".
For example, the `<img>` tag used to embed an image in an HTML page, has an `src attribute` which specifies the path to or source of the image to be displayed: `<img src='image.jpg'>`

In our case, the bold tag, represented as `<b>`, has an attribute `id` whose value is `boldest`. You can access a tag’s attributes by treating the tag like a dictionary. Recall that we have the bold tag, stored in the `tag_child` variable:

In [30]:
# print the tag_child variable containing the bold tag
print(tag_child)

<b id="boldest">Lebron James</b>


We see that it has an `id` attribute. Now, let's obtain it's value. We can access a tag’s attribute's value by treating the tag like a dictionary:

In [31]:
# Get the id attribute of the bold tag
print(tag_child['id'])

boldest


You can access that dictionary directly as attrs:

In [32]:
print(tag_child.attrs)

{'id': 'boldest'}


We can also obtain the content of the attribute of the tag using the Python `get()` method.

In [34]:
print(tag_child.get('id'))

boldest


### Navigable String
Earlier, we saw that we were able to get Stephen Curry's salary, from the `h3 tag` using the following line of code:

In [35]:
sibling_3 = sibling_2.next_sibling.string
print('Stephen Curry\'s', sibling_3.strip())

Stephen Curry's Salary: $85,000, 000


Noticed we used a BeautifulSoup attribute called `string`. Although, Beautiful soup calls it differently as ***Navigable String***

In [36]:
# verify the type is Navigable String
print(type(sibling_3))

<class 'bs4.element.NavigableString'>


A **NavigableString** is just like a Python string or Unicode string, to be more precise. The main difference is that it also supports some BeautifulSoup features. We can covert it to sting object in Python:

In [38]:
unicode_string = str(sibling_3)
print(type(unicode_string))
print(unicode_string)

<class 'str'>
 Salary: $85,000, 000 


## **Filter**
Filters allow you to find complex patterns, the simplest filter is a string. 

In this section we will pass a string to a different filter method and Beautiful Soup will perform a match against that exact string. Consider the following HTML of rocket launchs:

In [40]:
%%html
<!DOCTYPE html>
<html>
    <table>
        <tr>
            <td>Flight no</td>
            <td>Launch site</td>
            <td>Payload mass</td>
        </tr>
        <tr>
            <td>1</td>
            <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>
            <td>300</td>
        </tr>
        <tr>
            <td>2</td>
            <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>
            <td>94</td>
        </tr>
        <tr>
            <td>3</td>
            <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>
            <td>80</td>
        </tr>
    </table>
</html>        

0,1,2
Flight no,Launch site,Payload mass
1,Florida,300
2,Texas,94
3,Florida,80


We can store all that line of code as a string in the variable `table`:

In [41]:
table = """<html>
    <table>
        <tr>
            <td>Flight no</td>
            <td>Launch site</td>
            <td>Payload mass</td>
        </tr>
        <tr>
            <td>1</td>
            <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>
            <td>300</td>
        </tr>
        <tr>
            <td>2</td>
            <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>
            <td>94</td>
        </tr>
        <tr>
            <td>3</td>
            <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>
            <td>80</td>
        </tr>
    </table>
</html>"""

# confirm that we have a string stored in table
print(type(table))

<class 'str'>


Now, we pass table into the BeautifulSoup constructor:

In [43]:
# creating a BeautifulSoup object
table_bs = BeautifulSoup(table, 'html.parser')

### find_all
The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters.

The Method signature for `find_all(name, attrs, recursive, string, limit, **kwargs)`

<h4><i>name</i></h4>
When we set the name parameter to a tag name, the method will extract all the tags with that name and its children.

In [44]:
# find all table rows tag
table_rows = table_bs.find_all('tr')
print(table_rows)

[<tr>
<td>Flight no</td>
<td>Launch site</td>
<td>Payload mass</td>
</tr>, <tr>
<td>1</td>
<td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>
<td>300</td>
</tr>, <tr>
<td>2</td>
<td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>
<td>94</td>
</tr>, <tr>
<td>3</td>
<td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>
<td>80</td>
</tr>]


The result is a Python Iterable just like a list, each element is a tag object:

In [45]:
# print the first element of the list
# print the first tables row
print(table_rows[0])

<tr>
<td>Flight no</td>
<td>Launch site</td>
<td>Payload mass</td>
</tr>


In [46]:
# print the second table row
print(table_rows[1])

<tr>
<td>1</td>
<td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>
<td>300</td>
</tr>


The type is `tag`

In [47]:
print(type(table_rows[1]))

<class 'bs4.element.Tag'>


we can obtain the child of the first table row, let's assign it to a variable, `table_row_1`

In [50]:
table_row_1 = table_rows[0]
print(table_row_1.td)

<td>Flight no</td>


If we iterate through the list, each element corresponds to a row in the table:

In [52]:
for i, row in enumerate(table_rows):
    print("row", i, ":", row)

row 0 : <tr>
<td>Flight no</td>
<td>Launch site</td>
<td>Payload mass</td>
</tr>
row 1 : <tr>
<td>1</td>
<td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>
<td>300</td>
</tr>
row 2 : <tr>
<td>2</td>
<td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>
<td>94</td>
</tr>
row 3 : <tr>
<td>3</td>
<td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>
<td>80</td>
</tr>


As row is a cell object, we can apply the method find_all to it and extract table cells in the object cells using the tag td, this is all the children with the name td. The result is a list, each element corresponds to a cell and is a Tag object, we can iterate through this list as well. We can extract the content using the string attribute.

In [53]:
for i,row in enumerate(table_rows):
    print("row",i)
    cells=row.find_all('td')
    for j,cell in enumerate(cells):
        print('colunm',j,"cell",cell)

row 0
colunm 0 cell <td>Flight no</td>
colunm 1 cell <td>Launch site</td>
colunm 2 cell <td>Payload mass</td>
row 1
colunm 0 cell <td>1</td>
colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>
colunm 2 cell <td>300</td>
row 2
colunm 0 cell <td>2</td>
colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>
colunm 2 cell <td>94</td>
row 3
colunm 0 cell <td>3</td>
colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>
colunm 2 cell <td>80</td>


If we use a list with the name parameter, we can match against any item in that list.

In [54]:
list_input = table_bs.find_all(name = ['tr', 'td'])
print(list_input)

[<tr>
<td>Flight no</td>
<td>Launch site</td>
<td>Payload mass</td>
</tr>, <td>Flight no</td>, <td>Launch site</td>, <td>Payload mass</td>, <tr>
<td>1</td>
<td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>
<td>300</td>
</tr>, <td>1</td>, <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>, <td>300</td>, <tr>
<td>2</td>
<td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>
<td>94</td>
</tr>, <td>2</td>, <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>, <td>94</td>, <tr>
<td>3</td>
<td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>
<td>80</td>
</tr>, <td>3</td>, <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>, <td>80</td>]


<h4><i>Attributes</i></h4>

We've established that HTML tag’s have attributes. For example the `href` argument, Beautiful Soup will filter against each tag’s href attribute.

In [59]:
list_input_ = table_bs.find_all(href="https://en.wikipedia.org/wiki/Florida")
print(list_input_)

[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>, <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]


If we set the href attribute to `True`, regardless of what the value is, the code finds all tags with href value:

In [60]:
list_input_ = table_bs.find_all(href=True)
print(list_input_)

[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>, <a href="https://en.wikipedia.org/wiki/Texas">Texas</a>, <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]


How about finding all the elements without href value?

In [61]:
list_input_ = table_bs.find_all(href=False)
print(list_input_)

[<html>
<table>
<tr>
<td>Flight no</td>
<td>Launch site</td>
<td>Payload mass</td>
</tr>
<tr>
<td>1</td>
<td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>
<td>300</td>
</tr>
<tr>
<td>2</td>
<td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>
<td>94</td>
</tr>
<tr>
<td>3</td>
<td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>
<td>80</td>
</tr>
</table>
</html>, <table>
<tr>
<td>Flight no</td>
<td>Launch site</td>
<td>Payload mass</td>
</tr>
<tr>
<td>1</td>
<td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>
<td>300</td>
</tr>
<tr>
<td>2</td>
<td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>
<td>94</td>
</tr>
<tr>
<td>3</td>
<td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>
<td>80</td>
</tr>
</table>, <tr>
<td>Flight no</td>
<td>Launch site</td>
<td>Payload mass</td>
</tr>, <td>Flight no</td>, <td>Launch site</td>, <td>Payload mass</td>, <tr>
<td>1</td>
<td><a href="https://en.wikipedia.o

<h4><i>string</i><h4>
With string you can search for strings instead of tags, where we find all the elments with Florida:

In [62]:
table_bs.find_all(string="Texas")

['Texas']

### find
The `find()` method looks through a tag’s descendants and retrieves the first descendants that match your filters.

This method is used if you are looking for one element. Basically, it finds the first element in the document. 

Consider the following two table:

In [93]:
%%html
<!DOCTYPE html>
<html>
    <body>
        <h3>Rocket Launch</h3>
        <p></p>
        <table class='rocket'>
            <tr>
                <td><b>Flight No</b></td>
                <td><b>Launch site</b></td>
                <td><b>Payload mass</b></td>
            </tr>
            <tr>
                <td>1</td>
                <td>Florida</td>
                <td>300 kg</td>
            </tr>
            <tr>
                <td>2</td>
                <td>Texas</td>
                <td>94 kg</td>
            </tr>
            <tr>
                <td>3</td>
                <td>Florida</td>
                <td>80 kg</td>
            </tr>
        </table>
        <p></p>
        <h3>Pizza Party</h3>
        <table class='pizza'>
            <tr>
                <td><b>Pizza</b></td>
                <td><b>Place Orders</b></td>
                <td><b>Slices</b></td>
            </tr>
            <tr>
                <td>Domino's Pizza</td>
                <td>10</td>
                <td>100</td>
            </tr>
            <tr>
                <td>Little Caesars</td>
                <td>12</td>
                <td>144</td>
            </tr>
            <tr>
                <td>Papa John's</td>
                <td>15</td>
                <td>165</td>
            </tr>
        </table>
    </body>
</html>

0,1,2
Flight No,Launch site,Payload mass
1,Florida,300 kg
2,Texas,94 kg
3,Florida,80 kg

0,1,2
Pizza,Place Orders,Slices
Domino's Pizza,10,100
Little Caesars,12,144
Papa John's,15,165


We store the HTML as a Python string - assigning it to `table_new`

In [94]:
table_new = "<html><body><h3>Rocket Launch</h3><p></p><table class='rocket'><tr><td><b>Flight No</b></td><td><b>Launch site</b></td><td><b>Payload mass</b></td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida</td><td>80 kg</td></tr></table><p></p><h3>Pizza Party</h3><table class='pizza'><tr><td><b>Pizza</b></td><td><b>Place Orders</b></td><td><b>Slices</b></td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td>144</td></tr><tr><td>Papa John's</td><td>15</td><td>165</td></tr></table></body></html>"

In [95]:
table_new_bs = BeautifulSoup(table_new, "html.parser")
print(table_new_bs)

<html><body><h3>Rocket Launch</h3><p></p><table class="rocket"><tr><td><b>Flight No</b></td><td><b>Launch site</b></td><td><b>Payload mass</b></td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida</td><td>80 kg</td></tr></table><p></p><h3>Pizza Party</h3><table class="pizza"><tr><td><b>Pizza</b></td><td><b>Place Orders</b></td><td><b>Slices</b></td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td>144</td></tr><tr><td>Papa John's</td><td>15</td><td>165</td></tr></table></body></html>


We can find the first table using the tag name <code>table</code>

In [96]:
print(table_new_bs.find(name='table'))

<table class="rocket"><tr><td><b>Flight No</b></td><td><b>Launch site</b></td><td><b>Payload mass</b></td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida</td><td>80 kg</td></tr></table>


We can filter on the `class` attribute to find the second table; the pizza table.

**Note** however, because class is a keyword in Python, we add an underscore.

In [97]:
print(table_new_bs.find(name="table", class_="pizza"))

<table class="pizza"><tr><td><b>Pizza</b></td><td><b>Place Orders</b></td><td><b>Slices</b></td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td>144</td></tr><tr><td>Papa John's</td><td>15</td><td>165</td></tr></table>


## **Downloading And Scraping a Web Page Content**
We Download the contents of the web page:

In [102]:
url = "http://www.google.com"

We use get to download the contents of the webpage in text format and store in a variable called `data`:

In [109]:
# Downloading and store the contents of the webpage
data = requests.get(url)

# Checking the Success data of download
print('Status code: ', data.status_code)

# getting the html file in text format
data = data.text

Status code:  200


We create a BeautifulSoup object using the BeautifulSoup constructor

In [151]:
# create a soup object using the variable 'data'
soup_new = BeautifulSoup(data, 'html.parser') 

### **Scrape all links**

In [118]:
for link in soup_new.find_all('a', href=True): 
    # in html anchor/link is represented by the tag <a>
    print(link['href'])

http://www.google.com.ng/imghp?hl=en&tab=wi
http://maps.google.com.ng/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
http://www.youtube.com/?gl=NG&tab=w1
https://news.google.com/?tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.com.ng/intl/en/about/products?tab=wh
http://www.google.com.ng/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://www.google.com/&ec=GAZAAQ
/advanced_search?hl=en-NG&authuser=0
http://www.google.com/setprefs?sig=0_L5ASQdAyWp1lZdOrfyFXOYKlj9Q%3D&hl=ha&source=homepage&sa=X&ved=0ahUKEwimzpL086D5AhUTAKYKHZojAEoQ2ZgBCAU
http://www.google.com/setprefs?sig=0_L5ASQdAyWp1lZdOrfyFXOYKlj9Q%3D&hl=ig&source=homepage&sa=X&ved=0ahUKEwimzpL086D5AhUTAKYKHZojAEoQ2ZgBCAY
http://www.google.com/setprefs?sig=0_L5ASQdAyWp1lZdOrfyFXOYKlj9Q%3D&hl=yo&source=homepage&sa=X&ved=0ahUKEwimzpL086D5AhUTAKYKHZojAEoQ2ZgBCAc
http://www.google.com/setprefs?sig=0_L5ASQdAyWp1lZdOrfy

### **Scrape  all images  Tags**

In [122]:
for img_tag in soup_new.find_all('img', src=True):
    # in html image is represented by the tag <img>
    print(img_tag)
    print(" ")
    print(img_tag.get('src'))

<img alt="Google" height="92" id="hplogo" src="/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png" style="padding:28px 0 14px" width="272"/>
 
/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png


### **Scrape data from HTML tables**

In [123]:
# The below url contains an html table with data about colors and color codes.
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"

Before proceeding to scrape a web site, let's examine the contents, and the way data is organized on the website. So, open the above url in your browser and check how many rows and columns are there in the color table.

In [126]:
# get the contents of the webpage in text format and store in a variable called data_new
data_new = requests.get(url).text

Next, parse with Beautiful Soup constructor

In [129]:
soup_ = BeautifulSoup(data_new,"html.parser")

In [134]:
# find a html table in the web page
table = soup_.find(name='table') # in html table is represented by the tag <table>

In [149]:
# Get all rows from the table
for i, row in enumerate(table.find_all('tr')):
    # Get all columns in each row
    cols  = row.find_all('td')
    print("row", i, ":") # print each row
    print(cols[2].string, ":", str(cols[3].string)) # get the colors, and hex code in each row

row 0 :
Color Name : None
row 1 :
lightsalmon : #FFA07A
row 2 :
salmon : #FA8072
row 3 :
darksalmon : #E9967A
row 4 :
lightcoral : #F08080
row 5 :
coral : #FF7F50
row 6 :
tomato : #FF6347
row 7 :
orangered : #FF4500
row 8 :
gold : #FFD700
row 9 :
orange : #FFA500
row 10 :
darkorange : #FF8C00
row 11 :
lightyellow : #FFFFE0
row 12 :
lemonchiffon : #FFFACD
row 13 :
papayawhip : #FFEFD5
row 14 :
moccasin : #FFE4B5
row 15 :
peachpuff : #FFDAB9
row 16 :
palegoldenrod : #EEE8AA
row 17 :
khaki : #F0E68C
row 18 :
darkkhaki : #BDB76B
row 19 :
yellow : #FFFF00
row 20 :
lawngreen : #7CFC00
row 21 :
chartreuse : #7FFF00
row 22 :
limegreen : #32CD32
row 23 :
lime : #00FF00
row 24 :
forestgreen : #228B22
row 25 :
green : #008000
row 26 :
powderblue : #B0E0E6
row 27 :
lightblue : #ADD8E6
row 28 :
lightskyblue : #87CEFA
row 29 :
skyblue : #87CEEB
row 30 :
deepskyblue : #00BFFF
row 31 :
lightsteelblue : #B0C4DE
row 32 :
dodgerblue : #1E90FF


### **Scrape data from HTML tables into a DataFrame using BeautifulSoup and Pandas**

In [150]:
# import pandas
import pandas as pd

In [152]:
# The below url contains html tables with data about world population.
url = "https://en.wikipedia.org/wiki/World_population"

Before proceeding to scrape a web site, we need to examine the contents, and the way data is organized on the website. 

Therefore, open the above url in your browser and check the tables on the webpage.

In [155]:
# get the contents of the webpage and store in a variable called data_new_2
data_new_2 = requests.get(url)

# Status check
print(data_new_2.status_code)

# Convert contents from data_new_2 to text format
data_new_2 = data_new_2.text

200


In [157]:
# Parse data using Beautiful Soup constructor
soup_new_1 = BeautifulSoup(data_new_2,"html.parser")

In [160]:
# find all html tables in the web page
table_list = soup_new_1.find_all(name="table")

# we can see how many tables were found by checking the length of the tables list
print(len(table_list))

25


We have 25 tables in this website.

Now, assume that we are looking for the `10 most densely populated countries` table, we can look through the tables list and find the right one we are look for based on the data in each table or we can search for the table name if it is in the table but this option might not always work.

In [197]:
for index, table in enumerate(table_list):
    if "10 most densely populated countries" in str(table):
        table_index = index

print(table_index)

5


Let's locate, using the table_index, the name of the table, '10 most densely populated countries', below.

In [203]:
pop_table = table_list[table_index] # this is the table we need

print(pop_table.prettify())

<table class="wikitable sortable" style="text-align:right">
 <caption>
  10 most densely populated countries
  <small>
   (with population above 5 million)
  </small>
 </caption>
 <tbody>
  <tr>
   <th>
    Rank
   </th>
   <th>
    Country
   </th>
   <th>
    Population
   </th>
   <th>
    Area
    <br/>
    <small>
     (km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
   <th>
    Density
    <br/>
    <small>
     (pop/km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
  </tr>
  <tr>
   <td>
    1
   </td>
   <td align="left">
    <span class="flagicon">
     <img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/4/48/Flag_of_Singapore.svg/23px-Flag_of_Singapore.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/4/48/Flag_of_Singapore.svg/35px-Flag_of_Singapore.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/4/48/Flag_of_Singapo

In [204]:
population_data = pd.DataFrame(columns=["Rank", "Country", "Population", "Area", "Density"])

for row in pop_table.tbody.find_all("tr"):
    col = row.find_all("td")
    if (col != []):
        rank = col[0].text
        country = col[1].text
        population = col[2].text.strip()
        area = col[3].text.strip()
        density = col[4].text.strip()
        population_data = population_data.append({"Rank":rank, 
                                                "Country":country, 
                                                "Population":population, 
                                                "Area":area, 
                                                "Density":density}, 
                                                ignore_index=True)

population_data

Unnamed: 0,Rank,Country,Population,Area,Density
0,1,Singapore,5704000,710,8033
1,2,Bangladesh,173150000,143998,1202
2,3,\n Palestine\n\n,5266785,6020,847
3,4,Lebanon,6856000,10452,656
4,5,Taiwan,23604000,36193,652
5,6,South Korea,51781000,99538,520
6,7,Rwanda,12374000,26338,470
7,8,Haiti,11578000,27065,428
8,9,Netherlands,17720000,41526,427
9,10,Israel,9550000,22072,433


### **Scrape data from HTML tables into a DataFrame using BeautifulSoup and read_html**
Using the same `url`, `data_new_2`, `soup_new_1`, and `table_list` object as in the last section we can use the Pandas `read_html` function to create a DataFrame.

Remember the table we need is located in `pop_table`

We can now use the pandas function `read_html` and give it the string version of the table as well as the `flavor` which is the parsing engine bs4.

In [207]:
pd.read_html(str(pop_table), flavor="bs4")

[   Rank      Country  Population  Area(km2)  Density(pop/km2)
 0     1    Singapore     5704000        710              8033
 1     2   Bangladesh   173150000     143998              1202
 2     3    Palestine     5266785       6020               847
 3     4      Lebanon     6856000      10452               656
 4     5       Taiwan    23604000      36193               652
 5     6  South Korea    51781000      99538               520
 6     7       Rwanda    12374000      26338               470
 7     8        Haiti    11578000      27065               428
 8     9  Netherlands    17720000      41526               427
 9    10       Israel     9550000      22072               433]

The function `read_html` always returns a list of DataFrames. 

Thus, since we have only one list of dataframe, we must pick the one we want out of the list, so as to represent it as a proper dataframe.

In [208]:
pd.read_html(str(pop_table), flavor="bs4")[0]

Unnamed: 0,Rank,Country,Population,Area(km2),Density(pop/km2)
0,1,Singapore,5704000,710,8033
1,2,Bangladesh,173150000,143998,1202
2,3,Palestine,5266785,6020,847
3,4,Lebanon,6856000,10452,656
4,5,Taiwan,23604000,36193,652
5,6,South Korea,51781000,99538,520
6,7,Rwanda,12374000,26338,470
7,8,Haiti,11578000,27065,428
8,9,Netherlands,17720000,41526,427
9,10,Israel,9550000,22072,433


### **Scrape data from HTML tables into a DataFrame using read_html**

We can also use the read_html function to directly get DataFrames from a url.

In [210]:
dataframe_list = pd.read_html(url, flavor="bs4")

# print the number of table found in the url
print(len(dataframe_list))

25


We see that we get the same length of table just like when we used find_all on the soup_new_1 object.

Finally we can pick the DataFrame we need out of the list.

In [215]:
print(dataframe_list[table_index])

   Rank      Country  Population  Area(km2)  Density(pop/km2)
0     1    Singapore     5704000        710              8033
1     2   Bangladesh   173150000     143998              1202
2     3    Palestine     5266785       6020               847
3     4      Lebanon     6856000      10452               656
4     5       Taiwan    23604000      36193               652
5     6  South Korea    51781000      99538               520
6     7       Rwanda    12374000      26338               470
7     8        Haiti    11578000      27065               428
8     9  Netherlands    17720000      41526               427
9    10       Israel     9550000      22072               433


Let's get the `Global annual population growth` located as the 7th index

In [228]:
print(dataframe_list[7])

    Year  Population Yearly growth           Density(pop/km2)  \
    Year  Population             %    Number Density(pop/km2)   
0   1951  2584034261         1.88%  47603112               17   
1   1952  2630861562         1.81%  46827301               18   
2   1953  2677608960         1.78%  46747398               18   
3   1954  2724846741         1.76%  47237781               18   
4   1955  2773019936         1.77%  48173195               19   
..   ...         ...           ...       ...              ...   
65  2016  7464022000         1.14%  84225000               50   
66  2017  7547859000         1.12%  83837000               51   
67  2018  7631091000         1.10%  83232000               51   
68  2019  7713468000         1.08%  82377000               52   
69  2020  7795000000         1.05%  81331000               52   

   Urban population       
             Number    %  
0         775067697  30%  
1         799282533  30%  
2         824289989  31%  
3         850179106

We can also use the `match` parameter to select the specific table we want. If the table contains a string matching the text it will be read.

In [217]:
pd.read_html(url, match="10 most densely populated countries", flavor='bs4')[0]

Unnamed: 0,Rank,Country,Population,Area(km2),Density(pop/km2)
0,1,Singapore,5704000,710,8033
1,2,Bangladesh,173150000,143998,1202
2,3,Palestine,5266785,6020,847
3,4,Lebanon,6856000,10452,656
4,5,Taiwan,23604000,36193,652
5,6,South Korea,51781000,99538,520
6,7,Rwanda,12374000,26338,470
7,8,Haiti,11578000,27065,428
8,9,Netherlands,17720000,41526,427
9,10,Israel,9550000,22072,433
