### Web Scraping with Python

#### **1. BeautifulSoup**
Its used for webscraping purposes to pull the data out of HTML and XML files. It create a parse tree from page source code taht can be used to extract datain hierarchical and more readable manner.

In [2]:
from bs4 import BeautifulSoup
import requests
URL = "http://www.example.com"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
soup

<!DOCTYPE html>
<html lang="en"><head><title>Example Domain</title><meta content="width=device-width, initial-scale=1" name="viewport"/><style>body{background:#eee;width:60vw;margin:15vh auto;font-family:system-ui,sans-serif}h1{font-size:1.5em}div{opacity:0.8}a:link,a:visited{color:#348}</style><body><div><h1>Example Domain</h1><p>This domain is for use in documentation examples without needing permission. Avoid use in operations.<p><a href="https://iana.org/domains/example">Learn more</a></p></p></div></body></head></html>

#### **2. Scrapy**
Scrapy is an open-source and collaborative web crawling framework for Python. It is used to extract the data from the website

In [3]:
import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com/tag/humor/',]
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {'quote': quote.css('span.text::text').get()}

In [4]:
QuotesSpider('hello')

<QuotesSpider 'hello' at 0x2b4b8bd7b60>

#### **3. Selenium**
Selenium is a tool used for controlling web browsers through programs and automating browser tasks.

### Applications of web Scraping
1. Price Comparison
2. Email address gathering
3. Social media Scraping

### Web Scraping Lab

In [5]:
pip install --upgrade beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [6]:
pip install --upgrade pandas

^C
Note: you may need to restart the kernel to use updated packages.
Collecting pandas
  Using cached pandas-2.3.3-cp313-cp313-win_amd64.whl.metadata (19 kB)
Downloading pandas-2.3.3-cp313-cp313-win_amd64.whl (11.0 MB)
   ---------------------------------------- 0.0/11.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.0 MB ? eta -:--:--
    --------------------------------------- 0.3/11.0 MB ? eta -:--:--
   -- ------------------------------------- 0.8/11.0 MB 2.9 MB/s eta 0:00:04
   ----- ---------------------------------- 1.6/11.0 MB 2.7 MB/s eta 0:00:04
   ------ --------------------------------- 1.8/11.0 MB 2.4 MB/s eta 0:00:04
   --------- ------------------------------ 2.6/11.0 MB 2.7 MB/s eta 0:00:04
   ------------- -------------------------- 3.7/11.0 MB 3.0 MB/s eta 0:00:03
   -------------- ------------------------- 3.9/11.0 MB 3.0 MB/s eta 0:00:03
   -------------- ---------------------

In [7]:
import warnings
warnings.simplefilter('ignore')

In [8]:
from bs4 import BeautifulSoup
import requests

In [9]:
%%html
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h3><b id='boldest'>Lebron James</b></h3>
<p> Salary: $ 92,000,000 </p>
<h3> Stephen Curry</h3>
<p> Salary: $85,000, 000 </p>
<h3> Kevin Durant </h3>
<p> Salary: $73,200, 000</p>
</body>
</html>

In [10]:
html="<!DOCTYPE html><html><head><title>Page Title</title></head><body><h3><b id='boldest'>Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body></html>"

In [None]:
soup = BeautifulSoup(html, 'html.parser')

In [None]:
soup.prettify()

In [None]:
print(soup.prettify())

### Tags


In [None]:
tag_object = soup.title
print('tag object:', tag_object)

In [None]:
print("tag object type: ", type(tag_object))

In [None]:
soup.h3

In [None]:
tag_object = soup.h3
tag_object

### Children, Parents, and Siblings

In [None]:
tag_child = tag_object.b
tag_child #Lebron James

In [None]:
parent_tag = tag_child.parent
parent_tag

In [None]:
sibling_1 = tag_object.next_sibling
sibling_1

In [None]:
sibling_2 = sibling_1.next_sibling
sibling_2

In [None]:
sibling_2.next_sibling

In [None]:
tag_child['id']

In [None]:
tag_child.attrs

In [None]:
tag_child.get('id')

### Navigable String
content within a tag. BeautifulSoup uses the NavigalbleString class to contain this text.

In [None]:
tag_string = tag_child.string

In [1]:
%%html
<table>
  <tr>
    <td id='flight' >Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
   </tr>
  <tr> 
    <td>1</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a> </td>
    <td>80 kg</td>
  </tr>
</table>

0,1,2
Flight No,Launch site,Payload mass
1,Florida,300 kg
2,Texas,94 kg
3,Florida,80 kg


In [2]:
table="<table><tr><td id='flight' >Flight No</td><td>Launch site</td><td>Payload mass</td></tr><tr><td>1</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td><td>300 kg</td></tr><tr><td>2</td><td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a> </td><td>80 kg</td></tr></table>"

In [4]:
from bs4 import BeautifulSoup
table_bs = BeautifulSoup(table, "html.parser")
print(table_bs.prettify())

<table>
 <tr>
  <td id="flight">
   Flight No
  </td>
  <td>
   Launch site
  </td>
  <td>
   Payload mass
  </td>
 </tr>
 <tr>
  <td>
   1
  </td>
  <td>
   <a href="https://en.wikipedia.org/wiki/Florida">
    Florida
   </a>
  </td>
  <td>
   300 kg
  </td>
 </tr>
 <tr>
  <td>
   2
  </td>
  <td>
   <a href="https://en.wikipedia.org/wiki/Texas">
    Texas
   </a>
  </td>
  <td>
   94 kg
  </td>
 </tr>
 <tr>
  <td>
   3
  </td>
  <td>
   <a href="https://en.wikipedia.org/wiki/Florida">
    Florida
   </a>
  </td>
  <td>
   80 kg
  </td>
 </tr>
</table>



In [8]:
table_rows = table_bs.find_all('tr')
table_rows

[<tr><td id="flight">Flight No</td><td>Launch site</td><td>Payload mass</td></tr>,
 <tr><td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td><td>300 kg</td></tr>,
 <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>,
 <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td><td>80 kg</td></tr>]

In [9]:
#first row
table_rows[0]

<tr><td id="flight">Flight No</td><td>Launch site</td><td>Payload mass</td></tr>

In [13]:
type(table_rows[-1])

bs4.element.Tag

In [11]:
table_rows[0:2]

[<tr><td id="flight">Flight No</td><td>Launch site</td><td>Payload mass</td></tr>,
 <tr><td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td><td>300 kg</td></tr>]

In [12]:
print(type(table_rows))

<class 'bs4.element.ResultSet'>


In [14]:
first_row = table_rows[0]
first_row.td

<td id="flight">Flight No</td>

In [17]:
for i, row in enumerate(table_rows):
    print("row", i, ":----is", row)

row 0 :----is <tr><td id="flight">Flight No</td><td>Launch site</td><td>Payload mass</td></tr>
row 1 :----is <tr><td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td><td>300 kg</td></tr>
row 2 :----is <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>
row 3 :----is <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td><td>80 kg</td></tr>


In [31]:
for i, row in enumerate(table_rows):
    print("row", i, "\n")
    cells = row.find_all('td')
    for j, cell in enumerate(cells):
        print('Colum', j, '---> Cells:', cell)

row 0 

Colum 0 ---> Cells: <td id="flight">Flight No</td>
Colum 1 ---> Cells: <td>Launch site</td>
Colum 2 ---> Cells: <td>Payload mass</td>
row 1 

Colum 0 ---> Cells: <td>1</td>
Colum 1 ---> Cells: <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>
Colum 2 ---> Cells: <td>300 kg</td>
row 2 

Colum 0 ---> Cells: <td>2</td>
Colum 1 ---> Cells: <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>
Colum 2 ---> Cells: <td>94 kg</td>
row 3 

Colum 0 ---> Cells: <td>3</td>
Colum 1 ---> Cells: <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td>
Colum 2 ---> Cells: <td>80 kg</td>


In [32]:
for i, row in enumerate(table_rows):
    print("rows", i, '\n')
    cells = row.find_all('td')
    for j, cell in enumerate(cells):
        print('Columsn', j, '--> Cell: ', cell)
    

rows 0 

Columsn 0 --> Cell:  <td id="flight">Flight No</td>
Columsn 1 --> Cell:  <td>Launch site</td>
Columsn 2 --> Cell:  <td>Payload mass</td>
rows 1 

Columsn 0 --> Cell:  <td>1</td>
Columsn 1 --> Cell:  <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>
Columsn 2 --> Cell:  <td>300 kg</td>
rows 2 

Columsn 0 --> Cell:  <td>2</td>
Columsn 1 --> Cell:  <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>
Columsn 2 --> Cell:  <td>94 kg</td>
rows 3 

Columsn 0 --> Cell:  <td>3</td>
Columsn 1 --> Cell:  <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td>
Columsn 2 --> Cell:  <td>80 kg</td>


In [33]:
list_input = table_bs.find_all(name=['tr', 'td'])
list_input

[<tr><td id="flight">Flight No</td><td>Launch site</td><td>Payload mass</td></tr>,
 <td id="flight">Flight No</td>,
 <td>Launch site</td>,
 <td>Payload mass</td>,
 <tr><td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td><td>300 kg</td></tr>,
 <td>1</td>,
 <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>,
 <td>300 kg</td>,
 <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>,
 <td>2</td>,
 <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>,
 <td>94 kg</td>,
 <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td><td>80 kg</td></tr>,
 <td>3</td>,
 <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td>,
 <td>80 kg</td>]

### Attributes
table_bs.find_all(id='flight')

In [34]:
table_bs.find_all(id='flight')

[<td id="flight">Flight No</td>]

In [35]:
list_input = table_bs.find_all(href = "https://en.wikipedia.org/wiki/Florida")
list_input

[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>,
 <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]

In [37]:
print(table_bs.prettify())

<table>
 <tr>
  <td id="flight">
   Flight No
  </td>
  <td>
   Launch site
  </td>
  <td>
   Payload mass
  </td>
 </tr>
 <tr>
  <td>
   1
  </td>
  <td>
   <a href="https://en.wikipedia.org/wiki/Florida">
    Florida
   </a>
  </td>
  <td>
   300 kg
  </td>
 </tr>
 <tr>
  <td>
   2
  </td>
  <td>
   <a href="https://en.wikipedia.org/wiki/Texas">
    Texas
   </a>
  </td>
  <td>
   94 kg
  </td>
 </tr>
 <tr>
  <td>
   3
  </td>
  <td>
   <a href="https://en.wikipedia.org/wiki/Florida">
    Florida
   </a>
  </td>
  <td>
   80 kg
  </td>
 </tr>
</table>



In [38]:
# to find all tags with href
table_bs.find_all(href = True)

[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>,
 <a href="https://en.wikipedia.org/wiki/Texas">Texas</a>,
 <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]

In [42]:
table_bs.find_all(href = True)

[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>,
 <a href="https://en.wikipedia.org/wiki/Texas">Texas</a>,
 <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]

In [48]:
 table_bs.find_all('a')


[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>,
 <a href="https://en.wikipedia.org/wiki/Texas">Texas</a>,
 <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]

In [51]:
table_bs.find_all(id = 'boldest')

[]

In [52]:
table_bs.find_all(string = 'Florida')

['Florida', 'Florida']

In [53]:
table_bs.find(string = 'Texas')

'Texas'

### Find()

In [54]:
%%html
<h3>Rocket Launch </h3>

<p>
<table class='rocket'>
  <tr>
    <td>Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
  </tr>
  <tr>
    <td>1</td>
    <td>Florida</td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td>Texas</td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td>Florida </td>
    <td>80 kg</td>
  </tr>
</table>
</p>
<p>

<h3>Pizza Party  </h3>
  
    
<table class='pizza'>
  <tr>
    <td>Pizza Place</td>
    <td>Orders</td> 
    <td>Slices </td>
   </tr>
  <tr>
    <td>Domino's Pizza</td>
    <td>10</td>
    <td>100</td>
  </tr>
  <tr>
    <td>Little Caesars</td>
    <td>12</td>
    <td >144 </td>
  </tr>
  <tr>
    <td>Papa John's </td>
    <td>15 </td>
    <td>165</td>
  </tr>


0,1,2
Flight No,Launch site,Payload mass
1,Florida,300 kg
2,Texas,94 kg
3,Florida,80 kg

0,1,2
Pizza Place,Orders,Slices
Domino's Pizza,10,100
Little Caesars,12,144
Papa John's,15,165


In [55]:
two_tables="<h3>Rocket Launch </h3><p><table class='rocket'><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table></p><p><h3>Pizza Party  </h3><table class='pizza'><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td >144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr>"

In [56]:
two_tables_bs = BeautifulSoup(two_tables, 'html.parser')
print(two_tables_bs.prettify())

<h3>
 Rocket Launch
</h3>
<p>
 <table class="rocket">
  <tr>
   <td>
    Flight No
   </td>
   <td>
    Launch site
   </td>
   <td>
    Payload mass
   </td>
  </tr>
  <tr>
   <td>
    1
   </td>
   <td>
    Florida
   </td>
   <td>
    300 kg
   </td>
  </tr>
  <tr>
   <td>
    2
   </td>
   <td>
    Texas
   </td>
   <td>
    94 kg
   </td>
  </tr>
  <tr>
   <td>
    3
   </td>
   <td>
    Florida
   </td>
   <td>
    80 kg
   </td>
  </tr>
 </table>
</p>
<p>
 <h3>
  Pizza Party
 </h3>
 <table class="pizza">
  <tr>
   <td>
    Pizza Place
   </td>
   <td>
    Orders
   </td>
   <td>
    Slices
   </td>
  </tr>
  <tr>
   <td>
    Domino's Pizza
   </td>
   <td>
    10
   </td>
   <td>
    100
   </td>
  </tr>
  <tr>
   <td>
    Little Caesars
   </td>
   <td>
    12
   </td>
   <td>
    144
   </td>
  </tr>
  <tr>
   <td>
    Papa John's
   </td>
   <td>
    15
   </td>
   <td>
    165
   </td>
  </tr>
 </table>
</p>



In [57]:
two_tables_bs.find('table')

<table class="rocket"><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table>

In [59]:
two_tables_bs.find("table", class_='pizza') #since class is a keyword in python so we will use underscore with it

<table class="pizza"><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td>144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr></table>

In [60]:
url = "https://web.archive.org/web/20230224123642/https://www.ibm.com/us-en/"

In [61]:
import requests
data = requests.get(url).text

In [63]:
soup = BeautifulSoup(data, "html.parser")


In [65]:
#scrap all links
for link in soup.find_all('a', href = True):
    print(link.get('href'))

https://web.archive.org/web/20230224123642/https://www.ibm.com/reports/threat-intelligence/
https://web.archive.org/web/20230224123642/https://www.ibm.com/about
https://web.archive.org/web/20230224123642/https://www.ibm.com/consulting/?lnk=flathl
https://web.archive.org/web/20230224123642/https://www.ibm.com/consulting/strategy/?lnk=flathl
https://web.archive.org/web/20230224123642/https://www.ibm.com/consulting/ibmix?lnk=flathl
https://web.archive.org/web/20230224123642/https://www.ibm.com/consulting/technology/
https://web.archive.org/web/20230224123642/https://www.ibm.com/consulting/operations/?lnk=flathl
https://web.archive.org/web/20230224123642/https://www.ibm.com/strategic-partnerships
https://web.archive.org/web/20230224123642/https://www.ibm.com/employment/?lnk=flatitem
https://web.archive.org/web/20230224123642/https://www.ibm.com/impact
https://web.archive.org/web/20230224123642/https://research.ibm.com/
https://web.archive.org/web/20230224123642/https://www.ibm.com/


In [66]:
trade = 'https://tradeconnection.site/'
Data = requests.get(trade).text
trade_bs = BeautifulSoup(trade, 'html.parser')


If you meant to use Beautiful Soup to parse the web page found at a certain URL, then something has gone wrong. You should use an Python package like 'requests' to fetch the content behind the URL. Once you have the content as a string, you can feed that string into Beautiful Soup.



    
  trade_bs = BeautifulSoup(trade, 'html.parser')


In [67]:
for link in trade_bs.find_all('a', href=True):
    print(link.get('href'))

In [70]:
for link in soup.find_all('img'):
    print(link)
    print(link.get('src'))

<img alt="Person standing with arms crossed" aria-describedby="bx--image-1" class="bx--image__img" src="https://web.archive.org/web/20230224123642im_/https://1.dam.s81c.com/p/0a23e414312bcb6f/08196d0e04260ae5_cropped.jpg.global.sr_16x9.jpg"/>
https://web.archive.org/web/20230224123642im_/https://1.dam.s81c.com/p/0a23e414312bcb6f/08196d0e04260ae5_cropped.jpg.global.sr_16x9.jpg
<img alt="Team members at work in a conference room" aria-describedby="bx--image-2" class="bx--image__img" src="https://web.archive.org/web/20230224123642im_/https://1.dam.s81c.com/p/06655c075aa3aa29/CaitOppermann_2019_12_06_IBMGarage_DSC3304.jpg.global.m_16x9.jpg"/>
https://web.archive.org/web/20230224123642im_/https://1.dam.s81c.com/p/06655c075aa3aa29/CaitOppermann_2019_12_06_IBMGarage_DSC3304.jpg.global.m_16x9.jpg
<img alt="Coworkers looking at laptops" aria-describedby="bx--image-3" class="bx--image__img" src="https://web.archive.org/web/20230224123642im_/https://1.dam.s81c.com/p/08f951353c2707b8/052022_CaitOp

In [71]:
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"

### Scrape data from HTML tables


In [72]:
Data = requests.get(url).text

In [73]:
soup = BeautifulSoup(Data, "html.parser")

In [77]:
table = soup.find('table')

In [79]:
#Get all rows from the table
for row in table.find_all('tr'): # in html table row is represented by the tag <tr>
    # Get all columns in each row.
    cols = row.find_all('td') # in html a column is represented by the tag <td>
    color_name = cols[2].string # store the value in column 3 as color_name
    color_code = cols[3].string # store the value in column 4 as color_code
    print("{}--->{}".format(color_name,color_code))

Color Name--->None
lightsalmon--->#FFA07A
salmon--->#FA8072
darksalmon--->#E9967A
lightcoral--->#F08080
coral--->#FF7F50
tomato--->#FF6347
orangered--->#FF4500
gold--->#FFD700
orange--->#FFA500
darkorange--->#FF8C00
lightyellow--->#FFFFE0
lemonchiffon--->#FFFACD
papayawhip--->#FFEFD5
moccasin--->#FFE4B5
peachpuff--->#FFDAB9
palegoldenrod--->#EEE8AA
khaki--->#F0E68C
darkkhaki--->#BDB76B
yellow--->#FFFF00
lawngreen--->#7CFC00
chartreuse--->#7FFF00
limegreen--->#32CD32
lime--->#00FF00
forestgreen--->#228B22
green--->#008000
powderblue--->#B0E0E6
lightblue--->#ADD8E6
lightskyblue--->#87CEFA
skyblue--->#87CEEB
deepskyblue--->#00BFFF
lightsteelblue--->#B0C4DE
dodgerblue--->#1E90FF


In [82]:
for row in table.find_all('tr'):
    #get all columns in each row
    cols = row.find_all('td')
    color_index = cols[0].string
    color_name = cols[2].string
    color_code = cols[3].string
    print("{}----> {} .....> {}".format(color_index, color_name, color_code))

Number ----> Color Name .....> None
1----> lightsalmon .....> #FFA07A
2----> salmon .....> #FA8072
3----> darksalmon .....> #E9967A
4----> lightcoral .....> #F08080
5----> coral .....> #FF7F50
6----> tomato .....> #FF6347
7----> orangered .....> #FF4500
8----> gold .....> #FFD700
9----> orange .....> #FFA500
10----> darkorange .....> #FF8C00
11----> lightyellow .....> #FFFFE0
12----> lemonchiffon .....> #FFFACD
13----> papayawhip .....> #FFEFD5
14----> moccasin .....> #FFE4B5
15----> peachpuff .....> #FFDAB9
16----> palegoldenrod .....> #EEE8AA
17----> khaki .....> #F0E68C
18----> darkkhaki .....> #BDB76B
19----> yellow .....> #FFFF00
20----> lawngreen .....> #7CFC00
21----> chartreuse .....> #7FFF00
22----> limegreen .....> #32CD32
23----> lime .....> #00FF00
24----> forestgreen .....> #228B22
25----> green .....> #008000
26----> powderblue .....> #B0E0E6
27----> lightblue .....> #ADD8E6
28----> lightskyblue .....> #87CEFA
29----> skyblue .....> #87CEEB
30----> deepskyblue .....> #00B

### Scrape data from HTML tables into a DataFrame using beautifulsoup and pandas

In [83]:
import pandas as pd

In [84]:
url = "https://en.wikipedia.org/wiki/World_population"

In [85]:
# get the contents of the webpage in text format and store in a variable called data
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/91.0.4472.124 Safari/537.36"
}

data  = requests.get(url, headers=headers)

In [87]:
soup = BeautifulSoup(data.text, 'html.parser')

In [88]:
tables = soup.find_all('table')

In [89]:
len(tables)

26

In [92]:
for index, table in enumerate(tables):
    if ("10 most densely populated countries" in str(table)):
        table_index = index
print(table_index)

5


In [93]:
print(tables[table_index].prettify())

<table class="wikitable sortable" style="text-align:right">
 <caption>
  10 most densely populated countries
  <small>
   (with population above 5Â million)
  </small>
  <sup class="reference" id="cite_ref-:10_106-0">
   <a href="#cite_note-:10-106">
    <span class="cite-bracket">
     [
    </span>
    101
    <span class="cite-bracket">
     ]
    </span>
   </a>
  </sup>
 </caption>
 <tbody>
  <tr>
   <th scope="col">
    Rank
   </th>
   <th scope="col">
    Country
   </th>
   <th scope="col">
    Population
   </th>
   <th scope="col">
    Area
    <br/>
    <small>
     (km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
   <th scope="col">
    Density
    <br/>
    <small>
     (pop/km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
  </tr>
  <tr>
   <td>
    1
   </td>
   <td align="left">
    <span class="flagicon nowrap">
     <span class="mw-image-border" typeof="mw:File">
      <span>
       <img alt="" class="mw-file-element" data-file-height="600

In [95]:
population_data = pd.DataFrame(columns=["Rank", "Country", "Population", "Area", "Density"])

for row in tables[table_index].tbody.find_all("tr"):
    col = row.find_all("td")
    if col:
        rank = col[0].text.strip()
        country = col[1].text.strip()
        population = col[2].text.strip()
        area = col[3].text.strip()
        density = col[4].text.strip()

        # Create a temporary DataFrame for the new row
        new_row = pd.DataFrame([{"Rank": rank, "Country": country, "Population": population, "Area": area, "Density": density}])

        # Use concat 
        population_data = pd.concat([population_data, new_row], ignore_index=True)

population_data

Unnamed: 0,Rank,Country,Population,Area,Density
0,1,Singapore,5921231,719,8235
1,2,Bangladesh,165650475,148460,1116
2,3,Palestine[note 3][102],5223000,6025,867
3,4,Taiwan[note 4],23580712,35980,655
4,5,South Korea,51844834,99720,520
5,6,Lebanon,5296814,10400,509
6,7,Rwanda,13173730,26338,500
7,8,Burundi,12696478,27830,456
8,9,Israel,9402617,21937,429
9,10,India,1389637446,3287263,423


### Extracting Stock Data using a Python Library


In [None]:
!pip install yfinance

In [None]:
# using yfinance library to extract Data
