# Web Scraping


For this lab, we are going to be using Python and several Python libraries. The cells below will install these libraries when executed.


In [None]:
# %pip install bs4
# %pip install lxml
# %pip install html5lib
# %pip install requests

Import the required modules and functions


In [None]:
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page

<h2 id="BSO">Beautiful Soup Objects</h2>


Beautiful Soup is a Python library for pulling data out of HTML and XML files, we will focus on HTML files. This is accomplished by representing the HTML as a set of objects with methods used to parse the HTML.  We can navigate the HTML as a tree and/or filter out what we are looking for.

Consider the following HTML:


In [6]:
%%html
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h3><b id='boldest'>Lebron James</b></h3>
<p> Salary: $ 92,000,000 </p>
<h3> Stephen Curry</h3>
<p> Salary: $85,000, 000 </p>
<h3> Kevin Durant </h3>
<p> Salary: $73,200, 000</p>
</body>
</html>

We can store it as a string in the variable HTML:


In [7]:
html="<!DOCTYPE html><html><head><title>Page Title</title></head><body><h3><b id='boldest'>Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body></html>"

To parse a document, pass it into the <code>BeautifulSoup</code> constructor, the <code>BeautifulSoup</code> object, which represents the document as a nested data structure:


In [8]:
soup = BeautifulSoup(html, "html.parser")

First, the document is converted to Unicode, (similar to ASCII),  and HTML entities are converted to Unicode characters. Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. The <code>BeautifulSoup</code> object can create other types of objects. In this lab, we will cover <code>BeautifulSoup</code> and <code>Tag</code> objects that for the purposes of this lab are identical, and <code>NavigableString</code> objects.


We can use the method <code>prettify()</code> to display the HTML in the nested structure:


In [9]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Page Title
  </title>
 </head>
 <body>
  <h3>
   <b id="boldest">
    Lebron James
   </b>
  </h3>
  <p>
   Salary: $ 92,000,000
  </p>
  <h3>
   Stephen Curry
  </h3>
  <p>
   Salary: $85,000, 000
  </p>
  <h3>
   Kevin Durant
  </h3>
  <p>
   Salary: $73,200, 000
  </p>
 </body>
</html>



<h2 id="DSCW">Downloading And Scraping The Contents Of A Web Page</h2> 


We Download the contents of the web page:


In [31]:
url = "https://www.cvut.cz/"

We use <code>get</code> to download the contents of the webpage in text format and store in a variable called <code>data</code>:


In [32]:
data  = requests.get(url).text 

We create a <code>BeautifulSoup</code> object using the <code>BeautifulSoup</code> constructor


In [34]:
soup = BeautifulSoup(data,"html.parser")  # create a soup object using the variable 'data'

Scrape all links


In [35]:
for link in soup.find_all('a',href=True):  # in html anchor/link is represented by the tag <a>

    print(link.get('href'))


https://www.cvut.cz/
/
/
/en
/
/vitejte-na-cvut
/proc-studovat-na-cvut
/veda-a-vyzkum-na-cvut
/partneri-cvut
/reskript-cisare-josefa-i
https://results.cvut.cz/lang/cs
https://www.cvut.cz/ukrajina
http://aktualne.cvut.cz/aktuality
https://aktualne.cvut.cz/aktuality/20231020-beh-17-listopadu-v-obore-hvezda
https://aktualne.cvut.cz/aktuality/20231019-vedeni-cvut-v-praze-odsuzuje-utok-na-izrael-a-nabizi-pomoc
https://aktualne.cvut.cz/aktuality/20231019-pozvanka-na-kolokvium-s-bohdanem-zronkem-z-cez
https://aktualne.cvut.cz/aktuality/20231018-informacni-centrum-zavreno-patek-20-rijna-2023
https://aktualne.cvut.cz/aktuality/20231013-collabothon-2023
https://aktualne.cvut.cz/aktuality/20231002-umeni-za-hudbou-v-betlemske-kapli
https://aktualne.cvut.cz/aktuality/20230929-nova-vystava-v-galerii-jaroslava-fragnera
https://aktualne.cvut.cz/aktuality/20230922-1-reprezentacni-ples-fakulty-elektrotechnicke-cvut
https://aktualne.cvut.cz/aktuality/20230907-zimni-lyzarske-kurzy-2024
https://aktualne.cv

## Scrape  all images  Tags


In [36]:
for link in soup.find_all('img'):# in html image is represented by the tag <img>
    print(link)
    print(link.get('src'))

<img height="1" src="https://www.facebook.com/tr?id=442735193546618&amp;ev=PageView&amp;noscript=1" style="display:none" width="1"/>
https://www.facebook.com/tr?id=442735193546618&ev=PageView&noscript=1
<img alt="" src="https://www.cvut.cz/sites/all/themes/cvut/logo-cs.svg" title="Go to homepage"/>
https://www.cvut.cz/sites/all/themes/cvut/logo-cs.svg
<img alt="Banner" src="https://www.cvut.cz/sites/default/files/content/37112eb8-8a91-49e1-8ce2-77aa35e163c6/cs/deb576fc-fd0e-4fc2-a044-29450c75cc6e.svgz" title="Informace k situaci na Ukrajině"/>
https://www.cvut.cz/sites/default/files/content/37112eb8-8a91-49e1-8ce2-77aa35e163c6/cs/deb576fc-fd0e-4fc2-a044-29450c75cc6e.svgz
<img alt="" height="480" src="https://aktualne.cvut.cz/sites/aktualne/files/styles/large/public/content/756db436-7aa2-4803-827e-a2d4d2565448/e8f14e38-8695-4d72-9d65-8fa708fc8a81.jpg?itok=d6Mbi0M0" width="720"><div class="carousel-caption">
<div class="pub-date"></div>
<h3><span class="field-content"><a href="https://ak

## Scrape data from HTML tables


In [55]:
#The below url contains an html table with data about team.
url = "http://www.robostav.cz/robostav-team"

Before proceeding to scrape a web site, you need to examine the contents, and the way data is organized on the website. Open the above url in your browser and check how many rows and columns are there in the teams table.


In [56]:
# get the contents of the webpage in text format and store in a variable called data
data  = requests.get(url).text

In [58]:
soup = BeautifulSoup(data,"html.parser")

In [59]:
#find a html table in the web page
table = soup.find('table') # in html table is represented by the tag <table>
table

<table cellpadding="4">
<tr>
<td align="center"><font class="small_font_em"><b></b></font></td>
<td align="center"><font class="small_font_em">Pozice</font></td>
<td align="center"><font class="small_font_em">Jméno člena týmu</font></td>
<td align="center"><font class="small_font_em">Pracoviště</font></td>
</tr>
<tr>
<td valign="top"><img class="img_ramka" height="150px" src="http://www.robostav.cz/cvut/robostav/img/tym/usm.jpg" width="111px"/></td>
<td valign="top">Vědecko-výzkumný pracovník<br/>Odborný asistent<br/>Programátor průmyslových robotů<br/></td>
<td valign="top"><b>Ing. Vjačeslav Usmanov, Ph.D.</b></td>
<td valign="top">ČVUT v Praze, Fakulta stavební<br/>K122 - Katedra technologie staveb, B482<br/>Telefon (fakultní): (+420) 224 35 3981
<br/>E-mail: <b><a href="mailto:vyacheslav.usmanov@fsv.cvut.cz">vyacheslav.usmanov@fsv.cvut.cz</a></b>
<br/>www: <b><a href="http://technologie.fsv.cvut.cz/clenove-katedry/vjaceslav-usmanov" target="_blank">web katedry K122</a></b>
</td>
</t

In [69]:
#Get all rows from the table
for row in table.find_all('tr'): # in html table row is represented by the tag <tr>
    # Get all columns in each row.
    cols = row.find_all('td') # in html a column is represented by the tag <td>
    team_name = cols[2].string # store the value in column 3 as team_name
    team_position = " ".join(cols[1].strings) # store the value in column 2 as team_position
    print("{}--->{}".format(team_name, team_position))

Jméno člena týmu--->Pozice
Ing. Vjačeslav Usmanov, Ph.D.--->Vědecko-výzkumný pracovník Odborný asistent Programátor průmyslových robotů
Ing. Rostislav Šulc, Ph.D.--->Hlavní koordinátor projektu Odborný asistent
Ing. Michal Kovářík--->Vědecko-výzkumný pracovník Doktorand Specialista na 3D tisk
Ing. Jan Illetško--->Vědecko-výzkumný pracovník Doktorand
doc. Ing. Pavel Svoboda, CSc.--->Vedoucí katedry technologie staveb


## Scrape data from HTML tables into a DataFrame using BeautifulSoup and Pandas


In [70]:
import pandas as pd

In [71]:
#The below url contains html tables with data about world population.
url = "https://en.wikipedia.org/wiki/World_population"

Before proceeding to scrape a web site, you need to examine the contents, and the way data is organized on the website. Open the above url in your browser and check the tables on the webpage.


In [72]:
# get the contents of the webpage in text format and store in a variable called data
data  = requests.get(url).text

In [73]:
soup = BeautifulSoup(data,"html.parser")

In [74]:
#find all html tables in the web page
tables = soup.find_all('table') # in html table is represented by the tag <table>

In [75]:
# we can see how many tables were found by checking the length of the tables list
len(tables)

29

Assume that we are looking for the `10 most densly populated countries` table, we can look through the tables list and find the right one we are look for based on the data in each table or we can search for the table name if it is in the table but this option might not always work.


In [76]:
for index,table in enumerate(tables):
    if ("10 most densely populated countries" in str(table)):
        table_index = index
print(table_index)

7


See if you can locate the table name of the table, `10 most densly populated countries`, below.


In [77]:
print(tables[table_index].prettify())

<table class="wikitable sortable" style="text-align:right">
 <caption>
  10 most densely populated countries
  <small>
   (with population above 5 million)
  </small>
  <sup class="reference" id="cite_ref-:10_106-0">
   <a href="#cite_note-:10-106">
    [102]
   </a>
  </sup>
 </caption>
 <tbody>
  <tr>
   <th scope="col">
    Rank
   </th>
   <th scope="col">
    Country
   </th>
   <th scope="col">
    Population
   </th>
   <th scope="col">
    Area
    <br/>
    <small>
     (km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
   <th scope="col">
    Density
    <br/>
    <small>
     (pop/km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
  </tr>
  <tr>
   <td>
    1
   </td>
   <td align="left">
    <span class="flagicon">
     <span class="mw-image-border" typeof="mw:File">
      <span>
       <img alt="" class="mw-file-element" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/4/4

In [80]:
population_data = pd.DataFrame(columns=["Rank", "Country", "Population", "Area", "Density"])

for row in tables[table_index].tbody.find_all("tr"):
    col = row.find_all("td")
    if (col != []):
        rank = col[0].text
        country = col[1].text
        population = col[2].text.strip()
        area = col[3].text.strip()
        density = col[4].text.strip()
        population_data = population_data._append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density}, ignore_index=True)

population_data

Unnamed: 0,Rank,Country,Population,Area,Density
0,1,Singapore,5921231,719,8235
1,2,Bangladesh,165650475,148460,1116
2,3,\n Palestine[note 3][103]\n\n,5223000,6025,867
3,4,Taiwan[note 4],23580712,35980,655
4,5,South Korea,51844834,99720,520
5,6,Lebanon,5296814,10400,509
6,7,Rwanda,13173730,26338,500
7,8,Burundi,12696478,27830,456
8,9,India,1389637446,3287263,423
9,10,Netherlands,17400824,41543,419
