<a href="https://colab.research.google.com/github/adong-hood/cs200/blob/main/ch_6_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Section 6.4, Screen/Web Scraping



In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup


Note: Read the section 6.4 of Runestone first, then start with this file.

## Preliminary

<p>It is better if you know a little bit of HTML basic elements, CSS selectors, and how web scraping works in general. </p>

<ul>
<li> <a href = "https://www.tutorialspoint.com/html/index.htm">HTML (id, class, parent-child)</a></li>
<li> <a href = "https://www.w3schools.com/html/html_css.asp">HTML CSS </a></li>
    <li> <a href = "https://www.topcoder.com/thrive/articles/web-scraping-with-beautiful-soup"> Web Scraping Using BeautifulSoup </a></li>    
</ul>

This world_countries.csv was extracted, by somebody else, from [CIA web site](https://www.cia.gov/the-world-factbook/countries/index.html), with each column coming from one web page.

In [None]:
wd = pd.read_csv('http://pluto.hood.edu/~dong/datasets/world_countries.csv')
wd.head()

In [None]:
wd.columns

In this exericise, we will perform similar web scrapings to extract all the information on our own.

## Use Beautiful Soup

<a href = "https://www.crummy.com/software/BeautifulSoup/bs4/doc/">Beautiful Soup </a> is a Python library for pulling data out of HTML and XML files.


### Quick Start
Beautiful Soup transforms page into a BeautifulSoup object, a nested data structure.

Here’s an HTML document I’ll be using as an example throughout this document. It’s part of a story from Alice in Wonderland:

In [None]:
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body>
</html>
"""

In [None]:
html_doc

Running the “three sisters” document through Beautiful Soup gives us a BeautifulSoup object, which represents the document as a nested data structure:

In [None]:
soup = BeautifulSoup(html_doc, 'html.parser')
pretty_html_doc = soup.prettify()
print(pretty_html_doc)

Here are some simple ways to navigate that data structure:

In [None]:
print(f'Title tag: {soup.title}\n')
# <title>The Dormouse's story</title>

print(f'Tag name: {soup.title.name}\n')
# 'title'

print(f'Tag text content: {soup.title.string}')
print(f'Tag text content: {soup.title.text}\n')
# 'The Dormouse's story'

print(f'Parent tag: {soup.title.parent}\n')
print(f'Parent tag name: {soup.title.parent.name}\n')
# 'head'

print(f'First paragraph tag:{soup.p}\n')
# <p class="title"><b>The Dormouse's story</b></p>

print(f"First paragraph class: {soup.p['class']}\n")
# 'title'

print(f'First Link:{soup.a}\n')
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

print(f"All links: {soup.find_all('a')}\n")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print(f"the url of the first Link: {soup.find_all('a')[0].get('href')}\n")


print(f'The link with specific id: {soup.find(id="link3")}\n')
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

One common task is extracting all the URLs found within a page’s <a> tags:

In [None]:
for link in soup.find_all('a'):
    print(link.get('href'))

Another common task is extracting all the text from a page:

In [None]:
print(soup.get_text())

## Request a Web Page

### Use Beautiful Soup for World factbook pages.

<p>Open a web page from the CIA web site. Use <code>requests</code>, the same way as you did in Section 6.3. </p>

In [None]:
address = 'https://www.cia.gov/the-world-factbook/countries/index.html'
res = requests.get(address)
print(res.status_code)
page = res.text
page[:500]


<p>However, the Runestone book fill-in-blank type answers are based on the 2017 factbook, instead of the 2021 data. </p>
<p>You may download all the web pages for 2017 and scraping informaiton offline with the download link in the book. However, in this demo we will use this <a href = "http://pluto.hood.edu/~dong/factbook-2017/index.html">psudo-CIA site</a>. Follow <a href = "http://pluto.hood.edu/~dong/cs200/datasets/factbook/">this link </a> for folder structure, which is the same as if you were to download the factbook 2017 files yourself. </p>
<p>Please keep in mind, the processes for online and offline web scraping should be almost the same. Below is how to do it offline. The file path in <code>open</code> is reletive to your folder struture.</p>

<p>Open a web page, <code>notesanddefs.html</code>, for scraping. This CIA page serves as a table of content to all other pages. Use <code>requests</code>, the same way as you did in Section 6.3. </p>

In [None]:
address = 'http://pluto.hood.edu/~dong/factbook-2017/docs/notesanddefs.html'
res = requests.get(address)
page = res.text
page[:300]
#res.status_code

In [None]:
page_content_notesanddefs = BeautifulSoup(page, 'html.parser')
print(page_content_notesanddefs.prettify()[42600:44500])

## Extrac All Column Headers


<p>Open in a text editor (not browser) and examine <code>page</code> and figure out how to extract the html page link for each column in <code>world_countries.csv</code>.

Below is an excerpt from the page that contains information for "Literacy." Similar structure is repeated for each column/field.</p>

In [None]:
literacy = '''
<a name="2103"></a>
				<div id="2103" name="2103">
					<li style="list-style-type: none; line-height: 20px; padding-bottom: 3px;" >
					<span style="padding: 2px; display:block; background-color:#F8f8e7;" class="category">
						<table width="100%" border="0" cellpadding="0" cellspacing="0" >
							<tr>
								<td style="width: 90%;" >Literacy</td>
                <td align="right" valign="middle">
											<a href="../fields/2103.html#139" title="Field info displayed for all countries in alpha order."> <img src="../graphics/field_listing_on.gif" border="0" style="padding:0px;" > </a>

								</td>
							</tr>
						</table>
					</span>
					<div id="data" class="category_data" style="width: 98%; font-weight: normal; background-color: #fff; padding: 5px; margin-left: 0px; border-top: 1px solid #ccc;" >
					<div class="category_data" style="text-transform:none">

						This entry includes a <em>definition</em> of literacy and Census Bureau percentages for the <em>total population</em>, <em>males</em>, and <em>females</em>. There are no universal definitions and standards of literacy. Unless otherwise specified, all rates are based on the most common definition - the ability to read and write at a specified age. Detailing the standards that individual countries use to assess the ability to read and write is beyond the scope of the <em>Factbook</em>. Information on literacy, while not a perfect measure of educational results, is probably the most easily available and valid for international comparisons. Low levels of literacy, and education in general, can impede the economic development of a country in the current rapidly changing, technology-driven world.</div>
				</div>
			</li>
			</div>
 '''

In [None]:
spans = page_content_notesanddefs.select("span.category") # return a list of span elements with class=category.
#print(spans[139].select('td')[0].text)
for aspan in spans:
    cells = aspan.select('td') # return all the td elements in one span element. There are two in the above segment.
    colname = cells[0].text
    print(colname)

## Extract Data <code>Literacy</code> Column

In [None]:
cols = page_content_notesanddefs.select("span.category") # return a list of span elements with class=category.
for col in cols:
    cells = col.select('td') # return all the td elements in one span element. There are two in the above segment.
    colname = cells[0].text
    if colname == 'Literacy':
      links = cells[1].select('a') # return all a elements in the second td element. There is one.
      print(links)
      if len(links) > 0:
        fpath = links[0]['href'] # href functions as the key to  ../fields/2103.html#139
        print("Field name:", colname, "\nFile Path:", fpath)

Let's now open and extract Literacy data from this 2103.html file.


In [None]:
address = 'http://pluto.hood.edu/~dong/factbook-2017/fields/2103.html#139'
res = requests.get(address)
page_literacy = res.text
page_content_literacy = BeautifulSoup(page_literacy)
#print(page_content_literacy.prettify())


In [None]:
'''
<tr style="background: #EEEEEE" id="ch">
<td class=country><a href=../geos/ch.html>China</td>
<td class=fieldData>
<strong>definition: </strong>age 15 and over can read and write<br />
<strong>total population: </strong>96.4%<br />
<strong>male: </strong>98.2%<br />
<strong>female: </strong>94.5% (2015 est.)<br />
</td>
</tr>
<tr id=co>
<td class=country><a href=../geos/co.html>Colombia</td>
<td class=fieldData>
<strong>definition: </strong>age 15 and over can read and write<br />
<strong>total population: </strong>94.2%<br />
<strong>male: </strong>94.1%<br />
<strong>female: </strong>94.4% (2015 est.)<br />
</td>
</tr>
'''

### Build the data frame with Country column.

All the files have 2-letter country code. We will use it to combine informaiton extracted from multiple web pages. We will also extract all country names.

In [None]:
dict_country_name = {}
cols = page_content_literacy.select("td.country")
#print(str(cols[20]).strip().split('../geos/')[1][:2])
for col in cols:
    code = str(col).strip().split('../geos/')[1][:2] # chained into one step. You can split into multiple steps to see how it works.
    name =  col.text
    dict_country_name[code] = name

dict_country_name


In [None]:
#  all_data = {'field name' : {coutry_code : value} ...}
# we only extract country name, thus, the resulting data frame only has one column.

all_data = {"Country": dict_country_name}
all_data_df = pd.DataFrame(all_data)

# add code as one common column so that we can easily add new columns later
all_data_df['Code'] = all_data_df.index
all_data_df.head()


### Extract and add Literacy Data

In [None]:
'''
<tr style="background: #EEEEEE" id="ch">
<td class=country><a href=../geos/ch.html>China</td>
<td class=fieldData>
<strong>definition: </strong>age 15 and over can read and write<br />
<strong>total population: </strong>96.4%<br />
<strong>male: </strong>98.2%<br />
<strong>female: </strong>94.5% (2015 est.)<br />
</td>
</tr>
<tr id=co>
<td class=country><a href=../geos/co.html>Colombia</td>
<td class=fieldData>
<strong>definition: </strong>age 15 and over can read and write<br />
<strong>total population: </strong>94.2%<br />
<strong>male: </strong>94.1%<br />
<strong>female: </strong>94.4% (2015 est.)<br />
</td></tr>
'''

There are more than one way to extract literacy data. This is one of the many possible ways.

In [None]:
dict_literacy = {}
cols = page_content_literacy.select("#fieldListing tr")
for col in cols:
  if len(str(col).split('/geos/')) > 1:
        code = str(col).split('/geos/')[1][:2]
  cells = col.select('td')
  if len(cells) > 0:
      total_rate = cells[1].text.strip()
      dict_literacy[code] = total_rate
dict_literacy

In [None]:
dict_literacy = {}
cols = page_content_literacy.select("#fieldListing tr")
for col in cols:
  if len(str(col).split('/geos/')) > 1:
        code = str(col).split('/geos/')[1][:2]
  cells = col.select('td')
  if len(cells) > 0:
      total_rate = float(cells[1].text.strip().split(':')[2].split('%')[0].strip())
      dict_literacy[code] = total_rate
dict_literacy




Add literacy as a new column using <code>map</code>.

In [None]:
all_data_df['Literacy']= all_data_df['Code'].map(dict_literacy)
all_data_df.head()



## Start your work for section 6.4 ......
You need to first find which html page to use, then extract infant mortality rate from one html page, and finally clean/convert total rate to numbers before calculation.