# Extracting Data from Websites

References:
- HTML: https://www.w3schools.com/html/html_attributes.asp
- urllib3: https://pypi.org/project/urllib3/
- Beautiful Soup: https://beautiful-soup-4.readthedocs.io/en/latest/

In [32]:
import bs4
import urllib3
import certifi
import numpy as np
import pandas as pd

## What is HTML?

- HTML stands for Hyper Text Markup Language
- HTML describes the **structure** of a Web page
- HTML consists of a series of **elements**

Below is a visualization of an HTML page structure:

```
<html>
    <head>
        <title>Page title</title>
    </head>
    <body>
        <h1>This is a heading </h1>
        <p>This is a paragraph.</p>
        <p>This is another paragraph.</p>
        <a href="https://bfi.uchicago.edu/">This is a link</a>
        <table>
            <tr>
                <th>Column 1</th>
                <th>Column 2</th>
            </tr>
            <tr>
                <td>Value 1</td>
                <td>Value 2</td>
            </tr>
        </table> 
    </body>
</html> 
```

## How to get the HTML file?

We need an **HTTP client**, such as `urllib3`.

In [33]:
pm = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
myurl = "https://www.aviationweather.gov/metar/data?ids=KORD&format=raw&date=0&hours=0"
html = pm.urlopen(url=myurl, method="GET").data
# html

## How to parse an HTML file?

We need an **HTML Parser**, such as `BeautifulSoup`.

In [34]:
# First, make the soup

soup = bs4.BeautifulSoup(html, features='lxml')
# soup

Here are some simple ways to navigate that data structure:

In [35]:
print(soup.title)
print(type(soup.title))
print(soup.title.text)
print(type(soup.title.text))

<title>AWC - METeorological Aerodrome Reports (METARs)</title>
<class 'bs4.element.Tag'>
AWC - METeorological Aerodrome Reports (METARs)
<class 'str'>


In [36]:
soup.title.parent

<head>
<!--[if lt IE 9]>
<meta http-equiv="X-UA-Compatible" content="IE=8" />
<![endif]-->
<title>AWC - METeorological Aerodrome Reports (METARs)</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="900" http-equiv="Refresh"/>
<meta content="en-us" name="DC.language" scheme="DCTERMS.RFC1766"/>
<meta content="Aviation Weather Center Homepage provides comprehensive user-friendly aviation weather Text products and graphics." name="description"/>
<meta content="aviation, weather, icing, turbulence, convection, pirep, metar, taf, airmet, sigmet, satellite, radar, surface, wind, temperature, aloft, airplane, NEXRAD, GOES, WSR-88D, precipitation, rain, snow, sleet, thunderstorm, en-route, prognosis, chart" name="keywords"/>
<meta content="AWC - Aviation Weather Center" name="DC.title"/>
<meta content="Aviation Weather Center Home Page ... METARs Page" name="DC.description"/>
<meta content="NOAA's National Weather Service - Aviation Weather Center Homepag

In [37]:
print(soup.title.parent)
print(type(soup.title.parent))

<head>
<!--[if lt IE 9]>
<meta http-equiv="X-UA-Compatible" content="IE=8" />
<![endif]-->
<title>AWC - METeorological Aerodrome Reports (METARs)</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="900" http-equiv="Refresh"/>
<meta content="en-us" name="DC.language" scheme="DCTERMS.RFC1766"/>
<meta content="Aviation Weather Center Homepage provides comprehensive user-friendly aviation weather Text products and graphics." name="description"/>
<meta content="aviation, weather, icing, turbulence, convection, pirep, metar, taf, airmet, sigmet, satellite, radar, surface, wind, temperature, aloft, airplane, NEXRAD, GOES, WSR-88D, precipitation, rain, snow, sleet, thunderstorm, en-route, prognosis, chart" name="keywords"/>
<meta content="AWC - Aviation Weather Center" name="DC.title"/>
<meta content="Aviation Weather Center Home Page ... METARs Page" name="DC.description"/>
<meta content="NOAA's National Weather Service - Aviation Weather Center Homepag

In [38]:
soup.p

<p clear="both">
<strong>Data at: 1743 UTC 18 Aug 2022</strong></p>

In [39]:
soup.find_all('p')

[<p clear="both">
 <strong>Data at: 1743 UTC 18 Aug 2022</strong></p>,
 <p style="text-align: center; font-size: 10px; color: #1150a0">
     Page loaded: 
   <a href="https://www.time.gov">17:43 UTC</a>  |  
   10:43 AM  Pacific  |  
   11:43 AM  Mountain  |  
   12:43 PM  Central  |  
   01:43 PM  Eastern
   </p>]

In [40]:
soup.find_all('a')

[<a href="https://www.noaa.gov" title="Visit the NOAA website"><img alt="NOAA Logo" border="0" src="/images/layout/noaa_logo.png"/></a>,
 <a href="https://www.weather.gov" title="Visit the NWS website"><img alt="NWS Logo" border="0" src="/images/layout/nws_logo.png"/></a>,
 <a href="/">AVIATION WEATHER CENTER</a>,
 <a href="https://www.noaa.gov">N O A A</a>,
 <a href="https://www.weather.gov">N A T I O N A L   W E A T H E R   S E R V I C E</a>,
 <a href="https://www.commerce.gov">
 <div class="awc_headerright" title="Visit Department of Commerce website">
 </div> <!-- /awc_headerright -->
 </a>,
 <a href="/">AWC Home</a>,
 <a href="https://www.ncep.noaa.gov/">NCEP Home</a>,
 <a href="/">Aviation (AWC)</a>,
 <a href="https://www.cpc.ncep.noaa.gov">Climate (CPC)</a>,
 <a href="https://www.nhc.noaa.gov">Hurricane (NHC)</a>,
 <a href="https://www.emc.ncep.noaa.gov">Modeling (EMC)</a>,
 <a href="https://ocean.weather.gov">Ocean (OPC)</a>,
 <a href="https://www.nco.ncep.noaa.gov">Operations 

In [41]:
soup.find_all("p", clear="both")

[<p clear="both">
 <strong>Data at: 1743 UTC 18 Aug 2022</strong></p>]

In [42]:
tag_list = soup.find_all("p", clear="both")
tag = tag_list[0]
print(tag)
print(tag.next_sibling)
print(tag.next_sibling.next_sibling)
print(tag.next_sibling.next_sibling.next_sibling)
print(tag.next_sibling.next_sibling.next_sibling.next_sibling)

<p clear="both">
<strong>Data at: 1743 UTC 18 Aug 2022</strong></p>


 Data starts here 


<code>KORD 181651Z 27003KT 10SM FEW055 FEW250 27/12 A3003 RMK AO2 SLP164 T02670117</code>


In [43]:
soup.find_all("code")

[<code>KORD 181651Z 27003KT 10SM FEW055 FEW250 27/12 A3003 RMK AO2 SLP164 T02670117</code>]

In [44]:
code_tags = soup.find_all("code")
weather_code = code_tags[0]
weather_code.text

'KORD 181651Z 27003KT 10SM FEW055 FEW250 27/12 A3003 RMK AO2 SLP164 T02670117'

Now, let's put the codes we wrote into a function:

In [45]:
def get_current_weather(code, pm):
    '''
    Get current weather at a specific airport.
    
    Inputs: 
        code (str): airport code
    
    Outputs:
        data (str): weather info from the website
    '''
    myurl = "https://www.aviationweather.gov/metar/data?ids=" + \
        code + "&format=raw&date=0&hours=0"
    html = pm.urlopen(url=myurl, method="GET").data
    soup = bs4.BeautifulSoup(html, features='lxml')
    code_tags = soup.find_all("code")
    weather_code = code_tags[0]
    data = weather_code.text
    return data

In [46]:
pm = urllib3.PoolManager(cert_reqs='CERT_REQUIRED',
                         ca_certs=certifi.where())

airports = ["KORD", "KMDW", "KSFO", "KLAX", "KATL"]
data_dict = dict()
for airport in airports:
    data_dict[airport] = get_current_weather(airport, pm)
data_dict

{'KORD': 'KORD 181651Z 27003KT 10SM FEW055 FEW250 27/12 A3003 RMK AO2 SLP164 T02670117',
 'KMDW': 'KMDW 181653Z 00000KT 10SM FEW050 FEW250 28/13 A3003 RMK AO2 SLP159 T02780128',
 'KSFO': 'KSFO 181656Z 33008KT 10SM FEW006 17/12 A2997 RMK AO2 SLP147 T01720122',
 'KLAX': 'KLAX 181653Z 25008KT 6SM BR BKN004 18/16 A2995 RMK AO2 SLP140 VIS SW-W 3/4 VIS NW 1 T01830161',
 'KATL': 'KATL 181652Z 10005KT 10SM SCT022 BKN070 BKN250 26/19 A3000 RMK AO2 SLP151 MDT CU ALQDS T02560194'}

## Another example

Step 1: Make the soup

In [47]:
myurl = "https://en.wikipedia.org/wiki/chicago"
html = pm.urlopen(url=myurl, method="GET").data
soup = bs4.BeautifulSoup(html, features='lxml')

Step 2: Examine the webpage and locate the element

In [48]:
tag_list = soup.find_all("th", colspan="14")
tag_list

[<th colspan="14">Climate data for Chicago (Midway Airport), 1991–2020 normals,<sup class="reference" id="cite_ref-Strange_field_expl_147-0"><a href="#cite_note-Strange_field_expl-147">[a]</a></sup> extremes 1928–present
 </th>,
 <th colspan="14">Climate data for Chicago (O'Hare Int'l Airport), 1991–2020 normals,<sup class="reference" id="cite_ref-Strange_field_expl_147-1"><a href="#cite_note-Strange_field_expl-147">[a]</a></sup> extremes 1871–present<sup class="reference" id="cite_ref-153"><a href="#cite_note-153">[b]</a></sup>
 </th>,
 <th colspan="14">Sunshine data for Chicago
 </th>,
 <th colspan="14" style="background:#f8f9fa;font-weight:normal;font-size:95%;">Source: Weather Atlas<sup class="reference" id="cite_ref-Weather_Atlas_156-0"><a href="#cite_note-Weather_Atlas-156">[154]</a></sup>
 </th>]

In [49]:
tag_list[0].text

'Climate data for Chicago (Midway Airport), 1991–2020 normals,[a] extremes 1928–present\n'

In [50]:
for tag in tag_list:
    if tag.text[:7] == "Climate":
        break
tag.text

'Climate data for Chicago (Midway Airport), 1991–2020 normals,[a] extremes 1928–present\n'

In [51]:
table_tag = tag.parent.parent
# table_tag

In [52]:
table_tag.find_all("th", scope="row")

[<th scope="row">Month
 </th>,
 <th scope="row" style="height: 16px;">Record high °F (°C)
 </th>,
 <th scope="row" style="height: 16px;">Mean maximum °F (°C)
 </th>,
 <th scope="row" style="height: 16px;">Average high °F (°C)
 </th>,
 <th scope="row" style="height: 16px;">Daily mean °F (°C)
 </th>,
 <th scope="row" style="height: 16px;">Average low °F (°C)
 </th>,
 <th scope="row" style="height: 16px;">Mean minimum °F (°C)
 </th>,
 <th scope="row" style="height: 16px;">Record low °F (°C)
 </th>,
 <th scope="row" style="height: 16px;">Average <a href="/wiki/Precipitation" title="Precipitation">precipitation</a> inches (mm)
 </th>,
 <th scope="row" style="height: 16px;">Average snowfall inches (cm)
 </th>,
 <th scope="row" style="height: 16px;">Average precipitation days <span class="nowrap" style="font-size:90%;">(≥ 0.01 in)</span>
 </th>,
 <th scope="row" style="height: 16px;">Average snowy days <span class="nowrap" style="font-size:90%;">(≥ 0.1 in)</span>
 </th>,
 <th scope="row" styl

In [53]:
table_tag.find_all("th", scope="col")

[<th scope="col">Jan
 </th>,
 <th scope="col">Feb
 </th>,
 <th scope="col">Mar
 </th>,
 <th scope="col">Apr
 </th>,
 <th scope="col">May
 </th>,
 <th scope="col">Jun
 </th>,
 <th scope="col">Jul
 </th>,
 <th scope="col">Aug
 </th>,
 <th scope="col">Sep
 </th>,
 <th scope="col">Oct
 </th>,
 <th scope="col">Nov
 </th>,
 <th scope="col">Dec
 </th>,
 <th scope="col" style="border-left-width:medium">Year
 </th>]

In [54]:
row_names = [row.text for row in table_tag.find_all("th", scope="row")]
row_names

['Month\n',
 'Record high °F (°C)\n',
 'Mean maximum °F (°C)\n',
 'Average high °F (°C)\n',
 'Daily mean °F (°C)\n',
 'Average low °F (°C)\n',
 'Mean minimum °F (°C)\n',
 'Record low °F (°C)\n',
 'Average precipitation inches (mm)\n',
 'Average snowfall inches (cm)\n',
 'Average precipitation days (≥ 0.01 in)\n',
 'Average snowy days (≥ 0.1 in)\n',
 'Average ultraviolet index\n']

In [55]:
row_names = [row.text[:-1] for row in table_tag.find_all("th", scope="row")][1:]
row_names

['Record high °F (°C)',
 'Mean maximum °F (°C)',
 'Average high °F (°C)',
 'Daily mean °F (°C)',
 'Average low °F (°C)',
 'Mean minimum °F (°C)',
 'Record low °F (°C)',
 'Average precipitation inches (mm)',
 'Average snowfall inches (cm)',
 'Average precipitation days (≥ 0.01 in)',
 'Average snowy days (≥ 0.1 in)',
 'Average ultraviolet index']

In [56]:
col_names = [col.text[:-1] for col in table_tag.find_all("th", scope="col")]
col_names

['Jan',
 'Feb',
 'Mar',
 'Apr',
 'May',
 'Jun',
 'Jul',
 'Aug',
 'Sep',
 'Oct',
 'Nov',
 'Dec',
 'Year']

In [58]:
table_tag.find_all("td")[:10]

[<td style="background: #FF9B37; color:#000000;">67<br/>(19)
 </td>,
 <td style="background: #FF7800; color:#000000;">75<br/>(24)
 </td>,
 <td style="background: #FF4F00; color:#000000;">86<br/>(30)
 </td>,
 <td style="background: #FF3A00; color:#000000;">92<br/>(33)
 </td>,
 <td style="background: #FF1100; color:#FFFFFF;">102<br/>(39)
 </td>,
 <td style="background: #F80000; color:#FFFFFF;">107<br/>(42)
 </td>,
 <td style="background: #EA0000; color:#FFFFFF;">109<br/>(43)
 </td>,
 <td style="background: #FF0A00; color:#FFFFFF;">104<br/>(40)
 </td>,
 <td style="background: #FF1100; color:#FFFFFF;">102<br/>(39)
 </td>,
 <td style="background: #FF3300; color:#000000;">94<br/>(34)
 </td>]

In [60]:
value_tags = table_tag.find_all("td")[:len(row_names)*len(col_names)]
value_tags[:10]

[<td style="background: #FF9B37; color:#000000;">67<br/>(19)
 </td>,
 <td style="background: #FF7800; color:#000000;">75<br/>(24)
 </td>,
 <td style="background: #FF4F00; color:#000000;">86<br/>(30)
 </td>,
 <td style="background: #FF3A00; color:#000000;">92<br/>(33)
 </td>,
 <td style="background: #FF1100; color:#FFFFFF;">102<br/>(39)
 </td>,
 <td style="background: #F80000; color:#FFFFFF;">107<br/>(42)
 </td>,
 <td style="background: #EA0000; color:#FFFFFF;">109<br/>(43)
 </td>,
 <td style="background: #FF0A00; color:#FFFFFF;">104<br/>(40)
 </td>,
 <td style="background: #FF1100; color:#FFFFFF;">102<br/>(39)
 </td>,
 <td style="background: #FF3300; color:#000000;">94<br/>(34)
 </td>]

In [61]:
data = [value.text[:-1] for value in value_tags]
data = np.array(data).reshape(len(row_names), len(col_names))
df = pd.DataFrame(data, columns=col_names, index=row_names)
df

Unnamed: 0,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,Year
Record high °F (°C),67(19),75(24),86(30),92(33),102(39),107(42),109(43),104(40),102(39),94(34),81(27),72(22),109(43)
Mean maximum °F (°C),53.4(11.9),57.9(14.4),72.0(22.2),81.5(27.5),89.2(31.8),93.9(34.4),96.0(35.6),94.2(34.6),90.8(32.7),82.8(28.2),68.0(20.0),57.5(14.2),97.1(36.2)
Average high °F (°C),32.8(0.4),36.8(2.7),47.9(8.8),60.0(15.6),71.5(21.9),81.2(27.3),85.2(29.6),83.1(28.4),76.5(24.7),63.7(17.6),49.6(9.8),37.7(3.2),60.5(15.8)
Daily mean °F (°C),26.2(−3.2),29.9(−1.2),39.9(4.4),50.9(10.5),61.9(16.6),71.9(22.2),76.7(24.8),75.0(23.9),67.8(19.9),55.3(12.9),42.4(5.8),31.5(−0.3),52.4(11.3)
Average low °F (°C),19.5(−6.9),22.9(−5.1),32.0(0.0),41.7(5.4),52.4(11.3),62.7(17.1),68.1(20.1),66.9(19.4),59.2(15.1),46.8(8.2),35.2(1.8),25.3(−3.7),44.4(6.9)
Mean minimum °F (°C),−3(−19),3.4(−15.9),14.1(−9.9),28.2(−2.1),39.1(3.9),49.3(9.6),58.6(14.8),57.6(14.2),45.0(7.2),31.8(−0.1),19.7(−6.8),5.3(−14.8),−6.5(−21.4)
Record low °F (°C),−25(−32),−20(−29),−7(−22),10(−12),28(−2),35(2),46(8),43(6),29(−2),20(−7),−3(−19),−20(−29),−25(−32)
Average precipitation inches (mm),2.30(58),2.12(54),2.66(68),4.15(105),4.75(121),4.53(115),4.02(102),4.10(104),3.33(85),3.86(98),2.73(69),2.33(59),"40.88(1,038)"
Average snowfall inches (cm),12.5(32),10.1(26),5.7(14),1.0(2.5),0.0(0.0),0.0(0.0),0.0(0.0),0.0(0.0),0.0(0.0),0.1(0.25),1.5(3.8),7.9(20),38.8(99)
Average precipitation days (≥ 0.01 in),11.5,9.4,11.1,12.0,12.4,11.1,10.0,9.3,8.4,10.8,10.2,10.8,127.0


In [62]:
def get_climate(city):


    html = pm.urlopen(url=myurl, method="GET").data
    soup = bs4.BeautifulSoup(html, features='lxml')
    tag_list = soup.find_all("th", colspan="14")
    try:
        tag = tag_list[0].parent.parent
        rows = [row.text[:-1] for row in tag.find_all("th", scope="row")][1:]
        cols = [col.text[:-1] for col in tag.find_all("th", scope="col")]
        data = [num.text[:-1]
                for num in tag.find_all("td")[:len(rows)*len(cols)]]
        data = np.array(data).reshape(len(rows), len(cols))
        df = pd.DataFrame(data, columns=cols, index=rows)
    except IndexError:
        return None

    return df

In [63]:
df = get_climate("san francisco")
df["Jan"]

Record high °F (°C)                           67(19)
Mean maximum °F (°C)                      53.4(11.9)
Average high °F (°C)                       32.8(0.4)
Daily mean °F (°C)                        26.2(−3.2)
Average low °F (°C)                       19.5(−6.9)
Mean minimum °F (°C)                         −3(−19)
Record low °F (°C)                          −25(−32)
Average precipitation inches (mm)           2.30(58)
Average snowfall inches (cm)                12.5(32)
Average precipitation days (≥ 0.01 in)          11.5
Average snowy days (≥ 0.1 in)                    8.9
Average ultraviolet index                          1
Name: Jan, dtype: object

## Reference

- CMSC 12200