# Solution - Web Scraping using BeautifulSoup

## Exercise

Use [HTML File](https://raw.githubusercontent.com/dgadiraju/itversity-books/master/Data%20Engineering%20Bootcamp/30%20Basics%20of%20Programming%20using%20Python/11%20Exercise%20-%20Web%20Scraping%20-%20Airports%20Data.html) and get the data into the Data Frame with these fields.

* IATA Code
* Major city served
* State
* Year
* Air Traffic

**Hint: You can use Pandas melt function to unpivot the data**

Output should contain 330 records.

## Solution

Here is the solution approach for the exercise.
* Read the html content into BeautifulSoup object.
* Get the table header values as field names.
* Get the table data values and build dictionary with field name and corrsponding value.
* Append the dictionary to the list. By now the table will be converted into list of dict objects.
* Create Data Frame with all the columns in the table.
* Drop unnecessary columns from Data Frame and unpivot the data.

In [1]:
import requests
from bs4 import BeautifulSoup

airport_url = 'https://raw.githubusercontent.com/dgadiraju/itversity-books/master/Data%20Engineering%20Bootcamp/30%20Basics%20of%20Programming%20using%20Python/11%20Exercise%20-%20Web%20Scraping%20-%20Airports%20Data.html'
airport_page = requests.get(airport_url)

airport_soup = BeautifulSoup(airport_page.content, 'html.parser')

for tr in airport_soup.find_all('tr'):
    th = tr.find_all('th')
    if len(th) > 0:
        for field_name in th:
            print(field_name.get_text())

Rank (2018)
Airports (large hubs)
IATA Code
Major city served
State
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009


In [2]:
field_names = []
for tr in airport_soup.find_all('tr'):
    th = tr.find_all('th')
    if len(th) > 0:
        for field_name in th:
            field_names.append(field_name.get_text())

for field_name in field_names:
    if field_name.isnumeric():
        print(field_name)

2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009


In [3]:
airport_traffic = []
cnt = 0
for tr in airport_soup.find_all('tr'):
    airport_rec = {}
    td = tr.find_all('td')
    field_values = []
    if len(td) > 0:
        if cnt > 3: break
        for field_value in td:
            print(field_value)
        cnt += 1

<td>1
</td>
<td><span class="nowrap"><a href="/wiki/Hartsfield%E2%80%93Jackson_Atlanta_International_Airport" title="Hartsfield–Jackson Atlanta International Airport">Hartsfield–Jackson Atlanta International Airport</a></span>
</td>
<td>ATL
</td>
<td><a href="/wiki/Atlanta" title="Atlanta">Atlanta</a>
</td>
<td>GA
</td>
<td>
</td>
<td>51,866,464
</td>
<td>50,251,964
</td>
<td>50,501,858
</td>
<td>49,340,732
</td>
<td>46,604,273
</td>
<td>45,308,407
</td>
<td>45,798,809
</td>
<td>44,414,121
</td>
<td>43,130,585
</td>
<td>42,280,868
</td>
<td>2
</td>
<td><a href="/wiki/Los_Angeles_International_Airport" title="Los Angeles International Airport">Los Angeles International Airport</a>
</td>
<td>LAX
</td>
<td><a href="/wiki/Los_Angeles" title="Los Angeles">Los Angeles</a>
</td>
<td>CA
</td>
<td>
</td>
<td>42,626,783
</td>
<td>41,232,432
</td>
<td>39,636,042
</td>
<td>36,351,226
</td>
<td>34,314,197
</td>
<td>32,425,892
</td>
<td>31,326,268
</td>
<td>30,528,737
</td>
<td>28,857,755
</td>
<td>

In [5]:
airport_traffic = []
for tr in airport_soup.find_all('tr'):
    airport_rec = {}
    td = tr.find_all('td')
    field_values = []
    if len(td) > 0:
        for field_value in td:
            field_values.append(field_value.get_text().rstrip('\n'))
        airport_rec = dict(zip(field_names, field_values))
        airport_traffic.append(airport_rec)

airport_traffic[:3]

[{'Rank (2018)': '1',
  'Airports (large hubs)': 'Hartsfield–Jackson Atlanta International Airport',
  'IATA Code': 'ATL',
  'Major city served': 'Atlanta',
  'State': 'GA',
  '2019': '',
  '2018': '51,866,464',
  '2017': '50,251,964',
  '2016': '50,501,858',
  '2015': '49,340,732',
  '2014': '46,604,273',
  '2013': '45,308,407',
  '2012': '45,798,809',
  '2011': '44,414,121',
  '2010': '43,130,585',
  '2009': '42,280,868'},
 {'Rank (2018)': '2',
  'Airports (large hubs)': 'Los Angeles International Airport',
  'IATA Code': 'LAX',
  'Major city served': 'Los Angeles',
  'State': 'CA',
  '2019': '',
  '2018': '42,626,783',
  '2017': '41,232,432',
  '2016': '39,636,042',
  '2015': '36,351,226',
  '2014': '34,314,197',
  '2013': '32,425,892',
  '2012': '31,326,268',
  '2011': '30,528,737',
  '2010': '28,857,755',
  '2009': '27,439,897'},
 {'Rank (2018)': '3',
  'Airports (large hubs)': "O'Hare International Airport",
  'IATA Code': 'ORD',
  'Major city served': 'Chicago',
  'State': 'IL',

In [6]:
import pandas as pd
airport_traffic_df = pd.DataFrame(airport_traffic)

In [9]:
airport_traffic_df.melt?

[0;31mSignature:[0m
[0mairport_traffic_df[0m[0;34m.[0m[0mmelt[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mid_vars[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mvalue_vars[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mvar_name[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mvalue_name[0m[0;34m=[0m[0;34m'value'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcol_level[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mignore_index[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0;34m'DataFrame'[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.

This function is useful to massage a DataFrame into a format where one
or more columns are identifier variables (`id_vars`), while all other
columns, considered measured variables (`value_vars`), are "unpivote

In [12]:
airport_traffic_df_by_year = airport_traffic_df. \
    drop(['Rank (2018)', 'Airports (large hubs)'], axis=1). \
    melt(
        id_vars=['IATA Code', 'Major city served', 'State'],
        var_name='year',
        value_name='Air Traffic'
    )
airport_traffic_df_by_year

Unnamed: 0,IATA Code,Major city served,State,year,Air Traffic
0,ATL,Atlanta,GA,2019,
1,LAX,Los Angeles,CA,2019,
2,ORD,Chicago,IL,2019,
3,DFW,Dallas,TX,2019,
4,DEN,Denver,CO,2019,
...,...,...,...,...,...
325,DCA,"Washington, D.C.",VA,2009,8490288
326,MDW,Chicago,IL,2009,8253620
327,TPA,Tampa,FL,2009,8263294
328,PDX,Portland,OR,2009,6430119
