## Web Scraping using BeautifulSoup

As part of this module we will see how to scrape the data from web pages.
* Problem Statement
* Installing Pre-requisites
* Overview of BeautifulSoup
* Getting HTML Content
* Processing HTML Content
* Creating Data Frame
* Processing Data using Data Frame APIs
* Exercise - Airport Traffic

## Problem Statement

Let us define a problem statement for Web Scraping. We will primarily focus on BeautifulSoup.

* Vande Bharat Flight Service is Indian Government sponsored service. They open up flights in regular intervals amidst of Covid19 Pandemic. Here is the [link](https://mea.gov.in/phase-6.htm) for one of the published schedule.
* However their website is static and we are not able to easily figure out the details related to flights from the source to destination.
* We want to scrape this static page and try to extract the information we are looking for. Ideally we can build a website with some additional filter criteria.
* However as we want to just explore Web Scraping using Beautiful Soup, we will go to the extent of reading the HTML table into Pandas Data Frame and run some basic queries.
  * Get all Columns
  * Get all unique destinations. They are nothing but Indian airports.
  * Get all the distinct origins with **Country of Origin** as **USA**.
  * Get all the distinct airports in US from which flights are operated. You have to get all the origins which are not reflected in destinations.
  * Get all the flights that are available beyond passed date from US.

## Installing Pre-requisites

We will use multiple Python libraries to perform Web Scraping.
* Library to get the content from HTML Pages **requests**
* Process HTML Tags and extract Data **beautifulsoup4**
* Data Processing using Data Frame APIs **pandas**

```
pip install beautifulsoup4
pip install pandas
```

## Overview of BeautifulSoup

Let us get brief overview of BeautifulSoup.
* We will create a simple HTML Table

In [18]:
%%html
<table>
    <tbody>
        <tr>
            <th>Details</th>
            <th>URL</th>
        </tr>
        <tr>
            <td>Video Content</td>
            <td><a href="https://www.youtube.com/itversityin">YouTube Channel</a></td>
        </tr>
        <tr>
            <td>Reference Material</td>
            <td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a></td>
        </tr>
    </tbody>
</table>

Details,URL
Video Content,YouTube Channel
Reference Material,GitHub Repository


In [73]:
html_str = """<table>
    <tbody>
        <tr>
            <th>Details</th>
            <th>URL</th>
        </tr>
        <tr>
            <td>Video Content</td>
            <td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
            </td>
        </tr>
        <tr>
            <td>Reference Material</td>
            <td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
            </td>
        </tr>
    </tbody>
</table>"""

* Create BeautifulSoup object by name soup.
* We can access first occurrence of tag using its reference.

In [74]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_str, 'html.parser')
print(soup.prettify())

<table>
 <tbody>
  <tr>
   <th>
    Details
   </th>
   <th>
    URL
   </th>
  </tr>
  <tr>
   <td>
    Video Content
   </td>
   <td>
    <a href="https://www.youtube.com/itversityin">
     YouTube Channel
    </a>
   </td>
  </tr>
  <tr>
   <td>
    Reference Material
   </td>
   <td>
    <a href="https://www.github.com/dgadiraju/itversity-books">
     GitHub Repository
    </a>
   </td>
  </tr>
 </tbody>
</table>


* Accessing first occurrence of `tr`

In [75]:
soup.table.tbody.tr

<tr>
<th>Details</th>
<th>URL</th>
</tr>

* Accessing first `th` value, we can use `string` or `get_text()`

In [76]:
soup.table.tbody.tr.th.string

'Details'

In [85]:
soup.table.tbody.tr.th.get_text()

'Details'

* Accessing first occurrence of anchor tag

In [77]:
soup.table.tbody.a

<a href="https://www.youtube.com/itversityin">YouTube Channel</a>

* Getting the url from `href` attribute of anchor tag

In [78]:
soup.table.tbody.a['href']

'https://www.youtube.com/itversityin'

* Accessing the value of anchor tag.

In [79]:
soup.table.tbody.a.string

'YouTube Channel'

* Get all anchor tags

In [86]:
soup.table.tbody.find_all('a')

[<a href="https://www.youtube.com/itversityin">YouTube Channel</a>,
 <a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>]

* Get all `td` tags

In [81]:
for a in soup.find_all('td'):
    print(a)

<td>Video Content</td>
<td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
</td>
<td>Reference Material</td>
<td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
</td>


* Get value from all `td` tags.

In [82]:
# If the text in the tag have characters like new line, string might return None
for td in soup.find_all('td'):
    print(td.string)

Video Content
None
Reference Material
None


In [83]:
# If the text in the tag have characters like new line, we can use get_text
for td in soup.find_all('td'):
    print(td.get_text())

Video Content
YouTube Channel

Reference Material
GitHub Repository



In [84]:
# Stripping new line characters
for td in soup.find_all('td'):
    print(td.get_text().rstrip('\n'))

Video Content
YouTube Channel
Reference Material
GitHub Repository


* Get values and URLs from anchor tags as a list of dicts

In [87]:
itversity_details = []
for a in soup.find_all('a'):
    rec = {'description': a.string, 'url': a['href']}
    itversity_details.append(rec)

itversity_details

[{'description': 'YouTube Channel',
  'url': 'https://www.youtube.com/itversityin'},
 {'description': 'GitHub Repository',
  'url': 'https://www.github.com/dgadiraju/itversity-books'}]

## Getting HTML Content

We can use Python core library `requests` to get the content from HTML pages.
* `requests` provides `get` funcion to which we can pass web URL. - `flights_page = requests.get(flights_url)`
* We can access content using `flights_page.content`.
* We can use pass the content to BeautifulSoup and parse the HTML Tags and data for further processing.

In [71]:
import requests

In [None]:
flights_url = 'https://mea.gov.in/phase-6.htm'
flights_page = requests.get(flights_url)

## Processing HTML Content

We can process the content and extract HTML Tags as well as data using BeautifulSoup.
* We have to pass the content using `html.parser` and build the BeautifulSoup object.
* Let us prettify and print the content.

In [None]:
soup = BeautifulSoup(flights_page.content, 'html.parser')

print(soup.prettify())

* Let us extract all the `th` tags. It will give us header of the table.
* We have multiple `tr` tags with `th` and we need to consider only that `tr` tag with 12 elements.
* Here is the code snippet to get the header of the table.

In [None]:
for tr in soup.find_all('tr'):
    th = tr.find_all('th')
    if len(th) == 12:
        for field_name in th:
            print(field_name)

* We can use `field_name.string` to get only the value.

In [None]:
for tr in soup.find_all('tr'):
    th = tr.find_all('th')
    if len(th) == 12:
        for field_name in th:
            print(field_name.string)

* We can also get the actual values from `td` tags.
* We will only get 3 of them.

In [None]:
ctr = 0
for tr in soup.find_all('tr'):
    if ctr == 3: break
    td = tr.find_all('td')
    if len(td) == 12:
        for field_name in td:
            print(field_name.string)
        ctr += 1

## Creating Data Frame

Let us build the Data Frame so that we can process the data using Data Frame APIs.
* We will get all the headers into a list **field_names** by using data from `th` tags.
* We will get all the `tr` tags with `td` tags. We will build list **field_values** using one row at a time.
* While processing **table rows** we will build the dict using **field_names** and **field_values**. Using these dicts, we will build a list of dicts
* Using list of dicts we will create the Data Frame.

In [None]:
# Build list for field names.
field_names = []

for tr in soup.find_all('tr'):
    th = tr.find_all('th')
    if len(th) == 12:
        for field_name in th:
            field_names.append(field_name.string)

field_names

In [None]:
# If we have list of tuples with 2 elements we can create dict as below 
l = [(1, 'Hello'), (2, 'World')]
dict(l)

In [None]:
# If we have 2 lists, we can merge into one list of paired tuples using zip as below.
# We will use this approach to build dic
l1 = [1, 2]
l2 = ['Hello', 'World']

dict(zip(l1, l2))

In [None]:
# Build list of dicts. Each dict will contain 12 elements with keys from field_names and values from field_values
data = []
for tr in soup.find_all('tr'):
    td = tr.find_all('td')
    field_values = []
    if len(td) == 12:
        for field_value in td:
            field_values.append(field_value.string)
        rec = dict(zip(field_names, field_values))
        data.append(rec)

In [None]:
data[:3]

In [None]:
# Creating Pandas Data Frame
import pandas as pd
df = pd.DataFrame(data)

df

## Processing Data using Data Frame APIs
Here are the problem statements for which we will try to come up with the solution
* Get all Columns
* Get all unique destinations. They are nothing but Indian airports.
* Get all the distinct origins with **Country of Origin** as **USA**.
* Get all the distinct airports in US from which flights are operated. You have to get all the origins which are not reflected in destinations.
* Get all the flights that are available beyond passed date from US.

In [None]:
# Get all Columns

df.columns

In [None]:
# Get all unique destinations
unique_destinations = df['Destination'].unique()

unique_destinations

In [None]:
# Get all the distinct airports in US from which flights are operated. You have to get all the origins which are not reflected in destinations.
unique_origins = df.query('`Country of Origin` == "USA"')['Origin'].unique()

unique_origins

In [None]:
set(unique_origins).difference(set(unique_destinations))

In [None]:
# Get all the flights that are available beyond passed date from US.
df.query('`Country of Origin` == "USA"')

In [None]:
import datetime

datetime.datetime.strptime('2-Sep-20', '%d-%b-%y')

In [None]:
df['Dep Date'] = pd.to_datetime(df['Dep Date'], format='%d-%b-%y')
df

In [None]:
df.query('`Country of Origin` == "USA" & `Dep Date`.dt.strftime("%Y-%m-%d") > "2020-09-03"')

## Exercise - Airport Traffic

Use [HTML File](https://raw.githubusercontent.com/dgadiraju/itversity-books/master/Data%20Engineering%20Bootcamp/30%20Basics%20of%20Programming%20using%20Python/11%20Exercise%20-%20Web%20Scraping%20-%20Airports%20Data.html) and get the data into the Data Frame with these fields. We need to unpivot and get the air traffic by year.

* IATA Code
* Major city served
* State
* Year
* Air Traffic

**Hint: You can use Pandas melt function to unpivot the data**

Output should contain 330 records.