# Lab 3 (Due @ by 11:59 pm via Canvas/Gradescope)


Due: Tuesday Sep 26 @ 11:59 PM EST

### Submission Instructions
Submit this `ipynb` file to Gradescope (this can also be done via the assignment on Canvas).  To ensure that your submitted `ipynb` file represents your latest code, make sure to give a fresh `Kernel > Restart & Run All` just before uploading the `ipynb` file to gradescope.

### Group Work

You are encouraged to work in groups for this Lab, however each student should submit their own notebook file to Gradescope. While each Part of the Lab depends on previous parts, talking through the problem with your group should help speed up both understanding and arriving at a solution. 

In [2]:
# you will use the below modules on this lab
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import requests
from datetime import datetime
import pandas as pd
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup

## Part 1: Web Scraping Warm-Up (20 points)

Build a `df_premier` which contains the names and stadiums of all the current English Premier League Teams based [on this website](https://www.premierleague.com/clubs):

    df_premier.head()
    
yields an output:

| name                   | stadium                 |
|------------------------|-------------------------|
| Arsenal                | Emirates Stadium        |
| Aston Villa            | Villa Park              |
| AFC Bournemouth        | Vitality Stadium        |
| Brentford              | Gtech Community Stadium |
| Brighton & Hove Albion | Amex Stadium            |

Make sure you: 
- use BeautifulSoup
- print the `.head()` of the data frame when you are finished

**Hint:** there should only be two `class_` values you need to accomplish this.

# Part 2: A Trickier Web Scraper

For this problem, we will (together) create a small data set scraped from [flightaware.com](https://flightaware.com/) which includes some details from the current flight schedule at Boston Logan Airport. You will build the first two parts of the data pipeline as functions (Parts 2.1 and 2.2) and then provide a detailed overview/description of the last two parts of the pipeline based on code I have written/provided (Parts 2.3 and 2.4). Please do not take these final two parts lightly!

## Part 2.1: The Scraper Function (20 points)

Complete the function `get_airport_html()` below (including docstring) which visits the url of a given US airport code and grabs the html. Visit [flightaware.com](https://flightaware.com/) and type in a few codes (e.g. BOS, JFK, LAX, RDU) and notice the pattern in the url so that you can pass any airport code to the function as a string. **Make sure to remove the `pass` statement when you are finished**. I have written the code you should run once the function is completed.

In [5]:
def get_airport_html(code):
    
    pass


In [7]:
# when you are done the following code should be run
url_text = get_airport_html('BOS')

## Part 2.2: The Soup Function (20 points)

Complete the function `get_airport_table_soup()` below (including docstring) which takes the html from the previous function and outputs one of four beautiful soup objects, depending on the board you are interested in as defined by the `'id'` attribute:

    - `id='arrivals-board'`
    - `id='departures-board'`
    - `id='enroute-board'`
    - `id='scheduled-board'`
    
The function should take two arguments: the html object from `get_airport_html()` and a string that specifies the `id` you are interested in (by default, the arrivals board).
    
**Make sure to remove the `pass` statement when you are finished.** 

In [8]:
def get_airport_table_soup(html, board):
    
    pass


In [10]:
# when you are done the following code should be run (feel free to change the board if you wish)
board_choice = 'arrivals-board'
my_board_soup = get_airport_table_soup(url_text, board_choice)

## Part 2.3: Cleaning The Board (20 points)

Below is the function `clean_board_df()`, which takes the soup object from the previous function and creates a data frame with the following columns:

    - `flight number`: the flight number
    - `aircraft type`: the type of aircraft
    - `airport name`: the name of the originating/destination airport (depending on type of board)
    - `airport code`: the letter code of the originating/destination airport
    - `departure time`: the time of the flight's departure
    - `arrival time`: the time of the flight's arrival

I have written the function and (given your function from Part 2.2 works) it should work. **DO NOT CHANGE ANYTHING IN THE BODY OF THE FUNCTION.**

**In a markdown cell** create a bullet point list where you explain each what each chunk of code does. Your bullet point list should have **FOUR** bullet points/explanations corresponding to the four chunks below the `# EXPLAIN THIS (number)` comments. You do not have to be super detailed, but you must accurately summarize the intention of each code chunk. **Talking to your neighbors/group about this is highly recommended.**

In [11]:
def clean_board_df(soup):    
    """ takes the soup of a board and cleans it, creating a data frame

    Args:
        soup (soup): the soup from get_airport_table_soup

    Returns:
        clean_board_df (data frame): a data frame with six columns corresponding to
            flight number
            aircraft type
            airport name
            airport code
            departure time
            arrival time
    """
    
    # EXPLAIN THIS (1)
    names = soup.find_all('span', attrs = {'title':True})
    flight_number = []
    aircraft_type = []
    airport_name = []
    for idx in range(0, len(names), 3):
        flight_number.append(names[idx].text)
        aircraft_type.append(names[idx+1].text)
        airport_name.append(names[idx+2].text)

    # EXPLAIN THIS (2)
    codes = soup.find_all(attrs = {'dir': 'ltr'})
    airport_code = []
    for idx in range(0, len(codes), 2):
        airport_code.append(codes[idx+1].text.replace("(", "").replace(")", ""))

    # EXPLAIN THIS (3)
    times = soup.find_all(class_='tz')
    departure_time = []
    arrival_time = []
    for idx in range(0, len(times), 2):
        dep_split_string = times[idx].previous_sibling.split('\xa0')
        arr_split_string = times[idx+1].previous_sibling.split('\xa0')
        
        if dep_split_string[0].endswith('a') == True:
            dep_datetime_str = dep_split_string[0][:-1] + ' AM'
            dep_datetime_time = datetime.strptime(dep_datetime_str, '%I:%M %p').time()
            departure_time.append(dep_datetime_time)
        else:
            dep_datetime_str = dep_split_string[0][:-1] + ' PM'
            dep_datetime_time = datetime.strptime(dep_datetime_str, '%I:%M %p').time()
            departure_time.append(dep_datetime_time)
        
        if arr_split_string[0].endswith('a') == True:
            arr_datetime_str = arr_split_string[0][:-1] + ' AM'
            arr_datetime_time = datetime.strptime(arr_datetime_str, '%I:%M %p').time()
            arrival_time.append(arr_datetime_time)
        else:
            arr_datetime_str = arr_split_string[0][:-1] + ' PM'
            arr_datetime_time = datetime.strptime(arr_datetime_str, '%I:%M %p').time()
            arrival_time.append(arr_datetime_time)

    # EXPLAIN THIS (4)
    clean_board_dict = {'flight number': flight_number,
                        'aircraft type': aircraft_type,
                        'airport name': airport_name,
                        'airport code': airport_code,
                        'departure time': departure_time,
                        'arrival time': arrival_time}
    clean_board_df = pd.DataFrame.from_dict(clean_board_dict)
    
    return clean_board_df
    
clean_df = clean_board_df(my_board_soup)
clean_df.head()

Unnamed: 0,flight number,aircraft type,airport name,airport code,departure time,arrival time
0,RPA5652,E75L,Pittsburgh Intl,PIT,18:49:00,20:20:00
1,UAL1779,A320,Newark Liberty Intl,EWR,19:27:00,20:20:00
2,AAL1318,B738,Chicago O'Hare Intl,ORD,17:32:00,20:19:00
3,AAL1484,A321,Charlotte/Douglas Intl,CLT,18:32:00,20:19:00
4,DAL1736,B712,Cincinnati/Northern Kentucky International Air...,CVG,18:16:00,20:19:00


Your answer in this cell:
- Explain Code Chunk 1:
- Explain Code Chunk 2:
- Explain Code Chunk 3:
- Explain Code Chunk 4:

## Part 2.4: Grabbing More Data (20 points)

Below (already written for you) is the function `get_aircraft_info()` which cycles through the different aircraft types in the data frame from the previous part and adds a column with a count of the number of aircrafts currently operating of that type. **DO NOT CHANGE ANYTHING IN THE BODY OF THE FUNCTION.**

**In a markdown cell** explain in why we were able to use `pd.read_html()` instead of `requests.get()` and comment on the values in the new column; is there something off about them?

**Hint:** you may want to take a look at an example url in your browser.

**Note:** I occasionally got an HTTPS error when running this; just wait a minute and try again, it should eventually work...

In [12]:
def get_aircraft_info(clean_df):
    """ takes a data frame of an aircraft board and adds a column with count of aircraft types

    Args:
        clean_df (data frame): the output of clean_board_df

    Returns:
        clean_df (data frame): the same data frame, but with an extra column
    """
 
    # get a list of aircraft types from the initial data frame
    aircraft_type = list(clean_df['aircraft type'])

    #initialize an empty list to count the number of each type
    num_type = []

    # loop through the different types
    for idx in range(len(aircraft_type)):

        # get the url for each type
        craft_url = f'https://flightaware.com/live/aircrafttype/{aircraft_type[idx]}'

        # grab the table from the url
        craft_tables = pd.read_html(craft_url)

        # add the info from the table to the list
        num_type.append(craft_tables[2].shape[0])

    # turn the list into a series and add it to the data frame
    clean_df['num type'] = pd.Series(num_type)

    # return the updated data frame
    return clean_df

final_df = get_aircraft_info(clean_df)

In [13]:
final_df.head()

Unnamed: 0,flight number,aircraft type,airport name,airport code,departure time,arrival time,num type
0,RPA5652,E75L,Pittsburgh Intl,PIT,18:49:00,20:20:00,20
1,UAL1779,A320,Newark Liberty Intl,EWR,19:27:00,20:20:00,20
2,AAL1318,B738,Chicago O'Hare Intl,ORD,17:32:00,20:19:00,20
3,AAL1484,A321,Charlotte/Douglas Intl,CLT,18:32:00,20:19:00,20
4,DAL1736,B712,Cincinnati/Northern Kentucky International Air...,CVG,18:16:00,20:19:00,20


Your answer in this cell:
