In [1]:
import pandas as pd
from plotly import express as px
import plotly.graph_objects as go
import numpy as np
from sklearn.preprocessing import LabelEncoder

# Airport Visualization using Plotly Express

Can we see any trends in the location of airports on a map and their ticket prices? To answer this question, we will use plotly express to visualize flights and well as provide useful visualizations for users of our website.

## Preprocessing

We found an airport codes csv that contains the names, codes, and location (longitude and latitude) of airports across the world. We will also read the webscraped and combined ticket price data.

In [2]:
flights = pd.read_csv('flights.csv')
airports = pd.read_csv('airport-codes_csv.csv')

In [3]:
flights

Unnamed: 0,Price,Company Name,Stops,Duration,Destination,From,Date
0,254,American Airlines,nonstop,4h 38m,ATL,LAX,6/1/23
1,73,Spirit Airlines,1 stop,25h 28m,ATL,LAX,6/1/23
2,209,American Airlines,1 stop,6h 15m,ATL,LAX,6/1/23
3,159,United Airlines,1 stop,6h 55m,ATL,LAX,6/1/23
4,204,United Airlines,1 stop,6h 10m,ATL,LAX,6/1/23
...,...,...,...,...,...,...,...
158633,982,American Airlines,1 stop,21h 55m,SFO,LAX,8/31/23
158634,712,"Spirit Airlines, Sun Country Air",2 stops,31h 15m,SFO,LAX,8/31/23
158635,702,"Spirit Airlines, Sun Country Air",2 stops,32h 27m,SFO,LAX,8/31/23
158636,737,"Spirit Airlines, Sun Country Air",2 stops,32h 14m,SFO,LAX,8/31/23


In [4]:
airports

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450.0,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820.0,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237.0,,US,US-AR,Newport,,,,"-91.254898, 35.6087"
...,...,...,...,...,...,...,...,...,...,...,...,...
57416,ZYYK,medium_airport,Yingkou Lanqi Airport,0.0,AS,CN,CN-21,Yingkou,ZYYK,YKH,,"122.3586, 40.542524"
57417,ZYYY,medium_airport,Shenyang Dongta Airport,,AS,CN,CN-21,Shenyang,ZYYY,,,"123.49600219726562, 41.784400939941406"
57418,ZZ-0001,heliport,Sealand Helipad,40.0,EU,GB,GB-ENG,Sealand,,,,"1.4825, 51.894444"
57419,ZZ-0002,small_airport,Glorioso Islands Airstrip,11.0,AF,TF,TF-U-A,Grande Glorieuse,,,,"47.296388888900005, -11.584277777799999"


As we can see, there is a lot of information contained in `airports` that we do not need, so we limit the rows to airports in our data set, and the columns to the airport name, code, and coordinates.

In [5]:
#only want destinations + LAX
destinations = set(flights['Destination'])
destinations.add('LAX')

airports = airports[['name', 'iata_code', 'coordinates']]
airports = airports.rename({'name' : 'Airport', 
                            'iata_code' : 'Destination'},
                           axis = 1)

In [6]:
airports = airports[airports['Destination'].isin(destinations)]

In [7]:
airports

Unnamed: 0,Airport,Destination,coordinates
27523,Hartsfield Jackson Atlanta International Airport,ATL,"-84.428101, 33.6367"
27952,Denver International Airport,DEN,"-104.672996521, 39.861698150635"
27957,Dallas Fort Worth International Airport,DFW,"-97.038002, 32.896801"
29073,John F Kennedy International Airport,JFK,"-73.7789, 40.639801"
29188,Los Angeles International Airport,LAX,"-118.407997, 33.942501"
29785,Chicago O'Hare International Airport,ORD,"-87.9048, 41.9786"
31332,San Francisco International Airport,SFO,"-122.375, 37.61899948120117"
40449,Daniel K Inouye International Airport,HNL,"-157.924228, 21.32062"


We now extract the longitude and latitude from the `coordinates` column.

In [9]:
airports['longitude'] = airports['coordinates'].str.split(', ').str[0].astype(float)
airports['latitude'] = airports['coordinates'].str.split(', ').str[1].astype(float)

In [10]:
airports

Unnamed: 0,Airport,Destination,coordinates,longitude,latitude
27523,Hartsfield Jackson Atlanta International Airport,ATL,"-84.428101, 33.6367",-84.428101,33.6367
27952,Denver International Airport,DEN,"-104.672996521, 39.861698150635",-104.672997,39.861698
27957,Dallas Fort Worth International Airport,DFW,"-97.038002, 32.896801",-97.038002,32.896801
29073,John F Kennedy International Airport,JFK,"-73.7789, 40.639801",-73.7789,40.639801
29188,Los Angeles International Airport,LAX,"-118.407997, 33.942501",-118.407997,33.942501
29785,Chicago O'Hare International Airport,ORD,"-87.9048, 41.9786",-87.9048,41.9786
31332,San Francisco International Airport,SFO,"-122.375, 37.61899948120117",-122.375,37.618999
40449,Daniel K Inouye International Airport,HNL,"-157.924228, 21.32062",-157.924228,21.32062


For later use, we will store the longitude and latitude of LAX and then remove it from `airports`.

In [11]:
#for later use
LAX_lon = airports[airports['Destination'] == 'LAX']['longitude'].iloc[0]
LAX_lat = airports[airports['Destination'] == 'LAX']['latitude'].iloc[0]

#remove LAX
airports = airports[airports['Destination'] != 'LAX']

## Data Visualization

What type of metrics can we examine in our ticket price data? Some options include standard deviation and mean.

As we can see below, there are notable differences between some airports. The `std` and `mean` of HNL is higher than other airports, which we might be able to attribute to the physical distance of the flight.

In [12]:
flights.groupby('Destination')['Price'].aggregate([np.std, np.mean])

Unnamed: 0_level_0,std,mean
Destination,Unnamed: 1_level_1,Unnamed: 2_level_1
ATL,198.79232,430.719912
DEN,144.175585,345.503717
DFW,143.797519,302.896042
HNL,329.260149,545.536184
JFK,294.188653,494.337262
ORD,166.402091,417.926679
SFO,173.866652,410.857652


To better understand these metrics, we define a function that plots any given metric on a map visualization.

In [23]:
def visualize_airports(airports, flights, metric):
    '''
    Function plots the given airports and applies the given metric to the flights
    '''
    
    lons = []
    lats = []
    
    #to create lines, create list of lines from each airport to LAX
    for lon in airports['longitude']:
        lons.append(lon)
        lons.append(LAX_lon)
    for lat in airports['latitude']:
        lats.append(lat)
        lats.append(LAX_lat)
    
    #colors airports based on metric
    colors = flights.groupby('Destination')['Price'].aggregate([metric])
    metric_name = colors.columns[0]
    colors = airports.join(colors, on = 'Destination')
    colors = list(colors[metric_name])
    
    #plots each destination
    fig1 = px.scatter_mapbox(airports,
                             lat = 'latitude',
                             lon = 'longitude',
                             color = colors,
                             hover_data = {'latitude' : False,
                                           'longitude' : False},
                             labels = {'color' : metric_name},
                             hover_name = 'Airport')
    #plots LAX airport
    fig2 = px.scatter_mapbox(lon = [LAX_lon],
                             lat = [LAX_lat],
                             hover_name = ['Los Angeles International Airport']).add_traces(fig1.data)
    #plots lines from LAX to each destination
    fig3 = px.line_mapbox(airports,
                          lat = lats,
                          lon = lons,
                          color_discrete_sequence = ['red'],
                          zoom = 2,
                          mapbox_style = 'carto-positron').add_traces(fig2.data)
    
    fig3.update_traces(marker={'size' : 20,
                               'opacity' : .5})

    return fig3

## Findings

Looking at the standard deviation, we see the highest variation in HNL and JFK, which is probably due to geographical distance.

In [24]:
visualize_airports(airports, flights, metric = np.std)

We see similar results for the mean. The most expensive flights are near the East coast or to Hawaii, and the cheapest flights tend to be in the Midwest.

In [26]:
visualize_airports(airports, flights, metric = np.mean)

What about certain types of flights? If we examine nonstop flights, we see that the average price at each airport is generally lower at every airport.

In [27]:
nonstop = flights[flights['Stops'] == 'nonstop']
visualize_airports(airports, nonstop, metric = np.mean)

Finally we look at 3 stop flights. HNL has a shocking mean of about $1103, which explains the high std seen earlier. This most likely means that within the data set, there are a few very expensive 3-stop tickets to Hawaii.

In [28]:
three_stops = flights[flights['Stops'] == '3 stops']
visualize_airports(airports, three_stops, metric = np.mean)