### Open Flights Data Wrangling

To practice, you are going to wrangle data from OpenFlights.  You can read about it here: 

http://openflights.org/data.html

This includes three main files, one for each airport, one for each airline, and one for each route.  They can be merged or joined with the appropriate fields.  I have modified the files slightly to include a header row in the .dat files, which makes it a bit easier for you.  

You are required to work through the problems below.  This may take some time.  Be persistent, and ask questions or seek help as needed.  

In [1]:
import pandas as pd
import numpy as np

In [2]:
# These files use \N as a missing value indicator.  When reading the CSVs, we will tell
# it to use that value as missing or NA.  The double backslash is required because
# otherwise it will interpret \N as a carriage return. 

# Read in the airports data.
airports = pd.read_csv("data/airports.dat", header=None, na_values='\\N')
airports.columns = ["id", "name", "city", "country", "iata", "icao", "latitude", "longitude", "altitude","timezone", "dst", "tz", "type", "source"]

# Read in the airlines data.
airlines = pd.read_csv("data/airlines.dat", header=None, na_values='\\N')
airlines.columns = ["id", "name", "alias", "iata", "icao", "callsign", "country", "active"]

# Read in the routes data.
routes = pd.read_csv("data/routes.dat", header=None, na_values='\\N')
routes.columns = ["airline", "airline_id", "source", "source_id", "dest", "dest_id", "codeshare", "stops", "equipment"]

1) Start by seeing what's in the data.  What columns are there?  What data types are the columns?  

Remember, 'object' means it is a string, while the numerical values can be floats or ints.  Sometimes you will have problems if it reads numeric data in as strings.  If that happens, you can use the function .astype() to convert it.  Look it up in the pandas API to get more details

In [3]:
print(airports)

         id                                         name                city  \
0         1                               Goroka Airport              Goroka   
1         2                               Madang Airport              Madang   
2         3                 Mount Hagen Kagamuga Airport         Mount Hagen   
3         4                               Nadzab Airport              Nadzab   
4         5  Port Moresby Jacksons International Airport        Port Moresby   
5         6                  Wewak International Airport               Wewak   
6         7                           Narsarsuaq Airport        Narssarssuaq   
7         8                      Godthaab / Nuuk Airport            Godthaab   
8         9                        Kangerlussuaq Airport         Sondrestrom   
9        10                               Thule Air Base               Thule   
10       11                             Akureyri Airport            Akureyri   
11       12                          Egi

In [4]:
print(airports.columns)

Index(['id', 'name', 'city', 'country', 'iata', 'icao', 'latitude',
       'longitude', 'altitude', 'timezone', 'dst', 'tz', 'type', 'source'],
      dtype='object')


In [5]:
print(airlines)

         id                                          name  \
0        -1                                       Unknown   
1         1                                Private flight   
2         2                                   135 Airways   
3         3                                 1Time Airline   
4         4  2 Sqn No 1 Elementary Flying Training School   
5         5                               213 Flight Unit   
6         6                 223 Flight Unit State Airline   
7         7                             224th Flight Unit   
8         8                                   247 Jet Ltd   
9         9                                   3D Aviation   
10       10                                   40-Mile Air   
11       11                                        4D Air   
12       12                        611897 Alberta Limited   
13       13                              Ansett Australia   
14       14                          Abacus International   
15       15             

In [6]:
print(airlines.columns)

Index(['id', 'name', 'alias', 'iata', 'icao', 'callsign', 'country', 'active'], dtype='object')


In [7]:
print(routes)

      airline  airline_id source  source_id dest  dest_id codeshare  stops  \
0          2B       410.0    AER     2965.0  KZN   2990.0       NaN      0   
1          2B       410.0    ASF     2966.0  KZN   2990.0       NaN      0   
2          2B       410.0    ASF     2966.0  MRV   2962.0       NaN      0   
3          2B       410.0    CEK     2968.0  KZN   2990.0       NaN      0   
4          2B       410.0    CEK     2968.0  OVB   4078.0       NaN      0   
5          2B       410.0    DME     4029.0  KZN   2990.0       NaN      0   
6          2B       410.0    DME     4029.0  NBC   6969.0       NaN      0   
7          2B       410.0    DME     4029.0  TGK      NaN       NaN      0   
8          2B       410.0    DME     4029.0  UUA   6160.0       NaN      0   
9          2B       410.0    EGO     6156.0  KGD   2952.0       NaN      0   
10         2B       410.0    EGO     6156.0  KZN   2990.0       NaN      0   
11         2B       410.0    GYD     2922.0  NBC   6969.0       

In [8]:
print(routes.columns)

Index(['airline', 'airline_id', 'source', 'source_id', 'dest', 'dest_id',
       'codeshare', 'stops', 'equipment'],
      dtype='object')


2) Select just the routes that go to or from Lexington Bluegrass Airport, and store them in their own dataframe.  

The airport code is LEX.  You should have a much smaller dataframe.  How many inbound routes and how many outbound routes are there? 

In [9]:
routes1=routes[routes['source']=='LEX']
routes1

Unnamed: 0,airline,airline_id,source,source_id,dest,dest_id,codeshare,stops,equipment
3588,9E,3976.0,LEX,4017.0,ATL,3682.0,,0,CRJ
5763,AA,24.0,LEX,4017.0,CLT,3876.0,Y,0,CR7 CRJ
5764,AA,24.0,LEX,4017.0,DFW,3670.0,Y,0,ERD ER4
5765,AA,24.0,LEX,4017.0,ORD,3830.0,Y,0,ERD ER4
9641,AF,137.0,LEX,4017.0,ATL,3682.0,Y,0,CRJ CR9
21095,DL,2009.0,LEX,4017.0,ATL,3682.0,,0,M88 717
21096,DL,2009.0,LEX,4017.0,DCA,3520.0,Y,0,CRJ
21097,DL,2009.0,LEX,4017.0,DTW,3645.0,Y,0,CR7 CRJ CR9
21098,DL,2009.0,LEX,4017.0,LGA,3697.0,,0,ERJ
21099,DL,2009.0,LEX,4017.0,MSP,3858.0,Y,0,CRJ


In [10]:
routes2=routes[routes['dest']=='LEX']
routes2

Unnamed: 0,airline,airline_id,source,source_id,dest,dest_id,codeshare,stops,equipment
3569,9E,3976.0,ATL,3682.0,LEX,4017.0,,0,CRJ
4953,AA,24.0,CLT,3876.0,LEX,4017.0,,0,CR7
5247,AA,24.0,DFW,3670.0,LEX,4017.0,Y,0,ERD ER4
6283,AA,24.0,ORD,3830.0,LEX,4017.0,Y,0,ERD ER4
9097,AF,137.0,ATL,3682.0,LEX,4017.0,Y,0,CR9 M88 CRJ 717
20164,DL,2009.0,ATL,3682.0,LEX,4017.0,,0,M88 717
20534,DL,2009.0,DCA,3520.0,LEX,4017.0,Y,0,CRJ
20638,DL,2009.0,DTW,3645.0,LEX,4017.0,,0,717
21131,DL,2009.0,LGA,3697.0,LEX,4017.0,Y,0,ERJ
21402,DL,2009.0,MSP,3858.0,LEX,4017.0,Y,0,CRJ


3) Now let's look at which airlines operate in and out of Lexington.  To do this, you need to merge the airline dataframe to the route dataframe.  

How many routes does each airline have?  The value_counts() method may be useful for answering this question.  

In [33]:
merge1 = pd.merge(airlines, routes1, on='id',  how='right')
merge1

Unnamed: 0,id,name_x,alias,iata_x,icao_x,callsign,country_x,active,airline,airline_id,...,iata_y,icao_y,latitude,longitude,altitude,timezone,dst,tz,type,source_y
0,4017,Pont International Airline Services,,,PSI,PONT,Suriname,N,9E,3976.0,...,LEX,KLEX,38.036499,-84.605904,979,-5.0,A,America/New_York,airport,OurAirports
1,4017,Pont International Airline Services,,,PSI,PONT,Suriname,N,AA,24.0,...,LEX,KLEX,38.036499,-84.605904,979,-5.0,A,America/New_York,airport,OurAirports
2,4017,Pont International Airline Services,,,PSI,PONT,Suriname,N,AA,24.0,...,LEX,KLEX,38.036499,-84.605904,979,-5.0,A,America/New_York,airport,OurAirports
3,4017,Pont International Airline Services,,,PSI,PONT,Suriname,N,AA,24.0,...,LEX,KLEX,38.036499,-84.605904,979,-5.0,A,America/New_York,airport,OurAirports
4,4017,Pont International Airline Services,,,PSI,PONT,Suriname,N,AF,137.0,...,LEX,KLEX,38.036499,-84.605904,979,-5.0,A,America/New_York,airport,OurAirports
5,4017,Pont International Airline Services,,,PSI,PONT,Suriname,N,DL,2009.0,...,LEX,KLEX,38.036499,-84.605904,979,-5.0,A,America/New_York,airport,OurAirports
6,4017,Pont International Airline Services,,,PSI,PONT,Suriname,N,DL,2009.0,...,LEX,KLEX,38.036499,-84.605904,979,-5.0,A,America/New_York,airport,OurAirports
7,4017,Pont International Airline Services,,,PSI,PONT,Suriname,N,DL,2009.0,...,LEX,KLEX,38.036499,-84.605904,979,-5.0,A,America/New_York,airport,OurAirports
8,4017,Pont International Airline Services,,,PSI,PONT,Suriname,N,DL,2009.0,...,LEX,KLEX,38.036499,-84.605904,979,-5.0,A,America/New_York,airport,OurAirports
9,4017,Pont International Airline Services,,,PSI,PONT,Suriname,N,DL,2009.0,...,LEX,KLEX,38.036499,-84.605904,979,-5.0,A,America/New_York,airport,OurAirports


In [38]:
routes2 = pd.merge(routes2 , airports, left_on='source_id', right_on='id', how='left')

In [39]:
merge2 = pd.merge(airlines, routes2, on='id',  how='right')
merge2

Unnamed: 0,id,name_x,alias,iata_x,icao_x,callsign,country_x,active,airline,airline_id,...,iata_y,icao_y,latitude,longitude,altitude,timezone,dst,tz,type,source_y
0,3520.0,Missionair,,,MSN,MISIONAIR,Spain,N,DL,2009.0,...,DCA,KDCA,38.8521,-77.037697,15,-5.0,A,America/New_York,airport,OurAirports
1,3533.0,Monarch Airlines,,,MNH,MONARCH AIR,United States,N,G4,35.0,...,FLL,KFLL,26.072599,-80.152702,9,-5.0,A,America/New_York,airport,OurAirports
2,3550.0,Mountain Air Company,,N4,MTC,MOUNTAIN LEONE,Sierra Leone,N,UA,5209.0,...,IAH,KIAH,29.9844,-95.3414,97,-6.0,A,America/Chicago,airport,OurAirports
3,3617.0,Naturelink Charter,,,NRK,NATURELINK,South Africa,N,G4,35.0,...,PIE,KPIE,27.9102,-82.687401,11,-5.0,A,America/New_York,airport,OurAirports
4,3645.0,New Heights 291,,,NHT,NEWHEIGHTS,South Africa,N,DL,2009.0,...,DTW,KDTW,42.212399,-83.353401,645,-5.0,A,America/New_York,airport,OurAirports
5,3670.0,Search and Rescue 22,,,SRD,,United Kingdom,N,AA,24.0,...,DFW,KDFW,32.896801,-97.038002,607,-6.0,A,America/Chicago,airport,OurAirports
6,3670.0,Search and Rescue 22,,,SRD,,United Kingdom,N,US,5265.0,...,DFW,KDFW,32.896801,-97.038002,607,-6.0,A,America/Chicago,airport,OurAirports
7,3682.0,Nordstree (Australia),,,NDS,,Australia,N,9E,3976.0,...,ATL,KATL,33.6367,-84.428101,1026,-5.0,A,America/New_York,airport,OurAirports
8,3682.0,Nordstree (Australia),,,NDS,,Australia,N,AF,137.0,...,ATL,KATL,33.6367,-84.428101,1026,-5.0,A,America/New_York,airport,OurAirports
9,3682.0,Nordstree (Australia),,,NDS,,Australia,N,DL,2009.0,...,ATL,KATL,33.6367,-84.428101,1026,-5.0,A,America/New_York,airport,OurAirports


4) It looks like there are some international airlines with Lexington routes.  To look at how many routes they have, create a new column in your dataframe called 'International', which is set to Y for an overseas airline and N for a domestic airline.  Calculate the percent of routes with an overseas airline.  

In [41]:
merge1['International']=np.where(merge1['country_x']=='United States','N','Y')
merge1

Unnamed: 0,id,name_x,alias,iata_x,icao_x,callsign,country_x,active,airline,airline_id,...,icao_y,latitude,longitude,altitude,timezone,dst,tz,type,source_y,International
0,4017,Pont International Airline Services,,,PSI,PONT,Suriname,N,9E,3976.0,...,KLEX,38.036499,-84.605904,979,-5.0,A,America/New_York,airport,OurAirports,Y
1,4017,Pont International Airline Services,,,PSI,PONT,Suriname,N,AA,24.0,...,KLEX,38.036499,-84.605904,979,-5.0,A,America/New_York,airport,OurAirports,Y
2,4017,Pont International Airline Services,,,PSI,PONT,Suriname,N,AA,24.0,...,KLEX,38.036499,-84.605904,979,-5.0,A,America/New_York,airport,OurAirports,Y
3,4017,Pont International Airline Services,,,PSI,PONT,Suriname,N,AA,24.0,...,KLEX,38.036499,-84.605904,979,-5.0,A,America/New_York,airport,OurAirports,Y
4,4017,Pont International Airline Services,,,PSI,PONT,Suriname,N,AF,137.0,...,KLEX,38.036499,-84.605904,979,-5.0,A,America/New_York,airport,OurAirports,Y
5,4017,Pont International Airline Services,,,PSI,PONT,Suriname,N,DL,2009.0,...,KLEX,38.036499,-84.605904,979,-5.0,A,America/New_York,airport,OurAirports,Y
6,4017,Pont International Airline Services,,,PSI,PONT,Suriname,N,DL,2009.0,...,KLEX,38.036499,-84.605904,979,-5.0,A,America/New_York,airport,OurAirports,Y
7,4017,Pont International Airline Services,,,PSI,PONT,Suriname,N,DL,2009.0,...,KLEX,38.036499,-84.605904,979,-5.0,A,America/New_York,airport,OurAirports,Y
8,4017,Pont International Airline Services,,,PSI,PONT,Suriname,N,DL,2009.0,...,KLEX,38.036499,-84.605904,979,-5.0,A,America/New_York,airport,OurAirports,Y
9,4017,Pont International Airline Services,,,PSI,PONT,Suriname,N,DL,2009.0,...,KLEX,38.036499,-84.605904,979,-5.0,A,America/New_York,airport,OurAirports,Y


In [42]:
merge2['International']=np.where(merge2['country_x']=='United States','N','Y')
merge2

Unnamed: 0,id,name_x,alias,iata_x,icao_x,callsign,country_x,active,airline,airline_id,...,icao_y,latitude,longitude,altitude,timezone,dst,tz,type,source_y,International
0,3520.0,Missionair,,,MSN,MISIONAIR,Spain,N,DL,2009.0,...,KDCA,38.8521,-77.037697,15,-5.0,A,America/New_York,airport,OurAirports,Y
1,3533.0,Monarch Airlines,,,MNH,MONARCH AIR,United States,N,G4,35.0,...,KFLL,26.072599,-80.152702,9,-5.0,A,America/New_York,airport,OurAirports,N
2,3550.0,Mountain Air Company,,N4,MTC,MOUNTAIN LEONE,Sierra Leone,N,UA,5209.0,...,KIAH,29.9844,-95.3414,97,-6.0,A,America/Chicago,airport,OurAirports,Y
3,3617.0,Naturelink Charter,,,NRK,NATURELINK,South Africa,N,G4,35.0,...,KPIE,27.9102,-82.687401,11,-5.0,A,America/New_York,airport,OurAirports,Y
4,3645.0,New Heights 291,,,NHT,NEWHEIGHTS,South Africa,N,DL,2009.0,...,KDTW,42.212399,-83.353401,645,-5.0,A,America/New_York,airport,OurAirports,Y
5,3670.0,Search and Rescue 22,,,SRD,,United Kingdom,N,AA,24.0,...,KDFW,32.896801,-97.038002,607,-6.0,A,America/Chicago,airport,OurAirports,Y
6,3670.0,Search and Rescue 22,,,SRD,,United Kingdom,N,US,5265.0,...,KDFW,32.896801,-97.038002,607,-6.0,A,America/Chicago,airport,OurAirports,Y
7,3682.0,Nordstree (Australia),,,NDS,,Australia,N,9E,3976.0,...,KATL,33.6367,-84.428101,1026,-5.0,A,America/New_York,airport,OurAirports,Y
8,3682.0,Nordstree (Australia),,,NDS,,Australia,N,AF,137.0,...,KATL,33.6367,-84.428101,1026,-5.0,A,America/New_York,airport,OurAirports,Y
9,3682.0,Nordstree (Australia),,,NDS,,Australia,N,DL,2009.0,...,KATL,33.6367,-84.428101,1026,-5.0,A,America/New_York,airport,OurAirports,Y


5) Actually, it looks like a bunch of these routes are codeshares.  That means they are marketed by this airline, but operated by a different airline.  See the note in the data documentation on openflights.org/data.  The implication of this is that there are duplicates.

Can you figure out which ones are duplicates?  Can you then create a dataframe with only the unique routes?  How many unique inbound and outbound routes are there? 

Remember, someone has to operate the flight, so if all the routes to/from a particular airport are listed as codeshares, then something is funny...

It is also possible that more than one airline actually operates a route between the same two airports. (Having this sort of competition generally means that you will get better fares as a traveler.)  It may not be obvious what is actually in the data set, so dig or do external research as needed.  