For this problem, you will be working with flight data from the Bureau of Transportation Statistics. For development, these data are a little largeish (~700Mb/7 million flights) for prototyping, so two truncated files are provided under 'data/2007-005.csv' and 'data/2008-005.csv'. Important: it's strongly recommended to prototype and develop code using the truncated data.

Design note: The code you will develop as part of this problem's solution should be generalized so that it works when there are more than just two years' worth of data, i.e. when there are more files than just for years 2007 and 2008. Additionally, all refernce outputs are presented for the half-percent annual flight sample (small files).

A1. (4 points) To start, complete the function to take a year as an input argument and load its data into a pandas dataframe. This load should drop the rows in the dataframe that have a null in any of these columns: "Year", "Month", "DayofMonth", "DepTime", "Origin", and "Dest", and then return the result

In [1]:
# A1:Function(4/4)

import pandas as pd

def read_data(year):
    
    filename = "data/" + str(year) + "-005.csv" if True else ".csv"
    
    flight_yr = pd.read_csv(filename)
    
    flight_yr.dropna(subset=['Year', 'Month', 'DayofMonth', 'DepTime', 'Origin', 'Dest'],inplace=True)
  
    return flight_yr

In [2]:
flight_07 = read_data(2007)
flight_08 = read_data(2008)

In [3]:
flight_07.head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,2007,1,2,2,1121.0,1120,1218.0,1240,WN,1482,...,7,10,0,,0,0,0,0,0,0
1,2007,1,2,2,1647.0,1640,1754.0,1750,WN,1663,...,4,9,0,,0,0,0,0,0,0
2,2007,1,2,2,950.0,950,1046.0,1050,WN,389,...,3,8,0,,0,0,0,0,0,0
3,2007,1,2,2,1520.0,1520,1626.0,1645,WN,1781,...,2,5,0,,0,0,0,0,0,0
4,2007,1,2,2,1046.0,1035,1058.0,1050,WN,1073,...,2,13,0,,0,0,0,0,0,0


In [4]:
flight_08.head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,2008,1,3,4,1439.0,1425,1720.0,1720,WN,1582,...,7.0,9.0,0,,0,,,,,
1,2008,1,3,4,744.0,740,838.0,840,WN,2111,...,3.0,5.0,0,,0,,,,,
2,2008,1,3,4,1319.0,950,1615.0,1240,WN,3846,...,4.0,25.0,0,,0,200.0,0.0,6.0,0.0,9.0
3,2008,1,3,4,1138.0,1000,1235.0,1105,WN,29,...,3.0,8.0,0,,0,86.0,0.0,0.0,0.0,4.0
4,2008,1,3,4,1100.0,1040,1414.0,1410,WN,1847,...,4.0,6.0,0,,0,,,,,


Check to make sure that the specific columns listed do not contain nulls

In [5]:
erase = ['Year', 'Month', 'DayofMonth', 'DepTime', 'Origin', 'Dest']

In [6]:
for i in erase:
    print(i,":", flight_07[i].isnull().sum())

Year : 0
Month : 0
DayofMonth : 0
DepTime : 0
Origin : 0
Dest : 0


In [7]:
for i in erase:
    print(i,":", flight_08[i].isnull().sum())

Year : 0
Month : 0
DayofMonth : 0
DepTime : 0
Origin : 0
Dest : 0


A2. (7 points) Next, complete the new data-loading functions to create a new column in the dataframe that contains datetime objects holding the departure dates of the flights under a new column keyed as "DepartureDate".

In [8]:
from datetime import date, datetime, timedelta

In [9]:
def create_dep_datetime(row):
    
    #---your code starts here---
    
    year = row['Year']
    month = row['Month']
    day = row['DayofMonth']

    #---your code stops here---
    
    return datetime(year, month, day)

Use this function to create 'DepartureDate'

In [10]:
# A2:Function(4/7)

def read_data_parsetimes(year):
    depart = []
    
    filename = "data/" + str(year) + "-005.csv" if True else ".csv"
    
    flight_info = pd.read_csv(filename)
    flight_info.dropna(subset=['Year', 'Month', 'DayofMonth', 'DepTime', 'Origin', 'Dest'],inplace=True)
  
    
    for i in range(len(flight_info)):
        depart.append(create_dep_datetime(flight_info.iloc[i]))
        
    flight_info['DepartureDate'] = depart
    
    return flight_info

In [11]:
read_data_parsetimes(2008).head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay,DepartureDate
0,2008,1,3,4,1439.0,1425,1720.0,1720,WN,1582,...,9.0,0,,0,,,,,,2008-01-03
1,2008,1,3,4,744.0,740,838.0,840,WN,2111,...,5.0,0,,0,,,,,,2008-01-03
2,2008,1,3,4,1319.0,950,1615.0,1240,WN,3846,...,25.0,0,,0,200.0,0.0,6.0,0.0,9.0,2008-01-03
3,2008,1,3,4,1138.0,1000,1235.0,1105,WN,29,...,8.0,0,,0,86.0,0.0,0.0,0.0,4.0,2008-01-03
4,2008,1,3,4,1100.0,1040,1414.0,1410,WN,1847,...,6.0,0,,0,,,,,,2008-01-03


In [12]:
read_data_parsetimes(2007).head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay,DepartureDate
0,2007,1,2,2,1121.0,1120,1218.0,1240,WN,1482,...,10,0,,0,0,0,0,0,0,2007-01-02
1,2007,1,2,2,1647.0,1640,1754.0,1750,WN,1663,...,9,0,,0,0,0,0,0,0,2007-01-02
2,2007,1,2,2,950.0,950,1046.0,1050,WN,389,...,8,0,,0,0,0,0,0,0,2007-01-02
3,2007,1,2,2,1520.0,1520,1626.0,1645,WN,1781,...,5,0,,0,0,0,0,0,0,2007-01-02
4,2007,1,2,2,1046.0,1035,1058.0,1050,WN,1073,...,13,0,,0,0,0,0,0,0,2007-01-02


A3. (5 points) Now complete the updated function that must also takes an airport code as an input argument. This should then return a dataframe of flights originating from that airport that occurred in the specified year.

In [13]:
def read_data_parsetimes_byorigin(airport, year):
    
    depart = []
    
    filename = "data/" + str(year) + "-005.csv" if True else ".csv"
    
    flight_info = pd.read_csv(filename)
    flight_info.dropna(subset=['Year', 'Month', 'DayofMonth', 'DepTime', 'Origin', 'Dest'],inplace=True)
  
    
    for i in range(len(flight_info)):
        depart.append(create_dep_datetime(flight_info.iloc[i]))
        
    flight_info['DepartureDate'] = depart
    
    flight_info = flight_info.loc[(flight_info['Origin']==airport) & (flight_info['Year']==year)]
    
    return flight_info

In [14]:
read_data_parsetimes_byorigin("PHL", 2008).head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay,DepartureDate
8,2008,1,3,4,1308.0,1310,1521.0,1550,WN,1319,...,8.0,0,,0,,,,,,2008-01-03
36,2008,1,5,6,1553.0,1545,1814.0,1825,WN,455,...,9.0,0,,0,,,,,,2008-01-05
53,2008,1,6,7,2005.0,2010,2120.0,2155,WN,384,...,10.0,0,,0,,,,,,2008-01-06
81,2008,1,8,2,2006.0,2010,2124.0,2155,WN,384,...,11.0,0,,0,,,,,,2008-01-08
134,2008,1,11,5,743.0,745,856.0,900,WN,487,...,26.0,0,,0,,,,,,2008-01-11


A4. (4 points) Using this function, create dataframes holding the flight data for Philadelphia International Airport (PHL) for 2007 and 2008. Then use the .groupby() method to obtain the busiest month of the year for both years. Did this change from 2007 to 2008?

In [15]:
month_07, count_07 = int(), int()

# try using .groupby() and .size() to create a count for months
# and then finding the month with highest number of flights.
#---your code starts here---

phl_2007 = read_data_parsetimes_byorigin("PHL", 2007)

month_count = phl_2007.groupby('Month').size()
#print(month_count) give the list of months with their counts

month_07, count_07 = month_count.idxmax(),  max(month_count)


month_07, count_07

(10, 55)

In [16]:
phl_2008 = read_data_parsetimes_byorigin("PHL", 2008)

month_count = phl_2008.groupby('Month').size()
#print(month_count)

month_08, count_08 = month_count.idxmax(),  max(month_count)


month_08, count_08

(6, 49)

In [17]:
# A4:Inline(1/4)

# Was the busiest month the same month for both years, 2007 and 2008?
# Answer one of: "Yes" or "No"
print("No")

No


Will use this function in the final step of the larger fn

In [18]:
def get_dates(data):

        month = data[0]
        day = data[1]
        year = data[2]

        return datetime(year, month, day)

A5. (8 points) Finally, complete the updated function that takes two integer tripes of the form (month, day, year), each representing a date as either the start or end of a range (potentially more than two years long). The function must now return all flights originating from the specified airport within this range of time. If the range spans more than two years, it should load data from all necessary files (i.e., assume more than two year-files exist) and return a single dataframe containing all the data within the specified range of time.

FINAL FUNCTION

In [19]:
def read_data_parsetimes_byorigin_daterange(airport, start, end):
    depart = []
    years = []

    for i in range(start[-1], end[-1]+1): #grabs the years only from the start and end date
        years.append(i)

    flight_data = pd.DataFrame() #initialize dataframe to concatenate later

    for year in years: #loop through the files that contain the year(s) provided in start/end dates
        
        filename = "data/" + str(year) + ("-005.csv" if True else ".csv")
        
        flight_data_for_year = pd.read_csv(filename)
                
        flight_data = pd.concat([flight_data, flight_data_for_year])
        
        flight_data.dropna(subset=['Year', 'Month', 'DayofMonth', 'DepTime', 'Origin', 'Dest'],inplace=True)
        
        
    for i in range(len(flight_data)):
        depart.append(create_dep_datetime(flight_data.iloc[i]))
        
    flight_data['DepartureDate'] = depart
    
    #use get_date fn to change date tuples to datetime objects
    
    start = get_dates(start)
    end = get_dates(end)
    
    #last, filter out the flights between the start and end date that left from the specified airport
    
    flight_range = flight_data.loc[(flight_data['DepartureDate'] >= start) & (flight_data['DepartureDate'] <= end)\
                                  & (flight_data['Origin']==airport)]
    
    
    return flight_range

In [20]:
read_data_parsetimes_byorigin_daterange("LAX",(1,17,2007),(2,3,2008))

#Can see how the interval is from the dates provided

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay,DepartureDate
221,2007,1,17,3,1831.0,1830,1940.0,1945,WN,2383,...,13.0,0,,0,0.0,0.0,0.0,0.0,0.0,2007-01-17
222,2007,1,17,3,2129.0,2130,2228.0,2235,WN,367,...,8.0,0,,0,0.0,0.0,0.0,0.0,0.0,2007-01-17
254,2007,1,19,5,758.0,800,917.0,915,WN,2662,...,19.0,0,,0,0.0,0.0,0.0,0.0,0.0,2007-01-19
270,2007,1,20,6,1832.0,1825,1939.0,1925,WN,1031,...,8.0,0,,0,0.0,0.0,0.0,0.0,0.0,2007-01-20
287,2007,1,21,7,915.0,915,1427.0,1425,WN,258,...,11.0,0,,0,0.0,0.0,0.0,0.0,0.0,2007-01-21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3931,2008,2,2,6,1601.0,1600,1817.0,1820,OO,6528,...,14.0,0,,0,,,,,,2008-02-02
4289,2008,2,1,5,1957.0,2000,2120.0,2121,UA,1174,...,15.0,0,,0,,,,,,2008-02-01
4498,2008,2,1,5,1001.0,1000,1112.0,1115,WN,1306,...,6.0,0,,0,,,,,,2008-02-01
4499,2008,2,1,5,1750.0,1630,1906.0,1755,WN,97,...,9.0,0,,0,12.0,0.0,0.0,0.0,59.0,2008-02-01


Using this function, get all the flight data for flights from PHL for 2007–2008. Then, create a daily count of flights over all of the days in the two years and report the busiest day over the two years.

In [21]:
# A5:Inline(2/8)

busy_day, busy_count = int(), int() #initialize values

all_phl_07_08 = read_data_parsetimes_byorigin_daterange("PHL",(1,1,2007),(12,31,2008))

#group flights by their departure date, grab the counts, and sort from highest to lowest
sorted_flight_dts = all_phl_07_08.groupby('DepartureDate').size().sort_values(ascending=False)

#grab the element at the top of the list
busy_day, busy_count = sorted_flight_dts.index[0], sorted_flight_dts.values[0] 

busy_day, busy_count

(Timestamp('2007-04-23 00:00:00'), 5)