# Python data manipulation exercise - Airline Performance Analysis

## Problem

Airlines are always run on tight schedules and very thin operating margins. To top this, the customers are extremely demanding expecting their flights to be on time always. Airline carriers try to overcome this challenge by detailed planning and prompt execution. However, there are factors like weather, technical glitches, unexpected challenges at airports which end up causing delay in arrival or departure of aircrafts. Since, the flights hop multiple cities any delay caused at one nodal city adds up and results in significant delays in subsequent stoppages if not taken care of early. 

## What is expected from you?
You have been provided with 3 datasets : 
1. airline-performance.csv
2. air-carrier-details.csv
3. airports.csv

Below are a set of questions of interest related to the problem statement at hand. Leverage your Python data wrangling skills to solve them

<a id="home"></a>
## Quick reference to solutions : 
* [Solution : Question-1](#q1)
* [Solution : Question-2](#q2)
* [Solution : Question-3](#q3)
* [Solution : Question-4](#q4)
* [Solution : Question-5](#q5)
* [Solution : Question-6](#q6)
* [Soluion : Question-7](#q7)
* [Solution : Question-8](#q8)
* [Solution : Question-9](#q9)
* [Solution : Question-10](#q10)

In [None]:
file_path = '../../problem-sets/Set-1/data'

In [None]:
# Let us read the datasets first
import pandas as pd
ar_pf = pd.read_csv(file_path+'/airline-performance.csv')
ar_cr = pd.read_csv(file_path+'/air-carrier-details.csv')
arp = pd.read_csv(file_path+'/airports.csv')

In [None]:
# let's read the columns of airline-performance dataset
ar_pf.columns

<a id="q1"></a>
[Go Back](#home)
## Question : 1

* Which airline carrier is busiest in terms of volume?

Let's analyze the question here. We need to find out the name of the airline carrier
* Also, busiest in terms of volume would mean calculating count of trips
* As we find out, each row of the dataset airline-performance corresponds to a trip
* This dataset also has reference to airline carrier called **UniqueCarrierCode** and refers to the dataset air-carrier-details which has the name of the carrier in the column **Description**

In [None]:
# Let's first look at sample data in the dataframe ar_cr
ar_cr.head()

In [None]:
# Looking at the data above, it seems the value which is required to join with the datasets ar_pf is a part of 
# the Description column
# We need to extract the letters after the : symbol, which would then be used to join the dataset ar_pf using the 
# column UniqueCarrierCode
# This can be easily done by the below code

ar_cr[['airline_name','airline_code']] = ar_cr['Description'].str.split(':',expand = True)

In [None]:
ar_cr.head()

Have a look at the reference [here](https://www.geeksforgeeks.org/split-a-text-column-into-two-columns-in-pandas-dataframe/) to understand the above code

In [None]:
# Just for hygiene purposes we strip off all possible whitespaces from the code
ar_cr['airline_code'] = ar_cr['airline_code'].str.strip()

In [None]:
# Let's now look how the dataframe looks like :
ar_cr.head()

In [None]:
# Cool, now all that is left is joining with the ar_pf dataframe on UniqueCarrierCode column
# we do this using merge function of pandas
# Why do we do left join?
dfq1 = pd.merge(ar_pf,ar_cr,left_on='UniqueCarrierCode',right_on='airline_code',how='left')

In [None]:
# Ok, merge is done. Let's see how the new dataframe looks like
dfq1.loc[:0,]

In [None]:
# Just a check to see which all columns have missing values (we just did a left join above)
dfq1.info()

In [None]:
# Let's set missing values in airline_name column with NA
dfq1['airline_name'].fillna('NA',inplace = True)
# Why inplace = True above?
# Because fillna function without inplace = True does not replace the actual data in the dataframe. 
# For the fillna function to take effect inplace = True needs to be supplied

In [None]:
# Now, we can group by airline_name
# take count of rows using the size function
# sort the data by the count
# and slice out the top row
dfq1.groupby('airline_name').size().sort_values(ascending=False)[:1]

In [None]:
# Let's store the name and value separately (of the top most airline)
airline_name = dfq1.groupby('airline_name').size().sort_values(ascending=False).index[0]
value = dfq1.groupby('airline_name').size().sort_values(ascending=False).values[0]

In [None]:
# Let's print the final result in a readable format
print("Airline %s is the busiest in terms of volume having trip count = %d"%(airline_name,value))

<a id="q2"></a>
[Go Back](#home)
## Question : 2

* Which city is busiest in terms of traffic?
    * Please note that you need to print out the city names against the traffic number and not just the airport name
    * Ex. an airport name entry looks like : **New York, NY: John F Kennedy International**. The city name from this entry is **New York** 

In [None]:
# Step 1 : Take count of trips by source city = This gives the departure count by city
df_dep = dfq1.groupby('OriginCode').size()
print(type(df_dep))
df_dep[:3]

In [None]:
# Step 2 : Repeat the above by destination city = This gives the arrival count by city
df_arr = dfq1.groupby('DestCode').size()
print(type(df_arr))
df_arr[:3]

In [None]:
# To calculate total traffic, total arrivals and departures need to be added up 
# Let's create a new series which is a sum total of departure series and arrival series
df_tot = df_dep.add(df_arr)

In [None]:
# Now, let's create a dataframe which would include the 3 series : departures, arrivals and totals
# We also do a reset index at the end as this process would set the series starting from 0
final_df = pd.concat([df_dep,df_arr,df_tot], axis=1).reset_index()

In [None]:
final_df.columns = ['City_Code','#Departures','#Arrivals','Total Traffic']

In [None]:
# Let's look at first few rows of the dataframe
final_df.head()

In [None]:
# join with arp dataframe to fetch names of cities
arp.head()

In [None]:
#again splitting the description column as done before
arp[['city','airport_name']] = arp['Description'].str.split(':',expand = True)

In [None]:
arp.head()

In [None]:
# split the city column further to fetch the first value
# please note the argument 1 below. This means fetch the value at the "first" occurrence of ","
arp[['city_name','state_code']] = arp['city'].str.split(',',1,expand = True)

In [None]:
arp.head()

In [None]:
# get rid of columns which are not required
arp.drop(['Description','city','airport_name','state_code'],axis = 1,inplace=True)

In [None]:
arp.head()

In [None]:
# As an hygiene we should ensure there are no whitespaces in the city_name
arp['city_name'] = arp['city_name'].str.strip()

In [None]:
# Cool, now all that is left is joining with the ar_pf dataframe on City_Code column
dfq2 = pd.merge(final_df,arp,left_on='City_Code',right_on='Code',how='left')

In [None]:
dfq2.head()

In [None]:
# We sort the final dataframe to ensure the city with highest traffic comes at the top
dfq2.sort_values(by='Total Traffic',ascending=False)[:1]

In [None]:
# Let's store the name and value separately (of the top most airline)
# iloc index 0 means first row of dataframe, index -1 means last value
city_name = dfq2.sort_values(by='Total Traffic',ascending=False).iloc[0][-1]
value = dfq2.sort_values(by='Total Traffic',ascending=False).iloc[0][-3]

In [None]:
# Let's print the final result in a readable format
print("%s City is the busiest in terms of traffic with value = %d"%(city_name,value))

<a id="q3"></a>
[Go Back](#home)
## Question : 3

* Which carrier has got the highest air time?

In [None]:
# type your code below
dfq3 = dfq1.groupby('airline_name')['AirTime'].sum().sort_values(ascending=False)  

In [None]:
dfq3.head()

In [None]:
# Let's store the name and value separately (of the top most airline)
# iloc index 0 means first row of dataframe, index -1 means last value
carrier_name = dfq3.index[0]
value = dfq3.values[0]

In [None]:
# Let's print the final result in a readable format
print("Airline %s has got the highest airtime with value as %d minutes"%(carrier_name,value))

<a id="q4"></a>
[Go Back](#home)
## Question : 4

* List top 5 cities which are busiest in terms of average flights handled per day for the month of June?

In [None]:
# let's first find out how many inbound or outbound flights are there for every city for each day of the month
# Let's look for all departures first, and then arrivals
dep_jun = ar_pf[ar_pf['Month'] == 6].groupby(['OriginCode','DayofMonth']).size()
arr_jun = ar_pf[ar_pf['Month'] == 6].groupby(['DestCode','DayofMonth']).size()

In [None]:
# let's convert both series obtained above to pandas dataframes
dep_jun = pd.DataFrame(dep_jun)
arr_jun = pd.DataFrame(arr_jun)

In [None]:
# let's give a meaningful name to columns in both dataframes
dep_jun.columns =['#Departures']
arr_jun.columns =['#Arrivals']

In [None]:
dep_jun.head()

In [None]:
# Since both dataframes have hierarchical index, in order to perform join between them, the hierarchical indices
# need to be converted to columns first
# We use the reset_index function for this purpose
dep_jun.reset_index(inplace=True)
arr_jun.reset_index(inplace=True)

In [None]:
# We are good to join the 2 dataframes now
final = pd.merge(dep_jun,arr_jun,left_on=['OriginCode','DayofMonth']
                 ,right_on=['DestCode','DayofMonth']
                 ,how='outer')

In [None]:
# let's fill all null values for #Departures or #Arrivals columns to 0
final['#Departures'].fillna(0,inplace = True)
final['#Arrivals'].fillna(0,inplace = True)

In [None]:
# Now let's calculate the total traffic column
final['Total_Traffic'] = final['#Departures'] + final['#Arrivals']

In [None]:
# Since, Either the Origin City or the Destination City can be blank at a time, let's do a coalesce of the two columns
# to obtain the new column City
final['City'] = final['OriginCode'].combine_first(final['DestCode'])

In [None]:
final.head()

In [None]:
import numpy as np

In [None]:
# Finally, we'll group by city and calculate the total traffic and total no. of days flights occurred
result = final.groupby('City')['Total_Traffic','DayofMonth'].agg({'Total_Traffic':np.sum
                                    ,'DayofMonth':lambda x:x.nunique()})

In [None]:
# let's calculate the average as asked in the question
result['Average'] = result['Total_Traffic']/result['DayofMonth']

In [None]:
# sort the final result by Average in descending order and slice to fetch the top 5
result.sort_values(by='Average',ascending=False)[:5]

In [None]:
# Only thing left out is to join the City column with dfq2 to fetch exact city names
# Join the dataframe result with dfq2 dataframe (as shown above)
# created in the previous question to fetch the city names

<a id="q5"></a>
[Go Back](#home)
## Question 5
* Which day of the week is busiest in terms of traffic?
* Please note : Total traffic = Total #arrivals + Total #departures

In [None]:
# type your code below


<a id="q6"></a>
[Go Back](#home)
## Question 6 : 

* Are weekends busier than weekdays?

In [None]:
# type your code below


<a id="q7"></a>
[Go Back](#home)
## Question 7 :

* Longer duration flights have a larger tendency to have arrival delays. Analyze the hypothesis with appropriate visuals

In [None]:
# type your code below
import matplotlib.pyplot as plt

In [None]:
plt.scatter(dfq1.AirTime, dfq1.ArrDelay)

<a id="q8"></a>
[Go Back](#home)
## Question 8 : 

* Consider the below bucketing logic for "Actual departure Time" :
    * Any flight departing between 4am - 12pm : Morning flight
    * Any flight departing between 12pm - 4pm : Afternoon flight
    * Any flight departing between 4pm - 9pm : Evening flight
    * Any flight departing between 9pm - 4am : Night flight
* Based on the above logic, answer the below questions : 
    * Which time of the day observes highest departure delays?
    * Create a pivot table with time of day in rows and type of delay in columns and #of occurrences in values
    * Which type of delay is most frequent in evening flights?
    * Which airport sees the highest occurrences of security related delay in the morning?

In [None]:
# type your code below


<a id="q9"></a>
[Go Back](#home)
## Question 9 :

* Consider the term "delay" as :
    * Any flight arriving more than 15 min later than the expected arrival time is considered "arrival delay"
    * Any flight departing more than 15 min later than the expected departure time is considered "departure delay"
    * A flight is considered delayed when any one of the above conditions is true
    * Based on the above, answer the below questions :
        * Which airline carriers have caused the highest % of delays?
        * Which airports are facing the highest % of delayed flights?

In [None]:
# type your code below


<a id="q10"></a>
[Go Back](#home)
## Question : 10

* List down top 10 cities in terms of total traffic between June and September
* Please note : Total traffic = Total #arrivals + Total #departures

In [None]:
# type your code below
