# 2008 Airline on time dataset
## by Vincent Khor

## Preliminary Wrangling

> Briefly introduce your dataset here. 


In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import time

%matplotlib inline

> Load in your dataset and describe its properties through the questions below.
Try and motivate your exploration goals through this section.

In [2]:
# Load necessary csv files.
flights = pd.read_csv('flights_2008.csv')

In [3]:
print('There are '+ str(flights.shape[0]) +' rows and ' + str(flights.shape[1]) + ' columns in this dataset.')

There are 2389217 rows and 29 columns in this dataset.


In [4]:
flights.head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,2008,1,3,4,1343.0,1325,1451.0,1435,WN,588,...,4.0,9.0,0,,0,16.0,0.0,0.0,0.0,0.0
1,2008,1,3,4,1125.0,1120,1247.0,1245,WN,1343,...,3.0,8.0,0,,0,,,,,
2,2008,1,3,4,2009.0,2015,2136.0,2140,WN,3841,...,2.0,14.0,0,,0,,,,,
3,2008,1,3,4,903.0,855,1203.0,1205,WN,3,...,5.0,7.0,0,,0,,,,,
4,2008,1,3,4,1423.0,1400,1726.0,1710,WN,25,...,6.0,10.0,0,,0,16.0,0.0,0.0,0.0,0.0


In [5]:
# Converting column names to be more descriptive
col_name = {'Year':'year', 'Month':'month', 'DayofMonth':'day_of_month','DayOfWeek':'day_of_week',
            'DepTime':'actual_dep_time','CRSDepTime':'scheduled_dep_time','ArrTime':'actual_arr_time',
            'CRSArrTime':'scheduled_arr_time','UniqueCarrier':'carrier_code','FlightNum':'flight_number',
            'TailNum':'tail_number','ActualElapsedTime':'actual_elapsed_time','CRSElapsedTime':'scheduled_elapsed_time',
            'AirTime':'air_time','ArrDelay':'arr_delay','DepDelay':'dep_delay','Origin':'origin','Dest':'destination',
            'Distance':'distance','TaxiIn':'taxi_in_time','TaxiOut':'taxi_out_time','Cancelled':'cancelled',
            'CancellationCode':'cancellation_code','Diverted':'diverted','CarrierDelay':'carrier_delay',
            'WeatherDelay':'weather_delay','NASDelay':'nas_delay', 'SecurityDelay':'security_delay',
            'LateAircraftDelay':'late_aircraft_delay'}
flights=flights.rename(columns=col_name)

In [6]:
flights.head()

Unnamed: 0,year,month,day_of_month,day_of_week,actual_dep_time,scheduled_dep_time,actual_arr_time,scheduled_arr_time,carrier_code,flight_number,...,taxi_in_time,taxi_out_time,cancelled,cancellation_code,diverted,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay
0,2008,1,3,4,1343.0,1325,1451.0,1435,WN,588,...,4.0,9.0,0,,0,16.0,0.0,0.0,0.0,0.0
1,2008,1,3,4,1125.0,1120,1247.0,1245,WN,1343,...,3.0,8.0,0,,0,,,,,
2,2008,1,3,4,2009.0,2015,2136.0,2140,WN,3841,...,2.0,14.0,0,,0,,,,,
3,2008,1,3,4,903.0,855,1203.0,1205,WN,3,...,5.0,7.0,0,,0,,,,,
4,2008,1,3,4,1423.0,1400,1726.0,1710,WN,25,...,6.0,10.0,0,,0,16.0,0.0,0.0,0.0,0.0


In [7]:
# Display basic information
flights.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2389217 entries, 0 to 2389216
Data columns (total 29 columns):
 #   Column                  Non-Null Count    Dtype  
---  ------                  --------------    -----  
 0   year                    2389217 non-null  int64  
 1   month                   2389217 non-null  int64  
 2   day_of_month            2389217 non-null  int64  
 3   day_of_week             2389217 non-null  int64  
 4   actual_dep_time         2324775 non-null  float64
 5   scheduled_dep_time      2389217 non-null  int64  
 6   actual_arr_time         2319121 non-null  float64
 7   scheduled_arr_time      2389217 non-null  int64  
 8   carrier_code            2389217 non-null  object 
 9   flight_number           2389217 non-null  int64  
 10  tail_number             2346765 non-null  object 
 11  actual_elapsed_time     2319121 non-null  float64
 12  scheduled_elapsed_time  2388810 non-null  float64
 13  air_time                2319121 non-null  float64
 14  ar

In [8]:
# replace 'day of week' values
day = {1:'Monday',2:'Tuesday',3:'Wednesday',4:'Thursday',5:'Friday',6:'Saturday',7:'Sunday'}
flights['day_of_week'].replace(day, inplace=True)

In [9]:
# Set day order
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday','Saturday','Sunday']
day_order = pd.api.types.CategoricalDtype(ordered=True, categories=days)
flights['day_of_week'] = flights['day_of_week'].astype(day_order);

In [10]:
flights['day_of_week'].unique()

['Thursday', 'Friday', 'Saturday', 'Sunday', 'Monday', 'Tuesday', 'Wednesday']
Categories (7, object): ['Monday' < 'Tuesday' < 'Wednesday' < 'Thursday' < 'Friday' < 'Saturday' < 'Sunday']

In [11]:
flights.head()

Unnamed: 0,year,month,day_of_month,day_of_week,actual_dep_time,scheduled_dep_time,actual_arr_time,scheduled_arr_time,carrier_code,flight_number,...,taxi_in_time,taxi_out_time,cancelled,cancellation_code,diverted,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay
0,2008,1,3,Thursday,1343.0,1325,1451.0,1435,WN,588,...,4.0,9.0,0,,0,16.0,0.0,0.0,0.0,0.0
1,2008,1,3,Thursday,1125.0,1120,1247.0,1245,WN,1343,...,3.0,8.0,0,,0,,,,,
2,2008,1,3,Thursday,2009.0,2015,2136.0,2140,WN,3841,...,2.0,14.0,0,,0,,,,,
3,2008,1,3,Thursday,903.0,855,1203.0,1205,WN,3,...,5.0,7.0,0,,0,,,,,
4,2008,1,3,Thursday,1423.0,1400,1726.0,1710,WN,25,...,6.0,10.0,0,,0,16.0,0.0,0.0,0.0,0.0


In [12]:
flights.day_of_week.value_counts()

Wednesday    365560
Tuesday      358942
Friday       350566
Thursday     349831
Monday       347984
Sunday       328237
Saturday     288097
Name: day_of_week, dtype: int64

Columns to convert:
- actual_dep_time
- scheduled_dep_time
- actual_arr_time
- scheduled_arr_time

### Converting to HH:MM 

In [None]:
flights[flights['arr_delay'].isnull()]

In [21]:
# convert actual_dep_time, scheduled_dep_time, actual_arr_time, scheduled_arr_time to a more readable format HH:MM
# Columns to convert
convert_col = ['actual_dep_time','scheduled_dep_time','actual_arr_time','scheduled_arr_time']

for column in convert_col:
        #Convert to string
        flights[column] = flights[column].astype(str)
        
        # remove .0 if present
        flights[column] = flights[column].str.replace('\.0','')
        
        # pad string so it is 4 digits 
        flights[column] = flights[column].str.zfill(4)
        
        # Convert to format HH:MM
        flights[column] = flights[column].apply(lambda x:"{}:{}:00".format(x[:2],x[2:]))
        
        # replace 24:00 with 00:00
        flights[column].replace({'24:00:00':'00:00:00'}, inplace=True)
        
        # check if any times are still 24:00:00 instead of 00:00:00
        print(column+', ' + str(flights[flights[column]=='24:00:00'].shape[0])+' still need converting.')

actual_dep_time, 0 still need converting.
scheduled_dep_time, 0 still need converting.
actual_arr_time, 0 still need converting.
scheduled_arr_time, 0 still need converting.


In [22]:
flights.head()

Unnamed: 0,year,month,day_of_month,day_of_week,actual_dep_time,scheduled_dep_time,actual_arr_time,scheduled_arr_time,carrier_code,flight_number,...,taxi_in_time,taxi_out_time,cancelled,cancellation_code,diverted,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay
0,2008,1,3,Thursday,13:43:00,13:25:00,14:51:00,14:35:00,WN,588,...,4.0,9.0,0,,0,16.0,0.0,0.0,0.0,0.0
1,2008,1,3,Thursday,11:25:00,11:20:00,12:47:00,12:45:00,WN,1343,...,3.0,8.0,0,,0,,,,,
2,2008,1,3,Thursday,20:09:00,20:15:00,21:36:00,21:40:00,WN,3841,...,2.0,14.0,0,,0,,,,,
3,2008,1,3,Thursday,09:03:00,08:55:00,12:03:00,12:05:00,WN,3,...,5.0,7.0,0,,0,,,,,
4,2008,1,3,Thursday,14:23:00,14:00:00,17:26:00,17:10:00,WN,25,...,6.0,10.0,0,,0,16.0,0.0,0.0,0.0,0.0


In [23]:
flights.sample(30)

Unnamed: 0,year,month,day_of_month,day_of_week,actual_dep_time,scheduled_dep_time,actual_arr_time,scheduled_arr_time,carrier_code,flight_number,...,taxi_in_time,taxi_out_time,cancelled,cancellation_code,diverted,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay
985856,2008,2,28,Thursday,13:41:00,13:40:00,15:40:00,15:45:00,MQ,4052,...,4.0,13.0,0,,0,,,,,
872326,2008,2,20,Wednesday,14:49:00,14:50:00,15:03:00,15:04:00,US,196,...,2.0,16.0,0,,0,,,,,
2318929,2008,4,15,Tuesday,12:40:00,12:45:00,14:33:00,14:52:00,AS,338,...,4.0,19.0,0,,0,,,,,
882467,2008,2,28,Thursday,15:38:00,15:45:00,16:52:00,17:18:00,US,1460,...,4.0,13.0,0,,0,,,,,
1466606,2008,3,24,Monday,17:13:00,17:04:00,18:33:00,18:32:00,US,32,...,9.0,14.0,0,,0,,,,,
269635,2008,1,11,Friday,14:07:00,14:13:00,15:05:00,15:14:00,US,248,...,6.0,15.0,0,,0,,,,,
1972402,2008,4,23,Wednesday,06:14:00,06:20:00,07:05:00,07:18:00,OO,6701,...,7.0,9.0,0,,0,,,,,
1347542,2008,3,2,Sunday,20:42:00,19:45:00,21:32:00,20:14:00,OO,6535,...,34.0,7.0,0,,0,0.0,0.0,0.0,0.0,78.0
1389460,2008,3,27,Thursday,14:33:00,12:23:00,17:28:00,14:59:00,OO,6740,...,10.0,36.0,0,,0,0.0,0.0,149.0,0.0,0.0
548044,2008,1,3,Thursday,22:15:00,22:10:00,01:01:00,01:10:00,B6,706,...,5.0,11.0,0,,0,,,,,


Some values converted

In [None]:
flights.info()

Some attempts below


In [None]:
flight = flights.copy()

In [None]:
flight.info()

**This method below works but try making it into a function/loop also for other columns**

Need to convert above to HH:MM.

## update notes
**Need to update variable names in description of variables.**
- merge airport locations?
- update string to time in HH:MM for columns with time
- 

In [None]:
flights.describe()

save dataset as new csv..

In [None]:
copy dataset before analysis

In [None]:
flights.info()

### carrier_delay, weather_delay, nas_delay, security_delay, late_aircraft_delay 

In [None]:
flights.carrier_delay.sort_values()

This shows that some values for delays have been recorded as 0 minutes delay whilst others are NaN. For consistency, NaN will be converted to 0 as there is no delay and there shouldn't be two ways to record no delay.

In [None]:
delay_columns = ['carrier_delay','weather_delay','nas_delay','security_delay','late_aircraft_delay']

for column in delay_columns:
    print('Start: There are '+str(flights[column].isnull().sum()) +' NaN values in the '+str(column)+' column.')
    flights[column].fillna(0,inplace=True)
    print('End: There are '+str(flights[column].isnull().sum()) +' NaN values in the '+str(column)+' column. \n')

In [None]:
flights.info(verbose=True, null_counts=True)

### NaN values in arr_delay and dep_delay

In [None]:
flights[flights['arr_delay'].isnull()]

In [None]:
flights[flights['actual_dep_time']=='0n:an']

Looks like for some flights that were not delayed, the recorded value was 0.

In [None]:
# confirm 0n:an in actual_dep_time and actual_arr_time have 0 as values for all delays.
flights.loc[(flights['actual_dep_time']=='0n:an')&
            (flights['actual_arr_time']=='0n:an')&
            (flights['carrier_delay']==0)&
            (flights['weather_delay']==0)&
            (flights['nas_delay']==0)&
            (flights['security_delay']==0)&
            (flights['late_aircraft_delay']==0)]

This shows that all 0n:an values were flights with no delays, these will be replaced with the scheduled_dep_time or scheduled_arr_time.

In [None]:
flights[flights['actual_dep_time']=='0n:an']

In [None]:
# If actual_dep_time is 0n.an and there are no delays, set time as scheduled_dep_time
flights['actual_dep_time'] = np.where(flights['actual_dep_time']=='0n:an',flights['scheduled_dep_time'],flights['actual_dep_time'])

In [None]:
# If actual_arr_time is 0n.an and there are no delays, set time as scheduled_arr_time
flights['actual_arr_time'] = np.where(flights['actual_arr_time']=='0n:an',flights['scheduled_arr_time'],flights['actual_arr_time'])

In [None]:
flights['actual_arr_time'] = pd.to_timedelta(flights['actual_arr_time'])

### What is the structure of your dataset?

This flight dataset contains 2,389,217 rows of data with 29 different variables that have been recorded.

### What is/are the main feature(s) of interest in your dataset?

I'm most interested in examining which variables are best for predicting airline on-time performance. 

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

The features that will help support my investigation will be time variables as well as reasons for the delays, cancellations and diversions.

## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.

ideas:
- carrier code distribution
- arr delay
- dep delay
- origin
- destination
    

In [None]:
(flights.cancelled==0).sum()

### Comparing the number of flights vs flights cancelled

In [None]:
# Cancelled flights
plt.figure(figsize=(8,6))

# Plot barchart
sns.countplot(data=flights,x='cancelled', color='green')

# labels
ticks = np.arange(0,25e5+25e4,25e4)
plt.yticks(ticks,['0','250K','500K','750K','1M','1.25M','1.5M','1.75M','2M','2.25M','2.5M'])
plt.ylabel('Frequency')
plt.xlabel('Cancelled')
plt.title('Flights cancelled');

Only a small proportion of flights are cancelled as shown in the bar chart above, 0 represents flights not cancelled whilst 1 represents flights cancelled. It would be interesting to see the distribution of flights throughout the week.

### Examining the distribution of flights grouped by day of the week

In [None]:
# Plot graph
plt.figure(figsize=(8,5))
base_colour = sns.color_palette()[0]
sns.countplot(data=flight,x='day_of_week',color=base_colour)

# Labels
tick_locs = np.arange(275000,flight['day_of_week'].value_counts().values.max()+25000,25000)
plt.xticks(rotation=90)
plt.yticks(tick_locs) # set y ticks
plt.ylim([275e3,38e4]) # set y axis limits
plt.xlabel('Day of the week')
plt.ylabel('Frequency')
plt.title('Number of flights by day of the week');

The bar graph above shows that the frequency of flights differs depending on the day of the week. Weekends have the fewest number of flights whilst wednesdays have the most. I will later be examining whether the day of the week affects the proportion of delayed/diverted/cancelled flights when performing a bivariate analysis.

### Compare the distribution of cancellation codes.

In [None]:
# Cancellation code count

# Plot graph
plt.figure(figsize=(8,6))
code_order = ['A','B','C','D']
sns.countplot(data=flights,x='cancellation_code',order=code_order,color='brown')

# tick locations
tick_locs = np.arange(0,275e2+25e2,25e2)
plt.yticks(tick_locs)

# Labels
plt.xlabel('Cancellation Code')
plt.ylabel('Frequency')
plt.title('Frequency of cancellation codes');

In [None]:
# Frequency values for each cancellation code
flights.cancellation_code.value_counts()

The graph above shows that the most common cancellation code is A (Carrier) followed closely by B (weather). C (NAS), is much less frequent, approximately half as frequent as A and B. D was the least common, there were only 6 instances where a flight was cancelled due to security issues. 

### Comparing flight distances

In [None]:
# Find the min and max flight distances
flights.distance.describe()

In [None]:
# Flight distances

# Select bin size
bins = 10 ** np.arange(-1,3.7+0.1,0.1)

# Plot graph
plt.figure(figsize=(8,6))
plt.hist(data=flights,x='distance', bins=bins)

# set x axis scale
plt.xscale('log')

# set x and y axis labels (custom)
x_ticks = [10,20,50,100,200,500,1000,2000,5000]
x_labels = ['{}'.format(v) for v in x_ticks]
y_ticks = np.arange(0,3e5+5e4,5e4)
y_labels = ['0','50K','100K','150K','200K','250K','300K']

# Labels and title
plt.xticks(x_ticks,x_labels)
plt.yticks(y_ticks,y_labels)
plt.title('Flight distances')
plt.ylabel('Frequency')
plt.xlabel('Distance (miles)')
plt.xlim(5,5000);

The graph above shows that the most common flights were between 500-1000 miles. Some flights were up to 5,000 miles whilst others were less than 50 miles. Therefore, there is a large variation in flight distances. It would be useful to see whether flight distance affects whether a flight gets cancelled in bivariate analysis.

### Comparing flight airtimes

In [None]:
# Flight airtimes

# Select bin size
bins = 10 ** np.arange(-1,3+0.05,0.05)

# Plot graph
plt.figure(figsize=(8,6))
plt.hist(data=flights,x='air_time',bins=bins)

# set x axis scale
plt.xscale('log')

# set axis labels (custom)
x_ticks = [10,20,50,100,200,500,1000]
x_labels = ['{}'.format(v) for v in x_ticks]

y_ticks = np.arange(0,18e4+2e4,2e4)
y_labels = ['0','20K','40K','60K','80K','100K','120K','140K','160K','180K']

# Labels and title
plt.xticks(x_ticks,x_labels)
plt.yticks(y_ticks,y_labels)
plt.title('Flight Airtime')
plt.ylabel('Frequency')
plt.xlabel('Airtime (minutes)')
plt.xlim(5,1000);

The airtime for most flights was around 100mins. The distribution of flight times appears normally distributed when taking the log values of airtime.

### What are the top 15 airport origins and destinations?

In [None]:
# data setup:
origin_freq = flights.origin.value_counts() # top 10 airport origins
dest_freq = flights.destination.value_counts() # top airport destinations
thres = 15 # set threshold

# set order
top_origins = origin_freq.index[:thres]
origin_order = origin_freq.index[:thres]

top_dest = dest_freq.index[:thres]
dest_order = dest_freq.index[:thres]

# create subset dataframes
origin_sub = flights.loc[flights['origin'].isin(top_origins)]
dest_sub = flights.loc[flights['destination'].isin(top_dest)]

In [None]:
plt.figure(figsize = [20, 5])
base_colour = sns.color_palette()[0]

# set y labels
tick_locs = np.arange(0,14e4+2e4,2e4)
tick_labels = ['0','20K','40K','60K','80K','100K','120K','140K']

# plot top origins
plt.subplot(1,2,1)
sns.countplot(data=origin_sub,x='origin',order=origin_order,color=base_colour)
plt.yticks(tick_locs,tick_labels)
plt.ylabel('Frequency')
plt.xlabel('Origin')
plt.title('Top airport origins')

# plot top destinations
plt.subplot(1,2,2)
sns.countplot(data=dest_sub,x='destination',order=dest_order,color=base_colour)
plt.yticks(tick_locs,tick_labels)
plt.ylabel('Frequency')
plt.xlabel('Destination')
plt.title('Top airport destinations');

The top 15 airport origins and destinations are the same as expected. This makes sense as a plane landing at an airport should then make a flight with passengers to another airport as it would be efficient to do so.

Below are the value counts of each of the top 15 origins and destinations. The numbers for the origins and destinations for each airport are very similar.

In [None]:
origin_sub.origin.value_counts()

In [None]:
dest_sub.destination.value_counts()

### Comparing the distribution of arrival and departure delay times

In [None]:
print('The minimum departure delay is ' + str(int(flights.dep_delay.min())) + ' minutes.')
print('The maximum departure delay is ' + str(int(flights.dep_delay.max())) + ' minutes.')

print('The minimum arrival delay is ' + str(int(flights.arr_delay.min())) + ' minutes.')
print('The maximum arrival delay is ' + str(int(flights.arr_delay.max())) + ' minutes.')

Some flights departed early and some flights arrived early also. Below are two histograms. These show the difference in the scheduled and actual times for departures and arrivals.

In [None]:
# flight departure 
bins = np.arange(flights.dep_delay.min()-1,flights.dep_delay.max()+1,1)

# Plotting flight departure delays
plt.xlim(flights.dep_delay.min()-1,flights.dep_delay.max()+1)
tick_locs = np.arange(0,180000+20000,20000)
plt.yticks(tick_locs)

# plot graph
plt.hist(data=flights,x='dep_delay',bins=bins);

In [None]:
# flight arrival delays
bins = np.arange(flights.arr_delay.min()-1,flights.arr_delay.max()+1,1)

#set limits and axis
plt.xlim(flights.arr_delay.min()-1,flights.arr_delay.max()+1)
tick_locs = np.arange(0,180000+20000,20000)
plt.yticks(tick_locs)

# plot graph
plt.hist(data=flights,x='arr_delay',bins=bins);

Both of the histograms above show that most flights departed or arrived on time. The scales on both histograms however are no the same and there appears to be some outliers causing most of the histogram to not be clearly visible. A further comparison of these will be made with limits set to the x axis and with the same scale for yticks for easier comparison.

In [None]:
# Comparing departure and arrival delay times
plt.figure(figsize = [20, 5])

bins = np.arange(-100,200+1,1) # Select bin size
tick_locs = np.arange(0,18e4+2e4,2e4) # set tick locations
tick_labels = [0,'20K','40K','60K','80K','100K','120K','140K','160K','180K']

# plot departure delay times
plt.subplot(1,2,1)
plt.hist(data=flights,x='dep_delay',bins=bins)
# set labels and axes
plt.yticks(tick_locs,tick_labels)
plt.xlim(-100,200+1)
plt.ylabel('Frequency')
plt.xlabel('Departure Delay (mins)')
plt.title('Departure delays')

# plot arrival delay times
plt.subplot(1,2,2)
plt.hist(data=flights,x='arr_delay',bins=bins)
# set labels and axes
plt.xlim(-100,200+1)
plt.yticks(tick_locs,tick_labels)
plt.ylabel('Frequency')
plt.xlabel('Arrival Delay (mins)')
plt.title('Arrival delays');

When comparing these two histograms. Both histograms show most flights depart and arrive on time. However, flights are more likely to depart on time whilst fewer arrive on time. This comparison can easily be made as the y axis scale on both histograms are now the same. Flight arrival times are also more likely to vary from the scheduled arrival time. We can see that more flights are likely to arrive early compared to departing early. This is expected as a flight leaving early would likely cause people to miss their flight whilst arriving early is not an issue. Flights may end up taking less time to reach their destination, for example due to good or advantageous weather such as wind in the same direction of travel, pilots flying at a faster speed or perhaps airlines are more like to overestimate the time required to reach their destination and therefore arriving early becomes more common.

Flights are also more likely to arrive late however, reasons for this may include bad weather or congestion at the destination airport meaning the aeroplane takes longer to land. 

> Make sure that, after every plot or related series of plots, that you
include a Markdown cell with comments about what you observed, and what
you plan on investigating next.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

ideas:
- distance vs cancellation - if not update description in univariate
- carrier vs acncellation
- subplots for cancelled vs carrier/weather/nas/security/late aircraft delays
- faceting top airports vs delay/cancellation times

In [None]:
flights.info()

In [None]:
# Create subset of cancelled flights
flight_cancelled = flights[flights.cancelled==1]
sns.countplot(data=flight_cancelled,x='day_of_week',color=base_colour)

The graph above shows the total number of cancellations by day of the week. This shows that by number, there were fewest cancellations on weekends. However, as the number of flight varies by day, it would be more appropriate to calculate the proportion of cancelled flights by day.


split by day, cal average since it is boolean.

In [None]:
# Calculate the overall proportion of cancelled flights
flights.cancelled.mean()

This shows that there is currently a 2.7% chance of a flight being cancelled, however, how does this rate differ when we group by day?

In [None]:
day_means = flights.groupby('day_of_week',as_index=False).mean()
day_means

In [None]:
plt.figure(figsize=(8,6))
day_means.plot.bar(x='day_of_week',y='cancelled',legend=False);
# Labels

plt.xlabel('Day of the Week')
plt.ylabel('Proportion  of flights cancelled')
plt.title('Proportion of flights cancelled by day of the week');

This now shows that the rate of cancellations does differ from day to day. The days with the highest proportion of cancelled flights are Tuesday and Fridays, Saturday and Sundays have the lowest proportion of cancellations.

compare whether certain airports have a higher proportion of cancelled flights.

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

notes: try plotting long vs lat vs one other variable.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you **remove all of
the quote-formatted guide notes** like this one before you finish your report!